Full-text search using PostgreSQL: simple, fast, cached

Let’s find out what can we achieve; using what we already have, without adding other layers, servers or applications: PostgreSQL. It will help us keep the architecture simple, please be patient 🙂

We need to prepare records for full-text search using 1-to-1 relationship for relevant records in other tables with search term vector from one or more text fields (the assumption is that each of those tables has a unique integer field):

CREATE TABLE public.search_index ( table_name VARCHAR(50) NOT NULL, id INTEGER NOT NULL, search_vector TSVECTOR, CONSTRAINT search_index_pkey PRIMARY KEY(table_name, id) ) WITH (oids = false);

Here is the infamous full-text search index on the search term vector:

CREATE INDEX search_index_idx_search_vector ON public.search_index USING gin (search_vector);

We need to insert and update records in our search index table; for example, when we add a user record in the users table using the text fields such as address details and technical skills.

INSERT INTO search_index (table_name, id, search_vector) SELECT 'users' AS table_name, p.user_id AS id, to_tsvector( COALESCE(u.name, '') || ' ' || COALESCE(u.surname, '') || ' ' || COALESCE(u.address_1, '') || ' ' || COALESCE(u.address_2, '') || ' ' || COALESCE(u.city, '') || ' ' || COALESCE(u.country, '') || ' ' || COALESCE(u.skills, '') ) AS search_vector FROM users u;

NOTE: It is a good idea to write trigger function(s) (after insert, after update and after delete) on source table(s) to take care of synchronizing the records in search_index table.

How can we use the search index table?
Very simple: using to_tsquery() function and powerful PostgreSQL operators:

SELECT i.id, ts_rank(i.search_vector, to_tsquery('london')) AS rank FROM search_index i WHERE i.table_name = 'users' AND i.search_vector @@ to_tsquery('london') LIMIT 10 OFFSET 0 ;

We can easily join this query with the source table and get the results straight away:

SELECT q.rank, u.* FROM ( SELECT i.id, ts_rank(i.search_vector, to_tsquery('london')) AS rank FROM search_index i WHERE i.table_name = 'users' AND i.search_vector @@ to_tsquery('london') LIMIT 10 OFFSET 0 ) q INNER JOIN users u ON q.id = u.id ;

However, we can go further and add a table to cache the search terms for each table that we want to enable full-text search:

CREATE TABLE public.search_cache ( search_cache_id SERIAL NOT NULL, table_name VARCHAR(50) NOT NULL, -- also used in search_index.table_name term VARCHAR(50) NOT NULL, created_at TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL, updated_at TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL, usage_count INTEGER DEFAULT 1 NOT NULL, PRIMARY KEY(search_cache_id) ) WITH (oids = false);

We don’t want to duplicate values for table name and term; let’s create a unique index:

CREATE UNIQUE INDEX search_cache_unq_table_name_and_term ON public.search_cache USING btree (table_name COLLATE pg_catalog."default", term COLLATE pg_catalog."default");

Here is an example to populate it:

INSERT INTO search_cache (table_name, term) VALUES ('users', 'london') RETURNING search_cache_id; -- ==> e.g. 1

Then, we can store row IDs from the result of a search cache:

CREATE TABLE public.search_cache_row ( search_cache_id INTEGER NOT NULL, -- refers to search_cache.search_cache_id id INTEGER NOT NULL, -- also used in search_index.id rank REAL NOT NULL, CONSTRAINT search_cache_row_pkey PRIMARY KEY(search_cache_id, id) ) WITH (oids = false);

This is how we can populate that table:

INSERT INTO search_cache_row (search_cache_id, id, rank) SELECT 1 AS search_cache_id, i.id, ts_rank(i.search_vector, to_tsquery('london')) AS rank FROM search_index i WHERE i.table_name = 'users' AND i.search_vector @@ to_tsquery('london') LIMIT 100 OFFSET 0 ;

As long as the source records and search term do not change, the final result will not change, so we can retrieve the results faster. We can also store recent search terms in memory by using Memcached.

Finally, the following code shows how to use the cache rows:

SELECT q.rank, u.* FROM ( SELECT c.id, c.rank FROM search_cache_row c WHERE c.search_cache_id = 1 LIMIT 10 OFFSET 0 ) q INNER JOIN users u ON q.id = u.id ;

Reference: http://www.postgresql.org/docs/9.4/static/textsearch-tables.html

Tags: Performance, PostgreSQL, Simplicity, SQL

Categories:

Technology

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Full-text search using PostgreSQL: simple, fast, cached

Leave a Reply

Categories

Archives