Using PostgreSQL for an Ajax Autocomplete Field
I recently implemented an Ajax based autocomplete field for a web application using a PostgreSQL database on the back-end. I had trouble finding appropriate PostgreSQL queries, so I’ve decided to share my approach in the hope that it will be useful to others.
The application uses the following PostgreSQL table to store the author data:
CREATE TABLE authors (
authors_id serial PRIMARY KEY,
author_name character varying UNIQUE NOT NULL
The author_name field represents the full name of an author. E.g. ‘jane austen’, ‘john smith’, ‘john q. public’, ‘william shakespeare’, etc.
Creating the initial version of the autocomplete field was relatively straightforward. To handle the Ajax we simply added the following method to our controller:
sub json_author_search : Local
my ( $self, $c, $dashboards_id ) = @_;
my $term = $c->req->param( 'term' ) || 0;
$term = '%' . $term . '%';
my $terms =
$c->dbis->query( "select authors_id, author_name as label from authors where author_name ilike ? ", $term )->hashes;
$c->res->body( encode_json( $terms ) );
This code was fully functional and worked with small data sets on a development machine. However, once we started testing on our production machine with a large dataset, it became clear that there were performance problems.
Using an explain showed that PostgreSQL was doing a sequential scan. I.e. it was reading through the entire database table for every query:
mediacloud=# explain select authors_id, author_name as label from authors where author_name ilike '%bla%';
Seq Scan on authors (cost=0.00..37.33 rows=1 width=22)
Filter: ((author_name)::text ~~* '%bla%'::text)
I did some web searches looking for a solution. I found Nikolay Samokhvalov’s slides on Using PostgresSQL In Web 2.0 Applications.(http://www.scribd.com/doc/4844027/Using-PostgreSQL-In-Web-20-Applications-)
These slides have useful information and are worth reading. However, their suggested approaches focus on key word search rather than autocomplete. The first approach that Samokhvalov gives is to use prefix search. This solution uses like ‘bla%’ queries and relies on text_pattern_ops indexes for quick performance. Samokhvalov’s other tip is to use lower instead of using ilike. So in our case we would add the following index:
CREATE INDEX author_name_prefix ON authors USING btree ( lower (author_name) text_pattern_ops);
and change our search query to:
$term = ‘%’ . $term;
$c->dbis->query( "select authors_id, author_name as label from authors where lower(author_name) like ? ", $term )->hashes;
This approach is fast but it doesn’t give us the behavior that we wanted. For example, if the user typed ‘sm’, we would want ‘john smith’ to be included in the search results. However, “john smith” would only be matched if the user instead typed ‘jo’.
Another solution that Samkhvalov suggests is using GIN indexes with full text search. This is an interesting approach but it didn’t seem like the right fit for an autocomplete service. As far as I was able to determine, PostgreSQL’s full text search is only designed to match entire words so ‘sm’ would not match ‘smith’. Finally Samkhvalov suggests an interesting hybrid approach that uses both prefix search and full text search with GIN indexes. This approach relies on having a separate table that contains all the distinct words in every string in the table that’s being searched. So in our case we would have an author words table that contained words such as ‘alice’,’bob’,’brown’,’david’,’smith’,etc. This approach might work well for relatively static data — creating the initial table of distinct words is easy. However, in our data set we are constantly adding new author strings. I wanted to avoid the additional complexity of managing and updating an author words table.
I added the following indexes to the authors table:
create index authors_name_varchar_pattern on authors(lower(author_name) varchar_pattern_ops);
create index authors_name_varchar_pattern_1 on authors(lower(split_part(author_name, ' ', 1)) varchar_pattern_ops);
create index authors_name_varchar_pattern_2 on authors(lower(split_part(author_name, ' ', 2)) varchar_pattern_ops);
create index authors_name_varchar_pattern_3 on authors(lower(split_part(author_name, ' ', 3)) varchar_pattern_ops);
Then I changed the query to:
$term = $term . '%';
my $terms =
$c->dbis->query( "select authors_id, author_name as label from authors where lower(author_name) like lower(?) OR " .
" lower(split_part(author_name, ' ', 1)) like lower(?) OR " .
" lower(split_part(author_name, ' ', 2)) like lower(?) OR " .
" lower(split_part(author_name, ' ', 3)) like lower(?) LIMIT 100 ",
$term, $term, $term, $term )->hashes
The above query performs prefix searches on each of the first three words within the author name string. Additionally, it performs a prefix search on the entire author name string. We OR these prefix searches so that we return the results that match any of the prefix searches. The nice thing about this search is that the user can start typing either the first name or the last name (or the middle name). For example, when the user starts typing ‘joh’, they’ll get a list that includes the ‘john’ names. When the user starts typing ‘smi’, they’ll get a list that includes the ‘smith’ names. Because we also do a prefix search on the entire string, the user could start typing ‘john’ to get a list of names starting with ‘john’ and then expand that to ‘john sm’ to get a list of ‘john smith’s and similar names.
(Note: it might have been safe to omit
lower(split_part(author_name, ' ', 1)) like lower(?) from the query since it will often be redundant with
lower(author_name) like lower(?). We decided to leave it in place to enhance readability and out of concern that it may be necessary to handle improperly formatted author strings.)
This approach does have its weakness. It won’t work for names involving more than 3 words. Fortunately those names are rare. It would also have been nice to be able to do full wild card searches such as like ‘%bla%’. Unfortunately, PostgreSQL does not have any index to optimize those types of searches.
In an ideal world, the database would have a set of indexes and operators that would perfectly handle the search queries necessary for autocomplete and abstract them away from the programmer. However, I’ve presented a solution that has performed well for me and has added only a minimal amount of complexity to the code and the database schema. I’d love to hear comments or suggestions for alternate approaches.