Using full text dictionaries in PostgreSQL: obartunov

obartunov

Using full text dictionaries in PostgreSQL

May 11, 2016 02:25

This is a technical post to inform FTS users about several improvements in working with full text dictionaries. Detailed information about full text dictionaries is available in Postgres documentation.

First, several useful links from Postgres Professional github repository:

hunspell_dicts repo contains several extensions, which facilitate installing hunspell dictionaries. Follow instructions in README file to install extensions. Now, procedure of installing hunspell dictionaries reduces to create extension command, for example:

CREATE EXTENSION hunspell_ru_ru; -- creates russian_hunspell dictionary
CREATE EXTENSION hunspell_en_us; -- creates english_hunspell dictionary
CREATE EXTENSION hunspell_nn_no; -- creates norwegian_hunspell dictionary
SELECT ts_lexize('english_hunspell', 'evening');
ts_lexize
----------------
{evening,even}
(1 row)

Time: 57.612 ms
SELECT ts_lexize('russian_hunspell', 'туши');
ts_lexize
------------------------
{туша,тушь,тушить,туш}
(1 row)

Time: 382.221 ms
SELECT ts_lexize('norwegian_hunspell','fotballklubber');
ts_lexize
--------------------------------
{fotball,klubb,fot,ball,klubb}
(1 row)

Time: 323.046 ms

Notice horrible time to run ts_lexize(). Fortunately, next time you run ts_lexize() it will run much faster.

SELECT ts_lexize('norwegian_hunspell','fotballklubber');
ts_lexize
--------------------------------
{fotball,klubb,fot,ball,klubb}
(1 row)

Time: 0.235 ms

Unfortunately, every new session requires loading dictionary into memory, making the first query slow. It's easy to demonstrate, just exit from psql and start new session. shared_ispell extension allows loading of full text dictionary into shared memory and stay persistent
across sessions. Let me demonstrate the effect (assume you compile and install shared_ispell extension).

CREATE EXTENSION shared_ispell;
CREATE TEXT SEARCH DICTIONARY english_shared (
TEMPLATE = shared_ispell,
DictFile = en_us,
AffFile = en_us,
StopWords = english
);
CREATE TEXT SEARCH DICTIONARY russian_shared (
TEMPLATE = shared_ispell,
DictFile = ru_ru,
AffFile = ru_ru,
StopWords = russian
);

time for i in {1..10}; do echo $i; psql postgres -c "select ts_lexize('english_hunspell', 'evening')" > /dev/null; done
1
2
3
4
5
6
7
8
9
10

real 0m0.656s
user 0m0.015s
sys 0m0.031s
time for i in {1..10}; do echo $i; psql postgres -c "select ts_lexize('english_shared', 'evening')" > /dev/null; done
1
2
3
4
5
6
7
8
9
10

real 0m0.095s
user 0m0.015s
sys 0m0.025s

Benefit of using shared dictionary is much bigger for russian dictionary.

time for i in {1..10}; do echo $i; psql postgres -c "select ts_lexize('russian_hunspell', 'туши')" > /dev/null; done
1
2
3
4
5
6
7
8
9
10

real 0m3.809s
user 0m0.015s
sys 0m0.029s

time for i in {1..10}; do echo $i; psql postgres -c "select ts_lexize('russian_shared', 'туши')" > /dev/null; done
1
2
3
4
5
6
7
8
9
10

real 0m0.170s
user 0m0.015s
sys 0m0.027s

Performance win is not the main benefit of using shared dictionary, it's possible to use persistent connections and pooling. Dictionaries require a lot of memory and if you have many sessions, then total memory occupied by dictionaries can be very big. For example, russian dictionary, when loaded into memory, occupies about 20 MB only for one session. For hundred sessions the memory just to keep dictionary in memory will be 2GB, this is a waste of memory !
Kudos to Thomas Vondra, who wrote an original version of the extension and to Arthur Zakirov, who added support for affixes, which uses full regular expressions.

Also, 9.6 release has rather big improvement of hunspell dictionaries support:

commit f4ceed6ceba31a72ed7a726fef05d211641f283c
Author: Teodor Sigaev
Date: Thu Mar 17 17:23:38 2016 +0300

Improve support of Hunspell

- allow to use non-ascii characters as affix flag. Non-numeric affix flags now
are stored as string instead of numeric value of character.
- allow to use 0 as affix flag in numeric encoded affixes

That adds support for arabian, hungarian, turkish and
brazilian portuguese languages.

Author: Artur Zakirov with heavy editorization by me

fts, pg, pgen