Full-Text Search of JSON[b] data in PostgreSQL

Apr 05, 2017 13:26

I always thought that full-text search in json is overestimated feature of Oracle, but new SQL/JSON specification in SQL Standard 2016 changed my mind and I started to think about adding this to postgres. Artur Zakirov made a quick prototype, but since he was quite busy with other projects, I asked Dmitry Dolgov to work on this feature before feature freeze. The implementation is quite straightforward and Depesz already blogged about it, but there is one interesting moment - support of phrase search, so we should avoid an accident matching of phrase between values of different keys. This was solved by adding an artificial extra position on the boundary between values of different keys. In the example it's clear seen, that 'moscow' has position of 5 instead of 4.

select to_tsvector(jb) from (values ('
{
"abstract": "It is a very long story about true and false",
"title": "Peace and War",
"publisher": "Moscow International house"
}
'::json)) foo(jb);
to_tsvector
------------------------------------------------------------------------------------------
'fals':10 'hous':18 'intern':17 'long':5 'moscow':16 'peac':12 'stori':6 'true':8 'war':14

Notice, also, that for jsonb we'll get quite a different result, because keys in jsonb are sorted.

select to_tsvector(jb) from (values ('
{
"abstract": "It is a very long story about true and false",
"title": "Peace and War",
"publisher": "Moscow International house"
}
'::jsonb)) foo(jb);
to_tsvector
------------------------------------------------------------------------------------------
'fals':14 'hous':18 'intern':17 'long':9 'moscow':16 'peac':1 'stori':10 'true':12 'war':3

Now, we can see, that phrase search is really works as expected:

select phraseto_tsquery('english','war moscow') @@ to_tsvector(jb) from (values ('
{
"abstract": "It is a very long story about true and false",
"title": "Peace and War",
"publisher": "Moscow International house"
}
'::jsonb)) foo(jb);
?column?
----------
f
(1 row)
select phraseto_tsquery('english','moscow international') @@ to_tsvector(jb) from (values ('
{
"abstract": "It is a very long story about true and false",
"title": "Peace and War",
"publisher": "Moscow International house"
}
'::jsonb)) foo(jb);
?column?
----------
t
(1 row)

I'd like to thank also Andrew Dunstan, who spent his time to review the patch during PGConf.US and committed to Postgres 10.

I think that for the first version this is ok, but in future versions I would like to see support of fts in specified part of json[b], kind of json_fts(json, 'jsonpath').

fts, pg, pgen

Previous post Next post
Up