The sizes of FTS indices: obartunov

obartunov

The sizes of FTS indices

Aug 27, 2016 23:58

This is a technical post to save my findings about RUM index, which we introduced at PGCon-2016 in Ottawa. The original idea of RUM index appeared in 2012 year, when I and Alexander Korotkov presented Our milliseconds FTS in Prague. This year, inspired by Alexander's work on extensibility, committed to 9.6, I asked my colleagues in Postgres Professional Teodor Sigaev and Artur Zakirov to implement it as full-fledged access method as extension. The name RUM was chosen in according to GIN...VODKA range, but some people suggested "Really Useful Method" or "Russian Useful Method" :)

Public RUM repository is available on https://github.com/postgrespro/rum. We know several bugs in RUM, hope to fix them soon.

Table pglist contains about million posts from postgres mailing lists, fts - is a fts representation of post - tsvector (subject, body_plain, author).
I created several fts indices:

1. GIN index
2. RUM index - (GIN + positions as addinfo)
3. RUM index with sent time as addinfo
4. as 3, but postings are sorted by sent time

RUM index is bigger than GIN, since it contains more information, which allow to speedup certain types of queries, namely:

2. ranking fts - index returns results in the order of position based relevance.
3. fts results ordered by sent time (latest fts results) - index returns results in order of sent time, stored in postings
4. even faster than 3, since postings are stored in the sent time order

The price for speedup is the size of index !

\d pglist
Table "public.pglist"
Column | Type | Modifiers
------------+-----------------------------+-----------
id | integer |
sent | timestamp without time zone |
subject | text |
author | text |
body_plain | text |
fts | tsvector |
Indexes:
"pglist_fts_gin_idx4" gin (fts)
"pglist_fts_rum_idx4" rum (fts rum_tsvector_ops)
"pglist_fts_ts_order_rum_idx4" rum (fts rum_tsvector_timestamp_ops, sent) WITH (attach=sent, "to"=fts, order_by_attach=t)
"pglist_fts_ts_rum_idx4" rum (fts rum_tsvector_timestamp_ops, sent) WITH (attach=sent, "to"=fts)
"pglist_sent_idx" btree (sent)

select pg_size_pretty(sum(pg_column_size(fts))) as
fts,pg_size_pretty(pg_table_size('pglist_fts_gin_idx4')) as gin,
pg_size_pretty(pg_table_size('pglist_fts_rum_idx4')) as rum,
pg_size_pretty(pg_table_size('pglist_fts_ts_rum_idx4')) as rum_ts, pg_size_pretty(pg_table_size('pglist_fts_ts_order_rum_idx4')) as rum_ts_order from
pglist;
fts | gin | rum | rum_ts | rum_ts_order
---------+--------+--------+---------+--------------
1302 MB | 535 MB | 980 MB | 1531 MB | 1921 MB
(1 row)

RUM access method was implemented as an extension using Extensible Access Methods in 9.6. GENERIC WAL in current state of art doesn't recognizes difference between pages with shifts, so RUM could generate a lot of WAL traffic, for example, for RUM index with sent time ordered by sent time CREATE INDEX generates about 186 GB of WAL (compare to 1921 MB index size), since postings now arranged not by their location on page and almost each posting generates new wal record.
I calculated WAL traffic like this:

SELECT pg_current_xlog_location();
11/B54E6098
CREATE INDEX ....
SELECT pg_current_xlog_location();
15/6F2F8F18

SELECT pg_size_pretty(pg_xlog_location_diff('15/6F2F8F18','11/B54E6098'));
pg_size_pretty
----------------
15 GB

The number of postings could be obtained by this query:

select sum(length(fts)) from pglist;
sum
----------
84472729
(1 row)

So, if we assume average fullness as 75% and each posting generates traffic about half of page, then we could estimate WAL-traffic as
84472729*4096/1024/1024/1024*0.75 ~ 242 GB, which is close to 186 GB :)

Using some optimization we were able to reduce traffic to 15 GB, which is still high, compare to index size (1.9 GB). Finally, we got an idea to write WAL records after index created and we got WAL traffic 1.5GB, which is even smaller than the index.

Notice, that this valid only for CREATE INDEX, insertions to RUM will generate big traffic unless GENERIC WAL become more smart.

I think the same optimization we could apply to GiST, GIN and SP-GIST. Saving WAL-traffic on my SSD-based system had low impact on time of index creation, but I suspect it should be noticeable on real hard drives.

wal, rum, fts, pg, pgen