Jsquery - now with optimizer and hinting. EDB, more benchmarks !: obartunov

obartunov

Jsquery - now with optimizer and hinting. EDB, more benchmarks !

Sep 28, 2014 22:57

EDB recently blogged new results from benchmarking PostgreSQL 9.4 and Mongodb

The newest round of performance comparisons of PostgreSQL and MongoDB produced a near repeat of the results from the first tests that proved PostgreSQL can outperform MongoDB. The advances Postgres has made with JSON and JSONB have transformed Postgres’ ability to support a document database.

This blog motivates me to write this post to point EDB for another set of benchmarks with more operators included provided by jsquery.

After PGCon-2014, where we presented first version of jsquery, we made several enhancements worth to mention (see my slides from Japan (PDF).

1). We added simple built-in jsquery optimizer, which recognizes non-selective part of a query and push it to recheck, so recheck works like a FILTER.
2) If you don't like, how optimizer works, you may use HINTING (well, jsquery is an extension after all).

We understand, that this is just a temporal solution for impatient people wante to use jsonb in 9.4, which, honestly, has rather primitive support. Yes, we just didn't have time to do all the best, we even missed several useful functions we did for nested hstore. Hope, we'll have contrib/jsonbx soon. Jsquery was our experiment to play with indexes and the set of operations was chosen especially from this point of view. We are working on better approach, where jsquery will be implemented on sql-level (see this post (in russian) and eventually, after someone implements statistics for jsonb, optimizer wil do its work !

More details are below.

1). Optimizer. Jsquery is opaque to optimizer, so original version had very distressed problem (we demonstrated this at PGCon-2014:

select count(*) from jr where jr @@ ' similar_product_ids && ["B000089778"]
AND product_sales_rank( $ > 10000 AND $ < 20000)';

runs 129.309 ms, while

select count(*) from jr where jr @@ ' similar_product_ids && ["B000089778"]';

took only 0.394 ms !

product_sales_rank( $ > 10000 AND $ < 20000) is non-selective and better not to use index for this part of query and this is exactly how MongoDB does:

db.reviews.find( { $and :[ {similar_product_ids: { $in:["B000089778"]}}, {product_sales_rank:{$gt:10000, $lt:20000}}] } )
.explain()
{
"n" : 45,
….................
"millis" : 7,
"indexBounds" : {
"similar_product_ids" : [
[
"B000089778",
"B000089778"
]
]
},
}

Notice, that if we rewrite our query to

select count(*) from jr where jr @@ ' similar_product_ids && ["B000089778"]'
and (jr->>'product_sales_rank')::int>10000 and (jr->>'product_sales_rank')::int<20000;
(push non-selective part of query up to sql-level, so optimizer could do something), the query runs 0.505 ms, which means, that postgres potentially (again) is faster MongoDB !

The plan of this query is:

Aggregate (actual time=0.479..0.479 rows=1 loops=1)
-> Bitmap Heap Scan on jr (actual time=0.079..0.472 rows=45 loops=1)
Recheck Cond: (jr @@ '"similar_product_ids" && ["B000089778"]'::jsquery)
Filter: ((((jr ->> 'product_sales_rank'::text))::integer > 10000) AND
(((jr ->> 'product_sales_rank'::text))::integer < 20000))
Rows Removed by Filter: 140
Heap Blocks: exact=107
-> Bitmap Index Scan on jr_path_value_idx (actual time=0.041..0.041 rows=185 loops=1)
Index Cond: (jr @@ '"similar_product_ids" && ["B000089778"]'::jsquery)
Execution time: 0.506 ms

Now, jsquery is wise enough:

explain (analyze, costs off) select count(*) from jr where jr @@ 'similar_product_ids &&["B000089778"]
AND product_sales_rank( $ > 10000 AND $ < 20000)'
------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (actual time=0.422..0.422 rows=1 loops=1)
-> Bitmap Heap Scan on jr (actual time=0.099..0.416 rows=45 loops=1)
Recheck Cond: (jr @@ '("similar_product_ids" && ["B000089778"] AND "product_sales_rank"($ > 10000 AND $ < 20000))'::jsquery)
Rows Removed by Index Recheck: 140
Heap Blocks: exact=107
-> Bitmap Index Scan on jr_path_value_idx (actual time=0.060..0.060 rows=185 loops=1)
Index Cond: (jr @@ '("similar_product_ids" && ["B000089778"] AND "product_sales_rank"($ > 10000 AND $ < 20000))'::jsquery)
Execution time: 0.480 ms

Compare RECHECK in this plan and FILTER in plan of rewritten query above. Built-in optimizer analyzes the query tree and push non-selective parts to recheck. It operated with selectivity classes:

1) Equality (x = c)
2) Range (c1 < x < c2)
3) Inequality (c > c1)
4) Is (x is type)
5) Any (x = *)

Jsquery provides two debug functions gin_debug_query_path_value and gin_debug_query_value_path for each of opclasses to check how jsquery will process the query:

SELECT gin_debug_query_path_value('similar_product_ids && ["B000089778"]
AND product_sales_rank( $ > 10000 AND $ < 20000)');
gin_debug_query_path_value
-------------------------------------------------
similar_product_ids.# = "B000089778" , entry 0 +

Only the first part of query will be processed by index, created using opclass jsonb_path_value.

2). HINTING. Use /*-- noindex */ or /*-- index */ before operator to suppress using index, or force index. To illustrate hinting I'll use debug function gin_debug_query_path_value:

SELECT gin_debug_query_path_value('product_sales_rank > 10000');
gin_debug_query_path_value
---------------------------------------
product_sales_rank > 10000 , entry 0 +

SELECT gin_debug_query_path_value('product_sales_rank /*-- noindex */ > 10000');
gin_debug_query_path_value
----------------------------
NULL +

Jsquery is available from Git repository and it's compatible with 9.4. To install it just follow regular procedure of installation of extensions. Examples of usage are available from sql/ subdirectory.

jsquery, jsonb, pg, pgen