I'm trying to investigate why the performance of this query is so non-deterministic. It can take anywhere from 1 seconds, to 60 seconds and above. The nature of the query is to select a "time window", and get all rows from within that time window.
Here's the query in question, running on a table of approximately 1 billion rows:
SELECT CAST(extract(EPOCH from ts)*1000000 as bigint) as ts , ticks , quantity , side FROM order_book WHERE ts >= TO_TIMESTAMP(1618882633073383/1000000.0) AND ts < TO_TIMESTAMP(1618969033073383/1000000.0) AND zx_prod_id = 0 ORDER BY ts ASC, del desc;
This is how the table is created
CREATE TABLE public.order_book ( ts timestamp with time zone NOT NULL, zx_prod_id smallint NOT NULL, ticks integer NOT NULL, quantity integer NOT NULL, side boolean NOT NULL, del boolean NOT NULL )
The values within
TO_TIMESTAMP will keep sliding forward as I walk the whole table. Here is the
EXPLAIN ANALYZE output for the same query on two different time windows:
Gather Merge (cost=105996.20..177498.48 rows=586308 width=18) (actual time=45196.559..45280.769 rows=539265 loops=1) Workers Planned: 6 Workers Launched: 6 Buffers: shared hit=116386 read=42298 -> Sort (cost=104996.11..105240.40 rows=97718 width=18) (actual time=45169.717..45176.775 rows=77038 loops=7) Sort Key: (((date_part('epoch'::text, _hyper_16_214_chunk.ts) * '1000000'::double precision))::bigint), _hyper_16_214_chunk.del DESC Sort Method: quicksort Memory: 9327kB Worker 0: Sort Method: quicksort Memory: 8967kB Worker 1: Sort Method: quicksort Memory: 9121kB Worker 2: Sort Method: quicksort Memory: 9098kB Worker 3: Sort Method: quicksort Memory: 9075kB Worker 4: Sort Method: quicksort Memory: 9019kB Worker 5: Sort Method: quicksort Memory: 9031kB Buffers: shared hit=116386 read=42298 -> Result (cost=0.57..96897.07 rows=97718 width=18) (actual time=7.475..45131.932 rows=77038 loops=7) Buffers: shared hit=116296 read=42298 -> Parallel Index Scan using _hyper_16_214_chunk_order_book_ts_idx on _hyper_16_214_chunk (cost=0.57..95187.01 rows=97718 width=18) (actual time=7.455..45101.670 rows=77038 loops=7) Index Cond: ((ts >= '2021-04-22 01:34:31.357179+00'::timestamp with time zone) AND (ts < '2021-04-22 02:34:31.357179+00'::timestamp with time zone)) Filter: (zx_prod_id = 0) Rows Removed by Filter: 465513 Buffers: shared hit=116296 read=42298 Planning Time: 1.107 ms JIT: Functions: 49 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 9.273 ms, Inlining 0.000 ms, Optimization 2.008 ms, Emission 36.235 ms, Total 47.517 ms Execution Time: 45335.178 ms
Gather Merge (cost=105095.94..170457.62 rows=535956 width=18) (actual time=172.723..240.628 rows=546367 loops=1) Workers Planned: 6 Workers Launched: 6 Buffers: shared hit=158212 -> Sort (cost=104095.84..104319.16 rows=89326 width=18) (actual time=146.702..152.849 rows=78052 loops=7) Sort Key: (((date_part('epoch'::text, _hyper_16_214_chunk.ts) * '1000000'::double precision))::bigint), _hyper_16_214_chunk.del DESC Sort Method: quicksort Memory: 11366kB Worker 0: Sort Method: quicksort Memory: 8664kB Worker 1: Sort Method: quicksort Memory: 8986kB Worker 2: Sort Method: quicksort Memory: 9116kB Worker 3: Sort Method: quicksort Memory: 8858kB Worker 4: Sort Method: quicksort Memory: 9057kB Worker 5: Sort Method: quicksort Memory: 6611kB Buffers: shared hit=158212 -> Result (cost=0.57..96750.21 rows=89326 width=18) (actual time=6.145..127.591 rows=78052 loops=7) Buffers: shared hit=158122 -> Parallel Index Scan using _hyper_16_214_chunk_order_book_ts_idx on _hyper_16_214_chunk (cost=0.57..95187.01 rows=89326 width=18) (actual time=6.124..114.023 rows=78052 loops=7) Index Cond: ((ts >= '2021-04-22 01:34:31.357179+00'::timestamp with time zone) AND (ts < '2021-04-22 02:34:31.357179+00'::timestamp with time zone)) Filter: (zx_prod_id = 4) Rows Removed by Filter: 464498 Buffers: shared hit=158122 Planning Time: 0.419 ms JIT: Functions: 49 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 10.405 ms, Inlining 0.000 ms, Optimization 2.185 ms, Emission 39.188 ms, Total 51.778 ms Execution Time: 274.413 ms
I interpreted this output as most of the blame lying on this parallel index scan.
At first, I tried to raise
work_mem to 1 GB and
shared_buffers to 24 GB, thinking that maybe it couldn't fit all the stuff it needed in RAM, but that didn't seem to help.
Next, I tried creating an index on
(zx_prod_id, ts), thinking that the filter on the parallel index scan might be taking a while, but that didn't seem to do anything either.
I'm no database expert, and so I've kind of exhausted the limits of my knowledge.
The all-important difference between the two query plans is the added
read=xyz bit in multiple places of the slow version.
Buffers: shared hit=116296 read=42298
Buffers: shared hit=158122
This tells you that Postgres encountered data (or index) pages that were not cached, yet. Repeat the slow query (possibly more than one time, until the added
read=xyz vanishes), then you'll see comparable performance on equally cached data pages.
work_mem can actually hurt your situation, as
work_mem and cache memory (both operating system cache and Postres' own shared memory buffers) are competitors for available RAM. Setting
work_mem higher can lead to data being evicted from the (then smaller) cache sooner, so your "slow query" becomes more likely to occur. Don't forget that Postgres can use many times the amount set with
work_mem. The manual:
Note that for a complex query, several sort or hash operations might be running in parallel; each operation will generally be allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of
Your query uses 6 parallel workers, that's already 6x
work_mem - as possible worst case; not every worker will use its maximum allowance.
In case of insufficient
work_mem you see
disk mentioned in various contexts of the
EXPLAIN output. That's when additional
work_mem is in order.
shared_buffers can make more sense. Still, don't set it too high, either. The manual:
If you have a dedicated database server with 1GB or more of RAM, a reasonable starting value for shared_buffers is 25% of the memory in your system. There are some workloads where even larger settings for
shared_buffersare effective, but because PostgreSQL also relies on the operating system cache, it is unlikely that an allocation of more than 40% of RAM to
shared_bufferswill work better than a smaller amount.
If you have plenty of RAM for everything (hard to imagine for a "table with 1 billion rows"), then all that's missing is that data pages in question just were not cached, yet. You may be interested in
Aside: the second query plan shows
Filter: (zx_prod_id = 4), which does not match the displayed query, but I think that's insignificant for the problem.
You already tried an index on
(zx_prod_id, ts) but it ...
didn't seem to do anything.
That index would typically look good for the query, though.
The multicolumn index is around 50 % larger than the index on just
(ts): 4 + 8 + 16 = 28 bytes vs. 4 + 8 + 8 = 20 bytes per index tuple, and possibly less compression from "index deduplication". And since RAM seems to be a contested resource, that is a downside. See:
Rows Removed by Filter: 465513. So the query reads twice the rows it needs to. Depending on data distribution, that index may or may not be an improvement. It would certainly be if your table was physically clustered on that index. Then most data pages contain rows with the same
zx_prod_id value, which makes that index much more effective. Again, see:
Your query seems to always filter for a single value of
zx_prod_id. If that's focused around a few values, or if there are only few distinct values to begin with, a couple of partial indexes can make sense. Like, if only
zx_prod_id IN (1,4) are targeted:
CREATE INDEX order_book_ts_zx1_idx ON order_book (ts) WHERE zx_prod_id = 1; CREATE INDEX order_book_ts_zx4_idx ON order_book (ts) WHERE zx_prod_id = 4;
Then the query can work with a (much) smaller index, and does not have to filter half of the rows.
ts obviously is type
timestamptz. Still you handle timestamps as
bigint numbers representing microseconds. That split approach has downsides and pitfalls. It's typically best to work with timestamps across the food chain. Executing
CAST(extract(EPOCH from ts)*1000000 as bigint) for every row is not free. Adds up for half a million rows.
External links referenced by this document:
Local articles referenced by this article: