Database Administration
postgresql postgresql-performance spatial gist-index
Updated Wed, 21 Sep 2022 08:26:30 GMT

Optimizing geolocation query in Postgres with earth_box and earth_distance


I'm trying to optimize my geolocation query for a table of addresses of ~862k rows within a search radius using earth_box. My first initial query is not too terrible:

explain analyze SELECT
    id
FROM
   location
WHERE
    earth_box(ll_to_earth(40.65130101, -73.83367812), 25000) @> ll_to_earth(latitude, longitude)
  ;
                                                                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on location  (cost=46.97..3385.08 rows=863 width=16) (actual time=11.430..49.406 rows=29407 loops=1)
   Recheck Cond: ('(1322317.9173587903, -4672693.781456112, 4130081.708818594),(1372317.8853516562, -4622693.8134632455, 4180081.6768114604)'::cube @> (ll_to_earth(latitude, longitude))::cube)
   Heap Blocks: exact=22479
   ->  Bitmap Index Scan on location_gist_lat_lon_idx  (cost=0.00..46.76 rows=863 width=0) (actual time=7.942..7.943 rows=29407 loops=1)
         Index Cond: ((ll_to_earth(latitude, longitude))::cube <@ '(1322317.9173587903, -4672693.781456112, 4130081.708818594),(1372317.8853516562, -4622693.8134632455, 4180081.6768114604)'::cube)
 Planning Time: 0.711 ms
 Execution Time:  51.018 ms

However, when I want to refine my search to get more accurate results with earth_distance the execution time dramatically increases by 10x:

=> explain analyze SELECT
    id
FROM
   location
WHERE
    earth_box(ll_to_earth(40.65130101, -73.83367812), 25000) @> ll_to_earth(latitude, longitude)
  AND earth_distance(ll_to_earth(40.65130101, -73.83367812),
                     ll_to_earth(latitude, longitude)) < 25000 ;
                                                                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on location  (cost=46.83..3820.75 rows=288 width=16) (actual time=11.264..537.018 rows=24898 loops=1)
   Recheck Cond: ('(1322317.9173587903, -4672693.781456112, 4130081.708818594),(1372317.8853516562, -4622693.8134632455, 4180081.6768114604)'::cube @> (ll_to_earth(latitude, longitude))::cube)
   Filter: (sec_to_gc(cube_distance('(1347317.9013552233, -4647693.797459679, 4155081.692815027)'::cube, (ll_to_earth(latitude, longitude))::cube)) < '25000'::double precision)
   Rows Removed by Filter: 4509
   Heap Blocks: exact=22479
   ->  Bitmap Index Scan on location_gist_lat_lon_idx  (cost=0.00..46.76 rows=863 width=0) (actual time=7.358..7.358 rows=29407 loops=1)
         Index Cond: ((ll_to_earth(latitude, longitude))::cube <@ '(1322317.9173587903, -4672693.781456112, 4130081.708818594),(1372317.8853516562, -4622693.8134632455, 4180081.6768114604)'::cube)
 Planning Time: 0.901 ms
 Execution Time: 539.113 ms
(9 rows)

My table schema (excluded some columns):

                               Table "provider.location"
      Column      |           Type           | Collation | Nullable |      Default
------------------+--------------------------+-----------+----------+-------------------
 id               | uuid                     |           | not null |
 created_at       | timestamp with time zone |           | not null | CURRENT_TIMESTAMP
 updated_at       | timestamp with time zone |           | not null | CURRENT_TIMESTAMP
 can_delete       | boolean                  |           |          |
 name             | text                     |           |          |
 address          | text                     |           |          |
 address_line_1   | text                     |           |          |
 address_line_2   | text                     |           |          |
 city             | text                     |           |          |
 state            | text                     |           |          |
 street           | text                     |           |          |
 zip              | text                     |           |          |
 confidence       | int                      |           |          |
 google_maps_link | text                     |           |          |
 is_coe           | boolean                  |           |          |
 latitude         | double precision         |           |          |
 longitude        | double precision         |           |          |

I also have a GiST index created for the lat/lon:

  "location_gist_lat_lon_idx" gist (ll_to_earth(latitude, longitude))

I'm wondering what is it that I'm missing that's making the additional query execution time increase by 10x?

My postgres 13.5 instance has the following specs:

CPU: 4vCPU
Memory: 16GB
SDD: 250GB

This question is related to How can I speed-up my query on geo-location processes, however after following the suggested answer it didn't seem to improve my performance.




Solution

I don't know what you are missing, other than some perspective.

earth_distance is kind of slow. So computing it 29407 times is slow.

I don't know what from that linked post you tried to do. The suggestion was to create an index, which you already had. References are great, but you still need to tell us what you did.

earth_distance (the module) is mostly for demonstration purposes. Any serious work should be one with postgis. In my hands, postgis is about 5 times faster at the recheck/filter task than earth_distance is. If you were willing to do some C programming, you could probably make earth_distance quite a bit faster, but I would say that that is the overkill when postgis already exists.

Another thing you could do is get it to run in parallel. The only way to do this (that I could find) is to do alter table location set (parallel_workers = 4);. I don't know why the planner doesn't pick parallel on its own without doing that. You might also need to increase max_parallel_workers_per_gather as well.





Comments (3)

  • +0 – Thanks @jjanes for the reply. The only optimization I did was create a GiST index on thelocation table. Other than that I can't seem to figure out a way to get the query to be faster. I thought the postgis extension would be overkill because i'm only really just querying for locations within a given search radius. — Jul 29, 2022 at 04:30  
  • +0 – It should be much faster with the GIST index than without it. But you want it faster yet, which is not something addressed in the linked post. — Jul 30, 2022 at 00:28  
  • +0 – Using Postgis saved me a huge bunch of time. Thanks! — Aug 26, 2022 at 00:39  


Linked Articles

Local articles referenced by this article: