Software Engineering
caching distributed-system postgres
Updated Mon, 03 Oct 2022 18:54:32 GMT

How can I efficiently diff a CSV file against a database?


I have an inventory of products stored in Postgres. I need to be able to take a CSV file and get a list of changesthe things in the CSV file that are different to what is in the database. The CSV file has about 1.6 million rows.

The naive approach is to simply take each row, retrieve that product from the database using the key field, make the comparison, emit the changes (including updating the database), then move on to the next row. However, that many round trips causes the whole process to take a long time (upwards of two minutes). I've tried locally caching the inventory in an off-heap map (using MapDB), which improved the performance a lot, since I only needed to hit the database to write changed data, but I didn't figure out a way to make that scale. There will be many inventories for different customers. Perhaps some kind of sharding approach would be needed, but then I have to deal with nodes going on- and offline. Maybe Akka Cluster could help here too.

Are there some good approaches that I'm overlooking?




Solution

Since the roundtrip seems to be the issue, you could:

  • either opt for a local solution, with the scaling issue you mentioned. (You could still try to split the task across several local nodes, each responsible of a subrange of the index space).
  • or opt to the db solution, bulk uploading your csv in a temporary table, and let the db server work very efficiently on (indexed) tables. The benefit of this approach is that youd reach the scalability of the db itself. You could fine tune the approach for any distribution scheme that woukd already be in place, if its already a distributed database.

Some more thoughts:

  • if you have many columns/fields to compare, you may consider adding a hash code on each row, in the csv as well as in the db (updated at every row change). The hash code would be calculated using the fields that are relevant for the comparison. Finding the diff is then reduced to finding the new rows and the existing rows with a difference on the hash.
  • ultimately, it would be more efficient to handle the problem at the source, i.e intercepting the events that would cause the csv to change, or using some kind of timestamp of the last change. But ok, this is not always possible.




Comments (3)

  • +1 – Unfortunately it's not possible at this point to get those eventsthey happen in someone else's system. Using a combination of answers, pulling all the digests from the database and comparing those locally might be the way to go. — Aug 31, 2022 at 10:43  
  • +4 – I'd personally let the DB compute the hashes during the loading of the CSV into a temporary table. Seems faster and makes the process simpler, since in that way it remains an internal detail inside the DB and doesn't require any kind of external change. — Aug 31, 2022 at 13:25  
  • +0 – @GACy20 I agree for the scenario of the temporary db table, since it moreover allows to share the same code for calculating the hash. In ly answer I was more neutral, since this approach can also be used to accelerate the local scenario. Thanks for making it more explicit. — Sep 01, 2022 at 06:14