I have an inventory of products stored in Postgres. I need to be able to take a CSV file and get a list of changesthe things in the CSV file that are different to what is in the database. The CSV file has about 1.6 million rows.
The naive approach is to simply take each row, retrieve that product from the database using the key field, make the comparison, emit the changes (including updating the database), then move on to the next row. However, that many round trips causes the whole process to take a long time (upwards of two minutes). I've tried locally caching the inventory in an off-heap map (using MapDB), which improved the performance a lot, since I only needed to hit the database to write changed data, but I didn't figure out a way to make that scale. There will be many inventories for different customers. Perhaps some kind of sharding approach would be needed, but then I have to deal with nodes going on- and offline. Maybe Akka Cluster could help here too.
Are there some good approaches that I'm overlooking?
Since the roundtrip seems to be the issue, you could:
Some more thoughts: