Database Administration
postgresql database-design schema json postgresql-9.6
Updated Sun, 26 Jun 2022 15:55:37 GMT

Best way to map different JSON keys to same target columns


I have CSV import data that comes from multiple vendors with 50+ different columns, and each vendor has slightly different naming.

Instead of trying to spend hours mapping all of the column names to a master table with tons of columns, instead I want to store the original CSV row as a JSON column, such as csv_data.

I also want to store a unique identifier that is shared by all of the vendors, like email.

So my columns in my `contacts table would be:

vendor_id
email
csv_data

Then I was hoping I could create a view or materialized view which I could start with a small subset of columns that I know I will need (and will spend time mapping). Then, if I ever need the new columns later I could modify the view with new mappings.

I was hoping I could do some kind of case switch, like this:

SELECT
  CASE 
    WHEN vendor_id = 100 THEN
        csv_data->>'fullname' AS full_name,
        csv_data->>'age' AS age,
        ... etc ...
    WHEN vendor_id = 101 THEN
        csv_data->>'FULL NAME' AS full_name,
        csv_data->>'AGE' AS age,
        ... etc ...
    WHEN vendor_id = 102 THEN
        csv_data->>'full name' AS full_name,
        csv_data->>'cust_age' AS age,
        ... etc ...
  END
FROM contacts

Unfortunately this is a syntax error... is there any more sane way to do this? I couldn't come up with good ideas. I truly don't want to create some enormous mapping as so many of the column names are messy and probably will never be used. I want to store the data in case I need it, but for my 'results' query I want to create a nice clean mapping that I only add to when I need it.

I'm open to better ways of doing this overall as well. I was trying to avoid the nuclear route of the massive massive table. But I guess I would if I absolutely advised as the best way.




Solution

If you go that route, storing original JSON data (I suggest as jsonb rather than json - cleaner and faster to extract), then consider a translation table with key names for every relevant target column per vendor. Conveniently as another single jsonb value. Like:

CREATE TABLE key2col (
  vendor_id int PRIMARY KEY
, keys jsonb NOT NULL
);
INSERT INTO key2col VALUES
  (100, '{"full_name":"fullname", "age":"age"}')
, (101, '{"full_name":"FULL NAME", "age":"AGE"}')
-- more ...
;

Then your query can simply be:

SELECT c.vendor_id, c.email
     , c.csv_data->>(k.keys->>'full_name') AS full_name
     , c.csv_data->>(k.keys->>'age') AS age
FROM   contacts c
LEFT   JOIN key2col k USING (vendor_id);

db<>fiddle here

This is even fit for any varying set of key --> column translation per vendor. Every field that has no JSON key for the given vendor in key2col.keys will be NULL. Same if the vendor did not actually provide the declared key in contacts.csv_data.

The LEFT JOIN in the query retains rows from contacts without any registered columns. You may want a plain JOIN instead ...





Comments (5)

  • +0 – Truly amazing ,thank you — Mar 14, 2022 at 01:02  
  • +0 – That makes total sense — Mar 14, 2022 at 01:02  
  • +0 – That is so amazingly clever, I love it. How woudl I handle the case where vendor 101 doesnt provide age? — Mar 14, 2022 at 01:04  
  • +0 – Well, for starters, you wouldn't use "age" to begin with, that being a wide-spread antipattern. Use "birthday" instead, which doesn't bitrot as quickly. That aside, if a vendor does not provide a key, the result is NULL. Seems ok? — Mar 14, 2022 at 01:07  
  • +1 – Ha, then I'll tolerate age for the example! :) — Mar 14, 2022 at 01:19  


External Links

External links referenced by this document: