I have CSV import data that comes from multiple vendors with 50+ different columns, and each vendor has slightly different naming.
Instead of trying to spend hours mapping all of the column names to a master table with tons of columns, instead I want to store the original CSV row as a JSON column, such as csv_data
.
I also want to store a unique identifier that is shared by all of the vendors, like email
.
So my columns in my `contacts table would be:
vendor_id
email
csv_data
Then I was hoping I could create a view or materialized view which I could start with a small subset of columns that I know I will need (and will spend time mapping). Then, if I ever need the new columns later I could modify the view with new mappings.
I was hoping I could do some kind of case switch, like this:
SELECT
CASE
WHEN vendor_id = 100 THEN
csv_data->>'fullname' AS full_name,
csv_data->>'age' AS age,
... etc ...
WHEN vendor_id = 101 THEN
csv_data->>'FULL NAME' AS full_name,
csv_data->>'AGE' AS age,
... etc ...
WHEN vendor_id = 102 THEN
csv_data->>'full name' AS full_name,
csv_data->>'cust_age' AS age,
... etc ...
END
FROM contacts
Unfortunately this is a syntax error... is there any more sane way to do this? I couldn't come up with good ideas. I truly don't want to create some enormous mapping as so many of the column names are messy and probably will never be used. I want to store the data in case I need it, but for my 'results' query I want to create a nice clean mapping that I only add to when I need it.
I'm open to better ways of doing this overall as well. I was trying to avoid the nuclear route of the massive massive table. But I guess I would if I absolutely advised as the best way.
If you go that route, storing original JSON data (I suggest as jsonb
rather than json
- cleaner and faster to extract), then consider a translation table with key names for every relevant target column per vendor. Conveniently as another single jsonb
value. Like:
CREATE TABLE key2col (
vendor_id int PRIMARY KEY
, keys jsonb NOT NULL
);
INSERT INTO key2col VALUES
(100, '{"full_name":"fullname", "age":"age"}')
, (101, '{"full_name":"FULL NAME", "age":"AGE"}')
-- more ...
;
Then your query can simply be:
SELECT c.vendor_id, c.email
, c.csv_data->>(k.keys->>'full_name') AS full_name
, c.csv_data->>(k.keys->>'age') AS age
FROM contacts c
LEFT JOIN key2col k USING (vendor_id);
db<>fiddle here
This is even fit for any varying set of key --> column translation per vendor. Every field that has no JSON key for the given vendor in key2col.keys
will be NULL
. Same if the vendor did not actually provide the declared key in contacts.csv_data
.
The LEFT JOIN
in the query retains rows from contacts without any registered columns. You may want a plain JOIN
instead ...
NULL
. Seems ok? — Mar 14, 2022 at 01:07 age
for the example! :) — Mar 14, 2022 at 01:19 External links referenced by this document: