I receive a csv daily from an external company capturing feedback for us.
Unhelpfully the csv headings vary. There could be any combination of headers. For example:
day1: heading1, heading2, heading3, heading4, heading5
day2: heading1, heading2, heading4, heading6
day3: heading1, heading2, heading7, heading8
day4: heading1, heading2, heading3, heading4, heading5
We have 18 months (no one has done anything with this data until now) of backlog files and so far I have found 22 different combinations...
I know what all the potential headings will be so I can have an SQL table that contains all the appropriate columns but I am struggling to handle the variance for import.
Is there a way to handle this in SSIS? I have looked at bcp and openrowset but I can't seem to make it work.
As this file is received daily I want an sql job that imports them to my table for reporting. I can only create an SSIS package that will import a fixed flat file but I need it to be dynamic. Worst case scenario would be to have 22+ different SSIS packages and run a specific 1 for the given headings available but then I would need a way of automatically reading the structure of the csv to decide which package to use.
I'm lost, so does anyone have any direction?
SQL 2016 (13.0.5622.0)
Consider a simplified scenario with two different files, each containing different subsets of data.
a,b,c 1,a,a 2,b,b 3,c,c 4,d,d
a,b,d 1,a,1 4,d,4 5,e,5 6,f,6
Now, we can use PowerShell to import the file,
Select columns into a known, consistent number & order, then export it back to a csv with predictable files:
Import-Csv abc.csv | Select a,b,c,d,e,f,g | Export-Csv abc_cleansed.csv -NoTypeInformation Import-Csv abd.csv | Select a,b,c,d,e,f,g | Export-Csv abd_cleansed.csv -NoTypeInformation
This will produce two new files that have a common, predictable file format.
"a","b","c","d","e","f","g" "1","a","a",,,, "2","b","b",,,, "3","c","c",,,, "4","d","d",,,,
"a","b","c","d","e","f","g" "1","a",,"1",,, "4","d",,"4",,, "5","e",,"5",,, "6","f",,"6",,,
Note: I omitted double-quotes on my initial csv files, and PowerShell has also helpfully added them in. You could remove them, but IMHO this is definitely a beneficial feature for data quality.
Now, you can simply build a data pipeline where you take an input file, do conversion to cleanse the format, then pick up those cleansed files to import them. Depending on your process, you could do this all within a single SSIS package, or build separate data cleansing/data importing processes.
-UseQuotes AsNeeded? It's a game changer — Sep 29, 2020 at 15:53