views:

35

answers:

3

The company I work for has a lot of systems where the source data originates as a flat file or Excel workbook. These files are then imported using DTS into a SQL Server database.

Quite often these files contain various control characters (for instance I've just spent an hour finding that some records have a \0 in them which apparently SAP like to pad things with) which ideally we would like to strip out before they get into our database.

I've not used DTS/SSIS that heavily, but I thought that there must be something already in them, or at least a sanitisation best practice, that I should be using?

Any ideas on how to do this in a repeatable fashion across many imports would be appreciated.

A: 

If certain characters can always simply be replaced/removed, then we typically handle this pre-import with a custom C program. This obviously doesn't work for binary files or anything with variable length column length indicator columns or EBCDIC, etc, but when it can be used, it can be applied irrespective of import tool.

Cade Roux
Thanks for your answer. Unfortunately we let end users upload files, so running an extra process isn't always possible.
starskythehutch
+1  A: 

I can only speak coming from my experiences with SSIS, so I apologise if this is only marginally useful.

In terms of general sanitation, I haven't come across any sort of purpose-build component for this job. I assume that's due to the notion that different scenarios can have widely different definitions of "bad data", but arguably things like stripping out non-printable characters might be a common business requirement so this is somewhat surprising.

Naturally you can always fall back on the Derived Column Transformation or, in more complex cases, the Script Component, but these hardly make replication of the logic across packages painless. There are cases where we've used CozyRoc components to create scripts that we could share between packages, but in addition to having to actually have access to those components in the first place, it still involves a certain level of setup every time you need the functionality.

So, in cases like this, I personally feel like the best option is to consolidate your common sanitation routines into a Custom Data Flow Component. We developed one for ourselves to handle some very common data validation logic, and set the component editor up so that you could select checkboxes to enable whichever checks you wanted applied to a given column. Then it just becomes a matter of drag-and-drop and wiring up your data flow. At least for us, it was definitely worth taking the time to develop the component.

I don't know what DTS offers in this regard (I think it's fairly limited, or at least not friendly, but I could be wrong), so this is likely not useful for your legacy packages. I feel like Cade's solution would probably be most stress-free in that case, assuming it's feasible for your input data.

As far as general SSIS insight goes, I've found Jamie Thomson's blogs (previously here) very helpful, so you might want to check to see if he's discussed this particular topic before.

In any case, hopefully this was at least a little helpful, and good luck coming up with a solution.

On a side note: Kudos for wanting to sanitize your data before sticking it in your database. We get data from a particular vendor who doesn't ever clean their data, then tries to shove it in XML to send it to us via a web service. Needless to say, the XML parser isn't too fond of non-printable characters.

Tim Stone
Thanks for the answer. I was afraid that I wasn't alone in not being able to find a component to fitted the bill for either DTS and SSIS. I've not developed a custom data flow component before but at least I'll gain some knowledge doing that! I've up voted your answer.
starskythehutch
A: 

Depending on your needs, you could also do the following:

  1. Import the data (through SSIS / DTS) into a 'staging' table in your DB
  2. Have a 'sanitize' table in your DB, containing (per row) any string you wish to strip (or replace with something else). This table could have additional columns for different scenario's, of course
  3. Use cursors to step through the tables and sanitize.
Tobiasopdenbrouw