views:

344

answers:

2

I am looking for the best solution for custom file parsing for our enterprise import routines. I want to basically change one file format into a standard file format and have one routine that imports that data into the database. I need to be able to create custom scripts for each client since its difficult to get the customer to comply with a standard or template format. I have looked at PowerShell and Iron Python to do this so far but I am not sure this is the route I want to go. I have also looked at some tools such as Talend which is a drag and drop style tool which may or may not give me what I want as far as flexibility. We are a .NET shop and have created custom code to do this in the past but I need something that is quicker to create then coding custom parsing functions each time we get a new file format in.

A: 

Python is wonderful for this kind of thing. That's why we use. Each new customer transfer is a new adventure and Python gives us the flexibility to respond quickly.


Edit. All python scripts that read files are "custom file parsers". Without an actual example, it's not sensible to provide a detailed example.

with open( "some file", "r" ) as source:
    for line in source:
        process( line )

That's about all there is to a "custom file parser". If you're parsing .csv or .xml files, then Python has modules for that. If you're parsing fixed-format files, you'd use string slicing operations. If you're parsing other files (X12? JSON? YAML?) you'll need appropriate parsers.

Tab-Delim.

from collections import namedtuple
RecordLayout = namedtuple('RecordLayout',['field1','field2','field3',...])
def process( aLine ):
    record = RecordLayout( aLine.split('\t') )
    ...

Fixed Layout.

from collections import namedtuple
RecordLayout = namedtuple('RecordLayout',['field1','field2','field3',...])
def process( aLine ):
    fields = ( aLine[:10], aLine[10:20], aLine[20:30], ... )
    record = RecordLayout( fields )
    ...
S.Lott
Can you show me a sample script you use to do the parsing? I have not found any great samples which show custom file parsing.
Ryan
@Ryan: All python programs that read a file are "custom file parsers". You have built-in parsers for XML, CSV, JSON, YAML and some others. You can download parsers for XLS spreadsheets. Since your question just says "custom", it's not sensible to attempt to provide any code. You'll have to provide an example file.
S.Lott
The files we get in are in strange formats sometime and may require us to do some custom parsing to get the data we want out of them. They are tab delimited for the most part but sometimes they won't even load with SSIS using tab delimited settings. Below is a sample of how we do import parsing now with C# on one of the files we load.string lname = line.Substring(10, 20).Trim();string fname = line.Substring(30, 10).Trim();string mi = line.Substring(40, 1).Trim();
Ryan
+2  A: 

Depending on the complexity and variability of your work, you should consider an ETL tool like SSIS (SQL Server Integration Services).

John Saunders
Some of the files we get in are difficult to parse. Is there any special tools that come with SSIS we can us other than a standard import for files?
Ryan
Please give more detail about the formats. In general, SSIS will allow you to parse just about any format - fixed field, delimited, XML, etc. If necessary, it's fairly easy to write your own parsers that fit in with SSIS so that only the parse function need be custom.
John Saunders
I didn't realize you could use custom parsing functions inside SSIS. I will have to look into this more. Thanks! I was looking at buying a tool like Advanced ETL Processor Ent.
Ryan
You can create custom components for SSIS - either purely in code, or in a Script Component. In any case, write a custom Source Transform that does the parsing and returns the results as new output columns.
John Saunders
I agree with John-definitely check out SSIS. It's a great ETL tool that comes with SQL Server 2005 or 2008 standard or enterprise editions.
rfonn