tags:

views:

194

answers:

4

I have to read invoice ascii files that are structured in a really convoluted way, for example:

55651108 3090617.10.0806:46:32101639Example Company               Construction Company          Example Road. 9            9524 Example City

There's actually additional stuff in there, but I don't want to confuse you any further.

I know I'm doomed if the client can't offer a better structure. For instance 30906 is an iterative number that grows. 101639 is the CustomerId. The whitespaces between "Example Company" and "Construction Company" are of variable length The field "Example Company" could have whitespaces of variable length too however, for instance "Microsoft Corporation Redmond". Same with the other fields. So there's no clear way to extract data from the latter part.

But that's not the question. I got taken away. My question is as follows:

If the input was somewhat structured and well defined, how would you guard against future changes in its structure. How would you design and implement a reader.

I was thinking of using a simple EAV Model in my DB, and use text or xml templates that describe the input, the entity names, and their valuetypes. I would parse the invoice files according to the templates.

A: 

I believe that a template describing the entity names and the value types is good one. Something like a "schema" for a text file.

What I would try to do is to separate the reader from the rest of the application as much as possible. So, the question really is, how to define an interface that will be able to accommodate for changes in the parameters list. This is may not be always possible, but still, if you are relying on an interface to read the data, you could change the implementation of the reader without affecting the rest of the system.

kgiannakakis
A: 

Well, your file format looks much like the french protocol called Etebac used between banks and their customers.

It's a fixed width text format.

The best you can do is use some kind of unpack function :

$ perl -MData::Dumper -e 'print Dumper(unpack("A8 x A5 A8 A8 A6 A30 A30", "55651108 3090617.10.0806:46:32101639Example Company               Construction Company          Example Road. 9            9524 Example City"))'
$VAR1 = '55651108';
$VAR2 = '30906';
$VAR3 = '17.10.08';
$VAR4 = '06:46:32';
$VAR5 = '101639';
$VAR6 = 'Example Company';
$VAR7 = 'Construction Company';

What you should do is for every input, check that it is what it's supposed to be, that is, XX.XX.XX, or YY:YY:YY or that it does not start with a space, and abort if it does.

mat
A: 

I'd have a database of invoice data, with tables such as Company, Invoices, Invoice_Items. Depends on complexity, do you wish to record your orders as well, and then link invoices to the orders, and so on? But I digress...

I'd have an in-memory model of the database model, but that's a given. If XML output and input was needed, I would have an XML serialisation of the model if I needed to supply the invoices as data elsewhere, and a SAX parser to read it in. Some APIs might make this trivial to do, or maybe you just want to expose a web service to your repository if you are going to have clients reading from you.

As for reading in the text files (and there isn't much information relating to them - why would the format of these change? where are they coming from? Are you replacing this system, or will it keep on running, and you're just a new backend that that they're feeding?) You say the number of spaces is variable - is that just because the format is fixed-width columns? I would create a reader that would read them into your model, and hence your database schema.

JeeBee
+1  A: 

"If the input was somewhat structured and well defined, how would you guard against future changes in its structure. How would you design and implement a reader?"

You must define the layout in a way you can flexibly pick it apart.

Here's a python version

class Field( object ):
    def __init__( self, name, size ):
        self.name= name
        self.size = size
        self.offset= None

class Record( object ):
    def __init__( self, fieldList ):
        self.fields= fieldList
        self.fieldMap= {}
        offset= 0
        for f in self.fields: 
            f.offset= offset
            offset += f.size
            self.fieldMap[f.name]= f
    def parse( self, aLine ):
        self.buffer= aLine
    def get( self, aField ):
        fld= self.fieldMap[aField]
        return self.buffer[ fld.offset:fld.offset+fld.size+1 ]
    def __getattr__( self, aField ):
        return self.get(aField)

Now you can define records

myRecord= Record( 
    Field('aField',8), 
    Field('filler',1), 
    Field('another',5),
    Field('somethingElse',8),
)

This gives you a fighting chance of picking apart some input in a reasonably flexible way.

myRecord.parse(input)
myRecord.get('aField')

Once you can parse, adding conversions is a matter of subclassing Field to define the various types (dates, amounts, etc.)

S.Lott