views:

37

answers:

2

We receive lots of data as flat files: delimitted or just fixed length records. It's sometimes hard to find out what the files actually contain.

Are there any well established practices for embedding the schema of the file to the beginning or the end of a file to make the file self-explanatory?

Just to get an idea, imagine something like this:

<data name=test records=2 type=fixed>
   <field name=foo start=0 length=2 type=numeric>
   <field name=bar start=2 length=4 type=text>
</data>
11test
12ing 

We would parse the xml in the beginning and use it for reading the records.

+1  A: 

have you looked at Protocol Buffers for inspiration?

Rob Fonseca-Ensor
PB is optimized for small records, to get faster and lower-latency communications. AFAICT, it can't describe existing schemas.
Javier
Javier is correct: we're trying to add metadata to existing schemas. +1 for an interesting link though.
Ville Koskinen
+1  A: 

So far as I'm aware no - or at least not hugely.

The only thing I'm aware of (in terms of a widely accepted standard) is for the first row of the data file to be the column names - at least for delimited records, for fixed length its harder especially if your data can contain multiple record types (which I've found to be far more likely with fixed length than with delimited).

From where I sit I'd suggest that you can't really embed the definition into the file I'm assuming you're getting data from external sources so you're unlikely to get help from them and even if you do you immediately create challenges as you can't (for example) easily open the files with Excel if necessary.

Thinking a bit laterally you could - if using XML - potentially embed the file into the definition (big lump of CDATA). This is a slightly more practical solution as its putting a wrapper round your external data not asking that the data itself be modified. Not sure how practical this is - but it feels better to me than the other way round.

Murph
Thanks, +1. Would you prefer an additional file containing the metadata? Our goal is just to add some data to the files to ensure they are what we expect and that they've been created correctly, for example that the number of records is what the sender reports. The files are very large and it's not necessary for us to be able to open the files in Excel or other standard tools.
Ville Koskinen
How much control do you have over the created files? That's a fairly key question (-: When I've worked with EDI files (fixed length) they tended to have headers and footers that defined things like record counts and possibly totals which is obviously helpful
Murph
The files come from outside the organization. The content and the structure of a file is defined in a formal agreement, so we have some control. However we are hesitant to make big changes to any of the files because we have lots of legacy code for further processing.
Ville Koskinen
Hmm... on that basis I think you had it in the first comment - what I think you want is a "manifest" file to go with the file you've been sent, so the actual file remains unchanged but you get an XML file to go with that both defines the format and gives you some form of validation. If its too hard for the supplier to change (been there...) you're no worse off and if its not then you load the XML which tells you about the file you're actually going to load and how to validate etc and off you go. Minimal pain. No idea if there are standards for manifest files though!
Murph