tags:

views:

166

answers:

4

I have a legacy binary file format containing records that we have to convert and serve to other parts of our system as XML. To give a sense of the data sizes, a single file may be up to 50 megs with 50,000 or more records in it. The XML conversion I have to work with blows this particular file up by a factor of 20 to nearly a gig.

(Unsuprisingly) compressing the file with gzip makes the file ~150 Mb so there is a lot of redundancy.

But what we have to serve out as XML is the individual records that are part of the larger file. Each of these records is quite small. Random access to the records is a requirement. The records themselves contain a variety of different fields so there is no mapping of elements to columns without having a very large table.

As other parts of the system utilize a postgresql database, we are considering storing each of the individual XML nodes as a row in the database. But we are wondering how inefficient this would be storage wise?

<xml>
<record><complex_other_xml_nodes>...<record>
<record>...<record>
<record>...<record>
<record>...<record>
<record>...<record>
</xml>

Or should we be off evaluating an XML database (or something else)? Oh, and we don't need to update or change the XML after conversion, these legacy records are static.

+4  A: 

Storing data to a DB is more efficient because of one of the disadvantages of XML: each element has metadata. Therefore, a row that contains just one integer value might contain over ten characters just to describe that one value. XML is very wordy. If you store the data in a DB, that value is stored in itself, with the metadata saved in a location as schema.

johnofcross
I should have been more clear in my question... the record format is a complex format with hundreds of different possible fields. So there is no easy mapping to a database.
Well, does it have an XML Schema? If so, then the DB transition will be easy. If it doesn't, I suggest creating a Schema that will allow for data integrity. If that's not possible, try and determine what data type each record contains, and map to a DB type.
johnofcross
Hmm... that strikes me as inefficient as well... either we would have lots of empty columns or the number of tables we would have to have would be fairly large. The XML is non-trivial with record nodes that have variable number of attributes and variable number of nested child nodes each with their own set of attributes. So mapping this all to a database doesn't feel right.Thanks for your continued feedback! We appreciate the pushback on our assumptions.
The thing is, XML in itself is inefficient because of its wordiness; there is no need to repeat metadata if we know what the value is. If you map your XML elements into relational objects, then you can easily map your XML into a DB. If the relational mapping is efficient, then you will never have duplicate records or empty records. It will take alot of time, granted, but the end result is better, because the solution will be alot more scalable.
johnofcross
A 'proper' normalized database schema for the XML though would result in many many tables. To reconstruct the XML to serve it out would require a corresponding number of disk seeks (if the data wasn't cached in memory.) Also it seems like messy query to go find all the parts to reconstruct the XML that we need to send.
A proper database, if indexed correctly, will require very minimal disk seeks. It really goes back to the scalability point. Queries might be messy, but that's what stored procedures are for: you write once and forget it. With the right parameters, your results can be retrieved quickly and efficiently. And in a DB, you can do alot more, such as dynamic reporting capabilities.
johnofcross
@TB: Don't make assumptions about whether a particular DB can handle this. That's their job, and if there's a way to do it, likely they'll know about it.
John Saunders
I completely agree with johnofcross. The limitation that "T B" says he has because of the variety of XML formats can probably be overcome by mapping to relational objects. Wish this poster could give an small example.
djangofan
A: 

First of all, you'll probably want to store the values in a single column without the XML tags to save a lot of space. Afterwards, you can build a simple view which selects ' <record>'||column_name||'</record>'||chr(10). To get the entire XML document, I'd recommend using a concatenating function (I'm from an Oracle background, so I'm not sure how it's done in Postgresql), which takes a single-column cursor as input, and outputs the entire result as one concatenated string. Then you can just concatenate the <xml> and </xml> tags, and it's good to go.

l0b0
Thanks, I should have been more clear in my question. The binary records contain hundreds of different possible fields. So there is no easy mapping to a database.
A: 

Database storage with proper indexes will most likely always be faster for random access but might take alot of effort to split records into their individual data elements. Maybe meet in the middle and store the entire record in a single datafield keyed based on whatever unique identifier you would be using to query the data.

If you just want to try and 'trim' down the xml files and have control over the schema - something as simple as making the node names a single character or two character has saved me alot of bandwidth/filesize in the past, of course the tradeoff is the readability of the xml goes away.

b

WillyCornbread
+1  A: 

Since the data is static and never (or rarely) changes, you are free to take a different approach and pre-generate the 50,000+ XML-formatted "records" into 50,000+ static files and then serve up this static content using Apache (or better: lighttpd or nginx). This is a very common technique for optimizing web sites. These static files can be regenerated as needed should the original data file be changed.

Note that you can get high availability and scalability by load balancing incoming HTTP requests to two or more static content server machines, each with its own copy of the data. You can also get scalability by using an HTTP reverse proxy cache in front of your web server(s).

But honestly, a gigabyte isn't what it used to be, and you can simply create a single PostgreSQL table that holds these 50,000+ chunks of pre-generated XML, keyed by whatever your row-index is.

Jim Ferrans