ansaurus

Question

Efficient storage of and access to large XML files

Answer 1

+4 A:

Storing data to a DB is more efficient because of one of the disadvantages of XML: each element has metadata. Therefore, a row that contains just one integer value might contain over ten characters just to describe that one value. XML is very wordy. If you store the data in a DB, that value is stored in itself, with the metadata saved in a location as schema.

johnofcross 2009-07-01 12:25:28

I should have been more clear in my question... the record format is a complex format with hundreds of different possible fields. So there is no easy mapping to a database.

2009-07-01 12:52:44

Well, does it have an XML Schema? If so, then the DB transition will be easy. If it doesn't, I suggest creating a Schema that will allow for data integrity. If that's not possible, try and determine what data type each record contains, and map to a DB type.

johnofcross 2009-07-01 12:57:05

Hmm... that strikes me as inefficient as well... either we would have lots of empty columns or the number of tables we would have to have would be fairly large. The XML is non-trivial with record nodes that have variable number of attributes and variable number of nested child nodes each with their own set of attributes. So mapping this all to a database doesn't feel right.Thanks for your continued feedback! We appreciate the pushback on our assumptions.

2009-07-01 13:26:27

The thing is, XML in itself is inefficient because of its wordiness; there is no need to repeat metadata if we know what the value is. If you map your XML elements into relational objects, then you can easily map your XML into a DB. If the relational mapping is efficient, then you will never have duplicate records or empty records. It will take alot of time, granted, but the end result is better, because the solution will be alot more scalable.

johnofcross 2009-07-01 13:32:56

A 'proper' normalized database schema for the XML though would result in many many tables. To reconstruct the XML to serve it out would require a corresponding number of disk seeks (if the data wasn't cached in memory.) Also it seems like messy query to go find all the parts to reconstruct the XML that we need to send.

2009-07-01 14:25:18

A proper database, if indexed correctly, will require very minimal disk seeks. It really goes back to the scalability point. Queries might be messy, but that's what stored procedures are for: you write once and forget it. With the right parameters, your results can be retrieved quickly and efficiently. And in a DB, you can do alot more, such as dynamic reporting capabilities.

johnofcross 2009-07-01 15:54:10

@TB: Don't make assumptions about whether a particular DB can handle this. That's their job, and if there's a way to do it, likely they'll know about it.

John Saunders 2009-07-01 16:55:36

I completely agree with johnofcross. The limitation that "T B" says he has because of the variety of XML formats can probably be overcome by mapping to relational objects. Wish this poster could give an small example.

djangofan 2009-07-29 22:30:27

Answer 2

A:

First of all, you'll probably want to store the values in a single column without the XML tags to save a lot of space. Afterwards, you can build a simple view which selects ' <record>'||column_name||'</record>'||chr(10). To get the entire XML document, I'd recommend using a concatenating function (I'm from an Oracle background, so I'm not sure how it's done in Postgresql), which takes a single-column cursor as input, and outputs the entire result as one concatenated string. Then you can just concatenate the <xml> and </xml> tags, and it's good to go.

l0b0 2009-07-01 12:29:11

Thanks, I should have been more clear in my question. The binary records contain hundreds of different possible fields. So there is no easy mapping to a database.

2009-07-01 12:53:40

Answer 3

A:

Database storage with proper indexes will most likely always be faster for random access but might take alot of effort to split records into their individual data elements. Maybe meet in the middle and store the entire record in a single datafield keyed based on whatever unique identifier you would be using to query the data.

If you just want to try and 'trim' down the xml files and have control over the schema - something as simple as making the node names a single character or two character has saved me alot of bandwidth/filesize in the past, of course the tradeoff is the readability of the xml goes away.

b

WillyCornbread 2009-07-01 12:54:03

Answer 4

+1 A:

Since the data is static and never (or rarely) changes, you are free to take a different approach and pre-generate the 50,000+ XML-formatted "records" into 50,000+ static files and then serve up this static content using Apache (or better: lighttpd or nginx). This is a very common technique for optimizing web sites. These static files can be regenerated as needed should the original data file be changed.

Note that you can get high availability and scalability by load balancing incoming HTTP requests to two or more static content server machines, each with its own copy of the data. You can also get scalability by using an HTTP reverse proxy cache in front of your web server(s).

But honestly, a gigabyte isn't what it used to be, and you can simply create a single PostgreSQL table that holds these 50,000+ chunks of pre-generated XML, keyed by whatever your row-index is.

Jim Ferrans 2009-07-01 16:47:05

ansaurus

tags:

views:

answers:

Efficient storage of and access to large XML files

related questions