tags:

views:

344

answers:

4

Problem. We log things to a database. To keep disk space usage capped we export from the database to files that can be copied off, or just plane deleted. Some power above me wants to see this as JSON.

I see a single JSON files as a single object. So in this case we'd create an object with a list of log messages. Problem is, this file could have several million log items in it which I imagine would choke most parsers. So the only way to do it I think is for each log item to have its own JSON object.

This means that JSON parsers can't handle the file as it. But we could write a line parser to read in the file and push each line through a JSON parser.

Does this sound correct?

I believe XML would have the same problem, but at least there we have SAX.. Or we could do it as a bunch of minidocs all prefixed by their length.

Thanks.

A: 

This means that JSON parsers can't handle the file as it. But we could write a line parser to read in the file and push each line through a JSON parser.

Does this sound correct?

That sounds reasonable... so you'd end up with a large array of lines delimited by line breaks, each line consisting of one JSON object.

Jason S
+2  A: 

That's correct, I've been unable to find a json parser that does not require the whole thing to be in memory at once, at least during some part of the process (I had a database dump in json format i need to parse...it was a nightmare).

The common way this is currently done is either with object style or csv style

object style:

{"name":"bob","position":"ceo","start_date":"2007-08-10"}
{"name":"tom","position":"cfo","start_date":"2007-08-11"}

,etc.

csv style:

["name","position","start_date"]
["bob","ceo","2007-08-10"]
["tom","cfo","2007-08-11"]

You waste a lot of disk space with the object style but each line is self contained.

You save disk space with csv style but your data is more tightly coupled to the format and unless you need to have nested data structures like:

["bill","cto","2007-08-12",{"projects":["foo","bar","baz"]}]

you might as well actually use the CSV format.

Freshhawk
+3  A: 

The whole idea of JSON doesn't exactly coexist with storing several million entries in a file...

The whole point of JSON was to remove the overhead caused by XML. If you write each record as a JSON object then you are back to storing overhead bits that have no meaning. The next logical step is to write out a regular CSV file with a header record that everything on the planet understands how to import.

If, for some reason, you have child records then you should look at how regular EDI works.

Chris Lively
+1  A: 

Your strategy sounds right: have single objects in JSON and generate/parse them with standard JSON tools, and handle the grouping problem yourself outside JSON.

Besides dumping all the data in just one file, you may want to consider other strategies. For example, you can keep each object in a separate file, or (if that's excessive since you say you have millions of objects) batch them up in files in reasonable groups, and naming the files according to some identifier that you have for these objects, either just primary key (so you get "0-10000", "10001-20000" etc) or something else. E.g for log entries, date/time would be appropriate. This way, should some poor soul need to use or examine this data in any shape some day, it's a bit more manageable. And to get these files into archival format, just zip/compress them into one file, JSON as text data should compress quite well.

Jaanus