views:

100

answers:

8

I am needing to store easily parsable data in a file as an alternative to the database backed solution (not up for debate). Since its going to be storing lots of data, preferably it would be a lightweight syntax. This does not necessarily need to be human readable, but should be parsable. Note that there are going to be multiple types of fields/columns, some of which might be used and some of which won't

From my limited experience without a database I see several options, all with issues

  • CSV - I could technically do this, and it is very light. However the parsing would be an issue, and then it would suck if I wanted to add a column. Multi-language support is iffy, mainly people's own custom parsers
  • XML - This is the perfect solution from many fronts except when it comes to parsing and overhead. Thats a lot of tags and would generate a giant file, and parsing would be very resource consuming. However virtually every language supports XML
  • JSON - This is the middle ground, but I don't really want to do this as its an awkward syntax and parsing is non-trivial. Language support is iffy.

So all have their disadvantages. But what would be the best when trying to aim for language support AND somewhat small file size?

+3  A: 

How about sqlite? This would allow you to basically embed the "DB" in your application, but not require a separate DB backend.

Also, if you end up using a DB backend later, it should be fairly easy to switch over.

If that's not suitable, I'd suggest one of the DBM-like stores for key-value lookups, such as Berkely DB or tdb.

Jeremy Kerr
SQLite is an option, but I was really wanting flat file storage, not just a db in a file
TheLQ
+1  A: 

If you're just using the basics of all these formats, all of the parsers are trivial. If CSV is an option, then for XML and JSON you're talking blocks of name/value pairs, so there's not even a recursive structure involved. json.org has support for pretty much any language.

That said.

I don't see what the problem is with CSV. If people write bad parsers, then too bad. If you're concerned about compatibility, adopt the default CSV model from Excel. Anyone that can't parse CSV from Excel isn't going to get far in this world. The weakest support you find in CSV is embedded newlines and carriage returns. If you data doesn't have this, then it's not a problem. Only other issue is embedded quotations, and those are escaped in CSV. If you don't have those either, then its even more trivial.

As for "adding a column", you have that problem with all of these. If you add a column, you get to rewrite the entire file. I don't see this being a big issue either.

If space is your concern, CSV is the most compact, followed by JSON, followed by XML. None of the resulting files can be easily updated. They pretty much all would need to be rewritten for any change in the data. CSV has the advantage that it's easily appended to, as there's no closing element (like JSON and XML).

Will Hartung
If I used XML, I could add an element by just adding a new tag. I forgot though that CSV was importable into excel spreadsheets.
TheLQ
Then your XML is not conforming. In theory, an XML file is one single element with potentially a zillion children of that root. Not saying it can't be done, and others do it, just saying that what you end up with is a file with several XML elements, rather than a file that is a conforming XML document.
Will Hartung
<serverline><type>Mode</type><mode>T</mode></serverline>
TheLQ
You miss the point. If you have a file that looks like <serverline>...</serverline> \n <serverline>...</serverline>, you may have individual lines that happen to look like XML, but the file is not a conforming, well formed XML document. A conforming XML document has a single root element. <html>....</html>, for example. Specifically, don't expect to take the log file and be able to read the entire thing with an off the shelf XML reader, it will likely only read the first line, and then stop.
Will Hartung
Well after hearing this I think I'm going to use CSV, mainly because of its compactness and format. Thanks for the help
TheLQ
A: 

JSON is probably your best bet (it's lightish, faster to parse, and self-descriptive so you can add your new columns as time goes by). You've said parsable - do you mean using Java? There are JSON libraries for Java to take the pain out of most of the work. There are also various light-weight in memory databases that can persist to a file (in case "not an option" means you don't want a big separate database)

jowierun
A: 

If this is just for logging some data quickly to a file, I find tab delimited files are easier to parse than CSV, so if it's a flat text file you're looking for I'd go with that (so long as you don't have tabs in the feed of course). If you have fixed size columns, you could use fixed-length fields. That is even quicker because you can seek.

If it's unstructured data that might need some analysis, I'd go for JSON.

If it's structured data and you envision ever doing any querying on it... I'd go with sqlite.

WOPR
Tab delimited seems horrible when you add a line thats one character longer than the rest of the column. And there might be tabs in the data.
TheLQ
A: 

When I needed solution like this I wrote up a simple representation of data prefixed with length. For example "Hi" will be represented as(in hex) 02 48 69.
To form rows just nest this operation(first number is number of fields, and then the fields), for example if field 0 contains "Hi" and field 1 contains "abc" then it will be:

Num of fields   Field Length   Data    Field Length   Data
02              02             48 69   03             61 62 63

You can also use first row as names for the columns. (I have to say this is kind of a DB backend).

Dani
A: 

You can use CSV and if you only add columns to the end this is simple to handle. i.e. if you have less columns than you expect, use the default value for the "missing" fields.

If you want to be able to change the order/use of fields, you can add a heading row. i.e. the first row has the names of the columns. This can be useful when you are trying to read the data.

Peter Lawrey
So I would have 3-4 blank columns at the end of each row?
TheLQ
I would suggest the parser assume that if it tries to read fields which are not present they be treated as blank.
Peter Lawrey
A: 

If you are forced to use a flat file, why not develop your own format? You should be able to tweak overhead and customize as much as you want (which is good if you are parsing lots of data). Data entries will be either of a fixed or variable length, there are advantages to forcing some entries to a fixed length but you will need to create a method for delimiting both. If you have different "types" of rows, write all the rows of a each type in a chunk. Each chunk of rows will have a header. Use one header to describe the type of the chunk, and another header to describe the columns and their sizes. Determine how you will use the headers to describe each chunk.

eg (H is header, C is column descriptions and D is data entry):

H Phone Numbers
C num(10) type
D 1234567890 Home
D 2223334444 Cell

H Addresses
C house(5) street postal(6) province
D 1234_ "some street" N1G5K6 Ontario
Eric Coutu
Thats doable I guess, but I was looking for a standard storage place
TheLQ
A: 

I'd say that if you want to store rows and columns, you've got to to use a DB. The reason is simple - modification of the structure with any approach except RDBMS will require significant efforts, and you mentioned that you want to change the structure in future.

Eugene Mayevski 'EldoS Corp