ansaurus

Question

Efficiently store easily parsable data in a file?

Answer 1

+3 A:

How about sqlite? This would allow you to basically embed the "DB" in your application, but not require a separate DB backend.

Also, if you end up using a DB backend later, it should be fairly easy to switch over.

If that's not suitable, I'd suggest one of the DBM-like stores for key-value lookups, such as Berkely DB or tdb.

Jeremy Kerr 2010-08-05 04:04:00

SQLite is an option, but I was really wanting flat file storage, not just a db in a file

TheLQ 2010-08-05 17:20:45

Answer 2

+1 A:

If you're just using the basics of all these formats, all of the parsers are trivial. If CSV is an option, then for XML and JSON you're talking blocks of name/value pairs, so there's not even a recursive structure involved. json.org has support for pretty much any language.

That said.

I don't see what the problem is with CSV. If people write bad parsers, then too bad. If you're concerned about compatibility, adopt the default CSV model from Excel. Anyone that can't parse CSV from Excel isn't going to get far in this world. The weakest support you find in CSV is embedded newlines and carriage returns. If you data doesn't have this, then it's not a problem. Only other issue is embedded quotations, and those are escaped in CSV. If you don't have those either, then its even more trivial.

As for "adding a column", you have that problem with all of these. If you add a column, you get to rewrite the entire file. I don't see this being a big issue either.

If space is your concern, CSV is the most compact, followed by JSON, followed by XML. None of the resulting files can be easily updated. They pretty much all would need to be rewritten for any change in the data. CSV has the advantage that it's easily appended to, as there's no closing element (like JSON and XML).

Will Hartung 2010-08-05 04:06:25

If I used XML, I could add an element by just adding a new tag. I forgot though that CSV was importable into excel spreadsheets.

TheLQ 2010-08-05 17:22:38

Then your XML is not conforming. In theory, an XML file is one single element with potentially a zillion children of that root. Not saying it can't be done, and others do it, just saying that what you end up with is a file with several XML elements, rather than a file that is a conforming XML document.

Will Hartung 2010-08-05 19:02:47

TheLQ 2010-08-06 04:23:28

You miss the point. If you have a file that looks like <serverline>...</serverline> \n <serverline>...</serverline>, you may have individual lines that happen to look like XML, but the file is not a conforming, well formed XML document. A conforming XML document has a single root element. <html>....</html>, for example. Specifically, don't expect to take the log file and be able to read the entire thing with an off the shelf XML reader, it will likely only read the first line, and then stop.

Will Hartung 2010-08-06 18:11:08

Well after hearing this I think I'm going to use CSV, mainly because of its compactness and format. Thanks for the help

TheLQ 2010-08-07 23:19:02

Answer 3

A:

JSON is probably your best bet (it's lightish, faster to parse, and self-descriptive so you can add your new columns as time goes by). You've said parsable - do you mean using Java? There are JSON libraries for Java to take the pain out of most of the work. There are also various light-weight in memory databases that can persist to a file (in case "not an option" means you don't want a big separate database)

jowierun 2010-08-05 04:08:33

Answer 4

A:

If this is just for logging some data quickly to a file, I find tab delimited files are easier to parse than CSV, so if it's a flat text file you're looking for I'd go with that (so long as you don't have tabs in the feed of course). If you have fixed size columns, you could use fixed-length fields. That is even quicker because you can seek.

If it's unstructured data that might need some analysis, I'd go for JSON.

If it's structured data and you envision ever doing any querying on it... I'd go with sqlite.

WOPR 2010-08-05 04:15:00

Tab delimited seems horrible when you add a line thats one character longer than the rest of the column. And there might be tabs in the data.

TheLQ 2010-08-05 17:29:48

Answer 5

A:

When I needed solution like this I wrote up a simple representation of data prefixed with length. For example "Hi" will be represented as(in hex) 02 48 69.
To form rows just nest this operation(first number is number of fields, and then the fields), for example if field 0 contains "Hi" and field 1 contains "abc" then it will be:

Num of fields   Field Length   Data    Field Length   Data
02              02             48 69   03             61 62 63

You can also use first row as names for the columns. (I have to say this is kind of a DB backend).

Dani 2010-08-05 04:27:41

Answer 6

A:

You can use CSV and if you only add columns to the end this is simple to handle. i.e. if you have less columns than you expect, use the default value for the "missing" fields.

If you want to be able to change the order/use of fields, you can add a heading row. i.e. the first row has the names of the columns. This can be useful when you are trying to read the data.

Peter Lawrey 2010-08-05 06:37:37

So I would have 3-4 blank columns at the end of each row?

TheLQ 2010-08-05 17:27:59

I would suggest the parser assume that if it tries to read fields which are not present they be treated as blank.

Peter Lawrey 2010-08-05 20:57:41

Answer 7

A:

If you are forced to use a flat file, why not develop your own format? You should be able to tweak overhead and customize as much as you want (which is good if you are parsing lots of data). Data entries will be either of a fixed or variable length, there are advantages to forcing some entries to a fixed length but you will need to create a method for delimiting both. If you have different "types" of rows, write all the rows of a each type in a chunk. Each chunk of rows will have a header. Use one header to describe the type of the chunk, and another header to describe the columns and their sizes. Determine how you will use the headers to describe each chunk.

eg (H is header, C is column descriptions and D is data entry):

H Phone Numbers
C num(10) type
D 1234567890 Home
D 2223334444 Cell

H Addresses
C house(5) street postal(6) province
D 1234_ "some street" N1G5K6 Ontario

Eric Coutu 2010-08-05 07:05:57

Thats doable I guess, but I was looking for a standard storage place

TheLQ 2010-08-05 17:27:29

Answer 8

A:

I'd say that if you want to store rows and columns, you've got to to use a DB. The reason is simple - modification of the structure with any approach except RDBMS will require significant efforts, and you mentioned that you want to change the structure in future.

Eugene Mayevski 'EldoS Corp 2010-08-05 11:01:44

ansaurus

tags:

views:

answers:

Efficiently store easily parsable data in a file?

related questions