ansaurus

Question

UTF-8 tuple storage using lowest common technological denominator, append-only

Answer 1

+1 A:

Mostly thinking out loud here...

Really low tech would be to use (for example) null bytes as separators, and just "quote" all null bytes appearing in the output with an additional null.

Perhaps one could use SCSU along with that.

Or it might be worth to look at the gzip format, and maybe ape it, if not using it:

A gzip file consists of a series of "members" (compressed data sets).

[...]

The members simply appear one after another in the file, with no additional information before, between, or after them.

Each of these members can have an optional "filename", comment, or the like, and i believe you can just keep appending members.

Or you could use bencode, used in torrent-files. Or BSON.

See also Wikipedia's Comparison of data serialization formats.

Otherwise i think your idea of preceding each string with its length is probably the simplest one.

Christoffer Hammarström 2010-06-30 16:03:40

Answer 2

+2 A:

I would recommend tab delimiting each field and carriage-return delimiting each record.

Within each string, Replace all characters that would affect the field and record interpretation and rendering. This would include control characters (U+0000–U+001F, U+007F–U+009F), non-graphical line and paragraph separators (U+2028, U=2029), directional control characters (U+202A–U+202E), and the byte order mark (U+FEFF).

They should be replaced with escape sequences of constant length. The escape sequences should begin with a rare (for your application) character. The escape character itself should also be escaped.

This would allow you to append new records easily. It has the additional advantage of being able to load the file for visual inspection and modification into any spreadsheet or word processing program, which could be useful for debugging purposes.

This would also be easy to code, since the file will be a valid UTF-8 document, so standard text reading and writing routines may be used. This also allows you to convert easily to UTF-16BE or UTF-16LE if desired, without complications.

Example:

U+0009 CHARACTER TABULATION becomes ~TB
U+000A LINE FEED            becomes ~LF
U+000D CARRIAGE RETURN      becomes ~CR
U+007E TILDE                becomes ~~~
etc.

There are a couple of reasons why tabs would be better than commas as field delimiters. Commas appear more commonly within normal text strings (such as English text), and would have to be replaced more frequently. And spreadsheet programs (such as Microsoft Excel) tend to handle tab-delimited files much more naturally.

Jeffrey L Whitledge 2010-06-30 16:56:10

Could probably use url-encoding then, as that is already standard.

Christoffer Hammarström 2010-06-30 19:04:06

@Christoffer Hammarström - URL encoding also converts spaces and (I’m guessing) quotes, which I would prefer to keep intact, but since it’s likely to be available in the language libraries, using that turns this whole project into just plugging a few pieces together, so that’s a pretty good idea.

Jeffrey L Whitledge 2010-06-30 20:04:02

Yes, i agree, it wouldn't have to be full url encoding though, just for the problematic bytes. Otherwise the length of mostly non-ASCII-compatible strings would double or triple.

Christoffer Hammarström 2010-07-01 07:30:28

Or one could of course just use the good old escape sequences \\ \r \t \n \x00

Christoffer Hammarström 2010-07-01 07:47:34

ansaurus

tags:

views:

answers:

UTF-8 tuple storage using lowest common technological denominator, append-only

related questions