views:

444

answers:

2

I've recently found out about protocol buffers and was wondering if they could be applied to my specific problem.

Basically I have some CSV data that I need to convert to a more compact format for storage as some of the files are several gig.

Each field in the CSV has a header, and there are only two types, strings and decimals (because sometimes there are alot of significant digits and I need to handle all numbers the same way). But each file will have different column names for each field.

As well as capturing the original CSV data I need to be able to add extra information to the file before saving. And I was hoping to make this future proof by handling different file versions.

So, is it possible to use protocol buffers to capture a random number of randomly named columns of data, like a CSV file?

A: 

Well, protobuf-net (my version) is based on regular .NET types, so no (since it won't cope with different schemas all the time). But Jon's version might allow dynamic types. Personally, I'd just use CSV and run it through GZipStream - I expect that will be fine for the purpose.


Edit: actually, I forgot: protobuf-net does support extensible objects, but you need to be a bit careful... it would depend on the full context, I expect.

Plus Jon's approach of nested data would probably work too.

Marc Gravell
Sorry, not sure if I made it clear - I'm also adding extra data to the CSV, sometimes as extra columns and sometimes as header or footer data. This data I'd like to version proof. That's why I was thinking about other methods of storage.
Cameron MacFarland
+2  A: 

Well, it's certainly representable. Something like:

message CsvFile {
    repeated CsvHeader header = 1;
    repeated CsvRow row = 2;
}

message CsvHeader {
    require string name = 1;
    require ColumnType type = 2;
}

enum ColumnType {
    DECIMAL = 1;
    STRING = 2;
}

message CsvRow {
    repeated CsvValue value = 1;
}

// Note that the column is implicit based on position within row    
message CsvValue {
    optional string string_value = 1;
    optional Decimal decimal_value = 2;
}

message Decimal {
    // However you want to represent it (there are various options here)
}

I'm not sure how much benefit it will provide, mind you... You can certainly add more information (add to the CsvFile message) and future proofing is in the "normal PB way" - only add optional fields, etc.

Jon Skeet
Yeah reading about the encoding of PBs didn't fill me with hope as my data is mainly dense numbers. Still I'll give it a shot and see what happens.
Cameron MacFarland
If you're interested in System.Decimal representations in PB, that probably deserves a separate question - or a post on the PB discussion group. Marc and I have discussed this before (and might do more tonight - Marc?).
Jon Skeet
@Jon - quite probably ;-p
Marc Gravell