views:

342

answers:

4

I'm currently using JSON (compressed via gzip) in my Java project, in which I need to store a large number of objects (hundreds of millions) on disk. I have one JSON object per line, and disallow linebreaks within the JSON object. This way I can stream the data off disk line-by-line without having to read the entire file at once.

It turns out that parsing the JSON code (using http://www.json.org/java/) is a bigger overhead than either pulling the raw data off disk, or decompressing it (which I do on the fly).

Ideally what I'd like is a strongly-typed serialization format, where I can specify "this object field is a list of strings" (for example), and because the system knows what to expect, it can deserialize it quickly. I can also specify the format just by giving someone else its "type".

It would also need to be cross-platform. I use Java, but work with people using PHP, Python, and other languages.

So, to recap, it should be:

  • Strongly typed
  • Streamable (ie. read a file bit by bit without having to load it all into RAM at once)
  • Cross platform (including Java and PHP)
  • Fast
  • Free (as in speech)

Any pointers?

+8  A: 

Have you looked at Google Protocol buffers?:

http://code.google.com/apis/protocolbuffers/

They're cross platform (C++, Java, Python) with third party bindings for PHP also. It's fast, fairly compact and strongly typed.

There's also a useful comparison between various formats here:

http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking

You might want to consider Thrift or one of the others mentioned here as well.

Jon
...and, there's Google backing it.
Isaac Waller
+2  A: 

You could take a look at YAML- http://www.yaml.org/

It's a superset of JSON so the data file structure will be familiar to you. It supports some additional data types as well as the ability to use references that include a part of one data structure into another.

I don't have any idea if it will be "fast enough"- but the libyaml parser (written in C) seems pretty snappy.

Sharpie
While Yaml is in no way a superset of JSON, I agree that it is one of the most readable/compact/typed format I know.
gizmo
yaml is way more complex than json. I think most implementations are slower.
troelskn
AFAIK, yes, implementations are not very performant. YAML is geared towards somewhat different goals, maximum expressiveness and so on, not speed or simplicity.
StaxMan
+3  A: 

I've had very good results parsing JSON with Jackson

Jackson is a:

  • Streaming (reading, writing)
  • FAST (measured to be faster than any other Java json parser and data binder)
  • Powerful (full data binding for common JDK classes as well as any Java bean class, Collection, Map or Enum)
  • Zero-dependency (does not rely on other packages beyond JDK)
  • Open Source (LGPL or AL)
  • Fully conformant

JSON processor (JSON parser + JSON generator) written in Java. Beyond basic JSON reading/writing (parsing, generating), it also offers full node-based Tree Model, as well as full OJM (Object/Json Mapper) data binding functionality.

Its performance is very good when compared to many other serialisation options.

Robert Munteanu
Use Jackson before trying anything else. The code on json.org isn't suitable for production use.
Kevin Peterson
A: 

I'd go with AMF

Erik