views:

107

answers:

8

I'm having hard time to choose the format on which my server and my end points will communicate with.
I am considering:

  • JSON
  • YAML Too hard to parse
  • CSV
  • Google Protobufs
  • Binary packing/unpacking (with no use of casting/memset/memcpy to enable portability)
  • Some form of DSL
  • Any other suggestion you might have

My criterias are ordered from the most important to the least:

  1. Which is the easiest to parse?
  2. Which is the fastest to parse?
  3. Which has the smallest in bytes?
  4. Which has the potential to have the most readable messages?
  5. Which has the potential to be encrypted more easily?
  6. Which has the potential to be compressed more easily?

EDIT to clarify:

  • Are the data transfers bi-directional? Yes.
  • What is the physical transport? Ethernet.
  • Is the data formatted as packets or streams? Both but usually packets.
  • How much RAM do the end-points have? The smallest amount possible, depeands on the format I choose.
  • How big are your data? As big as it needs to be. I won't receive huge datasets though.
  • Does the end-point have an RTOS? No.
+2  A: 

Usually in these cases it pays to customize the data format for the device. For example depending on the restrictions you face in terms of network or storage size, you can go for streaming compression or prefer full compression. Also the type of data you want to store is a big factor.

If really your biggest problem is ease of parsing you should go for xml, but on an embedded device ease of parsing is usually much less of a concern compared to transfer speed, storage size and cpu consumption. JSON and YAML, much like XML are primarily focussed on parsing ease first and foremost. Protobuf might squeeze in there, binary packing is what people usually do. Encryption and compression you should rather do on the transport level, although functionally you should aim to put as little information as possible in a message.

I know I'm not giving you a clear cut answer, but I think there is no such thing to such a generic question.

iwein
However I'm thinking about OEMs that might occur and in such events I would prefer a readable format over a binary format.Should I just create a tool that converts the binary format to a readable format? How inefficient is JSON or YAML parsing comparing to unpacking binary buffers?
the_drow
+1  A: 

I'm in the middle of doing a similar thing reading data off a SD card to an embedded processor. I have to think about the compactness and ease of translating the data on the card, versus the ability for our subsidiaries and potentially customers to read the data.

Conversion tools may give you the best compromise if the data isn't being human-read very often but if you need to provide conversion tools then this will be a lot of extra support (what if it doesn't work on the latest version of Windows, Linux etc.).

For my situation CSV is proving a reasonable compromise for my application due to the amount of easily available csv editors around (like excel) and only having to provide documentation as to how to produce/edit the csv files. CSV not being a fully defined standard is a pain but RFC4180 is a good csv "standard" to aim for.

http://tools.ietf.org/html/rfc4180

As another answer said I can't give you a clear cut answer, but as you have identified it will be a compromise between maintainability of the system by every person, and the speed and size of the embedded solution (i.e. it working!).

Good luck!

fluffyben
What about JSON/YAML as a human readable format?
the_drow
They appear quite human readable, but I can't really offer much of an opinion regarding these as I've never used JSON or YAML.
fluffyben
+2  A: 

CSV is going to meet your desires before an XML based solution would. Very easy to parse, one to two dozen lines of code. Then you add your what the terms/fields mean which you would need for any solution. The overhead of CSV is very light, some commas and quotes, compared to an XML solution where you often find more XML tags and syntax than real meat/data, dozens to hundreds of bytes are often burned for single 8 or 32 bit values. Granted CSV also has overhead if you think it takes three characters (bytes) to represent one 8 bit value (hexchar hexchar comma) compared to binary. Uncompressed an XML solution with its bulk is going to consume considerably more transmission bandwidth and storage on top of the bulky libraries used to create and parse and possibly compress/decompress. The CSV is going to be easier to read than binary certainly and often easier than XML as xml is very verbose and you cant see all of the related data on one screen at one time. Everyone has access to a good spreadsheet tool, gnumeric, openoffice, ms office, so that makes CSV that much easier to read/use, the gui is already there.

There is no generic answer though, you need to do your system engineering on this. You may very well desire to have JSON/XML on the host or big computer side and convert to some other format like binary for the transmission, then on the embedded side perhaps you do not need ASCII at all and no need to waste the energy on it, take the binary data and just use it. I also dont know your definition of embedded, I assume since you are talking about ascii formats this is not a resource limited microcontroller but probably an embedded linux or other operating system. From a system engineering perspective what exactly does the embedded system need and in what form? Up one level from that what resources do you have and as a result what form do you want to keep that data on the embedded system, does the embedded system want to simply take preformatted binary and simply hand the bytes right on through to whatever peripheral that data was intended for? the embedded driver could be very dumb/simple/reliable in that case and the bulk of the work and debugging is on the host side where there are plenty of resources and horsepower to format the data. I would aim for minimal formatting and overhead, if you have to include a library to parse it I would likely not use it. but I often work with resource limited embedded systems without an operating system.

dwelch
The server for sure accepts JSON or YAML. That's why I wouldn't like to code the communication protocol twice. We're using a FPGA with Xilinx MicroBlaze processor. It will be resource limited be we can decide the limit. We need small endpoints so we want to use the least hardware possible.
the_drow
Thats the rub, coding twice isnt necessarily a big deal if we are talking something simple like 10-20 lines of code on each side, the pain of getting the same code in embedded to work could overshadow the new code in time and effort. Not sure what your fpga is doing but for example your embedded code and transfer interface could be as simple as address and data (for register or memory addresses within the fpga), the embedded code is incredibly dumb and simple all it does is pull the address and data off the bus and perform the write. host code does all the rest of the work.
dwelch
@dwelch: The embedded device controls external hardware. All it has to do is to act upon the request of the server and ACK for failure with an error code or success.
the_drow
+2  A: 

The answer to your first question depends a lot on what you are trying to do. I gather from the tags attached to your question that your end-points are embedded systems and your server is some type of PC. Parsing XML on a PC is easy, but on an embedded system it is a little more difficult. You also don't mention if your communications is bi-directional or not. If in your case the end-points are only passing data to the server, but not the other way around, XML might work well. If the server is passing data to the end points then CSV or a proprietary binary format would probably be easier to parse at the end-point. Both CSV and XML are easily human readable.

  • Are the data transfers bi-directional?
  • What is the physical transport? (eg. RS-232, Ethernet, USB?)
  • Is the data formatted as packets or streams?
  • How much RAM do the end-points have? How big are your data?
  • Does the end-point have an RTOS?
mjh2007
See the answer. I also deliberatly didn't mention XML because it's too heavy. What about YAML/JSON/Protobufs/DSL?
the_drow
I'm not familiar with the other data formats. After looking them up they seem only slightly lighter than XML. I couldn't find any information on DSL.
mjh2007
@mjh2007: DSL = Domain Specific Language
the_drow
+2  A: 

First and foremost, see what kind of existing libraries you can find. Even if a format is difficult to parse, a pre-written library can make a format much more attractive. The easiest format to parse is the format that you already have a parser for.

Parsing speed is normally the best on binary formats. One of the fastest methods is to use a "flat" binary format (you read in the buffer, cast a pointer to the buffer as a pointer to a data structure, and access the data in the buffer through the data structure). No real "parsing" is needed, as you are transferring (essentially) a binary dump of a memory region.

To minimize payload, create a custom binary format that is tailored for your specific needs. That way, you can adjust the the various design tradeoffs to your biggest advantage.

"Readable" is subjective. Readable by whom? Plain-text formats like XML and CSV are easily readable by humans. Flat binary images are easily readable by machines.

Encryption routines typically treat the data to be compressed as a chunk of binary data (they don't attempt to interpret it at all), so encryption should apply equally well to data of any format.

Text-based formats (XML, CSV, etc) tend to be very compressible. Binary formats tend to be less compressible, but have fewer "wasted" bits to begin with.

In my experiences, I have had the best results with the following:

  • CSV - Best when the data is in a predictable, consistent format. Also useful when communicating with a scripting language (where text-based I/O can be easier than binary I/O). Easily generated/interpreted by hand.
  • Flat binary - Best when you are transporting a data structure (POD) from one place to another. For best results, pack the structure to avoid problems with different compilers using different padding.
  • Custom format - Usually the best results since designing a custom format lets you balance flexibility, overhead, and readability. Unfortunately, designing a custom format from scratch can end up being a lot more work than it seems.
bta
Of course if you design your own format, *document it thoroughly*. If you are expecting other people to be using the code, provide some sample code that includes a simple parser and generator.
bta
Re "binary dump of a memory region"--beware of endianness and alignment issues.
Craig McQueen
+1 Having existing libraries for parsing for all devices and users is handy, especially if it is to a defined standard then it is more likely to be easy to support and better documented.
fluffyben
+1 to Craig for pointing out endianness issues -- also watch out for struct padding when porting your code to other platforms.
tomlogic
@Craig, tomlogic: "Packing" your structures can avoid padding issues, but endianness can always be a problem. If you are planning on migrating binary memory dumps between platforms with different endian-ness-es, I would recommend defining a standard endian-ness for dump files. Each platform would be responsible for converting endianness (if needed) before writing the data out or after reading it in.
bta
A: 

From the YAML website:

Both JSON and YAML aim to be human readable data interchange formats. However, JSON and YAML have different priorities. JSON’s foremost design goal is simplicity and universality. Thus, J*SON is trivial to generate and parse, at the cost of reduced human readability.* It also uses a lowest common denominator information model, ensuring any JSON data can be easily processed by every modern programming environment.

In contrast, YAML’s foremost design goals are human readability and support for serializing arbitrary native data structures. Thus, YAML allows for extremely readable files, but is more complex to generate and parse. In addition, YAML ventures beyond the lowest common denominator data types, requiring more complex processing when crossing between different programming environments

So JSON is much better since it's human readable and more efficient the YAML.

the_drow
+1  A: 

I recently designed my own serialization scheme for communication with mobile devices, only to have my internal release coincide with the public announcement of Google protobufs. That was a bit of a disappointment as Google's protocol was quite a bit better. I'd advise looking into it.

For instance, take a look at simple numbers. Parsing JSON, XML, or CSV all require parsing ASCII numbers. ASCII gets you about 3.3 bits per byte; protobuf gets you 7. Parsing ASCII requires looking for delimiters and doing math, protobuf takes just bitfiddling.

Messages won't be directly readable with protobuf, of course. But a visualizer is quickly hacked together; the hard work is already done by Google.

MSalters
@MSalters: So basically it's just the same as binary packing/unpacking but in a readable API? What protobuf does under the hood exactly?
the_drow
Very much so, yes. Basically protobuf (the protocol) describes an efficient binary representation of basic data structures. In that respect it's similar to ASN.1. For instance, unsigned integers are basically represented as base-128, with the top bit indicating "more bytes to follow". Signed integers are the same, except that the sign bit is now the LSB - with variable-length encodings the notion of an MSB is fuzzy. Important details for efficiency, but hidden by the API.
MSalters
+1  A: 

Key factors are:

  • what capabilities have your clients? (e.g. Can you pick an XML parser from the shelf - without ruling out most of them because of performance reasons? Can you compress the packets on the fly?)
  • What is the complexity of your data ("flat" or deeply structured?)
  • Do you need high-frequency updates? Partial updates?

In my experience:

A simple text protocol (which would categorize itself as DSL) with an interface of

string RunCommand(string commandAndParams)
// e.g. RunCommand("version") returns "1.23"

makes many aspects easier: debugging, logging and tracing, extension of protocol, etc. Having a simple terminal / console for the device is invaluable in tracking down problems, running tests etc.

Let's discuss the limitation in detail, as a point of reference for the other formats:

  • The client needs to run a micro parser. That's not as complex as it might sound (the core of my "micro parser library" is 10 functions with about 200 lines of code total), but basic string processing should be possible
  • A badly written parser is a big attack surface. If the devices are critical/sensitive, or are expected to run in a hostile environment, implementation requires utmost care. (that's true for other protocols, too, but a quickly hacked text parser is easy to get wrong)
  • Overhead. Can be limited by a mixed text/binary protocol, or base64 (which has an overhead of 37%).
  • Latency. With typical network latency, you will not want many small commands issued, some way of batching requests and their returns helps.
  • Encoding. If you have to transfer strings that aren't representable in ASCII, and can't use something like UTF-8 for that on both ends, the advantage of a text-based protocol drops rapidly.

I'd use a binary protocol only if requried by the device, device processing capabilities are insanely low (say, USB controllers with 256 bytes of RAM), or your bandwidth is severely limited. Most of the protocols I've worked with use that, and it's a pain.

Google protBuf is an approach to make a binary protocol somewhat easier. A good choice if you can run the libraries on both ends, and have enough freedom to define the format.

CSV is a way to pack a lot of data into an easily parsed format, so that's an extension of the text format. It's very limited in structure, though. I'd use that only if you know your data fits.

XML/YAML/... I'd use only if processing power isn't an issue, bandwith either isn't an issue or you can compress on the fly, and the data has a very complex structure. JSON seems to be a little lighter on overhead and parser requirements, might be a good compromise.

peterchen
I am wondering.If I need json without arrays would that make parsing a lot easier?Great answer. This is the accepted one for me.
the_drow