views:

63

answers:

4

I have an application which will store a series of (float) values in an XML file. There could be upwards of 100,000 values so I am interested in keeping the size down, but I also want files to be readily accessible by third parties.

There seem to be various methods open to me as far as encoding the data within the XML:

1.

<data>
  <value>12.34</value>
  <value>56.78</value>
  ...
  <value>90.12</value>
</data>

2.

<data>
  <value v="12.34"/>
  <value v="56.78"/>
  ...
  <value v="90.12"/>
</data> 

3.

<data>12.34
56.78
  ...
90.12
</data> 

4.

<data>12.34, 56.78, ... 90.12</data> 

and there are probably more variations as well.

I'm just curious to know the drawbacks (if any) to each of these approaches. Some may not be compliant for example.

+2  A: 

I don't think there's a "better" way of doing it. Read my comment above for alternatives. But if you're hooked on XML, then go with whatever works for you. I personally prefer something like this

<data>
   <item key="somekey1" value="somevalue1" />
   <item key="somekey2" value="somevalue2" />
   <item key="somekey3" value="somevalue3" />
</data>

Simply because it's nice and easy to read, and keeps the tags smaller.

EDIT:

Remember, the fewer characters are in your XML, the smaller it will be. (again, why I suggest JSON), so if you can get it nice and tight, by all means do it.

<d>
   <i k="somekey1" v="somevalue1" />
   <i k="somekey2" v="somevalue2" />
   <i k="somekey3" v="somevalue3" />
</d>

EDIT:

Also, I know you didn't ask, but I thought I'd show you what JSON would look like

   [{ "key": "somevalue1", "value": "somevalue1"},
    { "key": "somevalue2", "value": "somevalue2"}]
rockinthesixstring
I do not like your second form. Size is a consideration, but given the choice between a smaller size and a document with descriptive, meaningful names, I'll sacrifice the disk space.
Anthony Pegram
I totally agree.. .I would never use that second one either... just giving an example on how to strip tags. I still prefer JSON over XML.
rockinthesixstring
If the aim is to represent a time series of samples (i.e. an array), surely <item key="somekey1" value="somevalue1" /> is superfluous when <item value="somevalue1" /> (or <i v="somevalue1"/>) would do. The "descriptive meaningful" part can be in the enclosing tag, like so: <ArrayOfVoltageSamples> <i v=12.34/>..<i v=56.78/></ArrayOfVoltageSamples>. I think I'll take the decrease in size and still have readability.
you mean "increase" in size and still have readability. Again.. I agree, was just showing the alternative. If a human is never going to read the actual file and it's just for the machine to read... then the tags are irrelevant. You say you have 100,000 values. What human is ever going to want to crack open the raw XML and read that... certainly not me :-P
rockinthesixstring
sorry... after re-reading your note it looks like you're saying that you WOULD prefer to use the smaller tags.
rockinthesixstring
just for fun...JSON... `[{ "v": "12.34" },{ "v": "56.78" }]`
rockinthesixstring
I do wonder how accepting people (i.e. your average Joe ThirdPartyCodeHack Bloggs) might be of JSON compared to XML. Also - are the quotes necessary around numeric values?
+2  A: 

The first two forms are preferrable to the final two, with the first being the best. The latter two would require reading the contents of the data and splitting it before you could use it. The first two, however, allow you to enumerate over the data and use only the piece or pieces you need at any given time. However, the second form embeds the value in yet another layer via an attribute, which makes it less desirable than the first (provided there aren't other elements/attributes for each particular data point).

Anthony Pegram
I agree with the part about the latter two. Though you might be able to get the file size smaller, you have to make the server work harder to extract the content.
rockinthesixstring
Is there really that much difference between `<element>text</element>` and `<element tag=value />` ? On .NET it's the difference of .Text (or is it .Value) versus .Attribute("tag"), so yes a few less characters, but no difference in access method.
drachenstern
@drachenstein - Yes, I'm thinking about it from a .NET perspective, particularly LINQ, where I could access a value as Element.Value (or (float)Element), or Element.Attribute(somename).Value (... (float)Element.Attribute(somename)). It's a preference thing, but I'd sacrifice the disk space if I did not have to embed the data in yet another layer.
Anthony Pegram
+3  A: 

Semantically, there's no "difference" between 1 and 2. Similarly there's no difference between 3 and 4, save that one is delimited. Also note that whitespace is/can be ignored in XML, so if you read #3, it may well come up as "one long line" without any newlines separating them.

As for which is better, it's up to you application, and how you plan on using the data.

The serialized version (with each number in its own element) gives the user "direct" accesss to the individual numbers.

Using the delimited "blob" requires the users to parse it themselves, so it depends on what kind of interface you're wishing to provide.

Also, the "blob" technique tends to prevent the XML from being "streamed", since you'll have one, enormous element, rather than a bunch of little elements. That can have a large memory impact.

As for the overall file size, it may help to know that of you actually compress this data, the final, compressed sizes will likely be very close to each other, regardless of the technique. Dunno if that property is important or not.

Will Hartung
A good point about losing direct access n the blob approach.
A: 

If the only data your file will process will always be only those float values, do not use XML. Use only a plain text file with a value in each line. It'll be many times faster to read and write and won't be even a little less self-descriptive than the XML samples you wrote.

XML may be a requirement, for an example case in which you will use this XML file from different applications/systems/users with different culture(TR, EN, FR). Some write floats with '.' (12.34) while some write them with ',' (12,34). An XML parser will handle all that stuff for you. So, if XML is a requirement, 3rd and 4th samples you wrote are totally missing the point of XML. In practice they're no different than using a plain text file except the slow XML parser on duty.

1st and 2nd samples you wrote have only a subtle difference in meaning / interpretation. First one implies that the actual data you like to present is 12.34, and it's a 'value'. Second implies that there's a 'value', and the 'v' data associated with it is 12.34.

Gorkem Pacaci