views:

150

answers:

5

I'm very confused with this wee little problem I have. I have a non-indexed file format header. (more specifically the ID3 header) Now, this header stores a string or rather three bytes for conformation that the data is actually an ID3 tag (TAG is the string btw.) Point is, now that this TAG in the file format is not null-terminated. So there are two things that can be done:

  • Load the entire file with fread and for non-terminated string comparison, use strncmp. But:
    1. This sounds hacky
    2. What if someone opens it up and tries to manipulate the string w/o prior knowledge of this?
  • The other option is that the file be loaded, but the C struct shouldn't exactly map to the file format, but include proper null-terminators, and then each member should be loaded using a unique call. But, this too feels hacky and is tedious.

Help, especially from people who have practical experience with dealing with such stuff, is appreciated.

+1  A: 

Keep three bytes and compare each byte with the characters 'T', 'A' and 'G'. This may not be very smart, but gets the job done well and more importantly correctly.

dirkgently
And, for parsing, specific, well-known file formats, use a library.
Sinan Ünür
I was talking about a more generalistic thing -- what if it was an arbitarily long string?
Aviral Dasgupta
When you write a file format parser, you usually work with well-known tags/metadata. Also, from what I understood, your code is about detection and not so much as parsing. So, the above approach suffices. In case there is an arbitrarily long string, the header will most likely have a `length` field so that you can `malloc` so much beforehand and read in the data.
dirkgently
@dirkgently So, as I see it, there is no way in which one shall not have to manually write each field of the data?
Aviral Dasgupta
@aviraldg: Um, that depends on what you are trying to achieve. You *have* to read in, whether you save the information will depend on your requirements.
dirkgently
+2  A: 

If you are just learning something, you can find the ID3v1 tag in a MP3 file by reading the last 128 bytes of the file, and checking if the first 3 characters of the block are TAG.

For a real application, use TagLib.

Lukáš Lalinský
The question was not about a specific application -- it was more like what-should-be-the-appropriate-approach-for kinda thing...
Aviral Dasgupta
"what-should-be-the-appropriate-approach-for" always depends on the specific file format. In most formats you know where to look for identifiers, or at least have a clearly defined how to scan the file to find one. Once you have identified the block for which you know the structure, you read it one field at a time (and then it doesn't matter if it's fixed-size, null terminated, etc.). You never parse files with C structs.
Lukáš Lalinský
@Lukáš Lalinský:"You never parse files with C structs." -- This statement is a bit too strong. I suggest you take a look at zlib.
dirkgently
Well, I did take a look at zlib and I can't find code that maps memory buffer read from a file to a plain struct. All I can find are pointers operations, memcpy and functions like get_byte/getLong/putLong. :)
Lukáš Lalinský
+3  A: 

The first thing to consider when parsing anything is: Are the lengths of these fields either fixed in size, or prefixed by counts (that are themselves fixed in size, for example, nearly every graphics file has a fixed size/structure header followed by a variable sized sequence of the pixels)? Or, does the format have completely variable length fields that are delimited somehow (for example, MPEG4 frames are delimited by the bytes 0x00, 0x00, 0x01)? Usually the answer to this question will go a long way toward telling you how to parse it.

dicroce
+2  A: 

If the file format specification says a certain three bytes have the values corresponding to 'T', 'A', 'G' (84, 65, 71), then you should compare just those three bytes.

For this example, strncmp() is OK. In general, memcmp() is better because it doesn't have to worry about string termination, so even if the byte stream (tag) you are comparing contains ASCII NUL '\0' characters, memcmp() will work.

You also need to recognize whether the file format you are working with is primarily printable data or whether it is primarily binary data. The techniques you use for printable data can be different from the techniques used for binary data; the techniques used for binary data sometimes (but not always) translate for use with printable data. One big difference is that the lengths of values in binary data is known in advance, either because the length is embedded in the file or because the structure of the file is known. With printable data, you are often dealing with variable-length encodings with implicit boundaries on the fields - and no length encoding information ahead of it.

For example, the Unix password file format is a text encoding with variable length fields; it uses a ':' to separate fields. You can't tell how long a field is until you come across the next ':' or the end of the line. This requires different handling from a binary format encoded using ASN.11, where fields can have a type indicator value (usually a byte) and a length (can be 1, 2 or 4 bytes, depending on type) before the actual data for the field.


1 ASN.1 is (justifiably) regarded as very complex; I've given a very simple example of roughly how it is used that can be criticized on many levels. Nevertheless, the basic idea is valid - length (and with ASN.1, usually type too) precedes the (binary) data. This is also known as TLV - type, length, value - encoding.

Jonathan Leffler
A: 

And don´t forget the genre that two different meaning on id3 v1 and id3v1.1

Arabcoder