tags:

views:

58

answers:

4

I have read a few links on the topic of file formats and encoding, but how is it done?

If all data is binary, what splits data into different file formats? What exactly does encoding the data involve? How is it done?

+1  A: 

All data is binary, including this web page you are viewing right now. Its the interpretation of the data that matters.

For instance, pretend you have four bytes:

0xaa 0x00 0x00 0x55

That could be (in no particular order):

  1. The number 43520 followed by the number 85
  2. The decimal number 170 followed by 21760
  3. The decimal number 2852126805
  4. Hundreds of other interpretations

And this is only the unsigned numbers. Any of those bytes or bits could be markers, order indicators, strings, position indicators, etc.

Yann Ramin
I get that, but am asking about encoding/decoding.
alJaree
I think it answers "what splits data into different file formats?" nicely.
Russell
+3  A: 

As per theatrus' response, it's all a matter of interpretation.

Typically the file extension (.txt, .jpg, .pdf etc.) provides enough information to determine which program should handle the file - and then the program will know how to handle the format it's given (or produce this format when saving to that particular file type).

Each file format has a (hopefully!) well defined format, for example a PDF file will always start with a line that reads "%PDF-x.y" where x.y is the version number e.g. 1.6. which enables the likes of Acrobat to determine that this 'is most likely a PDF file' and to decide how to handle it (different versions will have different internal structures).

.txt files are usually just sequences of 'characters' encoded in a particular way - plain English text is easily encoded, more complex languages with thousands of characters require more complex encodings (Unicode, or UTF-8, the latter being a 'compressed' form of Unicode).

Try opening up a few non-critical files in a hex-editor and get your hands on some format specifications and see what you can find!

Will A
+2  A: 

File formats describe data in a specific representation. For example, jpeg, bmp, png and tiff all describe images whereas html and rtf describe text documents.

A file format consists of a header that describes information about the contained data (image dimensions, compressed file name, etc.). These will contain identifying signatures that mark the file being a specific type:

  • Windows executables start with 'MZ'
  • jpeg images have JFIF in the first 20 bytes or so (can't remember the exact offset)
  • HTML documents have <html (upper or lower case) near the start of the document

This is the concept behind the unix file command and libmagic API.

Text encoding is what character set the text is encoded in. This is because programs historically use single-byte arrays (char * in C/C++) to represent strings and that is not enough to represent most human languages. The text encoding says that "this text is Simplified Chinese", or "this text is Cyrillic".

How text encodings are selected depends on the file format being used. Plain text formats (text, html, xml) can have a "byte-order-mark" at the beginning that identifies that text as UTF-32 (little endian or big endian), UTF-16 (little endian or big endian), or UTF-8. These are different representations of Unicode characters.

XML allows you to specify the encoding in the <?xml?> declaration -- e.g. <?xml version="1.0" encoding="ShiftJIS"?>. HTML allows you to specify the encoding in a <meta> tag -- e.g. <meta http-equiv="Content-Type" content="text/html; charset=utf-8">.

You can see examples where text is encoded in one form, but decoded as another (the text is mangled) in some emails or other places. These will look like • (which is a bullet character (middle black dot) encoded in utf-8) -- you can see this in firefox by going to the View > Character encoding menu and changing the encoding to Western (ISO-8859-1) (especially for non-Western characters).

You can also have other types of encoding. For example, email can be wrapped in base64 during transport.

reece
+1  A: 

The main ways to decide what format something is are by file extension or by MIME type - and less frequently by "magic numbers". The file extension will be checked by an OS or Application to decide what to do with it (which app to run it in, or which part of code to execute for it).

MIME types are used where an extension (or filename) isn't always applicable - for example, when downloading a file over HTTP, the URI for a file might be something like ~.php?id=12973. The filetype cannot be determined from ths alone, but the HTTP protocol will send a "Content-Type" definition to say what format the file is, and the browser will handle it correctly. eg: a Content-Type: image/png would force the browser to pass the file to some PNG decoding function.

When the application knows what the file format is, it'll pass the data to code which is written specifically for that format. If the program doesn't have code to read a format, it will fail to read it.

How a file is encoded is specific to the file. Most standard formats will have a specification to describe their binary encoding, and any application reading that file type must implement code to match the specification. (Although this is usually done by using a library which already does the reading for you).

To give an example of how binary encodings work, consider an image. The specification might say that bytes 10-13 signify the width of the image, and bytes 14-17 signify the height of the image. In order to read those pieces of the information from the file, the code must explicitly read the correct size data at the correct locations indicated by the spec. EG: fseek(f, 10, SEEK_SET); fread(&width, 4, 1, f); //Read 4 bytes at location 10 into "width"). I think your confusion is "what separates pieces of data in binary files?" (ie, in text files, this can be done by new lines, spaces, comma-separated values (CSV), etc). The answer is: usually the size of the data will determine where it ends - a specification will say what the binary type of each field is (perhaps it may say int32, indicating 32 bits/4 bytes).

Other than that, there can be ambiguities in file formats, but usually happens with text files, where the text inside can be read to determine the format. This isn't always applicable, because often a text file will simply have the extension ".txt", so it can be unknown to the application what the character encoding of the text is. (This was, and still is a problem for applications which do not use unicode).

Mark H