



Hello Everyone, Can someone please redirect me to some good references about the encoding and decoding in communication and different encoding techniques(unicode, base64, utf7) etc.

Thanks in advance, Rupesh

+1  A: 

Wikipedia is always a good start.

Then there's always Joel Spolsky's article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Note that the three things you name operate on different levels.

  • Unicode is a character set: a mapping between characters and numbers (code points).
  • UTF7 maps between code points and bytes.
  • base64 maps between bytes and bytes. (It mangles bytes so that they are represented by bytes in the ASCII range.)
- Good reference link and short description of each concept. Thanks
Rupesh Chavan

Regarding yours unicode, base64, utf7 (no one uses it, it might be utf8). They are not just "encoding & decoding" but encoding & decoding of text data.

Unicode is the way all real and possible characters are enumerated. It has nothing about encoding itself. UTFXX is set of encoding of unicode (converting code to actual bytes). most popular are UTF8 and UTF16. Very basically UTF8 is ASCII compatible (chars with codes < 128 are represented same way as ASCII), but other characters are represented by 2-3 bytes. UTF16 encode most of characters to 2 bytes.

Base64 has nothing about text data. It encodes generic binary data to text that consists of 64 printable ascii characters. It is used to transfer binary data, UTF8 and UTF16 via Email usually.

+1  A: 

The definitions of encoding and decoding are somewhat subjective.

Both are forms of transliteration, being the process of converting from one alphabet to another. ASCII to UTF8, ASCII to base64, etc are all examples of this.

What distinguishes the two is that "encoding" is often used when transliterating from a usable format to a transmission or intermediate format of some kind and decoding is the reverse. This is where the "subjective" bit comes in. ASCII to UTF8 can be viewed as encoding or decoding depending on the context.

Other formats like base64 are used almost universally for transmission only (eg binary data in email) and as such converting to them is almost universally called "encoding" and converting from as "decoding".

The important point to take away from all this is that something like ASCII or UTF8 is not magical in any way. All these formats are simply an agreed-upon encoding of information into a binary format. So ASCII 65 is 'A' for no other reason than that's the standard.

Unicode formats get more interesting because they make the distinction between the code point and the encoding. Unicode defines the code points for each character. The binary data is different for each encoding format. For example, see Unicode Character 'EURO-CURRENCY SIGN' (U+20A0) to see all the different binary values for one code point.
