views:

61

answers:

3

My app needs to process text files during a batch process. Occassionally I receive a file with some special character at the end of the file. I am not sure what that special character is. Is there anyway I can find what that character is so that I can tell the other team which is producing that file.

I have used mozilla's library to guess the file encoding and it says UTF-8.

+1  A: 

Any hex editor ought to allow you to see each individual byte in a file. This ought to allow you to tell them what character it is.

Here's one I've used in the past: http://www.hexworkshop.com/

Anderson Imes
+1  A: 

On Unix, you can use the od utility to output several representations of byte data in a file or stream.

Marcelo Cantos
+3  A: 

First, if the character is really "special" or not depends what you call a "special character". As a sidenote on Unix and OS X you can use, for example, the od, file and hexdump commands to easily examine files:

... $  hexdump -C example.txt 
00000530  6f 77 73 20 61 63 74 69  6f 6e 2e 0a 0a 0a 0a     |ows action.....|

Now if you know your file encoding is UTF-8, it means that every byte that has its highest bit set to zero correspond to exactly one character (in the example above, last byte is '0a', which means the '0a' byte correspond to one "character").

A file in UTF-8 also means that every byte whose highest bit is set to 1 is part of a multi-byte character. For example, in the following byte sequence:

75 20 5b e2 80 a6 5d 20  61 75 74 6f 72 69 73 61

the only three bytes that have their highest bit set are "e2 80 a6" (all the values from 0x80 to 0xFF have their leftmost/highest bit set) and they're part of the same character (you cannot have a non-ASCII character in UTF-8 made of only one byte whose highest bit is set, hence you know that these three bytes are part of the same character... The fact that every UTF-8 byte whose leftmost/highest bit is set is IMHO a truly beautiful feature of UTF-8).

Now you Google on "e2 80 a6" and you see that it's the Unicode character named "horizontal ellipsis" (whose codepoint, in UTF-8, is represented by hexadecimal e280a6).

So basically you have to do two things:

  • find which bytes are making up that last "special" character (is it just one byte or several bytes?)

  • find to which "special character" this/these byte(s) corresponds

Webinator
And now youngsters start to realize why the knowledge of bits and hexadecimal and the low-level character encodings "details" is a great skill to have ; )
Webinator
ok i ran the od -c filename.txt and below is the output. The character that is causing the issues seems to be ASCII SUB. "A substitute character (␚) is a control character that is used in the place of a character that is recognized to be invalid or in error or that cannot be represented on a given device." I am planning to ask the other team how they are generating this file and on which OS. Am i on the right approach or do you guys have any other suggestions. Regards0001340 . 0 0 0 0 0 | 0 1 0 0 \n 032 \n
Pangea