views:

194

answers:

2

I am writing a Java text component, and is trying to partially load some large text file in the middle (for speed reasons).

My question is if the text is in some multi-bytes encoding format, like UTF8, Big5, GBK, etc. How can I align the bytes so that I can correctly decode the text?

+2  A: 

I can't speak for the other formats but utf8 shouldn't be too hard.

Just look at the first byte of the chunk you grabbed and figure out from there:

Taken from wikipedia:

00000000-01111111   00-7F  0-127  US-ASCII (single byte)
10000000-10111111   80-BF  128-191 2'nd, 3rd, or 4'th byte of a multi-byte sequence
11000000-11000001   C0-C1  192-193 start of a 2-byte sequence, but code point <= 127
11000010-11011111   C2-DF  194-223 Start of 2-byte sequence
11100000-11101111   E0-EF  224-239 Start of 3-byte sequence
11110000-11110100   F0-F4  240-244 Start of 4-byte sequence

If the byte is in the 2'nd or 3'rd group then you know you missed part of a character. If it's in the 1'st,4'th,5'th,6'th group then you know you are on the start of a character. Proceed accordingly from there.

Jeremy Wall
A: 

In addition to Jeremy's comments for UTF-8, for encodings such as UTF-16, you could use some common-sense heuristics to decide if you've got the right alignment. For example, if you're basically expecting Latin characters plus the odd exotic one and half of your characters come out above 256, you've probably got the wrong alignment...

Neil Coffey