ansaurus

Question

Partially load large text file with different encodings

Answer 1

+2 A:

I can't speak for the other formats but utf8 shouldn't be too hard.

Just look at the first byte of the chunk you grabbed and figure out from there:

Taken from wikipedia:

00000000-01111111   00-7F  0-127  US-ASCII (single byte)
10000000-10111111   80-BF  128-191 2'nd, 3rd, or 4'th byte of a multi-byte sequence
11000000-11000001   C0-C1  192-193 start of a 2-byte sequence, but code point <= 127
11000010-11011111   C2-DF  194-223 Start of 2-byte sequence
11100000-11101111   E0-EF  224-239 Start of 3-byte sequence
11110000-11110100   F0-F4  240-244 Start of 4-byte sequence

If the byte is in the 2'nd or 3'rd group then you know you missed part of a character. If it's in the 1'st,4'th,5'th,6'th group then you know you are on the start of a character. Proceed accordingly from there.

Jeremy Wall 2009-06-12 03:01:54

Answer 2

A:

In addition to Jeremy's comments for UTF-8, for encodings such as UTF-16, you could use some common-sense heuristics to decide if you've got the right alignment. For example, if you're basically expecting Latin characters plus the odd exotic one and half of your characters come out above 256, you've probably got the wrong alignment...

Neil Coffey 2009-06-12 03:34:49

ansaurus

tags:

views:

answers:

Partially load large text file with different encodings

related questions