ansaurus

Question

Answer 1

A:

EDIT: (Removed bit about SQL Loader as it seems to no longer be relevant.)

One problem is going to be working out what counts as the "end" of a malformed UTF-8 character. It's easy to say what's illegal, but it may not be obvious where the next legal character starts.

Jon Skeet 2008-10-15 10:07:32

Finding where the next character starts is easy - that's one of the very cool things about UTF-8's design. http://tools.ietf.org/html/rfc3629

Mike G. 2008-10-15 11:46:01

Easy only in well-formed UTF-8. Imagine a duplicate lead-byte being inserted - you would have to skip the first occurrance. If you simply followed the UTF-8 rules, the next character would seem to start with the last byte of the character whose lead byte was duplicated.

MSalters 2008-10-16 10:34:58

@MSalters: That's the kind of thing I was thinking about, but I wasn't sure enough of myself :)

Jon Skeet 2008-10-16 11:02:13

Answer 2

A:

RFC 3629 describes the structure of UTF-8 characters. If you take a look at that, you'll see that it's pretty straightforward to find invalid characters, AND that the next character boundary is always easy to find (it's a character < 128, or one of the "long character" start markers, with leading bits of 110, 1110, or 11110).

But BKB is probably correct - the easiest answer is to let perl do it for you, although I'm not sure what Perl does when it detects the incorrect utf-8 with that filter in effect.

Mike G. 2008-10-15 11:44:35

Answer 3

A:

Perl script for find and replace the special characters in the CSV file

Thanks Sekar

2008-10-15 13:18:02

Answer 4

+2 A:

Consider Python. It allows to extend codecs with user-defined error handlers, so you can replace undecodable bytes with anything you want.

import codecs
codecs.register_error('spacer', lambda ex: (u' ', ex.start + 1))
s = 'spam\xb0\xc0eggs\xd0bacon'.decode('utf8', 'spacer')
print s.encode('utf8')

This prints:

spam  eggs bacon

Constantin 2008-10-15 17:47:40

Answer 5

A:

Why are there malformed UTF-8 data there in the first place? Are you sure it's actually UTF-8?

Lasse V. Karlsen 2008-10-16 10:22:38

ansaurus

tags:

views:

answers:

Malformed UTF characters

related questions