views:

453

answers:

5

I want to detect and replace the Malformed UTF-8 characters with blank space using Perl script while loading the data using SQL*Loader. How to do?

A: 

EDIT: (Removed bit about SQL Loader as it seems to no longer be relevant.)

One problem is going to be working out what counts as the "end" of a malformed UTF-8 character. It's easy to say what's illegal, but it may not be obvious where the next legal character starts.

Jon Skeet
Finding where the next character starts is easy - that's one of the very cool things about UTF-8's design. http://tools.ietf.org/html/rfc3629
Mike G.
Easy only in well-formed UTF-8. Imagine a duplicate lead-byte being inserted - you would have to skip the first occurrance. If you simply followed the UTF-8 rules, the next character would seem to start with the last byte of the character whose lead byte was duplicated.
MSalters
@MSalters: That's the kind of thing I was thinking about, but I wasn't sure enough of myself :)
Jon Skeet
A: 

RFC 3629 describes the structure of UTF-8 characters. If you take a look at that, you'll see that it's pretty straightforward to find invalid characters, AND that the next character boundary is always easy to find (it's a character < 128, or one of the "long character" start markers, with leading bits of 110, 1110, or 11110).

But BKB is probably correct - the easiest answer is to let perl do it for you, although I'm not sure what Perl does when it detects the incorrect utf-8 with that filter in effect.

Mike G.
A: 

Perl script for find and replace the special characters in the CSV file

Thanks Sekar

+2  A: 

Consider Python. It allows to extend codecs with user-defined error handlers, so you can replace undecodable bytes with anything you want.

import codecs
codecs.register_error('spacer', lambda ex: (u' ', ex.start + 1))
s = 'spam\xb0\xc0eggs\xd0bacon'.decode('utf8', 'spacer')
print s.encode('utf8')

This prints:

spam  eggs bacon
Constantin
A: 

Why are there malformed UTF-8 data there in the first place? Are you sure it's actually UTF-8?

Lasse V. Karlsen