ansaurus

Question

How to remove invalid UTF-8 characters from a JavaScript string?

Answer 1

+1 A:

Simple mistake, big effect:

strTest = strTest.replace(/your regex here/g, "$1");
// ----------------------------------------^

without the "global" flag, the replace occurs for the first match only.

Side note: To remove any character that does not fulfill some kind of complex condition, like falling into a set of certain Unicode character ranges, you can use negative lookahead:

var re = /(?![\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})./g;
strTest = strTest.replace(re, "")

where re reads as

(?!      # negative look-ahead: a position *not followed by*:
  […]    #   any allowed character range from above
)        # end lookahead
.        # match this character (only if previous condition is met!)

Tomalak 2010-04-19 19:07:04

Thank you, that was a big flaw in my code. Unfortunately, with the global flag now in place, both of the regular expressions I posted seem to be filtering anything that's not ASCII. The "stress test" data's first test is some valid UTF-8 text which is being stripped, and if I take sample text from http://www.columbia.edu/kermit/utf8.html everything but ASCII gets removed.

msielski 2010-04-19 19:18:46

Answer 2

A:

Already discussed.

http://stackoverflow.com/questions/1401317/remove-non-uft8-characters-from-string

Kasturi 2010-04-19 19:09:34

I based my code on the PHP code you linked to. I couldn't find this discussed for JavaScript yet though.

msielski 2010-04-19 19:24:40

Answer 3

+2 A:

JavaScript strings are natively Unicode. They hold character sequences* not byte sequences, so it is impossible for one to contain an invalid byte sequence.

(Technically, they actually contain UTF-16 code unit sequences, which is not quite the same thing, but this probably isn't anything you need to worry about right now.)

You can, if you need to for some reason, create a string holding characters used as placeholders for bytes. ie. using the character U+0080 ('\x80') to stand for the byte 0x80. This is what you would get if you encoded characters to bytes using UTF-8, then decoded them back to characters using ISO-8859-1 by mistake. There is a special JavaScript idiom for this:

var bytelike= unescape(encodeURIComponent(characters));

and to get back from UTF-8 pseudobytes to characters again:

var characters= decodeURIComponent(escape(bytelike));

(This is, notably, pretty much the only time the escape/unescape functions should ever be used. Their existence in any other program is almost always a bug.)

decodeURIComponent(escape(bytes)), since it behaves like a UTF-8 decoder, will raise an error if the sequence of code units fed into it would not be acceptable as UTF-8 bytes.

It is very rare for you to need to work on byte strings like this in JavaScript. Better to keep working natively in Unicode on the client side. The browser will take care of UTF-8-encoding the string on the wire (in a form submission or XMLHttpRequest).

bobince 2010-04-19 19:31:24

Thanks for an informative answer -- essentially that what I'm doing is difficult because I shouldn't be doing it. I'm having trouble with certain characters on the back-end, and need to address it there.

msielski 2010-04-19 20:20:24

ansaurus

tags:

views:

answers:

How to remove invalid UTF-8 characters from a JavaScript string?

related questions