views:

2757

answers:

10

I need to convert large UTF-8 strings into ASCII. It should be reversible, and ideally a quick/lightweight algorithm.

How can I do this? I need the source code (using loops) or the JavaScript code. (should not be dependent on any platform/framework/library)

Edit: I understand that the ASCII representation will not look correct and would be larger (in terms of bytes) than its UTF-8 counterpart, since its an encoded form of the UTF-8 original.

+1  A: 

UTF-8 can encode 2^20 different code points. ASCII can encode 128. You're not going to get reversible.

Before I spoonfeed you code that drops the non-ASCII characters, are you sure that's what you want?

kdgregory
Why the downvote? It's slightly incorrect (the Unicode codespace contains 17*2^16 codepoints, which is more than 2^20), but otherwise perfectly valid...
Christoph
+7  A: 

Any UTF-8 string that is reversibly convertible to ASCII is already ASCII.

UTF-8 can represent any unicode character - ASCII cannot.

Neall
"ASCII cannot" - Of course it can! look at the accepted answer above.
Jenko
@Jeremy: Then state your question less sneakly! "UTF-8 to ASCII conversion" sounds like a character encoding conversion problem, while what you really want is a way to represent *Unicode* (that's not the same as UTF-8) characters using the ASCII charset and a known character escaping syntax.
Romulo A. Ceccon
@Pat That's one of the most common misconceptions about UTF-8. UTF-8 and UTF-16 actually have variable bit lengths and either one can represent any unicode character. http://en.wikipedia.org/wiki/UTF-8
Neall
I stand corrected! (Previous comment removed.)
Pat
+3  A: 

As others have said, you can't convert UTF-8 text/plain into ASCII text/plain without dropping data.

You could convert UTF-8 text/plain into ASCII someother/format. For instance, HTML lets any character in UTF-8 be representing in an ASCII data file using character references.

If we continue with that example, in JavaScript, charCodeAt could help with converting a string to a representation of it using HTML character references.

Another approach is taken by URLs, and implemented in JS as encodeURIComponent.

David Dorward
"Without dropping data" - Of course you can! look at the accepted answer above.
Jenko
I think `encodeURI()` would be better than `encodeURIComponent()` - encoding reserved characters is unnecessary
Christoph
@Jeremy - I did say "text/plain"
David Dorward
@Christoph - That's a good point
David Dorward
+2  A: 

If the string is encoded as UTF-8, it's not a string any more. It's binary data, and if you want to represent the binary data as ASCII, you have to format it into a string that can be represented using the limited ASCII character set.

One way is to use base-64 encoding (example in C#):

string original = "asdf";
// encode the string into UTF-8 data:
byte[] encodedUtf8 = Encoding.UTF8.GetBytes(original);
// format the data into base-64:
string base64 = Convert.ToBase64String(encodedUtf8);

If you want the string encoded as ASCII data:

// encode the base-64 string into ASCII data:
byte[] encodedAscii = Encoding.ASCII.GetBytes(base64);
Guffa
Great idea, though I wanted JS. Thanks.
Jenko
+3  A: 

You could use an ASCII-only version of Douglas Crockford's json2.js quote function. Which would look like this:

    var escapable = /[\\\"\x00-\x1f\x7f-\uffff]/g,
        meta = {    // table of character substitutions
            '\b': '\\b',
            '\t': '\\t',
            '\n': '\\n',
            '\f': '\\f',
            '\r': '\\r',
            '"' : '\\"',
            '\\': '\\\\'
        };

    function quote(string) {

// If the string contains no control characters, no quote characters, and no
// backslash characters, then we can safely slap some quotes around it.
// Otherwise we must also replace the offending characters with safe escape
// sequences.

        escapable.lastIndex = 0;
        return escapable.test(string) ?
            '"' + string.replace(escapable, function (a) {
                var c = meta[a];
                return typeof c === 'string' ? c :
                    '\\u' + ('0000' + a.charCodeAt(0).toString(16)).slice(-4);
            }) + '"' :
            '"' + string + '"';
    }

This will produce a valid ASCII-only, javascript-quoted of the input string

e.g. quote("Doppelgänger!") will be "Doppelg\u00e4nger!"

To revert the encoding you can just eval the result

var encoded = quote("Doppelgänger!");
var back = eval(encoded);
fforw
Why not use something *other than eval()* ? Like say, html entities?
Fowl
mostly because you don't need to implement anything for reversion and it will be pretty fast. You could just as well use an regex-based unquote method very much like the quote function.
fforw
.. or you could secure the eval based unquote with regex validation like json2.js does for complete JSON.
fforw
Exactly what I was looking for. Thanks!
Jenko
Note that strictly speaking this is not "conversion to ASCII". You're actually implementing your own encoding scheme on top of ASCII. This may be perfectly ok for the requirements (and it seems to be for you), but it's not just a simple "conversion to ASCII".
Joachim Sauer
+1  A: 

An implementation of the quote() function might do what you want. My version can be found here

You can use eval() to reverse the encoding:

var foo = 'Hägar';
var quotedFoo = quote(foo);
var unquotedFoo = eval(quotedFoo);
alert(foo === unquotedFoo);
Christoph
Similar to the accepted answer above. Is yours better?
Jenko
@Jeremy: not really - same thing, different implementation; if I'd seen fforw's answer before posting my own, I wouldn't have bothered; my version has a few more options (choice between single or double quotes, optionally doesn't escape non-ascii characters), but most likely it will be slower
Christoph
+3  A: 

Your requirement is pretty strange.

Converting UTF-8 into ASCII would loose all information about Unicode codepoints > 127 (i.e. everything that's not in ASCII).

You could, however try to encode your Unicode data (no matter what source encoding) in an ASCII-compatible encoding, such as UTF-7. This would mean that the data that is produced could legally be interpreted as ASCII, but it is really UTF-7.

Joachim Sauer
"loose all information" - It can be lossless! look at the accepted answer above.
Jenko
Good idea about the UTF-7 though.
Jenko
@Jeremy: it can be lossless, but then you're no longer just "converting to ASCII", you're then converting to some encoding scheme implemented on top of the ASCII character set ...
Joachim Sauer
A: 

Do you want to strip all non ascii chars (slash replace them with '?', etc) or to store Unicode code points in a non unicode system?

First can be done in a loop checking for values > 128 and replacing them.

If you don't want to use "any platform/framework/library" then you will need to write your own encoder. Otherwise I'd just use JQuery's .html();

Fowl
A: 

Store both strings and use them in their respective setting.

that means that since you can't do such conversions you may keep a record of the unconverted string and instead of converting it back (which is not possible without losses if more than the standard english chars have been used) you just retrieve the original when you need to convert back.

tharkun
Huh. Whats that supposed to mean.
Jenko
+1  A: 

It is impossible to convert an UTF-8 string into ASCII but it is possible to encode Unicode as an ASCII compatible string.

Probably you want to use Punycode - this is already a standard Unicode encoding that encodes all Unicode characters into ASCII. For JavaScript code check this question

Please edit you question title and description in order to prevent others from down-voting it - do not use term conversion, use encoding.

Sorin Sbarnea