tags:

views:

2305

answers:

14

I'm looking for pseudocode, or sample code, to convert higher bit ascii characters (like, Ü which is extended ascii 154) into U (which is ascii 85).

My initial guess is that since there are only about 25 ascii characters that are similar to 7bit ascii characters, a translation array would have to be used.

Let me know if you can think of anything else.

A: 

I think you already nailed it on the head. Given your limited domain, a conversion array or hash is your best bet. No sense creating anything complex to try to automagically do it.

jdewald
+6  A: 

I think you just can't.

I usually do something like that:

AccentString = 'ÀÂÄÉÈÊ[and all the other]'
ConvertString = 'AAAEEE[and all the other]'

Looking for the char in AccentString and replacing it for the same index in ConvertString

HTH

vIceBerg
+1  A: 

You seem to have nailed it I think. A 128 byte long array of bytes, indexed by char&127, containing the matching 7-bit character for the 8-bit bit character.

JeeBee
A: 

A lookup array is probably the simplest and fastest way to accomplish this. This is one way that you can convert say, ASCII to EBCDIC.

Nighthawk
+1  A: 

Hm, why not just change the encoding of the string with iconv?

unexist
+1  A: 

It really depends on the nature of your source strings. If you know the string's encoding, and you know that it's an 8-bit encoding — for example, ISO Latin 1 or similar — then a simple static array is sufficient:

static const char xlate[256] = { ..., ['é'] = 'e', ..., ['Ü'] = 'U', ... }
...
new_c = xlate[old_c];

On the other hand, if you have a different encoding, or if you're using UTF-8 encoded strings, you will probably find the functions in the ICU library very helpful.

Derek Clegg
+14  A: 

Most languages have a standard way to replace accented characters with standard ASCII, but it depends on the language, and it often involves replacing a single accented character with two ASCII ones. e.g. in German ü becomes ue. So if you want to handle natural languages properly it's a lot more complicated than you think it is.

Mark Baker
A: 

The upper 128 characters do not have standard meanings. They can take different interpretations (code pages) depending on the user's language.

For example, see Portuguese versus French Canadian

Unless you know the code page, your "translation" will be wrong sometimes.

If you are going to assume a certain code page (e.g. the original IBM code page) then a translation array will work, but for true international users, it will be wrong a lot.

This is one reason why unicode is favored over the older system of code pages.

Strictly speaking, ASCII is only 7 bits.

Jamie
+4  A: 

Indeed as proposed by unexist : "iconv" function exists to handle all weird conversion for you, is available in almost all programming language and has a special option which tries to convert characters missing in the target set with approximations.

Use iconv to simply convert your input UTF-8 string to 7bit ASCII.

Otherwise, you'll always end hitting corner case : a 8bit input using a different codepage with a different set of characters (thus not working at all with your conversion table), forgot to map one last stupid accented caracter (you mapped all grave/acute accent, but forgot to map Czech caron or the nordic '°'), etc.

Of course if you want to apply the solution to a small specific problem (making file-system friendly filenames for your music collection) the the look-up arrays are the way to go (either an array which for each code number above 128 maps an approximation under 128 as proposed by JeeBee, or the source/target pairs proposed by vIceBerg depending on which substitution functions are already available in your language of choice), because it's quickly hacked together and quickly check for missing elements.

DrYak
+8  A: 

Is converting Ü to U really what you would like to do? I don't know about other languages but in German Ü would become Ue, ö would become oe, etc.

+3  A: 

In code page 1251, chars are coded with 2 bytes : one for the basic char and one for the variation. Then, when you encode back in ASCII, only basic chars are keept.

public string RemoveDiacritics(string text)
{

  return System.Text.Encoding.ASCII.GetString(System.Text.Encoding.GetEncoding(1251).GetBytes(text));

}

From : http://www.clt-services.com/blog/post/Enlever-les-accents-dans-une-chaine-(proprement).aspx

Michel
A: 

There is an article on CodeProject that looks good.

Also the conversion using codepage 1251 take my interest (see other answer).

I don't like the conversion tables, since the number of characters in Unicode are that large you easily miss one.

GvS
A: 

I use this function to fix a variable with accents to pass to a soap function from VB6:

Function FixAccents(ByVal Valor As String) As String

    Dim x As Long
    Valor = Replace(Valor, Chr$(38), "&#" & 38 & ";")

    For x = 127 To 255
        Valor = Replace(Valor, Chr$(x), "&#" & x & ";")
    Next

    FixAccents = Valor

End Function

And inside the soap function I do this (for the variable Filename):

FileName = HttpContext.Current.Server.HtmlDecode(FileName)
Gary
A: 

Try the uni2ascii program.

dan04