ansaurus

Question

How do I translate 8bit characters into 7bit characters? (i.e. Ü to U)

Answer 1

A:

I think you already nailed it on the head. Given your limited domain, a conversion array or hash is your best bet. No sense creating anything complex to try to automagically do it.

jdewald 2008-09-26 16:07:02

Answer 2

+6 A:

I think you just can't.

I usually do something like that:

AccentString = 'ÀÂÄÉÈÊ[and all the other]'
ConvertString = 'AAAEEE[and all the other]'

Looking for the char in AccentString and replacing it for the same index in ConvertString

HTH

vIceBerg 2008-09-26 16:08:43

Answer 3

+1 A:

You seem to have nailed it I think. A 128 byte long array of bytes, indexed by char&127, containing the matching 7-bit character for the 8-bit bit character.

JeeBee 2008-09-26 16:08:49

Answer 4

A:

A lookup array is probably the simplest and fastest way to accomplish this. This is one way that you can convert say, ASCII to EBCDIC.

Nighthawk 2008-09-26 16:13:39

Answer 5

+1 A:

Hm, why not just change the encoding of the string with iconv?

unexist 2008-09-26 16:15:52

Answer 6

+1 A:

It really depends on the nature of your source strings. If you know the string's encoding, and you know that it's an 8-bit encoding — for example, ISO Latin 1 or similar — then a simple static array is sufficient:

static const char xlate[256] = { ..., ['é'] = 'e', ..., ['Ü'] = 'U', ... }
...
new_c = xlate[old_c];

On the other hand, if you have a different encoding, or if you're using UTF-8 encoded strings, you will probably find the functions in the ICU library very helpful.

Derek Clegg 2008-09-26 16:24:22

Answer 7

+14 A:

Most languages have a standard way to replace accented characters with standard ASCII, but it depends on the language, and it often involves replacing a single accented character with two ASCII ones. e.g. in German ü becomes ue. So if you want to handle natural languages properly it's a lot more complicated than you think it is.

Mark Baker 2008-09-26 16:33:24

Answer 8

A:

The upper 128 characters do not have standard meanings. They can take different interpretations (code pages) depending on the user's language.

For example, see Portuguese versus French Canadian

Unless you know the code page, your "translation" will be wrong sometimes.

If you are going to assume a certain code page (e.g. the original IBM code page) then a translation array will work, but for true international users, it will be wrong a lot.

This is one reason why unicode is favored over the older system of code pages.

Strictly speaking, ASCII is only 7 bits.

Jamie 2008-09-26 16:36:47

Answer 9

+4 A:

Indeed as proposed by unexist : "iconv" function exists to handle all weird conversion for you, is available in almost all programming language and has a special option which tries to convert characters missing in the target set with approximations.

Use iconv to simply convert your input UTF-8 string to 7bit ASCII.

Otherwise, you'll always end hitting corner case : a 8bit input using a different codepage with a different set of characters (thus not working at all with your conversion table), forgot to map one last stupid accented caracter (you mapped all grave/acute accent, but forgot to map Czech caron or the nordic '°'), etc.

Of course if you want to apply the solution to a small specific problem (making file-system friendly filenames for your music collection) the the look-up arrays are the way to go (either an array which for each code number above 128 maps an approximation under 128 as proposed by JeeBee, or the source/target pairs proposed by vIceBerg depending on which substitution functions are already available in your language of choice), because it's quickly hacked together and quickly check for missing elements.

DrYak 2008-09-26 16:41:24

Answer 10

+8 A:

Is converting Ü to U really what you would like to do? I don't know about other languages but in German Ü would become Ue, ö would become oe, etc.

2008-09-26 16:43:14

Answer 11

+3 A:

In code page 1251, chars are coded with 2 bytes : one for the basic char and one for the variation. Then, when you encode back in ASCII, only basic chars are keept.

public string RemoveDiacritics(string text)
{

  return System.Text.Encoding.ASCII.GetString(System.Text.Encoding.GetEncoding(1251).GetBytes(text));

}

From : http://www.clt-services.com/blog/post/Enlever-les-accents-dans-une-chaine-(proprement).aspx

Michel 2008-09-29 09:51:31

Answer 12

A:

There is an article on CodeProject that looks good.

Also the conversion using codepage 1251 take my interest (see other answer).

I don't like the conversion tables, since the number of characters in Unicode are that large you easily miss one.

GvS 2008-10-08 16:03:34

Answer 13

A:

I use this function to fix a variable with accents to pass to a soap function from VB6:

Function FixAccents(ByVal Valor As String) As String

    Dim x As Long
    Valor = Replace(Valor, Chr$(38), "&#" & 38 & ";")

    For x = 127 To 255
        Valor = Replace(Valor, Chr$(x), "&#" & x & ";")
    Next

    FixAccents = Valor

End Function

And inside the soap function I do this (for the variable Filename):

FileName = HttpContext.Current.Server.HtmlDecode(FileName)

Gary 2009-06-07 17:07:18

Answer 14

A:

Try the uni2ascii program.

dan04 2010-03-09 05:32:56

ansaurus

tags:

views:

answers:

How do I translate 8bit characters into 7bit characters? (i.e. Ü to U)

related questions