ansaurus

Question

How is this website fixing the encoding ??

Answer 1

A:

Hi.
You can use the meta tag to set the proper encoding for your page. Here is an example how you can do that:

I suppose that this encoding would do the work.

Teddy 2010-05-15 12:08:48

not HTML tag but HTTP header

Col. Shrapnel 2010-05-15 12:10:29

Thanks Teddy.I tried changing the file to windows 1255 with notepad++, and it didn't help.

Tal Galili 2010-05-15 12:27:54

@Col thanks for the correction.

Teddy 2010-05-15 12:31:12

Answer 2

+2 A:

If you look closely at the gibberish, you can tell that each Hebrew character is encoded as 2 characters - it appears that של is encoded as ×©×œ.

This suggests that you are looking at UTF8 or UTF16 as ASCII. Converting to UTF8 will not help because it is already ASCII and will keep that encoding.

You can read each pair of bytes and reconstruct the original UTF8 from them.

Here is some C# I came up with - this is very simplistic (doesn't fully work - too many assumptions), but I could see some of the characters converted properly:

private string ToProperHebrew(string gibberish)
{
   byte[] orig = Encoding.Unicode.GetBytes(gibberish);
   byte[] heb = new byte[orig.Length / 2];

   for (int i = 0; i < orig.Length / 2; i++)
   {
     heb[i] = orig[i * 2];
   }

   return Encoding.UTF8.GetString(heb);
}

If appears that each byte was re-encoded as two bytes - not sure what encoding was used for this, but discarding one byte seemed to be the right thing for most doubled up characters.

Oded 2010-05-15 12:14:35

Thanks Oded !How would you suggest me to do this ?(is there a converter that can do it for a file ?)

Tal Galili 2010-05-15 12:26:22

@Tal: Expand on his code sample and write a small utility. Not difficult.

Computer Guru 2010-05-15 15:45:58

Technically this won't solve the issue, as it just breaks more the encoding. I thought about using sed for massive search and replace, but it doesn't required as we can force mysql to export the data correctly.

Tomer Cohen 2010-05-25 22:19:46

Answer 3

+2 A:

You might want to look here - the accepted answer to this question shows a way how to guess the encoding of a byte[]. All you have to ensure then, is getting the proper bytes from the gibberish. Guessing might always fail, of course...

the.duckman 2010-05-15 12:24:02

Answer 4

+3 A:

Since the issue was a MySQL fault with double-encoded UTF8 strings, MySQL is the right way to solve it.

Running the following commands will solve it -

mysqldump $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET --add-drop-table --default-character-set=latin1 > export.sql - latin1 is used here to force MySQL not to split the characters, and should not be used otherwise.
cp export{,.utf8}.sql - making a backup copy.
sed -i -e 's/latin1/utf8/g' export.utf8.sql - Replacing the latin1 with utf8 in the file, in order to import it as UTF-8 instead of 8859-1.
mysql $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET < export.utf8.sql - import everything back to the database.

This will solve the issue in about ten minutes.

Tomer Cohen 2010-05-16 09:13:52

There are giants among us, roaming in plain sight.Amazing job Tomar!!!

Tal Galili 2010-05-16 10:06:29

By the way, this was the whole diff in the sed command - diff export.*10c10< /*!40101 SET NAMES latin1 */;---> /*!40101 SET NAMES utf8 */;

Tomer Cohen 2010-05-16 10:06:45

Answer 5

A:

gibberish.encode('windows-1252').decode('utf-8', 'replace')

dan04 2010-05-26 13:41:17

What language is this ? :)

Tal Galili 2010-05-26 16:19:33

ansaurus

tags:

views:

answers:

How is this website fixing the encoding ??

related questions