views:

270

answers:

4

My client has an old MS SQL 2000 database that uses varchar(50) fields to store names. He tried to use this database to capture some data (via a web form). Some of the form-fillers are from other countries, and the varchar fields went nutty when some of these folks entered their names. Is it possible to recover the data somehow? Maybe by guessing what the character should be based on what it resolved to in ASCII/varchar and the country the person is from? Some of the data:

Name / Country / First or Last Name?
Jiří / CZE / F
Torbjörn / FIN / F
Huszár / HUN / L
Jürgen / DEU / F
Müller / CHE / L
Bumbálková / CZE / L
Doležal / CZE / L
Loïc / DEU / L

By the way, the web form specified this content-type:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
A: 

You basically need to poke it through libiconv, converting it to UTF8.

A full list of appropriate character sets is going to depend on your application, but you can make some guesses based on the country code. Start with this page on WikiPedia.

Warning: You will need a human to verify each conversion.

staticsan
+6  A: 

Working from the 5th example.

à is ascii #195 (C3). ¼ is ascii #188 (BC).

I'd guess that Müller is meant to be Müller.

If this is UTF-8, based upon http://en.wikipedia.org/wiki/UTF-8#Description

We've got C3 BC = 1100 0011 1011 1100

Applying the UTF-8 mapping:

(110) 00011 (10) 11 1100

0000 0000 1111 1100

00FC which is Unicode ü

U+00FC (see http://en.wikipedia.org/wiki/Latin_characters_in_Unicode)

Seems to me that you could work through this programmatically.

Now solving the first example:

Jiå™ã was actually Jiří (The final character not shown).

Ignoring the Ji, which is correct,

C5 99 c3 AD

(110)0 0101 (10)01 1001 (110)0 0011 (10)10 1101

0159 00ED

ří

So the name is: Jiří. Wikipedia says that special r is Czech and so is the i. Furthermore if I google Jiří (http://www.google.com/search?q=Ji%C5%99%C3%AD&amp;ie=utf-8&amp;oe=utf-8) I get plenty of hits. We're on a winner here.

The second example, Torbjörn, maps nicely to Torbjörn which sounds convincing.

IMHO there's no great need for human checking of these, they seem to just work.

Richard A
Regarding "Jiå™ã": The actual name will be pasted below. For some reason, the As got lower-cased (they were originally uppercase) and the last character got truncated. Jiří
Chris
Thanks. I've updated the solution now. I'm just getting to grips with unicode. Now, back to work :)
Richard A
Yup, utf-8. Added that info to the question.
Chris
+1  A: 

The Russian post office did it. Did anyone save the image before it disappeared?

http://forums.thedailywtf.com/forums/p/7156/133456.aspx

Windows programmer
A: 

Further to Richard's comments: if the web page containing the form specifies a character set (e.g. iso-8859-1 == unicode) & encoding (e.g. utf-8) then a standards-compliant browser should submit form data using that character set and encoding. If your web pages specified unicode, then you should't have to cope with random Microsoft codepages in the data - it should all be unicode.

Frentos
Ok, I added this information to the question.
Chris