views:

37

answers:

1

I have a scraper that is collecting some data from elsewhere that I have no control over. The source data does all sorts of interesting Unicode characters but it converts them to a pretty unhelpful format, so

\u00e4

for a small 'a' with umlaut (sans the double quotes that I think are supposed to be there)*. of course this gets rendered in my HTML as plain text.

Is there any realistic way to convert the unicode source into proper characters that doesn't involve me manually crunching out every single string sequence and replacing them during the scrape?

*here is a sample of the json that it spits out:

({"content":{"pagelet_tab_content":"<div class=\"post_user\">Latest post by <span>D\u00e4vid<\/span><\/div>\n})
+2  A: 

Considering \u00e4 is the Javascript representation of an Unicode character, a possibility could be to use the json_decode() PHP function, to decode that to a PHP string...

The valid JSON string would be :

$json = '"\u00e4"';

And this :

header('Content-type: text/html; charset=UTF-8');
$php = json_decode($json);
var_dump($php);

would give you the right output :

string 'ä' (length=2)

(It's one character, but two bytes long)


Still, it feels a bit hackish ^^
And it might not work too well, depending on the kind of string you get as input...

[Edit] I've just seen your comment where you seem to indicate you get JSON as input ? If so, json_decode() might really be the right tool for the job ;-)

Pascal MARTIN
This is exactly what i need, yes! Unfortunately the source json doesn't do the strings the way it's supposed to - I've edited the question to indicate the sort of formatting that's being used. As you can see the double quotes are completely omitted so json_decode can't recognise the string for what it is. Very frustrating.
hollsk