tags:

views:

433

answers:

2

Hi,

I am trying to convert this in to readable UTF8 text in PHP

Tel Aviv-Yafo (Hebrew: \u05ea\u05b5\u05bc\u05dc\u05be\u05d0\u05b8\u05d1\u05b4\u05d9\u05d1-\u05d9\u05b8\u05e4\u05d5\u05b9; Arabic: \u062a\u0644 \u0623\u0628\u064a\u0628\u200e, Tall \u02bcAb\u012bb), usually called Tel Aviv

Any ideas on how to do so?

Tried several methods online, but couldn't find one.

In this case I have unicode in Hebrew and Arabic

+1  A: 

See this comment for a way to get a unicode character from its numerical code. Then, you could write a regex replace that will replace each \uXXXX pattern with the equivalent character.

Alternatively, you could replace each \uXXXX pattern with its matching &#XXXX; html entity form, and then use the following:

mb_convert_encoding(string_with_html_entities, 'UTF-8', 'HTML-ENTITIES');

More complete example:

// The four \\\\ in the pattern here are necessary to match \u in the original string
$replacedString = preg_replace("/\\\\u(\d{4})/", "&#$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
Amber
Could you give me an example?I didn't understand the example in the link.Say I have this string "\u05ea" somewhere in the text - how would I change it to its html entity form as its not "ea;" or the first option you mentioned.Thanks for the help.
Simon
Sure, I added a more complete example to my answer.
Amber
@Dav: Why `\\\\u`? Isn't `\\u` enough? I also think that `\d{2,4}` would make it more complete.
Alix Axel
Alix: `\u` would be interpreted by the regex engine as an escape-code u, sort of like how `\d` is the set of digits, and `\w` is the set of "word" characters. Thus you need to actually escape the slash in the *regex*, which means your regex needs to be `\\u`, and then you have to escape those slashes since they're within the string, thus you have \\\\ as the escaped form of \\.
Amber
A: 

Hi,

I am trying this code:

function unicode_conv($originalString) {
  // The four \\\\ in the pattern here are necessary to match \u in the original string
  $replacedString = preg_replace("/\\\\u(\d{4})/", "&#$1;", $originalString);
  $unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
  return $unicodeString;
}

echo unicode_conv("Tel Aviv-Yafo (Hebrew: \u05ea\u05b5\u05bc\u05dc\u05be\u05d0\u05b8\u05d1\u05b4\u05d9\u05d1-\u05d9\u05b8\u05e4\u05d5\u05b9; Arabic: \u062a\u0644 \u0623\u0628\u064a\u0628\u200e, Tall \u02bcAb\u012bb), usually called Tel Aviv, is the second largest city in Israel, with an estimated population of 393,900. The city is situated on the Israeli Mediterranean coast, with a land area of 51.8\u00a0square kilometres (20.0\u00a0sq\u00a0mi). It is the largest and most populous city in the metropolitan area of Gush Dan, home to 3.15\u00a0million people as of 2008. The city is governed by the Tel Aviv-Yafo municipality, headed by Ron Huldai.\nTel Aviv was founded in 1909 on the outskirts of the ancient port city of Jaffa (Hebrew: \u05d9\u05b8\u05e4\u05d5\u05b9\u200e, Yafo; Arabic: \u064a\u0627\u0641\u0627\u200e, Yaffa). The growth of Tel Aviv soon outpaced Jaffa, which was largely Arab at the time. Tel Aviv and Jaffa were merged into a single municipality in 1950, two years after the establishment of the State of Israel. Tel Aviv's White City, designated a UNESCO World Heritage Site in 2003, comprises the world's largest concentration of Modernist-style buildings.\nTel Aviv is classified as a beta+...");

Result isn't correct, it doesn't really make much of a difference, a few letters are changed to greek/russian and not to Hebrew/Arabic.

Its like the entity number is incorrect.

Simon