views:

102

answers:

2

Im working on an imdb data scraper for a site, and I they seem to encode everything in a weird encoding I never saw before.

<a href="/keyword/exploding-ship/">Exploding&#xA0;Ship</a>
A Bug&#x27;s Life

Is there a php function that will convert these to regular characters?

+1  A: 

Those are SGML character escapes. They can be either decimal (&#39;) or hexadecimal (&#xA0) and refer directly to a Unicode code point.

html_entity_decode() should work in PHP 5. Though I can't test at the moment.

In the first comment on that reference page, the following code is given for older PHP versions:

// For users prior to PHP 4.3.0 you may do this:
function unhtmlentities($string)
{
    // replace numeric entities
    $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
    $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
    // replace literal entities
    $trans_tbl = get_html_translation_table(HTML_ENTITIES);
    $trans_tbl = array_flip($trans_tbl);
    return strtr($string, $trans_tbl);
}
Joey
+5  A: 

This is not encoding, it's html entities hexadecimal codes.

try

$converted = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
Sergei
In a way it *is* an encoding.
Joey
It's an *encoding* Jim, but not as we know it.
pavium
This works for the space, but not the apostrophe (nor ampersand).
Yegor