tags:

views:

131

answers:

6

Hi. When I want to get the title of a remote webiste, I use this script:

function get_remotetitle($urlpage) {
    $file = @fopen(($urlpage),"r");
    $text = fread($file,16384);
    if (preg_match('/<title>(.*?)<\/title>/is',$text,$found)) {
        $title = $found[1];
    } else {
        $title = 'Title N/A';
    }
    return $title;
}


But when I parase a webiste title with accents, I get "�". But if I look in PHPMyAdmin, I see the accents correctly. What's happening?

A: 

The trouble is that the text has a different encoding from what you're using on the page you're displaying it on.

What you want to do is find out what encoding the data is (for instance by looking at what encoding the page you take the text from is using) and converting it to the encoding you're using yourself.

For doing the actual conversion, you can use iconv (for the general case), utf8_decode (UTF8 -> ISO-8859-1), utf8_encode (ISO-8859-1 -> UTF8) or mb_convert_encoding.

To help you find out what the encoding of the source page is, you could for instance put the website through the w3c Validator which automatically detects encoding.

If want an automatic way to determine encoding, you'll have to look at the HTML itself. The ways you can determine the selected charset can be fonud in the HTML 4 specification.

In addition, it's worth having a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for a bit more information on encoding.

Sebastian P.
But, I get the title from a user-form entered website. How can I do it?
Francesc
I added a bit about determining the encoding on the fly; maybe the helps?
Sebastian P.
What I don't understand, it's why it's correct in PHPMyAdmin.
Francesc
PHPMyAdmin is outputting in a different encoding to the one you're outputting in.
Sebastian P.
A: 

This is most likely a character encoding issue. You are probably getting the character correctly but the page that displays it has the wrong character encoding so it doesn't display right.

Brendan Heywood
A: 

try this

echo iconv('UTF-8', 'ASCII//TRANSLIT', $title);

Ronald D. Willis
It says a PHP Notice. Notice: iconv() [function.iconv]: Detected an illegal character in input string.
Francesc
A: 

check out PHP Simple HTML DOM Parser

use it something like:

$html = file_get_html('http://www.google.com/');
$ret = $html->find('title', 0);
seengee
A: 

Im a little late I guess... Ronald already answered this which I was to answer.

Cheers Ronald!!

Naveen Bhalla
A: 

I solved it. I added htmlentities($text) and now displays the accents and so.

Francesc
That sounds like a pretty fragile solution. My guess is that there will be some pages with odd character encodings where that will break. Look in the wrapper data (check PHP docs) for a content encoding, and make sure to use that to work on the data. The very most watertight thing to do would be to parse the data using PHP DOM and re-parse using the HTTP header charset if one is not set in the file.
Nicholas Wilson