views:

985

answers:

4

Hey

I am scraping a list of RSS feeds by using cURL, and then I am reading and parsing the RSS data with SimpleXML. The sorted data is then inserted into a mySQL database.

However, as notice on http://dansays.co.uk/research/MNA/rss.php I am having several issues with characters not displaying correctly.

Examples:

‘Guitar Hero: Van Halen’ Trailer And Tracklist Available

NV 10/10/09 – Salt Lake City, UT 10/11/09 – Denver, CO 10/13/09 –

I have tried using htmlentities and htmlspecialchars on the data before inserting them into the database, but it doesn't seem to help resolve issue.

How could I possibly resolve this issue I am having?

Thanks for any advices.

Updated

I've tried what Greg suggested, and the issue is still here...

Here is the code I used to do SET NAMES in PDO:

$dbh = new PDO($dbstring, $username, $password); 

$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION); 

$dbh->query('SET NAMES "utf8"');

I did a bit of echo'ing with the simplexml data before it is sorted and inserted into the database, and I now believe it is something to do with the cURL...

Here is what I have for cURL:

$ch = curl_init($url);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);

curl_setopt($ch, CURLOPT_HEADER, 0);

curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8');

$data = curl_exec($ch);

curl_close($ch);

$doc = new SimpleXmlElement($data, LIBXML_NOCDATA);

Issue Resolved

I had to set the content charset in the RSS/HTML page to "UTF-8" to resolve this issue. I guess this isn't a real fix as the char problems are still there in the raw data. Looking forward to proper support for it in PHP6!

+2  A: 

Your page is being served as UTF-8 so I'd point my finger at the database.

Make sure the connection is in UTF-8 before any SELECTs or INSERTS - in MySQL:

SET NAMES "utf8"
Greg
The OP is reading an rss feed usinc cURL, nothing to do with a database
Marius
Please re-read the question: "The sorted data is then inserted into a mySQL database."
Greg
'utf8_general_ci' has already been set on the tables in myPHPAdmin.
Setting what encoding the tables use is a separate issue from what encoding you're using to INTERACT with those tables. You could very easily pass MySQL latin1 data, which it would convert to UTF-8 before storing.The SET NAMES utf8 command Greg recommends says, "I intend to do all my communication with MySQL using UTF-8." Or, another way: "The data I will be giving you, MySQL, is encoded as UTF-8 already."
VoteyDisciple
Greg
Greg is correct
Mark
Quick and simple solution to a problem I was trying to overcome yesterday for a couple of hours (had no internet connection... no access to SO... coding like a caveman ;)). THANK YOU GREG!
Peter Perháč
A: 

I've tried what Greg suggested, and the issue is still here...

Here is the code I used to do SET NAMES in PDO:

$dbh = new PDO($dbstring, $username, $password); 

$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION); 

$dbh->query('SET NAMES utf8');
I think you need quotes around utf8
Greg
Also you should really edit your question with new info rather than posting it as an answer
Greg
Alright, thanks. I've added quotes around the utf8, unfortunately it doesn't help resolve the issue... :\
+1  A: 

Like all debugging, you start by isolating the problem:

I am scraping a list of RSS feeds by using cURL, - look at the xml from the RSS feed that's giving the problem (there's more than one feed, so it's possible for some feeds to be right and for the feeds that are wrong to be wrong in different ways)

and then I am reading and parsing the RSS data with SimpleXML. - print out the field that SimpleXML read out - is it ok or does a problem show up?

The sorted data is then inserted into a mySQL database. - print out hex(field), length(field), and char_length(field) for the piece of data that's giving the problem.

EDIT

Take the feed http://hangout.altsounds.com/external.php?type=RSS2 , put it into the validator http://validator.w3.org/feed/ . They're declaring their content type as iso-8859-1 but some of the actual content, such as the quotes, is in something like cp1252 - for example they're using the byte 0x93 to represent the left quote - http://www.fileformat.info/info/unicode/char/201C/charset_support.htm .

What's annoying about this is that this doesn't show up in some tools - Firefox seems to guess what's going on and show the quotes correctly, and more to the point, SimpleXML converts the 0x93 into utf8, so it comes out as 0xc293, which exacerbates the problem.

EDIT 2

A workaround to get that feed to read a bit more correctly is to replace "ISO-8859-1" by "Windows-1252" before passing to Simple XML. It won't work 100% because it turns out that some parts of the feed are in UTF8.

The general approach, assuming that you can't get everyone in the world to correct their feeds, is to isolate whatever workarounds you require to the interface with the external system that's emitting the malformed data, and to pass in pure clear utf8 to the hub of your system. Save a dated copy of the raw external feed so you can remember in future why the workaround was required, separate off and comment the code lines that implement the workaround so it's easy to get at and change if and when the external organisation corrects its feed (or breaks it in a different way), and check it again from time to time. Unfortunately instead of programming to a spec you're programming to the current state of a bug, so there's no permanent, clean solution - the best you can do is isolate, document, and monitor.

d__
I am up to the point where the actual encoding issue is coming directly from the xml data itself as you can see here on some of the titles retrieved... http://dansays.co.uk/research/MNA/fetch.raw.php
Thanks for your awesome advice! :)
A: 

It may have to do with the XML prologue, which looks like this for that particular feed you linked to:

<?xml version="1.0" encoding="ISO-8859-1" ?>

As far as I know libxml, on which SimpleXML is based, looks for this kind of things. I'm not sure about XML files but I'm sure that with HTML strings it looks for META elements that specify the charset.

Try stripping the XML prologue (I solved a similar problem once by stripping the HTML META tags) and don't forget to utf8_encode() the data before feeding it to SimpleXMLElement.

Ionuț G. Stan
Your help sure did point me into the right direction, but still has not cleared all my issues yet. Thanks anyway!