views:

41

answers:

1

I'm doing a kind of roundabout experiment thing where I'm pulling data from tables in a remote page to turn it into an ICS so that I can find out when this sports team is playing (because I can't find anywhere that the information is more readily available than in this table), but that's just to give you some context.

I pull this data using cURL and parse it using domDocument. Then I take it and parse it for the info I need. What's giving me trouble is the opposing team. When I display the data on the initial PHP page, it's correct. But when I write to an ICS file, special UTF-8 characters get messed up. I thought utf8_encode would solve that problem, but it actually seems to have the opposite effect: when I run the function on my data, even the stuff displayed on the page (which had been displaying correctly), not in the separate ICS file (which was writing incorrectly), is incorrect. As an example: it turns "Inđija" to "InÄija."

Any tips or resources as far as dealing with UTF-8 strings in PHP? My server (a remote host) doesn't have mbstring installed either, which is a pain.

+1  A: 

utf8_encode encodes a string in ISO 8859-1 as UTF-8. If you put UTF-8 into it, it's going to interpret it as if it was ISO 8859-1, and hence produce mojibake.

To help with your first problem, before this, I'd want to know what sort of "special" characters are being messed up in the original problem, and what way are they being messed up?

Jon Hanna
Like I said, đ (d with a line through it, dj) is turning into Ä (and sometimes another character shows up next to this one, depending on where I view it). Whether or not I UTF8_encode it.
Ben Saufley
Also č and Č are turning into Ä. Looks like every character with diacritics (this is coming from Latin Serbian) is turning into Ä.
Ben Saufley
What are you reading the ICS file in? Have you tried writing a BOM at the beginning?
Jon Hanna
Reading the file in Mac's TextEdit, Google Calendar, iCalendar. It actually seems to be working now, but it was working (or seemed to be) yesterday. I'm not sure if something changed on the source page but I don't think so. I can't tell what Google's method for pulling this stuff is, it seems to cause trouble. I haven't tried adding a BOM. What is the benefit?
Ben Saufley
Some things that read things in use the BOM to indicate UTF-8, assuming another encoding otherwise. From what you've added, I'm wondering also if headers sent to google calendar could have been wrong, while the file was fine.
Jon Hanna