views:

30

answers:

2

Is there a way to fix the characters that display improperly after running this html markup through phpquery::newDocument? There are slated double quotes around -Classics with modern Woman- in the original document that end up displaying improperly after creating the new doc with phpquery.

    //Original document is UTF-8 encoded
$raw_html = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head><body><p>Mr. Smith of Bangkok celebrated the “Classics with modern Woman”.</p></body></html>';
print($raw_html);

$aNew_document = phpQuery::newDocument($raw_html);
print($aNew_document);

Original Output: Mr. Smith of Bangkok celebrated the “Classics with modern Woman”.

New Document Output: Mr. Smith of Bangkok celebrated the �Classics with modern Woman.

A: 

You have this in the <head> element:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> 

The next course would be to use HTML entities to display these characters.

Cody Snider
this won't solve the problem if the file itself is not saved as UTF-8
Yanick Rochon
+2  A: 
  1. You need to save the page with UTF-8 without BOM encoding.
  2. Add this header on top of your script:

    header("Content-Type: text/html; charset=UTF-8");

[EDIT]: How to Save Files as UTF-8 without BOM :

On OP request, here's how you can do on Windows:

  1. Download Notepad++. It is an awesome text-editor that you should be using.
  2. Install it.
  3. open the PHP script in Notepad++ that contains this code. The page where you are doing all the coding. Yes, that file on your computer.
  4. In Notepad++, from the Encoding menu at the top, select "Convert to UTF-8 without BOM".
  5. Save the file.
  6. Upload to your webserver by FTP or whatever you use.
  7. Now, run that script.
shamittomar
+1 because I've had this problem before when I used to be on Windows.... this is Windows saving files as CP1251 (or whatever the code page). Everything should always be saved as UTF-8 and content sent also using UTF-8. Linux doesn't have this problem :)
Yanick Rochon
@Yanick, same here.
shamittomar
Tried adding -header("Content-Type: text/html; charset=UTF-8");- at the top of the script, but it didn't fix it. Can you articulate what you mean by page being saved in this example? I don't think the page is ever saved, but exists in memory on the linux server before getting recreated by phpquery::newdocument(). If possible can you show how to insert this code properly? Or how to save the document with the correct encoding? I may be doing something wrong. Thanks
JMC
@acidjazz, I have updated the answer on how to save as UTF-8 without BOM. By saying. "save the page", I mean save the file that has the code in UTF-8 without BOM mode.
shamittomar
@shamittomar - +1 Good tool thanks. I can see using this as a companion to eclipse. My real problem is that I'm using $raw_html = file_get_contents($webpage_url); to get the html. In this case the file never gets saved. This is my fault for not specifying that in the original post. The special encoded chars are fine in $raw_html. When I pass the $raw_html into phpquery::newdocument the character encoding issue surfaces.
JMC
@acidjazz, You need to save the file that has the code `$raw_html = file_get_contents($webpage_url);` with UTF-8 wihtout BOM and put the `header("Content-Type: text/html; charset=UTF-8");` as its first line. Did you try this ? file_get_contents is UTF-8 compliant.
shamittomar
@shamittomar, tried saving the php file as UTF-8 without BOM but phpquery still returns weird characters. Even tried adding the charset to phpquery -$aNew_document = phpQuery::newDocumentHTML($raw_event_html, $charset = 'utf-8');- but the issue persists. This may be an issue in phpquery...Thinking about using XML instead.
JMC
@acidjazz, try phpQuery::newDocumentHTML(utf8_encode($raw_event_html), 'utf-8'); then. It must be the other page that is not properly UTF-8 encoded then.
Yanick Rochon