tags:

views:

1473

answers:

4

Hi,

I have a set of Word documents which I want to publish using a PHP tool I've written. I copy and paste the Word documents into a text box and then save them into MySQL using the PHP program. The problem I Have arises from all the non-standard characters that Word documents have, like curly quotes and ellipses ("..."). What I do at the moment is manually search and replace these kinds of things (and also foreign symbols such as e-acute) with either plain text or HTML entities (&eacute ; etc) Is there a function in PHP I can call that will take the output of a Word document and convert everything that should be entities into entities, and other symbols that don't display properly in Firefox into symbols that do display.

Thanks!

+2  A: 

A better solution would be to ensure that your database is set-up to support UTF-8 characters. The additional characters available in the extended set should cover all the "non-standard" characters that you're talking about.

Otherwise, if you really must convert these characters into HTML entities, use htmlentities().

Richard Turner
In my experience, even with all of the character encodings set right, some characters just get swallowed by the time they get to the browser. I don't know if this is a bug in PHP (the server language I use most) or what, but I've found conversion to entities more reliable.
eyelidlessness
Hi Richard, any advice on how to set MySQL up to support UTF-8?Thanks!
Ben
CREATE DATABASE db_name CHARACTER SET 'utf8' - see http://dev.mysql.com/doc/refman/5.0/en/charset-database.html and http://dev.mysql.com/doc/refman/5.0/en/charset-table.html.Note you'll have to do something like SET NAMES 'utf8'; when you connect to the DB to ensure you fetch data in UTF-8.
Richard Turner
A: 

htmlspecialchars() will get you a long way, but watch out because Word documents are messy.

acrosman
+2  A: 

This has served me well in the past:

$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')
eyelidlessness
A: 

I think that all these answers miss one vital point. Windows itself uses a windows flavour of latin1, so if you paste some special characters in (like asymetrical quotes) into a form on a windows machine and that gets sent to a unix (or anything non-muckrosoft) box (be that to a database or whatever) some of the characters do not get matched to anything the unix system comprehends, hence the confused and garbled characters. What this means is that even if you have a UTF-8 database, and use htmlentities, some nasties are still going to get through because they are characters the OS doesn't recognise - they aren't even part of UTF-8 - the are microsoft-only inventions. I would love to know of a slick solution - what I do is manually blacklist the character codes of the microsoft-only chars I have encountered with an (also manual) list of UTF-8 characters, do a str_replace for all of these, and THEN you can do whatever you want with them - iconv, htmlentities, save straight into an utf8 database, it matters not anymore.

My grasp on this all is a little shaky - check out http://www.cs.tut.fi/~jkorpela/www/windows-chars.html for an excellent explanation which I have mutilated into short form above. - If someone has a better solution (surely there is one out there!) of how to PHPify what this article explains... I would love to hear it!

Bheema