I'm a total python noob so please bear with me. I want to have python scan a page of html and replace instances of Microsoft Word entities with something UTF-8 compatible.
My question is, how do you do that in Python (I've Googled this but haven't found a clear answer so far)? I want to dip my toe in the Python waters so I figure something simple like this is a good place to start. It seems that I would need to:
- load text pasted from MS Word into a variable
- run some sort of replace function on the contents
- output it
In PHP I would do it like this:
$test = $_POST['pasted_from_Word']; //for example “Going Mobile”
function defangWord($string)
{
$search = array(
(chr(0xe2) . chr(0x80) . chr(0x98)),
(chr(0xe2) . chr(0x80) . chr(0x99)),
(chr(0xe2) . chr(0x80) . chr(0x9c)),
(chr(0xe2) . chr(0x80) . chr(0x9d)),
(chr(0xe2) . chr(0x80) . chr(0x93)),
(chr(0xe2) . chr(0x80) . chr(0x94)),
(chr(0x2d))
);
$replace = array(
"‘",
"’",
"“",
"”",
"–",
"—",
"–"
);
return str_replace($search, $replace, $string);
}
echo defangWord($test);
How would you do it in Python?
EDIT: Hmmm, ok ignore my confusion about UTF-8 and entities for the moment. The input contains text pasted from MS Word. Things like curly quotes are showing up as odd symbols. Various PHP functions I used to try and fix it were not giving me the results I wanted. By viewing those odd symbols in a hex editor I saw that they corresponded to the symbols I used above (0xe2, 0x80 etc.). So I simply swapped out the oddball characters with HTML entities. So if the bit I have above already IS UTF-8, what is being pasted in from MS Word that is causing the odd symbols?
EDIT2: So I set out to learn a bit about Python and found I don't really understand encoding. The problem I was trying to solve can be handled simply by having sonsistent encoding from end to end. If the input form is UTF-8, the database that stores the input is UTF-8 and the page that outputs it is UTF-8... pasting from Word works fine. No special functions needed. Now, about learning a little Python...