views:

1046

answers:

5

I'm a total python noob so please bear with me. I want to have python scan a page of html and replace instances of Microsoft Word entities with something UTF-8 compatible.

My question is, how do you do that in Python (I've Googled this but haven't found a clear answer so far)? I want to dip my toe in the Python waters so I figure something simple like this is a good place to start. It seems that I would need to:

  1. load text pasted from MS Word into a variable
  2. run some sort of replace function on the contents
  3. output it

In PHP I would do it like this:

$test = $_POST['pasted_from_Word']; //for example “Going Mobile”

function defangWord($string) 
{
    $search = array(
        (chr(0xe2) . chr(0x80) . chr(0x98)),
        (chr(0xe2) . chr(0x80) . chr(0x99)),
        (chr(0xe2) . chr(0x80) . chr(0x9c)), 
        (chr(0xe2) . chr(0x80) . chr(0x9d)), 
        (chr(0xe2) . chr(0x80) . chr(0x93)),
        (chr(0xe2) . chr(0x80) . chr(0x94)), 
        (chr(0x2d))
    ); 

    $replace = array(
        "‘",
        "’",
        "“",
        "”",
        "–",
        "—",
        "–"
    );

    return str_replace($search, $replace, $string); 
} 

echo defangWord($test);

How would you do it in Python?

EDIT: Hmmm, ok ignore my confusion about UTF-8 and entities for the moment. The input contains text pasted from MS Word. Things like curly quotes are showing up as odd symbols. Various PHP functions I used to try and fix it were not giving me the results I wanted. By viewing those odd symbols in a hex editor I saw that they corresponded to the symbols I used above (0xe2, 0x80 etc.). So I simply swapped out the oddball characters with HTML entities. So if the bit I have above already IS UTF-8, what is being pasted in from MS Word that is causing the odd symbols?

EDIT2: So I set out to learn a bit about Python and found I don't really understand encoding. The problem I was trying to solve can be handled simply by having sonsistent encoding from end to end. If the input form is UTF-8, the database that stores the input is UTF-8 and the page that outputs it is UTF-8... pasting from Word works fine. No special functions needed. Now, about learning a little Python...

+3  A: 

The Python code has the same outline.

Just replace all of the PHP-isms with Python-isms.

Start by creating a File object. The result of a file.read() is a string object. Strings have a "replace" operation.

S.Lott
+2  A: 

Your best bet for cleaning Word HTML is using HTML Tidy which has a mode just for that. There are a few Python wrappers you can use if you need to do it programmatically.

Matt Good
+1  A: 

As S.Lott said, the Python code would be very, very similar—the only differences would essentially be the function calls/statements.

I don't think Python has a direct equivalent to file_get_contents(), but since you can obtain an array of the lines in the file, you can then join them by newlines, like this:

sample = '\n'.join(open(test, 'r').readlines())

EDIT: Never mind, there's a much easier way: sample = file(test).read()

String replacing is almost exactly the same as str_replace():

sample = sample.replace(search, replace)

And outputting is as simple as a print statement:

print defang_word(sample)

So as you can see, the two versions look almost exactly the same.

htw
file('foo.txt').read()
Justus
Good call—edited.
htw
+20  A: 

First of all, those aren't Microsoft Word entities—they are UTF-8. You're converting them to HTML entities.

The Pythonic way to write something like:

chr(0xe2) . chr(0x80) . chr(0x98)

would be:

'\xe2\x80\x98'

But Python already has built-in functionality for the type of conversion you want to do:

def defang(string):
    return string.decode('utf-8').encode('ascii', 'xmlcharrefreplace')

This will replace the UTF-8 codes in a string for characters like with numeric entities like “.

If you want to replace those numeric entities with named ones where possible:

import re
from htmlentitydefs import codepoint2name

def convert_match_to_named(match):
    num = int(match.group(1))
    if num in codepoint2name:
        return "&%s;" % codepoint2name[num]
    else:
        return match.group(0)

def defang_named(string):
    return re.sub('&#(\d+);', convert_match_to_named, defang(string))

And use it like so:

>>> defang_named('\xe2\x80\x9cHello, world!\xe2\x80\x9d')
'“Hello, world!”'


To complete the answer, the equivalent code to your example to process a file would look something like this:

# in Python, it's common to operate a line at a time on a file instead of
# reading the entire thing into memory

my_file = open("test100.html")
for line in my_file:
    print defang_named(line)
my_file.close()

Note that this answer is targeted at Python 2.5; the Unicode situation is dramatically different for Python 3+.

I also agree with bobince's comment below: if you can just keep the text in UTF-8 format and send it with the correct content-type and charset, do that; if you need it to be in ASCII, then stick with the numeric entities—there's really no need to use the named ones.

Miles
+1 for xmlcharrefreplace — there is no need for HTML named entities today really. But really, leave the UTF-8 alone, smart-quotes intact. As long as you serve it with the correct ‘charset’ header/meta-tag there is no problem.
bobince
+1 for pointing out that the entities are UTF-8 and not some MS weirdness ;-) (and for a well-written answer overall, too)
David Zaslavsky
I'm confused. The document I am importing in the example is full of strange symbols that correspond to MS Word curly quotes. If I drop them straight into a page with UTF-8 encoding I get strange symbols. If I convert them using my example code they render fine. So, what are they before I convert?
gaoshan88
It's hard to tell what you mean when you say "drop them straight into a page with UTF-8 encoding". It sounds like you're opening the test100.html file in a text editor with the incorrect character set (probably Windows-1252)—make sure you open it as UTF-8.
Miles
Sorry, that wasn't clear. The PHP I wrote was created to handle people pasting directly from Word into a textarea. The pasted code would then appear with the garbled symbols (looking like “Inside Quotes†for example) and I could not find a good solution to clean it. My above code cleans it.
gaoshan88
Basically I want to clean text that was input by pasting from Word into a textarea. I shouldn't have used an html page for my example, in reality I was dealing with text input via a form. Does that make sense?
gaoshan88
Does the pasted content appear garbled *immediately* when it is pasted? Or after the form has been posted to the server and redisplayed on the following page? What you're describing is symptomatic of UTF-8 encoded text being interpreted as Latin-1 or Windows-1252.
Miles
Does the page with the textarea have its charset set to UTF-8? (With the appropriate HTTP Content-Type header, or putting <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in the <head>)
Miles
No. It appears correct. The pages handling the output are all UTF-8, though, so I suspect it is Windows-1252 that cannot be rendered properly. Maybe? I edited my example above to (hopefully) clarify.
gaoshan88
"Does the page with the textarea have its charset set to UTF-8?" Actually it is charset=iso-8859-1 on the input page and it WAS the same on the output page but I changed it to UTF-8 (on the output page). So it is a mess of 1252 being pasted into 8859-1 and viewed on utf-8. Ugh.
gaoshan88
A: 

Just for the record, that is not the way to do it in PHP.

$test = "test100.html";

$sample = file_get_contents($test);

echo htmlentities($sample, ENT_COMPAT, 'UTF-8');

Edit: for just these characters in a php file saved with UTF-8 encoding:

$search = array (
  '‘',
  '’',
  '“',
  '”',
  '–',
  '—',
  '-',
);

Of course, outputting the content as UTF-8 means you dont have to do this conversion at all.

OIS
Actually for what I'm doing it is. The problem with your solution is that it converts everything to entities causing correct markup to appear as <, for example. My solution only alters the strange symbols introduced my Word, nothing else.
gaoshan88