ansaurus

Question

Some Basic Python Questions

Answer 1

+3 A:

The Python code has the same outline.

Just replace all of the PHP-isms with Python-isms.

Start by creating a File object. The result of a file.read() is a string object. Strings have a "replace" operation.

S.Lott 2009-04-16 01:47:24

Answer 2

+2 A:

Your best bet for cleaning Word HTML is using HTML Tidy which has a mode just for that. There are a few Python wrappers you can use if you need to do it programmatically.

Matt Good 2009-04-16 01:53:12

Answer 3

+1 A:

As S.Lott said, the Python code would be very, very similar—the only differences would essentially be the function calls/statements.

I don't think Python has a direct equivalent to file_get_contents(), but since you can obtain an array of the lines in the file, you can then join them by newlines, like this:

sample = '\n'.join(open(test, 'r').readlines())

EDIT: Never mind, there's a much easier way: sample = file(test).read()

String replacing is almost exactly the same as str_replace():

sample = sample.replace(search, replace)

And outputting is as simple as a print statement:

print defang_word(sample)

So as you can see, the two versions look almost exactly the same.

htw 2009-04-16 01:54:55

file('foo.txt').read()

Justus 2009-04-16 02:09:04

Good call—edited.

htw 2009-04-16 02:11:25

Answer 4

+20 A:

First of all, those aren't Microsoft Word entities—they are UTF-8. You're converting them to HTML entities.

The Pythonic way to write something like:

chr(0xe2) . chr(0x80) . chr(0x98)

would be:

'\xe2\x80\x98'

But Python already has built-in functionality for the type of conversion you want to do:

def defang(string):
    return string.decode('utf-8').encode('ascii', 'xmlcharrefreplace')

This will replace the UTF-8 codes in a string for characters like ‘ with numeric entities like “.

If you want to replace those numeric entities with named ones where possible:

import re
from htmlentitydefs import codepoint2name

def convert_match_to_named(match):
    num = int(match.group(1))
    if num in codepoint2name:
        return "&%s;" % codepoint2name[num]
    else:
        return match.group(0)

def defang_named(string):
    return re.sub('&#(\d+);', convert_match_to_named, defang(string))

And use it like so:

>>> defang_named('\xe2\x80\x9cHello, world!\xe2\x80\x9d')
'&ldquo;Hello, world!&rdquo;'

To complete the answer, the equivalent code to your example to process a file would look something like this:

# in Python, it's common to operate a line at a time on a file instead of
# reading the entire thing into memory

my_file = open("test100.html")
for line in my_file:
    print defang_named(line)
my_file.close()

Note that this answer is targeted at Python 2.5; the Unicode situation is dramatically different for Python 3+.

I also agree with bobince's comment below: if you can just keep the text in UTF-8 format and send it with the correct content-type and charset, do that; if you need it to be in ASCII, then stick with the numeric entities—there's really no need to use the named ones.

Miles 2009-04-16 02:10:31

+1 for xmlcharrefreplace — there is no need for HTML named entities today really. But really, leave the UTF-8 alone, smart-quotes intact. As long as you serve it with the correct ‘charset’ header/meta-tag there is no problem.

bobince 2009-04-16 02:14:47

+1 for pointing out that the entities are UTF-8 and not some MS weirdness ;-) (and for a well-written answer overall, too)

David Zaslavsky 2009-04-16 02:48:02

I'm confused. The document I am importing in the example is full of strange symbols that correspond to MS Word curly quotes. If I drop them straight into a page with UTF-8 encoding I get strange symbols. If I convert them using my example code they render fine. So, what are they before I convert?

gaoshan88 2009-04-16 05:46:58

It's hard to tell what you mean when you say "drop them straight into a page with UTF-8 encoding". It sounds like you're opening the test100.html file in a text editor with the incorrect character set (probably Windows-1252)—make sure you open it as UTF-8.

Miles 2009-04-16 06:11:21

Sorry, that wasn't clear. The PHP I wrote was created to handle people pasting directly from Word into a textarea. The pasted code would then appear with the garbled symbols (looking like â€œInside Quotesâ€ for example) and I could not find a good solution to clean it. My above code cleans it.

gaoshan88 2009-04-16 06:18:10

Basically I want to clean text that was input by pasting from Word into a textarea. I shouldn't have used an html page for my example, in reality I was dealing with text input via a form. Does that make sense?

gaoshan88 2009-04-16 06:22:21

Does the pasted content appear garbled *immediately* when it is pasted? Or after the form has been posted to the server and redisplayed on the following page? What you're describing is symptomatic of UTF-8 encoded text being interpreted as Latin-1 or Windows-1252.

Miles 2009-04-16 06:32:48

Does the page with the textarea have its charset set to UTF-8? (With the appropriate HTTP Content-Type header, or putting <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in the <head>)

Miles 2009-04-16 06:35:02

No. It appears correct. The pages handling the output are all UTF-8, though, so I suspect it is Windows-1252 that cannot be rendered properly. Maybe? I edited my example above to (hopefully) clarify.

gaoshan88 2009-04-16 06:37:43

"Does the page with the textarea have its charset set to UTF-8?" Actually it is charset=iso-8859-1 on the input page and it WAS the same on the output page but I changed it to UTF-8 (on the output page). So it is a mess of 1252 being pasted into 8859-1 and viewed on utf-8. Ugh.

gaoshan88 2009-04-16 06:41:14

Answer 5

A:

Just for the record, that is not the way to do it in PHP.

$test = "test100.html";

$sample = file_get_contents($test);

echo htmlentities($sample, ENT_COMPAT, 'UTF-8');

Edit: for just these characters in a php file saved with UTF-8 encoding:

$search = array (
  '‘',
  '’',
  '“',
  '”',
  '–',
  '—',
  '-',
);

Of course, outputting the content as UTF-8 means you dont have to do this conversion at all.

OIS 2009-04-16 03:14:51

Actually for what I'm doing it is. The problem with your solution is that it converts everything to entities causing correct markup to appear as <, for example. My solution only alters the strange symbols introduced my Word, nothing else.

gaoshan88 2009-04-16 05:55:36

ansaurus

tags:

views:

answers:

Some Basic Python Questions

related questions