ansaurus

Question

Answer 1

+2 A:

Look at the codecs standard library, specifically the encode and decode methods provided in the Codec base class.

There's also a good article here that puts it all together.

Wayne Koorts 2009-04-15 18:17:29

Thanks-great article you are right it does put a lot together.

PyNEwbie 2009-04-15 18:32:46

Answer 2

A:

Just a note regarding HTML cleaning. It is very very hard, since

<
body
>

Is a valid way to write HTML. Just an fyi.

Ólafur Waage 2009-04-15 18:18:02

Answer 3

A:

You can convert it to unicode in this way:

print u'Hello, \xa0World'  # print Hello,  World

jcoon 2009-04-15 18:18:07

Answer 4

+5 A:

may be you should be doing

s=unicodestring.replace(u'\xa0',u'')

z33m 2009-04-15 18:22:48

So how did you know to do this since I have not seen this in any example? Thanks

PyNEwbie 2009-04-15 18:30:11

i think strings in single quotes are ascii.. so '\xa0' will raise an exception..

z33m 2009-04-15 18:32:55

Answer 5

+3 A:

s=unicodestring.replace('\xa0','')

..is trying to create the unicode character \xa0, which is not valid in an ASCII sctring (the default string type in Python until version 3.x)

The reason r'\xa0' did not error is because in a raw string, escape sequences have no effect. Rather than trying to encode \xa0 into the unicode character, it saw the string as a "literal backslash", "literal x" and so on..

The following are the same:

>>> r'\xa0'
'\\xa0'
>>> '\\xa0'
'\\xa0'

This is something resolved in Python v3, as the default string type is unicode, so you can just do..

>>> '\xa0'
'\xa0'

I am trying to clean all of the HTML out of a string so the final output is a text file

I would strongly recommend BeautifulSoup for this. Writing an HTML cleaning tool is difficult (given how horrible most HTML is), and BeautifulSoup does a great job at both parsing HTML, and dealing with Unicode..

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<html><body><h1>Hi</h1></body></html>")
>>> print soup.prettify()
<html>
 <body>
  <h1>
   Hi
  </h1>
 </body>
</html>

dbr 2009-04-15 20:33:03

I appreciate this answer. I have used BS to extract data from tables and it is very useful. However, it seems to me that to remove the html using BS I have to know what is present. Am I wrong about that?

PyNEwbie 2009-04-15 23:11:34

I'm not sure what you mean? You can remove HTML via countless ways, from the first table in a div, to by-class-or-id etc..

dbr 2009-04-16 14:01:54

ansaurus

tags:

views:

answers:

How to work with unicode in Python

related questions