views:

50

answers:

3

How can I replace HTML-entities in unicode-Strings with proper unicode?

u'"HAUS Kleider" - Über das Bekleiden und Entkleiden, das VerhŸllen und Veredeln'

to

u'"HAUS-Kleider" - Über das Bekleiden und Entkleiden, das Verhüllen und Veredeln'

edit
Actually the entities are wrong. At it seems like BeautifulSoup f...ed it up.

So the question is: How to deal with utf-8 encoded String and BeautifulSoup?

from BeautifulSoup import BeautifulSoup

f = open('path_to_file','r')
lines = [i for i in f.readlines()]
soup = BeautifulSoup(''.join(lines))
allArticles = []
for row in rows:
    l =[]
    for r in row.findAll('td'):
            l += [r.string] # here things seem to go wrong
    allArticles+=[l]

Ü -> Ÿ instead of Ü but actually I don't want the encoding to be changed anyway.

>>> soup.originalEncoding
'utf-8'

but I cant generate a proper unicode string of it

+1  A: 

I think what you need are ICU transliterators. I think there is a way to transliterate HTML entities into Unicode.

Try the transliterator id Hex/XML-Any that should to what you want. On the Demo page you can choose "Insert Sample: Compound" and then enter Hex/XML-Any into the "Compound 1" box, add some input data in the box and press "transform". Does this help?

There is a Python ICU binding, but its not taken care of well, I think.

towi
+1  A: 

htmlentitydefs.entitydefs["quot"] returns '"'
That's a dictionary that translates entities to their actual character.
You should be able to continue easily from that point.

BlueTrance
if BeautifulSoup would give me the right entities at all. see my edit
vikingosegundo
A: 

Ok, the problem was silly, I have to confess. I was working on an old version of rows in the interactive interpreter. I don't know what was wrong with it contents, but this is the correct code:

from BeautifulSoup import BeautifulSoup

f = open('path_to_file','r')
lines = [i for i in f.readlines()]
soup = BeautifulSoup(''.join(lines))
rows = soup.findAll('tr')
allArticles = []
for row in rows:
    l =[]
    for r in row.findAll('td'):
        l += [r.string]
    allArticles+=[l]

shame on me!

vikingosegundo