ansaurus

Question

Python strategy for extracting text from malformed html pages

Answer 1

+5 A:

Try not to laugh, but:

class TextFormatter:
    def __init__(self,lynx='/usr/bin/lynx'):
        self.lynx = lynx

    def html2text(self, unicode_html_source):
        "Expects unicode; returns unicode"
        return Popen([self.lynx, 
                      '-assume-charset=UTF-8', 
                      '-display-charset=UTF-8', 
                      '-dump', 
                      '-stdin'], 
                      stdin=PIPE, 
                      stdout=PIPE).communicate(input=unicode_html_source.encode('utf-8'))[0].decode('utf-8')

I hope you've got lynx!

Jonathan Feinberg 2009-10-23 18:26:02

I don't have lynx =( and they will not install it. They do have ELinks installed which they told me is suppose to be similar. Looking at the ELinks documentation to see it if will work. Good to know about lynx none the less.

Johnny4000 2009-10-23 18:54:40

ELinks and Lynx kick butt. Thanks for letting me know about them.

Johnny4000 2009-10-24 02:48:21

It was born out of desperation on my side, i can tell you. I'm glad it's useful to you!

Jonathan Feinberg 2009-10-24 14:06:16

Answer 2

A:

Well, it depends how good the solution has to be. I had a similar problem, importing hundreds of old html pages into a new website. I basically did

# remove all that crap around the body and let BS fix the tags
newhtml = "<html><body>%s</body></html>" % (
    u''.join( unicode( tag ) for tag in BeautifulSoup( oldhtml ).body.contents ))
# use html2text to turn it into text
text = html2text( newhtml )

and it worked out, but of course the documents could be so bad that even BS can't salvage much.

THC4k 2009-10-23 18:29:37

Answer 3

A:

BeautifulSoup will do bad with malformed HTML. What about some regex-fu?

>>> import re
>>> 
>>> html = """<p>This is paragraph with a bunch of lines
... from a news story.</p>"""
>>> 
>>> pattern = re.compile('(?<=p>).+(?=</p)', re.DOTALL)
>>> pattern.search(html).group()
'This is paragraph with a bunch of lines\nfrom a news story.'

You can then assembly a list of valid tags from which you want to extract information.

hcalves 2009-10-24 06:51:08

ansaurus

tags:

views:

answers:

Python strategy for extracting text from malformed html pages

related questions