ansaurus

Question

Extracting text from HTML file using Python

Answer 1

+1 A:

http://pypi.python.org/pypi/webstemmer/0.5.0

http://atropine.sourceforge.net/documentation.html

alternatively, i think you can drive lynx from python, search on that

Gene T 2008-11-30 02:57:12

Answer 2

+5 A:

html2text is a Python program that does a pretty good job at this.

RexE 2008-11-30 03:23:58

Answer 3

+1 A:

PyParsing does a great job. Paul McGuire has several scrips that are easy to adopt for various uses on the pyparsing wiki. (http://pyparsing.wikispaces.com/Examples) One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O'Reilly Short Cut manual that is also inexpensive.

Having said that, I use BeautifulSOup a lot and it is not that hard to deal with the entitites issues, you can convert them before you run BeautifulSoup.

Goodluck

PyNEwbie 2008-11-30 15:46:19

Answer 4

+3 A:

You can use html2text method in the stripogram library also.

from stripogram import html2text
text = html2text(your_html_string)

To install stripogram run sudo easy_install stripogram

GeekTantra 2009-09-23 03:21:58

This module, according to [its pypi page](http://pypi.python.org/pypi/stripogram), is deprecated: "Unless you have some historical reason for using this package, I'd advise against it!"

intuited 2010-07-24 19:02:09

Answer 5

+2 A:

Found myself facing just the same problem today. I wrote a very simple HTML parser to strip incoming content of all markups, returning the remaining text with only a minimum of formatting.

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

xperroni 2010-10-21 13:14:38

ansaurus

tags:

views:

answers:

Extracting text from HTML file using Python

related questions