views:

118

answers:

2

I have a piece of code that basically extracts text from a page. It uses BeautifulSoup to first remove script, style and noscript tags and then find all the text in the page and return it. I don't want to do anything fancy, just get all the text in a page.

However, it turns out that BeautifulSoup is rather slow, as it takes an appreciable amount of time to perform this operation. Does anyone know of a library I can use to do this with?

I'm also using BS to perform encoding detection, but I can just extract the relevant code and use that... Does anyone know of a faster alternative?

Thanks!

+3  A: 

The answers to this question might help you find alternatives to try out.

kenny.r
Thank you, I was just looking at lxml.etree. I didn't find that question before, thanks.
Stavros Korokithakis
fyi lxml can also use the BeautifulSoup parser :)
Tim McNamara
Glad to have been of service! Don't be afraid to accept my answer ;D
kenny.r
+1  A: 

For the record, I ended up considering lxml.html, which is really fast, for parsing and the UnicodeDammit class of BeautifulSoup for, again, very fast page encoding detection. Thanks for the tip, kenny.r!

Stavros Korokithakis