I have a piece of code that basically extracts text from a page. It uses BeautifulSoup to first remove script, style and noscript tags and then find all the text in the page and return it. I don't want to do anything fancy, just get all the text in a page.
However, it turns out that BeautifulSoup is rather slow, as it takes an appreciable amount of time to perform this operation. Does anyone know of a library I can use to do this with?
I'm also using BS to perform encoding detection, but I can just extract the relevant code and use that... Does anyone know of a faster alternative?
Thanks!