ansaurus

Question

Should I implement the mixed use of BeautifulSoup and REGEXs or rely solely on BS

Answer 1

+3 A:

Have you tried lxml? BeautifulSoup is good but not super-fast, and I believe lxml can offer the same quality but often better performance.

Alex Martelli 2009-05-19 02:11:37

Answer 2

+3 A:

BeautifulSoup uses regex internally (it's what separates it from other XML parsers) so you'll likely find yourself just repeating what it does. If you want a faster option then use try/catch to attempt an lxml or etree parse first then try BeautifulSoup and/or tidylib to parse broken HTML if the parser fails.

It seems for what you are doing you really want to be using XPath or XSLT to find and retrieve your data, lxml can do both.

Finally, given the size of your files you should probably parse using a path or file handle so the source can be read incrementally rather than held in memory for the parse.

SpliFF 2009-05-19 02:42:28

Answer 3

+1 A:

I don't quite understand what you are trying to do. But I do know that you don't need to enclose your div string with < html> tags. BS will parse that just fine.

Unknown 2009-05-19 02:47:00

Answer 4

+1 A:

I've found that even if lxml is faster than BeautifulSoup, for documents that size it's usually best to try to reduce the size to a few kB via regex (or direct stripping) and load that into BS, as you are doing now.

Vinko Vrsalovic 2009-05-19 03:26:33

ansaurus

tags:

views:

answers:

Should I implement the mixed use of BeautifulSoup and REGEXs or rely solely on BS

related questions