views:

56

answers:

1

Hello,

I'm using Python and BeautifulSoup to parse HTML pages. Unfortunately, for some pages (> 400K) BeatifulSoup is truncating the HTML content.

I use the following code to get the set of "div"s:

findSet = SoupStrainer('div')
set = BeautifulSoup(htmlSource, parseOnlyThese=findSet)
for it in set:
    print it

At a certain point, the output looks like:

correct string, correct string, incomplete/truncated string ("So, I")

although, the htmlSource contains the string "So, I am bored", and many others. Also, I would like to mention that when I prettify() the tree I see the HTML source truncated.

Do you have an idea how can I fix this issue?

Thanks!

+1  A: 

Try using lxml.html. It is a faster, better html parser, and deals better with broken html than latest BeautifulSoup. It is working fine for your example page, parsing the entire page.

import lxml.html

doc = lxml.html.parse('http://voinici.ceata.org/~sana/test.html')
print len(doc.findall('//div'))

Code above returns 131 divs.

nosklo
Thanks for your answer.
Laurențiu Dascălu