ansaurus

Question

Making BeautifulSoup ignore contents inside script tags

Answer 1

A:

Did you try replacing the angle brackets < and > with < and > in all the HTML that is inside the Javascript?

Jim Garrison 2009-11-14 01:52:52

Answer 2

A:

I've faced this kind of problem before, and what I normally do is replace every occurrence of <script with . That way, all the <script></script> tags are commented out.

blwy10 2009-11-14 02:50:16

Good point but would not work if script tag itself contains

alphageek 2009-11-14 11:00:44

That is quite true. I suppose you could implement a sanity check such that when a <script is encountered, it sets a flag that removes --> until a </script> tag is encountered, or something like that.

blwy10 2009-11-14 11:29:55

Answer 3

+1 A:

Reverting to BeautifulSoup 3.0.7a solved this issue and many other html oddities that 3.1.0.1 has choked on.

alphageek 2009-11-14 10:56:52

Answer 4

A:

That would work, but the point of BeautifulSoup is parsing whatever tag soup you throw at it, even if it's horribly ill-formed.

ddaa 2009-11-14 11:09:28

ansaurus

tags:

views:

answers:

Making BeautifulSoup ignore contents inside script tags

related questions