ansaurus

Question

python remove everything between <div class="comment> .. any... </div>

Answer 1

+2 A:

You cannot properly parse HTML with regular expressions. Use a HTML parser such as lxml or BeautifulSoup.

Ignacio Vazquez-Abrams 2010-04-15 23:56:22

i am trying to remove everything including the div and everything in between. I cant seems to find any reference about that in BeautifulSoup

Michelle Jun Lee 2010-04-16 00:08:00

another example, say like i want to remove <table> ... </table>, so i am trying to remove all the tables in the html contents, i am not sure how do you do that in BeautifulSoup

Michelle Jun Lee 2010-04-16 00:10:08

Not even in the "Removing elements" subsection of the "Modifying the Parse Tree" section of the documentation?

Ignacio Vazquez-Abrams 2010-04-16 00:10:57

yes, i saw that but you cant remove specific class or id related to that tag

Michelle Jun Lee 2010-04-16 00:13:34

Answer 2

A:

For the record, it is usually a bad idea to process XML with regular expressions. Nevertheless:

>>> re.sub('>[^<]*', '>', '<div class="comment> .. any… </div>')
'<div class="comment></div>'

David S. 2010-04-15 23:58:16

I wonder if the OP wishes to also remove the bookend items of the DIV tag itself in addition to the contents.

Jarret Hardie 2010-04-16 00:06:24

yes, basically , i am trying to remove from the start to the end of the div, other example, say like you want to remove certain table within html contents such as remove all <table id="1"> ... </table>,

Michelle Jun Lee 2010-04-16 00:11:25

ah, yeah, don't use a regex!

David S. 2010-04-16 00:22:29

Answer 3

A:

non regex way

pat='<div class="comment">'
for chunks in htmlstring.split("</div>"):
    m=chunks.find(pat)
    if m!=-1:
       chunks=chunks[:m]
    print chunks

output

$ cat file
one two <tag> ....</tag>
 adsfh asdf <div class="comment"> ....remove
all ....</div>s sdfds
<div class="blah" .......
.....
blah </div>

$ ./python.py
one two <tag> ....</tag>
 adsfh asdf
s sdfds
<div class="blah" .......
.....
blah

ghostdog74 2010-04-16 00:07:40

Answer 4

+4 A:

This can be done easily and reliably using an HTML parser like BeautifulSoup:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('<body><div>1</div><div class="comment"><strong>2</strong></div></body>')
>>> for div in soup.findAll('div', 'comment'):
...   div.extract()
... 
<div class="comment"><strong>2</strong></div>
>>> soup
<body><div>1</div></body>

See this question for examples on why parsing HTML using regular expressions is a bad idea.

Ayman Hourieh 2010-04-16 00:26:05

Answer 5

A:

Use Beautiful soup and do something like this to get all of those elements, and then just replace inside

tomatosoup = BeautifulSoup(myhtml)

tomatochunks = tomatosoup.findall("div", {"class":"comment"} )

for chunk in tomatochunks:
   #remove the stuff

JiminyCricket 2010-04-16 00:43:03

also if its XML and not HTML use BeautifulStoneSouphttp://www.crummy.com/software/BeautifulSoup/documentation.html

JiminyCricket 2010-04-16 00:43:35

Answer 6

+2 A:

With lxml.html:

from lxml import html
doc = html.fromstring(input)
for el in doc.cssselect('div.comment'):
    el.drop_tree()
result = html.tostring(doc)

Ian Bicking 2010-04-16 02:56:14

ansaurus

tags:

views:

answers:

python remove everything between <div class="comment> .. any... </div>

related questions