views:

261

answers:

6

Hi

how do you use python 2.6 to remove everything including the <div class="comment"> ....remove all ....</div>

i tried various way using re.sub without any success

Thank you

+2  A: 

You cannot properly parse HTML with regular expressions. Use a HTML parser such as lxml or BeautifulSoup.

Ignacio Vazquez-Abrams
i am trying to remove everything including the div and everything in between. I cant seems to find any reference about that in BeautifulSoup
Michelle Jun Lee
another example, say like i want to remove <table> ... </table>, so i am trying to remove all the tables in the html contents, i am not sure how do you do that in BeautifulSoup
Michelle Jun Lee
Not even in the "Removing elements" subsection of the "Modifying the Parse Tree" section of the documentation?
Ignacio Vazquez-Abrams
yes, i saw that but you cant remove specific class or id related to that tag
Michelle Jun Lee
A: 

For the record, it is usually a bad idea to process XML with regular expressions. Nevertheless:

>>> re.sub('>[^<]*', '>', '<div class="comment> .. any… </div>')
'<div class="comment></div>'
David S.
I wonder if the OP wishes to also remove the bookend items of the DIV tag itself in addition to the contents.
Jarret Hardie
yes, basically , i am trying to remove from the start to the end of the div, other example, say like you want to remove certain table within html contents such as remove all <table id="1"> ... </table>,
Michelle Jun Lee
ah, yeah, don't use a regex!
David S.
A: 

non regex way

pat='<div class="comment">'
for chunks in htmlstring.split("</div>"):
    m=chunks.find(pat)
    if m!=-1:
       chunks=chunks[:m]
    print chunks

output

$ cat file
one two <tag> ....</tag>
 adsfh asdf <div class="comment"> ....remove
all ....</div>s sdfds
<div class="blah" .......
.....
blah </div>

$ ./python.py
one two <tag> ....</tag>
 adsfh asdf
s sdfds
<div class="blah" .......
.....
blah
ghostdog74
+4  A: 

This can be done easily and reliably using an HTML parser like BeautifulSoup:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('<body><div>1</div><div class="comment"><strong>2</strong></div></body>')
>>> for div in soup.findAll('div', 'comment'):
...   div.extract()
... 
<div class="comment"><strong>2</strong></div>
>>> soup
<body><div>1</div></body>

See this question for examples on why parsing HTML using regular expressions is a bad idea.

Ayman Hourieh
A: 

Use Beautiful soup and do something like this to get all of those elements, and then just replace inside

tomatosoup = BeautifulSoup(myhtml)

tomatochunks = tomatosoup.findall("div", {"class":"comment"} )

for chunk in tomatochunks:
   #remove the stuff
JiminyCricket
also if its XML and not HTML use BeautifulStoneSouphttp://www.crummy.com/software/BeautifulSoup/documentation.html
JiminyCricket
+2  A: 

With lxml.html:

from lxml import html
doc = html.fromstring(input)
for el in doc.cssselect('div.comment'):
    el.drop_tree()
result = html.tostring(doc)
Ian Bicking