ansaurus

Question

Regex try and match until hitting end tag in python

Answer 1

+4 A:

Don't use a regular expression to parse HTML. Use lxml or BeautifulSoup.

Ignacio Vazquez-Abrams 2010-02-14 00:36:46

I've tried Bsoup and it didn't handle the HTML that well. Perhaps I was doing something wrong (I'm parsing a ton of HTML) but comparing it to my regex parser it didn't do a great job.I figured this would be something pretty straight forward to answer.

Michael 2010-02-14 00:42:30

@Michael: You should mention attempts that you've tried in your question. You **must** have known that you were going to get this response if you didn't mention it?

Mark Byers 2010-02-14 00:45:21

@Michael, can you post some of the HTML that isn't working with BS? Perhaps someone can tell you how it should work

gnibbler 2010-02-14 01:22:43

I tried BSoup to remove all the HTML tagging in the past. Basically I run through the HTML with tags still in place get the data I need and then strip out all the tagging. BSoup tended to leave some of the tagging behind and my regex html cleaner didn't so I decided to not use BSoup any longer.

Michael 2010-02-14 03:12:17

Answer 2

+3 A:

Don't use regular expressions to parse HTML -- use an HTML parser, such as BeautifulSoup.

Specifically, your situation is basically one of having to deal with "nested parentheses" (where an open "parens" is an opening <table> tag and the corresponding closed parens is the matching </table>) -- exactly the kind of parsing tasks that regular expressions can't perform well. Lots of the work in parsing HTML is exactly connected with this "matched parentheses" issue, which makes regular expressions a perfectly horrible choice for the purpose.

You mention in a comment to another answer that you've had unspecified problems with BS -- I suspect you were trying the latest, 3.1 release (which has gone downhill) instead of the right one; try 3.0.8 instead, as BS's own docs recommend, and you could be better off.

If you've made some kind of pact with Evil never to use the right tool for the job, your task might not be totally impossible if you don't need to deal with nesting (just matching), i.e., there is never a table inside another table. In this case you can identify one table with r'<\s*TABLE(.*?)<\s*/\s*TABLE' (with suitable flags such as re.DOTALL and re.I); loop over all such matches with the finditer method of regular expressions; and in the loop's body check whether BGCOLOR (in a case-insensitive sense) happens to be inside the body of the current match. It's still going to be more fragile, and more work, than using an HTML parser, but while definitely an inferior choice it needs not be a desperate situation.

If you do have nested tables to contend with, then it is a desperate situation.

Alex Martelli 2010-02-14 00:37:17

Thanks, I'll give the soup another try and to be fair my knock on it was with it's ability to remove all HTML tagging in a given document. On a side note, I am running 3.0.8

Michael 2010-02-14 03:09:41

Answer 3

A:

if your task is just this simple, here's a way. split on <TABLE> then iterate the items and find the required pattern you want.

myhtml="""
<TABLE>
<B>Item 1.</B>
</TABLE>

some text1
some text2
some text3

<TABLE>
blah
BGCOLOR
blah
</TABLE>

some texet
<TABLE>
<B>Item 2.</B>
</TABLE>
"""

for tab in myhtml.split("</TABLE>"):
    if "<TABLE>" in tab and "BGCOLOR" in tab:
        print ''.join(tab.split("<TABLE>")[1:])

output

$ ./python.py

blah
BGCOLOR
blah

ghostdog74 2010-02-14 01:30:59

Good call but the above is over simplified as the html doc has mucho content between the tables.

Michael 2010-02-14 03:04:45

then you do another split on <TABLE> and get element 1 onwards. see my edit

ghostdog74 2010-02-14 03:30:13

Answer 4

A:

Here's the code that ended up working for me. It finds the correct table and adds more tagging around it so that it is identified from the group with open and close tags of 'realTable'.

soup = BeautifulSoup(''.join(text))
for p in soup.findAll('table'):
    pattern = '.*BGCOLOR.*'
    if (re.match(pattern, str(p), re.S|re.I)):
        tags = Tag(soup, "realTable")
        p.replaceWith(tags)
        text = NavigableString(str(p))
        tags.insert(0, text)
print soup

prints this out:

<table><b>Item 1.</b></table>
<realTable><table>blah BGCOLOR blah</table></realTable>
<table><b>Item 2.</b></table>

Michael 2010-02-14 06:50:39

ansaurus

tags:

views:

answers:

Regex try and match until hitting end tag in python

related questions