views:

64

answers:

4

Hi all, I'm looking for a bit of help with a regex in python and google is failing me. Basically I'm searching some html and there is a certain type of table I'm searching for, specifically any table that includes a background tag in it (i.e. BGCOLOR). Some tables have this tag and some do not. Could someone help me out with how to write a regex that searches for the start of the table, then searches for the BGCOLOR but if it hits the end of the table then it stops and moves on?

Here's a very simplified example that will server the purpose:

`<TABLE>
<B>Item 1.</B>
</TABLE>

<TABLE>
BGCOLOR
</TABLE>

<TABLE>
<B>Item 2.</B>
</TABLE>`

So we have three tables but I'm only interested in finding the middle table that contains 'BGCOLOR' The problem with my regex at the moment is that it searches for the starting table tag then looks for 'BGCOLOR' and doesn't care if it passes the table end tag:

tables = re.findall('\<table.*?BGCOLOR=".*?".*?\<\/table\>', text, re.I|re.S)

So it would find the first two tables instead of just the second table. Let me know if anyone knows how to handle this situation.

Thanks, Michael

+4  A: 

Don't use a regular expression to parse HTML. Use lxml or BeautifulSoup.

Ignacio Vazquez-Abrams
I've tried Bsoup and it didn't handle the HTML that well. Perhaps I was doing something wrong (I'm parsing a ton of HTML) but comparing it to my regex parser it didn't do a great job.I figured this would be something pretty straight forward to answer.
Michael
@Michael: You should mention attempts that you've tried in your question. You **must** have known that you were going to get this response if you didn't mention it?
Mark Byers
@Michael, can you post some of the HTML that isn't working with BS? Perhaps someone can tell you how it should work
gnibbler
I tried BSoup to remove all the HTML tagging in the past. Basically I run through the HTML with tags still in place get the data I need and then strip out all the tagging. BSoup tended to leave some of the tagging behind and my regex html cleaner didn't so I decided to not use BSoup any longer.
Michael
+3  A: 

Don't use regular expressions to parse HTML -- use an HTML parser, such as BeautifulSoup.

Specifically, your situation is basically one of having to deal with "nested parentheses" (where an open "parens" is an opening <table> tag and the corresponding closed parens is the matching </table>) -- exactly the kind of parsing tasks that regular expressions can't perform well. Lots of the work in parsing HTML is exactly connected with this "matched parentheses" issue, which makes regular expressions a perfectly horrible choice for the purpose.

You mention in a comment to another answer that you've had unspecified problems with BS -- I suspect you were trying the latest, 3.1 release (which has gone downhill) instead of the right one; try 3.0.8 instead, as BS's own docs recommend, and you could be better off.

If you've made some kind of pact with Evil never to use the right tool for the job, your task might not be totally impossible if you don't need to deal with nesting (just matching), i.e., there is never a table inside another table. In this case you can identify one table with r'<\s*TABLE(.*?)<\s*/\s*TABLE' (with suitable flags such as re.DOTALL and re.I); loop over all such matches with the finditer method of regular expressions; and in the loop's body check whether BGCOLOR (in a case-insensitive sense) happens to be inside the body of the current match. It's still going to be more fragile, and more work, than using an HTML parser, but while definitely an inferior choice it needs not be a desperate situation.

If you do have nested tables to contend with, then it is a desperate situation.

Alex Martelli
Thanks, I'll give the soup another try and to be fair my knock on it was with it's ability to remove all HTML tagging in a given document. On a side note, I am running 3.0.8
Michael
A: 

if your task is just this simple, here's a way. split on <TABLE> then iterate the items and find the required pattern you want.

myhtml="""
<TABLE>
<B>Item 1.</B>
</TABLE>

some text1
some text2
some text3

<TABLE>
blah
BGCOLOR
blah
</TABLE>

some texet
<TABLE>
<B>Item 2.</B>
</TABLE>
"""

for tab in myhtml.split("</TABLE>"):
    if "<TABLE>" in tab and "BGCOLOR" in tab:
        print ''.join(tab.split("<TABLE>")[1:])

output

$ ./python.py

blah
BGCOLOR
blah
ghostdog74
Good call but the above is over simplified as the html doc has mucho content between the tables.
Michael
then you do another split on <TABLE> and get element 1 onwards. see my edit
ghostdog74
A: 

Here's the code that ended up working for me. It finds the correct table and adds more tagging around it so that it is identified from the group with open and close tags of 'realTable'.

soup = BeautifulSoup(''.join(text))
for p in soup.findAll('table'):
    pattern = '.*BGCOLOR.*'
    if (re.match(pattern, str(p), re.S|re.I)):
        tags = Tag(soup, "realTable")
        p.replaceWith(tags)
        text = NavigableString(str(p))
        tags.insert(0, text)
print soup

prints this out:

<table><b>Item 1.</b></table>
<realTable><table>blah BGCOLOR blah</table></realTable>
<table><b>Item 2.</b></table>
Michael