ansaurus

Question

Python regex on list

Answer 1

+3 A:

It's good that you're trying to using BeautifulSoup to parse HTML but this won't work:

re.compile('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>',
           re.DOTALL | re.IGNORECASE).findall(soup)

You're trying to parse a BeautifulSoup object using a regular expression. Instead you should be using the findAll method on the soup, like this:

regex = re.compile('^title metadata_title content_perceived_text', re.IGNORECASE)
for tag in soup.findAll('h1', attrs = { 'class' : regex }):
    print tag.contents

If you do actually want to parse the document as text with a regular expression then don't use BeautifulSoup - just read the document into a string and parse that. But I'd suggest you take the time to learn how BeautifulSoup works as this is the preferred way to do it. See the documentation for more details.

Mark Byers 2010-05-22 10:59:02

ah yes BUT it won't find the rest.. I have real problems getting BS to find the contents from within the tags..

Peter Nielsen 2010-05-22 11:01:58

@Peter Nielsen: Can you explain what you mean by 'it won't find the rest'? Does my update answer your question?

Mark Byers 2010-05-22 11:36:42

well, using bs and not regex gives me the problem as to how i find the contents inside the tags and not just the entire tag + content..Ty for answering so speedy, by the way:-)

Peter Nielsen 2010-05-22 11:43:44

@Peter Nielsen: "how i find the contents inside the tags". Try this: `for tag in soup.findAll('h1'): print tag.contents`

Mark Byers 2010-05-22 11:56:43

Uhhhhh.. very , very , very nice..I just got tingly all over.. ;-)Ty very much..

Peter Nielsen 2010-05-22 12:13:34

@Peter, since you like the answer you should upvote and accept it -- this is really fundamental SO etiquette!

Alex Martelli 2010-05-22 17:31:45

Ah, thank you.. Got it ..

Peter Nielsen 2010-05-24 11:32:05

The thing is, though, what to do is one of the tags that I am looking for with BS does not have an end tag ? It would seem that BS fails in such a case..

Peter Nielsen 2010-05-24 11:34:01

@Peter Nielsen: I'm not exactly sure what the problem is. I know that BeautifulSoup can handle invalid HTML but I don't know all the details of how it handles missing end tags. It's rather difficult to go into details in comments due to length limits, lack of formatting, etc. I would suggest that you create a new question describing what new issue you have with some examples of how it fails and what you want, then I am sure that I or someone else on Stack Overflow will be able to help you.

Mark Byers 2010-05-24 11:50:10

ansaurus

tags:

views:

answers:

Python regex on list

related questions