views:

2547

answers:

4

Consider the following:

<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>

How would you go about taking out the sitemap line with regex in python?

<a href="/sitemap">Sitemap</a>

The following can be used to pull out the anchor tags.

'/<a(.*?)a>/i'

However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?

+11  A: 

Don't use a regex. Use BeautfulSoup, an HTML parser.

from BeautifulSoup import BeautifulSoup

html = \
"""
<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>"""

soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a

# <a href="/sitemap">Sitemap</a>
Unknown
+3  A: 

Parsing HTML with regular expression is a bad idea!

Think about the following piece of html

<a></a > <!-- legal html, but won't pass your regex -->

<a href="/sitemap">Sitemap<!-- proof that a>b iff ab>1 --></a>

There are many more such examples. Regular expressions are good for many things, but not for parsing HTML.

You should consider using Beautiful Soup python HTML parser.

Anyhow, a ad-hoc solution using regex is

import re

data = """
<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>
"""

e = re.compile('<a *[^>]*>.*</a *>')

print e.findall(data)

Output:

>>> e.findall(data)
['<a href="foo1.com">Foo1</a>', '<a href="/">Home</a>', '<a href="/extract">Extract</a>', '<a href="/sitemap">Sitemap</a>']
Elazar Leibovich
If you replace that `.*` with `(?:[^<]+|<(!/a\b))*`, you'll get fewer false positives, without blowing up the regex engine with backtracking.
Ben Blank
+1  A: 

Use BeautifulSoup or lxml if you need to parse HTML.

Also, what is it that you really need to do? Find the last link? Find the third link? Find the link that points to /sitemap? It's unclear from you question. What do you need to do with the data?

If you really have to use regular expressions, have a look at findall.

Filip Salomonsson
A: 

In order to extract the contents of the tagline:

    <a href="/sitemap">Sitemap</a>

... I would use:

    >>> import re
    >>> s = '''
    <div id=hotlinklist>
    <a href="foo1.com">Foo1</a>
      <div id=hotlink>
        <a href="/">Home</a>
      </div>
      <div id=hotlink>
        <a href="/extract">Extract</a>
      </div>
      <div id=hotlink>
        <a href="/sitemap">Sitemap</a>
      </div>
    </div>'''
    >>> m = re.compile(r'<a href="/sitemap">(.*?)</a>').search(s)
    >>> m.group(1)
    'Sitemap'
Alex
Actually, replace sitemap with XYZ as it really can be anything.I would only know that it is the 3rd div within the hotlinlist div.The html pattern that used can be repeated many times. Let say I want to take out all the smart phones listing on ebay.I would know that the above pattern is repeated for each smart phone found, however, the <a herf="XYZ">XYZ</a> can be an iphone, blackberry, Nokia or any other smart phone. There could be no item or 100s.So, I was looking for something that says find the repeated pattern, then take the smart phone line out and have a list of smart phones.
VN44CA