ansaurus

Question

Python -- Regex -- How to find a string between two sets of strings

Answer 1

+11 A:

Don't use a regex. Use BeautfulSoup, an HTML parser.

from BeautifulSoup import BeautifulSoup

html = \
"""
<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>"""

soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a

# <a href="/sitemap">Sitemap</a>

Unknown 2009-05-11 20:32:41

Answer 2

+3 A:

Parsing HTML with regular expression is a bad idea!

Think about the following piece of html

<a></a > <!-- legal html, but won't pass your regex -->

<a href="/sitemap">Sitemap<!-- proof that a>b iff ab>1 --></a>

There are many more such examples. Regular expressions are good for many things, but not for parsing HTML.

You should consider using Beautiful Soup python HTML parser.

Anyhow, a ad-hoc solution using regex is

import re

data = """
<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>
"""

e = re.compile('<a *[^>]*>.*</a *>')

print e.findall(data)

Output:

>>> e.findall(data)
['<a href="foo1.com">Foo1</a>', '<a href="/">Home</a>', '<a href="/extract">Extract</a>', '<a href="/sitemap">Sitemap</a>']

Elazar Leibovich 2009-05-11 20:37:51

If you replace that `.*` with `(?:[^<]+|<(!/a\b))*`, you'll get fewer false positives, without blowing up the regex engine with backtracking.

Ben Blank 2009-05-11 20:53:59

Answer 3

+1 A:

Use BeautifulSoup or lxml if you need to parse HTML.

Also, what is it that you really need to do? Find the last link? Find the third link? Find the link that points to /sitemap? It's unclear from you question. What do you need to do with the data?

If you really have to use regular expressions, have a look at findall.

Filip Salomonsson 2009-05-11 20:43:23

Answer 4

A:

In order to extract the contents of the tagline:

    <a href="/sitemap">Sitemap</a>

... I would use:

    >>> import re
    >>> s = '''
    <div id=hotlinklist>
    <a href="foo1.com">Foo1</a>
      <div id=hotlink>
        <a href="/">Home</a>
      </div>
      <div id=hotlink>
        <a href="/extract">Extract</a>
      </div>
      <div id=hotlink>
        <a href="/sitemap">Sitemap</a>
      </div>
    </div>'''
    >>> m = re.compile(r'<a href="/sitemap">(.*?)</a>').search(s)
    >>> m.group(1)
    'Sitemap'

Alex 2009-05-12 07:37:33

Actually, replace sitemap with XYZ as it really can be anything.I would only know that it is the 3rd div within the hotlinlist div.The html pattern that used can be repeated many times. Let say I want to take out all the smart phones listing on ebay.I would know that the above pattern is repeated for each smart phone found, however, the <a herf="XYZ">XYZ</a> can be an iphone, blackberry, Nokia or any other smart phone. There could be no item or 100s.So, I was looking for something that says find the repeated pattern, then take the smart phone line out and have a list of smart phones.

VN44CA 2009-05-14 02:03:42

ansaurus

tags:

views:

answers:

Python -- Regex -- How to find a string between two sets of strings

related questions