ansaurus

Question

Answer 1

+1 A:

Use parentheses to form a capturing group:

'<a href=".*">(.*)</a>'

You also probably want to use a non-greedy quantifier to avoid matching far more than you intended.

'<a href=".*?">(.*?)</a>'

Result:

['2', 'nissan', 'all']

Or even better, consider using an HTML parser, such as BeautifulSoup.

Mark Byers 2010-09-21 00:06:27

+1 to BeautifulSoup, you will not have to tackle utf-8 parsing and html-encoding.

bronzebeard 2010-09-21 04:55:59

Answer 2

+1 A:

Regex is never a good idea for parsing HTML. There are too many edge cases that make crafting a robust regular expression difficult. Consider the following perfectly browser-viewable links:

< a href="/car/all/page1.html">all</a>
<a  href="/car/all/page1.html">all</a>
<a href= "/car/all/page1.html">all</a>
<a id="foo" href="/car/all/page1.html">all</a>
<a
 href="/car/all/page1.html">all</a>

All of which will not be matched by the given regular expression. I highly recommend an HTML parser, such as Beautiful Soup or lxml. Here's an lxml example:

from lxml import etree

html = """
Categories: <a href="/car/2/page1.html">2</a>, <a href="/car/nissan/">nissan</a>,<a href="/car/all/page1.html">all</a>
"""
doc = etree.HTML(html)
result = doc.xpath('//a[@href]/text()')

Result:

['2', 'nissan', 'all']

no matter if the HTML is different or even somewhat malformed.

Mark Thomas 2010-09-21 01:41:35

I've also seen `<a>` tags in the wild with only single quotes or even *no* quotes around the href value.

Paul McGuire 2010-09-21 13:05:43

ansaurus

tags:

views:

answers:

python regex retrieve only one group

related questions