tags:

views:

48

answers:

2

Hi,

I have juste a little experience with the regex, and now I have a little problem.

I must retrieve the strings between the .

So here is a sample :

Categories: <a href="/car/2/page1.html">2</a>, <a href="/car/nissan/">nissan</a>,<a href="/car/all/page1.html">all</a>

And this is my little regex:

re.findall("""<a href=".*">.*</a>""",string)

Well, it works , but I just want the strings between the , not the href, so how could I do this ?

thanks.

+1  A: 

Use parentheses to form a capturing group:

'<a href=".*">(.*)</a>'

You also probably want to use a non-greedy quantifier to avoid matching far more than you intended.

'<a href=".*?">(.*?)</a>'

Result:

['2', 'nissan', 'all']

Or even better, consider using an HTML parser, such as BeautifulSoup.

Mark Byers
+1 to BeautifulSoup, you will not have to tackle utf-8 parsing and html-encoding.
bronzebeard
+1  A: 

Regex is never a good idea for parsing HTML. There are too many edge cases that make crafting a robust regular expression difficult. Consider the following perfectly browser-viewable links:

< a href="/car/all/page1.html">all</a>
<a  href="/car/all/page1.html">all</a>
<a href= "/car/all/page1.html">all</a>
<a id="foo" href="/car/all/page1.html">all</a>
<a
 href="/car/all/page1.html">all</a>

All of which will not be matched by the given regular expression. I highly recommend an HTML parser, such as Beautiful Soup or lxml. Here's an lxml example:

from lxml import etree

html = """
Categories: <a href="/car/2/page1.html">2</a>, <a href="/car/nissan/">nissan</a>,<a href="/car/all/page1.html">all</a>
"""
doc = etree.HTML(html)
result = doc.xpath('//a[@href]/text()')

Result:

['2', 'nissan', 'all']

no matter if the HTML is different or even somewhat malformed.

Mark Thomas
I've also seen `<a>` tags in the wild with only single quotes or even *no* quotes around the href value.
Paul McGuire