tags:

views:

170

answers:

3

I need a regex in python to find a links html in a larger set of html.

so if I have:

<ul class="something">
<li id="li_id">
<a href="#" title="myurl">URL Text</a>
</li>
</ul>

I would get back:

<a href="#" title="myurl">URL Text</a>

I'd like to do it with a regex and not beautifulsoup or something similar to that. Does anyone have a snippet laying around I could use for this?

Thanks

+2  A: 

you really shouldn't use regexes to parse html.. ever.

try beautifulsoup or lxml.

but... you asked. so a quick and naive version might look like this:

import re

html = """
<ul class="something">
<li id="li_id">
<a href="#" title="myurl">URL Text</a>
</li>
</ul>
"""

m = re.search('(<a .*>)', html)
if m:
    print m.group(1)

I can think of a lot of ways this would break.

Corey Goldberg
Considering what he wants to get back, you probably want something more like `/(<a .*?</a>)/`. And yes, it breaks on pretty much everything.
Anon.
+1  A: 

you can try this since your requirement is simple. No need BeautifulSoup or regex

>>> s="""
... <ul class="something">
... <li id="li_id">
... <a href="#" title="myurl">URL Text</a>
... </li>
... </ul>
... """
>>> for item in s.split("</a>"):
...    if "<a href=" in item :
...        print item [ item.find("<a href=") : ] + "</a>"
...
<a href="#" title="myurl">URL Text</a>

You can include a check of '<li class="li_class">' in the if statement as desired.

ghostdog74
And of course lots of perfectly correct ways to write that HTML (even just switching the title and href attributes, for example!) will make this go down in flames. What a perfectly terrible "solution"!
Alex Martelli
I think you all should not jump too far ahead. What OP wants to do is supposedly very simple. You guys make it too complicated!
ghostdog74
+3  A: 

Soup is good for you:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<ul class="something">
... <li id="li_id">
... <a href="#" title="myurl">URL Text</a>
... </li>
... </ul>''')

There are many arguments you can pass to the findAll method; more here. The one line below will get you started by returning a list of all links matching some conditions.

>>> soup.findAll(href='#', title='myurl')
[<a href="#" title="myurl">URL Text</a>]

Edit: based on OP's comment, added info included:

So let's say you're interested in only tags within list elements of a certain class <li class="li_class">. You could do something like this:

>>> soup = BeautifulSoup('''<li class="li_class">
    <a href="#" title="myurl">URL Text</a>
    <a href="#" title="myurl2">URL Text2</a></li><li class="foo">
    <a href="#" title="myurl3">URL Text3</a></li>''') # just some sample html

>>> for elem in soup.findAll("li", "li_class"):
...   pprint(elem.findAll('a')) # requires `from pprint import pprint`
... 
[<a href="#" title="myurl">URL Text</a>,
 <a href="#" title="myurl2">URL Text2</a>]

Soup recipe:

  1. Download the one file required.
  2. Place dl'd file in site-packages dir or similar.
  3. Enjoy your soup.
Adam Bernier
Ok, lets say I only want to only find the a tags that are inside of <li class="li_class">. So, if the li tag doesn't have that class I don't want to return the a tag. How do I do that?
Joe
@Joe: see edit with more info included.
Adam Bernier