views:

47

answers:

5

What should be a fairly simple regex extraction is confounding me. Couldn't find a similar question on SO, so happy to be pointed to one if it exists. Given the following HTML:

<h1 class="title">Title One</h1><p><a href="#">40.5</a><a href="#">31.3</a></p>

<h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>

(amongst a larger document - the extracts will most probably run across multiple lines)

How can I construct a regular expression that finds the text within the A tags, within the first P following an H1? The regex will go in a loop, such that I can pass in the header, in order to retrieve the items that follow.

<a[^>]*>([0-9.]+?)</a> obviously matches all items in a tag (and should be fine as a tags cannot be nexted), but I can't tie them to an H1.

.+Title One.+<a[^>]*>([0-9.]+?)</a></p> fails.

I had tried to use look behind as so:

(?<=Title One.+)<a[^>]*>([0-9.]+?)</a></p> and some variations but it is only allowed for fixed width matches (which won't be the case here).

For context, this will be using Python's regex engine. I know regex isn't necessarily the best solution for this, so alternative suggestions using DOM or something else also gratefully received :)


Update

To clarify from the above, I'd like to get back the following:

{"Title One": ["40.5", "31.3"], "Title Two": ["12.1", "82.0"]}

(not that I need help composing the dictionary, but it does demonstrate how I need the values to be related to the title).

So far BeautifulSoup looks like the best shot. LXML will also probably work as the source HTML isn't really tag-soup - it's pretty well-structured, at least in the places I'm interested in.


A: 

Don't use regex to parse html. That can't be done, by definition. Use a html parser instead. I suggest lxml.html.

lxml.html deals with badly formed html better than BeautifulSoup, is actively maintained (BeautifulSoup isn't) and is a lot faster since it uses libxml2 internally.

nosklo
+1  A: 

You're right, regex is absolutely the wrong tool for HTML matching.

Your question, however, sounds exactly like the problem for Beautiful Soup - a HTML parser that can deal with less-than-perfect HTML.

Piskvor
+1  A: 

The other obvious answer to solve this problem is BeautifulSoup -- I like that it handles the kind of crappy html that you often run into out in the wild as sensibly and gracefully as you can hope.

bgporter
A: 

Here's a way using just normal string manipulation

html='''
<h1 class="title">Title One</h1><p><a href="#">40.5</a>
<a href="#">31.3</a></p>
<h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
'''

for i in html.split("</a>"):
    if "<a href" in i:
        print i.split("<a href")[-1].split(">")[-1]

output

$ python test.py
40.5
31.3
12.1
82.0

I don't actually understand what you want to get, but if your requirement is SIMPLE, yes, a regex or a few string mangling can do it. Not necessary need a parser for that.

ghostdog74
This is a nice simple solution, but doesn't match the values to the titles. I'll clarify the question.
majelbstoat
A: 

Is this the kind of thing you're after?

>>> from lxml import etree
>>>
>>> data = """
... <h1 class="title">Title One</h1><p><a href="#">40.5</a><a href="#">31.3</a></p>
... <h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
... """
>>>
>>> d = etree.HTML(data)
>>> d.xpath('//h1/following-sibling::p[1]/a/text()')
['40.5', '31.3', '12.1', '82.0']

This solution uses lxml.etree and an xpath expression.


Update

>>> from lxml import etree
>>> from pprint import pprint
>>>
>>> data = """
... <h1 class="title">Title One</h1><p><a href="#">40.5</a><a href="#">31.3</a></p>
... <h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
... """
>>>
>>> d = etree.HTML(data)
>>> #d.xpath('//h1[following-sibling::*[1][local-name()="p"]]') 
...
>>> results = {}
>>> for h in d.xpath('//h1[following-sibling::*[1][local-name()="p"]]'):
...   r = results.setdefault(str(h.text),[])
...   r += [ str(x) for x in h.xpath('./following-sibling::*[1][local-name()="p"]/a/text()') ]
...
>>> pprint(results)
{'Title One': ['40.5', '31.3'], 'Title Two': ['12.1', '82.0']}

Now using predicates to look ahead, this should iterate through <h1> tags which are immediately followed by <p> tags. ( Casting tag.text to strings explicitly as I have a recollection that they aren't normal strings, you'd have trouble pickling them, etc.)

MattH
It is, although there will be other H1 elements on the page and I'll need to know which values go with which title. Hadn't considered XPATH, will investigate, thanks.
majelbstoat
If you post what you're parsing and what you need out of it. I'm sure someone can point you in the right direction with xpath.
MattH