What should be a fairly simple regex extraction is confounding me. Couldn't find a similar question on SO, so happy to be pointed to one if it exists. Given the following HTML:
<h1 class="title">Title One</h1><p><a href="#">40.5</a><a href="#">31.3</a></p>
<h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
(amongst a larger document - the extracts will most probably run across multiple lines)
How can I construct a regular expression that finds the text within the A tags, within the first P following an H1? The regex will go in a loop, such that I can pass in the header, in order to retrieve the items that follow.
<a[^>]*>([0-9.]+?)</a>
obviously matches all items in a tag (and should be fine as a tags cannot be nexted), but I can't tie them to an H1.
.+Title One.+<a[^>]*>([0-9.]+?)</a></p>
fails.
I had tried to use look behind as so:
(?<=Title One.+)<a[^>]*>([0-9.]+?)</a></p>
and some variations but it is only allowed for fixed width matches (which won't be the case here).
For context, this will be using Python's regex engine. I know regex isn't necessarily the best solution for this, so alternative suggestions using DOM or something else also gratefully received :)
Update
To clarify from the above, I'd like to get back the following:
{"Title One": ["40.5", "31.3"], "Title Two": ["12.1", "82.0"]}
(not that I need help composing the dictionary, but it does demonstrate how I need the values to be related to the title).
So far BeautifulSoup looks like the best shot. LXML will also probably work as the source HTML isn't really tag-soup - it's pretty well-structured, at least in the places I'm interested in.