ansaurus

Question

Regex matching items following a header in HTML

Answer 1

A:

Don't use regex to parse html. That can't be done, by definition. Use a html parser instead. I suggest lxml.html.

lxml.html deals with badly formed html better than BeautifulSoup, is actively maintained (BeautifulSoup isn't) and is a lot faster since it uses libxml2 internally.

nosklo 2010-10-25 12:38:18

Answer 2

+1 A:

You're right, regex is absolutely the wrong tool for HTML matching.

Your question, however, sounds exactly like the problem for Beautiful Soup - a HTML parser that can deal with less-than-perfect HTML.

Piskvor 2010-10-25 12:40:39

Answer 3

+1 A:

The other obvious answer to solve this problem is BeautifulSoup -- I like that it handles the kind of crappy html that you often run into out in the wild as sensibly and gracefully as you can hope.

bgporter 2010-10-25 12:44:22

Answer 4

A:

Here's a way using just normal string manipulation

html='''
<h1 class="title">Title One</h1><p><a href="#">40.5</a>
<a href="#">31.3</a></p>
<h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
'''

for i in html.split("</a>"):
    if "<a href" in i:
        print i.split("<a href")[-1].split(">")[-1]

output

$ python test.py
40.5
31.3
12.1
82.0

I don't actually understand what you want to get, but if your requirement is SIMPLE, yes, a regex or a few string mangling can do it. Not necessary need a parser for that.

ghostdog74 2010-10-25 13:00:33

This is a nice simple solution, but doesn't match the values to the titles. I'll clarify the question.

majelbstoat 2010-10-25 21:45:21

Answer 5

A:

Is this the kind of thing you're after?

>>> from lxml import etree
>>>
>>> data = """
... <h1 class="title">Title One</h1><p><a href="#">40.5</a><a href="#">31.3</a></p>
... <h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
... """
>>>
>>> d = etree.HTML(data)
>>> d.xpath('//h1/following-sibling::p[1]/a/text()')
['40.5', '31.3', '12.1', '82.0']

This solution uses lxml.etree and an xpath expression.

Update

>>> from lxml import etree
>>> from pprint import pprint
>>>
>>> data = """
... <h1 class="title">Title One</h1><p><a href="#">40.5</a><a href="#">31.3</a></p>
... <h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
... """
>>>
>>> d = etree.HTML(data)
>>> #d.xpath('//h1[following-sibling::*[1][local-name()="p"]]') 
...
>>> results = {}
>>> for h in d.xpath('//h1[following-sibling::*[1][local-name()="p"]]'):
...   r = results.setdefault(str(h.text),[])
...   r += [ str(x) for x in h.xpath('./following-sibling::*[1][local-name()="p"]/a/text()') ]
...
>>> pprint(results)
{'Title One': ['40.5', '31.3'], 'Title Two': ['12.1', '82.0']}

Now using predicates to look ahead, this should iterate through <h1> tags which are immediately followed by <p> tags. ( Casting tag.text to strings explicitly as I have a recollection that they aren't normal strings, you'd have trouble pickling them, etc.)

MattH 2010-10-25 14:11:56

It is, although there will be other H1 elements on the page and I'll need to know which values go with which title. Hadn't considered XPATH, will investigate, thanks.

majelbstoat 2010-10-25 21:44:20

If you post what you're parsing and what you need out of it. I'm sure someone can point you in the right direction with xpath.

MattH 2010-10-26 09:21:52

ansaurus

tags:

views:

answers:

Regex matching items following a header in HTML

related questions