ansaurus

Question

How to get all pieces from regular expression (Python)

Answer 1

+2 A:

Your pattern only has one capture group, hence why you only get one item in your result. If you want more things captured, put parentheses around them.

However, I'd really recommend not using regex to parse HTML.

Use an actual parser, like BeautifulSoup, instead.

Amber 2010-08-13 07:45:37

Answer 2

+2 A:

Your problem is that you have a single group. Even though you've followed it with a * this still remains a single group and will return the final thing it matches rather than becoming multiple groups.

This is behaviour desirable because if you have other groups in the pattern you still know which number they are; if a group followed by a * became multiple groups pulling out particular values would become very hard.

In theory you could try using re.findall (but don't):

>>> re.findall("[a-zA-Z]+=[\#a-zA-Z0-9_.'\"]+",text)
["href='#'", "title='Title"]

However, you'll soon find that creating a regular expression that works very difficult.

If you're parsing HTML you're much better off using an HTML Parser rather than a regular expression.

Have a look at Beautiful Soup, lxml or sgmllib

Dave Webb 2010-08-13 07:47:23

Answer 3

A:

What are you trying to achieve with the code? It seems a bit like you're falling into the old "treating XML as a big pile of text' pitfall. You should use something like a DOM API like xml.dom.minidom to parse it instead of coming up with your own regex: http://docs.python.org/library/xml.dom.minidom.html

Of course, I'm assuming it's XHTML that you're dealing with.

teukkam 2010-08-13 07:52:12

ansaurus

tags:

views:

answers:

How to get all pieces from regular expression (Python)

related questions