views:

49

answers:

3

Hello!

I want get all mathes from this expression:

import re

def my_handler(matches):
  return str(matches.groups())

text = "<a href='#' title='Title here'>"

print re.sub("<[a-zA-Z]+( [a-zA-Z]+=[\#a-zA-Z0-9_.'\" ]+)*>", my_handler, text)

Actual result:

(" title='Title here'",)

Expected result:

("a", " href='#'", " title='Title here'",)

Please, help me understang, how to do this. Note, that I have many tags in text.

+2  A: 

Your pattern only has one capture group, hence why you only get one item in your result. If you want more things captured, put parentheses around them.

However, I'd really recommend not using regex to parse HTML.

Use an actual parser, like BeautifulSoup, instead.

Amber
+2  A: 

Your problem is that you have a single group. Even though you've followed it with a * this still remains a single group and will return the final thing it matches rather than becoming multiple groups.

This is behaviour desirable because if you have other groups in the pattern you still know which number they are; if a group followed by a * became multiple groups pulling out particular values would become very hard.

In theory you could try using re.findall (but don't):

>>> re.findall("[a-zA-Z]+=[\#a-zA-Z0-9_.'\"]+",text)
["href='#'", "title='Title"]

However, you'll soon find that creating a regular expression that works very difficult.

If you're parsing HTML you're much better off using an HTML Parser rather than a regular expression.

Have a look at Beautiful Soup, lxml or sgmllib

Dave Webb
A: 

What are you trying to achieve with the code? It seems a bit like you're falling into the old "treating XML as a big pile of text' pitfall. You should use something like a DOM API like xml.dom.minidom to parse it instead of coming up with your own regex: http://docs.python.org/library/xml.dom.minidom.html

Of course, I'm assuming it's XHTML that you're dealing with.

teukkam