tags:

views:

3198

answers:

6

I have the following regex expression to match html links:

<a\s*href=['|"](http:\/\/(.*?)\S['|"]>

it kind of works. Except not really. Because it grabs everything after the < a href... and just keeps going. I want to exclude the quote characters from that last \S match. Is there any way of doing that?

EDIT: This would make it grab only up to the quotes instead of everything after the < a href btw

+1  A: 

Why are you trying to match HTML links with a regex?

Depending on what you're trying to do the appropriate thing to do would vary.

You could try using an HTML Parser. There are several available, there's even one in the Python Library: http://www.python.org/doc/2.5.2/lib/module-HTMLParser.html

Hope this helps!

Marcos Lara
+1  A: 
>>> import re
>>> regex = '<a\s+href=["\'](http://(.*?))["\']&gt;'
>>> string = '<a href="http://google.com/test/this"&gt;'
>>> match = re.search(regex, string)
>>> match.group(1)
'http://google.com/test/this'
>>> match.group(2)
'google.com/test/this'

explanations:

 \s+   = match at least one white space (<ahref) is a bad link
 ["\'] = character class, | has no meaning within square brackets
         (it will match a literal pipe "|")
Owen
+4  A: 

I don't think your regex is doing what you want.

<a\s*href=['|"](http:\/\/(.*?)\S['|"]>

This captures anything non-greedily from http:// up to the first non-space character before a quote, single quote, or pipe. For that matter, I'm not sure how it parses, as it doesn't seem to have enough close parens.

If you are trying to capture the href, you might try something like this:

<a .*?+href=['"](http:\/\/.*?)['"].*?>

This uses the .*? (non-greedy match anything) to allow for other attributes (target, title, etc.). It matches an href that begins and ends with either a single or double quote (it does not distinguish, and allows the href to open with one and close with the other).

Ben Doom
All the regexes on display match mismatched single/double quotes (in question as well as answer). You'd have to capture the open quote and use it again in a \1 back-reference.
Jonathan Leffler
+1  A: 

\S matches any character that is not a whitespace character, just like [^\s]

Written like that, you can easily exclude quotes: [^\s"']

Note that you'll likely have to give the .*? in your regex the same treatment. The dot matches any character that is not a newline, just like [^\r\n]

Again, written like that, you can easily exclude quotes: [^\r\n'"]

Jan Goyvaerts
A: 

Read Jeff Friedl's "Mastering Regular Expressions" book.

As written:

<a\s*href=['|"](http:\/\/(.*?)\S['|"]>

You have unbalanced parentheses in the expression. Maybe the trouble is that the first match is being treated as "read to end of regex". Also, why would you not want the last non-space character of the URL?

The .*? (lazy greedy) operator is interesting. I must say, though, that I'd be more inclined to write:

<a\s+href=['|"]http://([^'"&gt;&lt;]+)\1&gt;

This distinguishes between "<ahref" (a non-existent HTML tag) and "<a href" (a valid HTML tag). It doesn't capture the 'http://' prefix. I'm not certain whether you have to escape the slashes -- in Perl, where I mainly work, I wouldn't need to. The capturing part uses the greedy match, but only on characters that might semi-legitimately appear in the URL. Specifically, it excludes both quotes and the end-tag (and, for good measure, the begin-tag too). If you really want the 'http://' prefix, shift the capturing parenthesis appropriately.

Jonathan Leffler
A: 

I ran into on issue with single quotes in some urls such as this one from Fox Sports. I made a slight adjustment that I think should take care of it.

http://msn.foxsports.com/mlb/story/9152594/Fehr:'Heightened'-concern-about-free-agent-market

/<a\s+href\s*=\s*["'](http:\/\/.*?)["'][>\s]/i

this requires that the closing quote be followed by a space or closing bracket.