tags:

views:

179

answers:

4

Taking this thread a step further, can someone tell me what the difference is between these two regular expressions? They both seem to accomplish the same thing: pulling a link out of html.

Expression 1:

'/(https?://)?(www.)?([a-zA-Z0-9_%]*)\b.[a-z]{2,4}(.[a-z]{2})?((/[a-zA-Z0-9_%])+)?(.[a-z])?/'

Expression 2:

'/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/si'

Which one would be better to use? And how could I modify one of those expressions to match only links that contain certain words, and to ignore any matches that do not contain those words?

Thanks.

+1  A: 

In the majority of cases I'd strongly recommend using an HTML parser (such as this one) to get these links. Using regular expressions to parse HTML is going to be problematic since HTML isn't regular and you'll have no end of edge cases to consider.

See here for more info.

Brian Agnew
I don't agree, matching a well formed link is not too difficult, and the dev time for doing it via regex is a fraction of what it would be with a parser. He's not even trying to parse HTML, he's parsing text which might have links.
Paul Creasey
Whilst I accept your point re. pragmatism (and I've amended my answer to reflect this), he does say in the above that the regexp is pulling a link out of HTML.
Brian Agnew
It's pulling out of a Wordpress post's content, so it would be HTML (right?), but pretty clean HTML at that. Using regex seems to be working fine for me, I just want to be sure I'm using the expression that will give me the best results. The parser is interesting though, thanks for the link.
rocky
In closed/known cases like yours, regexps aren't unreasonable. But it's worth looking at the parsers going forwards.
Brian Agnew
I don't know why people are so reticent to use a real parser. It's the only way to handle HTML correctly, and it's not exactly difficult. (To *write* a parser, sure, hard; to use one, trivial.) Coming up with a regex to handle common variations of link syntax is very difficult, and in the general case impossible. You want to hope that the link format on the page that you're scraping will always stay exactly the same? That you won't get fooled by commented-out links or HTML in script block content? Link elements split over multiple lines?
bobince
A: 

At a brief glance the first one is rubbish but seems to be trying to match a link as text, the second one is matching a html element.

Paul Creasey
#2 it is! Thanks.
rocky
+2  A: 

The difference is that expression 1 looks for valid and full URIs, following the specification. So you get all full urls that are somewhere inside of the code. This is not really related to getting all links, because it doesn't match relative urls that are very often used, and it gets every url, not only the ones that are link targets.

The second looks for a tags and gets the content of the href attribute. So this one will get you every link. Except for one error* in that expression, it is quite safe to use it and it will work good enough to get you every link – it checks for enough differences that can appear, such as whitespace or other attributes.

*However there is one error in that expression, as it does not look for the closing quote of the href attribute, you should add that or you might match weird things:

/<a.*?href\s*=\s*["\']([^"\'>]+)["\'][^>]*>.*?<\/a>/si

edit in response to the comment:

To look for word inside of the link url, use:

/<a.*?href\s*=\s*["\']([^"\'>]*word[^"\'>]*)["\'][^>]*>.*?<\/a>/si

To look for word inside of the link text, use:

/<a.*?href\s*=\s*["\']([^"\'>]+)["\'][^>]*>.*?word.*?<\/a>/si
poke
Awesome, thanks for the explanation. Now say I wanted to modify that second expression to match links that contained the word blue, or red, or green (anywhere in the link), and ignore the links that didn't contain one of those words. Is that possible?
rocky
updated my answer to reflect that.
poke
That did it.. one last question, what's the syntax for multiple words? Something that would function like this:"/<a.*?href\s*=\s*["\']([^"\'>]*red,green,blue[^"\'>]*)["\'][^>]*>.*?<\/a>/si"Googling for Regex is a nightmare. Thanks again poke.
rocky
if you want to match either "red" or "blue", do it like this: `(red|blue)`; if you don't want to match that part itself, you can also use `(?:red|blue)`.
poke
Got it! Thanks.
rocky
+1  A: 
/<a.*?href\s*=\s*["']([^"']+)[^>]*>.*?<\/a>/si

You have to be very careful with .*, even in the non-greedy form. . easily matches more than you bargained for, especially in dotall mode. For example:

<a name="foo">anchor</a>
<a href="...">...</a>

Matches from the start of the first <a to the end of the second.

Not to mention cases like:

<a href="a"></a >
<a href="b"></a>

or:

<a href="a'b>c">

or:

<a data-href="a" title="b>c" href="realhref">

or:

<!-- <a href="notreallyalink"> -->

and many many more fun edge cases. You can try to refine your regex to catch more possibilities, but you'll never get them all, because HTML cannot be parsed with regex (tell your friends)!

HTML+regex is a fool's game. Do yourself a favour. Use an HTML parser.

bobince
+1. I note that the Markup syntax highlight itself gets confused by the above, and I'm willing to bet that a regexp is involved somewhere!
Brian Agnew
Yep! SO's syntax highlighting is a highly complex regex bodge. It can't really parse HTML or XML properly (even if it wanted to include a full HTML parser in JS) because it doesn't even know that the above code blocks are HTML! SO makes a good guess at it, and it's impressive it does as well as it does, but it can never really get it right. But that's OK, as it's only for some colouring, not anything vital.
bobince