views:

361

answers:

3

I'd like to create a regex that will match an opening <a> tag containing an href attribute only:

<a href="doesntmatter.com">

It should match the above, but not match when other attributes are added:

<a href="doesntmatter.com" onmouseover="alert('Do something evil with Javascript')">

Normally that would be pretty easy, but the HTML is encoded. So encoding both of the above, I need the regex to match this:

&#60;a href&#61;&#34;doesntmatter.com&#34; &#62;

But not match this:

&#60;a href&#61;&#34;doesntmatter.com&#34; onmouseover&#61;&#34;alert&#40;&#39;do something evil with javascript.&#39;&#41;&#34; &#62;

Assume all encoded HTML is "valid" (no weird malformed XSS trickery) and assume that we don't need to follow any HTML sanitization best practices. I just need the simplest regex that will match A) above but not B).

Thanks!

+3  A: 

The initial regular expression that comes to mind is /<a href=".*?">/; a lazy expression (.*?) can be used to match the string between the quotes. However, as pointed out in the comments, because the regular expression is anchored by a >, it'll match the invalid tag as well, because a match is still made.

In order to get around this problem, you can use atomic grouping. Atomic grouping tells the regular expression engine, "once you have found a match for this group, accept it" -- this will solve the problem of the regex going back and matching the second string after not finding a > a the end of the href. The regular expression with an atomic group would look like:

/<a (?>href=".*?")>/

Which would look like the following when replacing the characters with their HTML entities:

/&#60;a (?>href&#61;&#34;.*?&#34;)&#62;/
Daniel Vandersluis
But correct me if I'm wrong: when you use the .* even if you make it non-greedy with .*?, it will capture everything up until the last quote in the onmouseover attribute, matching both of the expressions. This is exactly the problem I am having!
James D
You're right; because of the `>` at the end of the expression, the regular expression engine will match the invalid statement, because there is a match from the first quote to the end of the string. The solution here would be to use an atomic group, I will update my answer to explain.
Daniel Vandersluis
Ah yeah, using the atomic group is a great idea (I usually make use of possessive quantifiers; essentially the same thing)! I added it to my regex, because otherwise it would do a lot of unnecessary backtracking.
Blixt
Thanks guys. Great answers. I have no idea which one of these to mark as answered now. :)
James D
+1  A: 

I don't see how matching one is different from the other? You're just looking for exactly what you just wrote, making the portion that is doesntmatter.com the part you capture. I guess matching for anything until &#34; (not &quot;?) can present a problem, but you do it like this in regex:

(?:(?!&#34;).)*

It essentially means:

  • Match the following group 0 or more times
    • Fail match if the following string is "&#34;"
    • Match any character (except new line unless DOTALL is specified)

The complete regular expression would be:

/&#60;a href&#61;&#34;(?>(?:[^&]+|(?!&#34;).)*)&#34;&#62;/s

This is more efficient than using a non-greedy expression.

Credit to Daniel Vandersluis for reminding me of the atomic group! It fits nicely here for the sake of optimization (this pattern can never match if it has to backtrack.)

I also threw in an additional [^&]+ group to avoid repeating the negative look-ahead so many times.

Alternatively, one could use a possessive quantifier, which essentially does the same thing (your regex engine might not support it):

/&#60;a href&#61;&#34;(?:[^&]+|(?!&#34;).)*+&#34;&#62;/s

As you can see it's slightly shorter.

Blixt
That's what I was looking for - negative lookahead. Thanks~
James D
+2  A: 

Hey! I had to do a similar thing recently. I recommend decoding the html first then attempt to grab the info you want. Here's my solution in C#:

private string getAnchor(string data)
    {
        MatchCollection matches;
        string pattern = @"<a.*?href=[""'](?<href>.*?)[""'].*?>(?<text>.*?)</a>";
        Regex myRegex = new Regex(pattern, RegexOptions.Multiline);
        string anchor = "";

        matches = myRegex.Matches(data);

        foreach (Match match in matches)
        {
            anchor += match.Groups["href"].Value.Trim() + "," + match.Groups["text"].Value.Trim();
        }

        return anchor;
    }

I hope that helps!

I upvoted this as this was going to be my 2nd solution but I'd prefer not to have to encode/decode and possibly re-encode the HTML if it has those extra attributes. Thanks for the help!
James D