views:

74

answers:

3

the regexp

\<div class=g\>.*?\<a href=\"?(http:\/\/stackoverflow.com\/)\"?.*?\>.*?\<a href=\"?(.+?)\"?.*?\>.*?\<\/div\>

the target

<div class=g>
  <link rel=prefetch href="http://stackoverflow.com/"&gt;
  <h2 class=r>
    <a href="http://stackoverflow.com/" class=l onmousedown="return rwt(this,'','','dres','1','AFQjCNERidL9Hb6OvGW93_Y6MRj3aTdMVA','&amp;sig2=ybSqh-7yEKCGx_2MNIb7tA')">
      <em>Stack Overflow</em>
    </a>
  </h2>
  <table border=0 cellpadding=0 cellspacing=0>
    <tr>
      <td class=j>
        <font size=-1>
          <span class=f>Categoria: </span>
          <a href="/Top/Computers/Programming/Resources/Chats_and_Forums/?il=1">Computers&nbsp;&gt;&nbsp;Programming&nbsp;&gt;&nbsp;Resources&nbsp;&gt;&nbsp;Chats&nbsp;and&nbsp;Forums</a>
          <br>A language-independent collaboratively edited question and answer site for programmers. Questions and answers displayed by user votes and tags.<br>
          <span class=a><b>stackoverflow</b>.com/</span>
        </font>
      </td>
    </tr>
  </table>
</div>

it should match everything, http://stackoverflow.com/ and /Top/Computers/Programming/Resources/Chats_and_Forums/?il=1, but it matches everything, http://stackoverflow.com/ and /

Why?

+1  A: 

That is because your regex in the second group matches reluctantly (a.k.a. ungreedy matching). More info on this see: http://www.regular-expressions.info/repeat.html escpecially paragraph Laziness Instead of Greediness.

That's why it doesn't work as you expected it to.

Now, as to fixing your problem: use a proper parser for this or some existing tool to get attributes from html (jQuery can do this quite nicely, I heard). Don't try to do this with regex: you may get it working for this case, but next week you'll be here again because something else broke.

Best of luck!

Bart Kiers
+1  A: 

I'm definitely not one of those "omg, you said HTML and regex in the same sentence, you must die" -types, but this is clearly not a situation where regex is the best tool for the job. (Nor is it even a good tool, nor a functioning tool here).

Parse it with an XML/HTML parser, and save yourself a lot of hassle and abuse from your colleagues.

nickf
Exactly my thoughts. I am not one to jump on the bandwagon of the *Parser Police* (the name is not my invention!), but this is definitely not suited for a regex.
Bart Kiers
+1  A: 

The problem is this...

(.*?)

Why are you placing a question mark here? With that present, you're only getting the '/' in your search, because ? ensures zero or one return. If you replace it with the following...

([^"]+)

Which looks for all values that aren't a double quotation you should be getting everything, the stackoverflow href, and the other href you mentioned.

I'm not entirely sure why you're doing this. It's possible that you're using regular expressions when you don't have to. What is the purpose of this regular expression, it seems like overkill.

MillsJROSS
I think you meant (.+?). Now, it works without html parser. Thanks!
Delirium tremens