ansaurus

Question

Regex that finds hyperlinks while excluding plain text.

Answer 1

A:

How about like this, last part will try to match any thing except ' " >

http://rapidshare.com/files/(\d+)/([^'"&gt; ]+)

S.Mark 2010-01-13 14:34:25

Cool.. I'll give this a go. Project is on the back burner atm :-(Cheers!

Conor H 2010-01-29 21:15:25

Answer 2

A:

To capture the inner text of an anchor tag, while ignoring all attribute text of the tag, you'd use the pattern:

<a href="http://rapidshare.com/files/(\d+)/(.+)\.(\w{3,4})[^&gt;]*&gt;(.*?)&lt;/a&gt;

The [^>]* part matches everything else in your tag up until the end of the start tag. The (.*?) performs a non-greedy capture of the inner text.

If you want to capture anchor tag links and non-anchor tag links, then those are really two separate problems. There's probably a regex for it, but it would be terribly complicated. You're better off simply looking for non-anchor-tag links separately with the simple regex:

[^'"]http://rapidshare.com/files/(\d+)/(.+)\.(\w{3,4})

Chris S 2010-01-13 14:41:02

The OP states that "not all links are embedded in href tags", so your suggestion doesn't fit the bill.

Kitson 2010-01-13 18:53:48

Answer 3

A:

How about something like:

/http:\/\/rapidshare.com\/files\/\d+\/[^<&\s]+\.\w{3,4}/

I got rid of the capturing groups, because I think you only had them in there because you thought you might need them to make sure the different groupings worked and you can add them back in if you really want them parsed out.

You can expand upon the [^<&"\s] as it only is excluding white spaces, the < character which could be the start of the tag, the & which would include things like   and other HTML entities or the " which would be the end of the href=. but you could exclude any non-valid URI character if you wanted. This should cover your inline text as well as those embedded as href.

Kitson 2010-01-13 19:07:45

ansaurus

tags:

views:

answers:

Regex that finds hyperlinks while excluding plain text.

related questions