tags:

views:

153

answers:

3

Hi There,

So i'm looking to scrape rapidshare.com links from websites. I have the following regular expressions to find links:

<a href=\"(http://rapidshare.com/files/(\\d+)/(.+)\\.(\\w{3,4}))\"

http://rapidshare.com/files/(\\d+)/(.+)\\.(\\w{3,4})

How can I write a regex that will exclude text that is embedded in a <a href="..."> tag. and only capture the text in >here</a>

I also have to bare in mind that not all links are embedded in href tags. Some are just displayed in plain text.

Basically is there a wway to exclude patterns in regex ?

Thanks.

A: 

How about like this, last part will try to match any thing except ' " >

http://rapidshare.com/files/(\d+)/([^'"&gt; ]+)
S.Mark
Cool.. I'll give this a go. Project is on the back burner atm :-(Cheers!
Conor H
A: 

To capture the inner text of an anchor tag, while ignoring all attribute text of the tag, you'd use the pattern:

<a href="http://rapidshare.com/files/(\d+)/(.+)\.(\w{3,4})[^&gt;]*&gt;(.*?)&lt;/a&gt;

The [^>]* part matches everything else in your tag up until the end of the start tag. The (.*?) performs a non-greedy capture of the inner text.

If you want to capture anchor tag links and non-anchor tag links, then those are really two separate problems. There's probably a regex for it, but it would be terribly complicated. You're better off simply looking for non-anchor-tag links separately with the simple regex:

[^'"]http://rapidshare.com/files/(\d+)/(.+)\.(\w{3,4})
Chris S
The OP states that "not all links are embedded in href tags", so your suggestion doesn't fit the bill.
Kitson
A: 

How about something like:

/http:\/\/rapidshare.com\/files\/\d+\/[^<&\s]+\.\w{3,4}/

I got rid of the capturing groups, because I think you only had them in there because you thought you might need them to make sure the different groupings worked and you can add them back in if you really want them parsed out.

You can expand upon the [^<&"\s] as it only is excluding white spaces, the < character which could be the start of the tag, the & which would include things like &nbsp; and other HTML entities or the " which would be the end of the href=. but you could exclude any non-valid URI character if you wanted. This should cover your inline text as well as those embedded as href.

Kitson