views:

75

answers:

3

Is there a way to gather all links that has a specific domain in a string where they only include ones that are either:

href="http://yahoo.com/media/news.html"

or

>http://yahoo.com/media/news.html<

So basically links either prefixed by href=" and ends with "

or

links that are surrounded by ><.

I tried to use Regex ( "href=\"([^\"]*)\"></A>" ) but didn't match anything.

+1  A: 

try:

http=\"(.+)\"
Anatoly G
Thanks I tried Regex ( "http=\"(.+)\"" ), with pattern.Matches ( text ) but didn't match anything.
Joan Venge
+4  A: 
(href="[^"]*")|(>[^<]*<)

Starts with href=", followed by characters that are not ", ending with "

or

Starts with >, followed by characters that are not <, ending with <

mbeckish
Thanks alot. Do I have to escape these characters? It doesn't let me to compile because of red underlines in the editor. I just encapsulated it with "".
Joan Venge
Yes - if you surround this with " to make a string, you'll need to escape the " within the string.
mbeckish
Note, this just matches literally what you asked for. It doesn't take anything else into account, such as ensuring there is a valid URL being captured, looking at what kind of tag it is, etc.
mbeckish
Thanks got it. In the matches, I get many (><) mathes though. Is it because they fit the criteria too? I actually made it ( "(href=yahoo\"[^\"]*\")|(>[^<]*<)" )" but still picks up.
Joan Venge
Sorry I meant: "(href=http://yahoo\"[^\"]*\")|(>[^<]*<)" )"
Joan Venge
Ok now got it to do that. It works. One last thing I want to ask is, if it's possible not to capture the >< with the link?
Joan Venge
Now I get the links like >coollink<
Joan Venge
Yes, put the parts you don't want to capture in look-arounds.
mbeckish
+3  A: 

Try the following:

string[] inputs = { "href=\"http://yahoo.com/media/news.html\"", ">http://yahoo.com/media/news.html&lt;" };

string pattern = @"(?:href=""|>)(?<Url>http://.+?)[&lt;""]";
foreach (string input in inputs)
{
    Match m = Regex.Match(input, pattern);
    if (m.Success)
    {
        Console.WriteLine(m.Groups["Url"].Value);
    }
}

EDIT: another approach is to use look-arounds so that the text is matched but not captured. This allows you to use Match.Value directly instead of using groups. Try this alternate approach below.

string pattern = @"(?<=href=""|>)http://.+?(?=&lt;|"")";
foreach (string input in inputs)
{
    Match m = Regex.Match(input, pattern);
    if (m.Success)
    {
        Console.WriteLine(m.Value);
    }
}

EDIT #2: per the request in the comments here is a pattern that will not match URLs that contain "..." in the text.

string pattern = @"(?<=href=""|>)http://(?!.*\.{3}).+?(?=&lt;|"")";

The only change is the addition of (?!.*\.{3}) which is a negative look-ahead that allows the pattern to match if the specified suffix is absent. In this case it checks that the "..." is absent. If you need to match at least 3 dots then use {3,}.

Ahmad Mageed
Thanks this works, but is it possible to just get the links? I am getting the links like ">coollink<"
Joan Venge
@Joan: please show me what your input looks like. Is it different from the 2nd item I use in the `inputs` array?
Ahmad Mageed
Thanks Ahmad. My input is exactly same, but I get the match results like this:>http://yahoo.com/media/news.html<So it includes the >< too.
Joan Venge
Btw I am getting the results, using match.Value. Should I use the groups like you did?
Joan Venge
@Joan: yes the group is crucial to the way I setup the regex. Using match.Value returns the entire match which is why you see the >< characters.
Ahmad Mageed
Thanks Ahmad, now it works brilliantly.
Joan Venge
@Joan: great! I updated my post with an alternate approach that allows you to use `Match.Value` if you prefer that. For large documents the first approach will probably perform better.
Ahmad Mageed
Thanks Ahmad, so using groups is faster in performance than directly using Value?
Joan Venge
@Joan: no that's not what I meant. Using patterns with look-arounds usually perform slower than patterns that don't use them. I wouldn't avoid them without testing since they are extremely useful and make pattern matching easier at times.
Ahmad Mageed
Thanks Ahmad, learnt a few things now :O
Joan Venge
Actually finished the parsing but have one more question if you don't mind. If there was a string inside the string where if it exists you don't want to match it, would it be doable with regex? Or should I just skip that result out using standard logic like I do now. Basically I am trying to avoid links with "..." in them.
Joan Venge
Also Ahmad, it looks like there is a problem. When I get the matches, I checked some of the matches and they only received one or the other. So for instance in the website if a link is coded like a link then it will have both >< and the "" version but I only get one of them which in my case seems to be ><s not ""s, which contain the full link.
Joan Venge
@Joan I'll have to get back to you on the earlier question. As for the last one, are you using `Regex.Matches`? Can you post a representative sample of the input so I'm sure I understand the issue?
Ahmad Mageed
@Joan: please see my 2nd edit. It handles the "..." scenario.
Ahmad Mageed
Thanks Ahmad, I have cropped the website stream that I am using. It shows all cases, which are links with ""s, and ><s. Also shows the ... case. Basically what it is, is if a link is inside href="", then it's formatted as a link inside the forum, which will have both the display text (links between ><) and the link itself. But if they just code it as text, then they will appear fine and there will be only 1 link that's inside ><. Does it make sense?For me I only need the full link which is always getting the href="link", and if that's not available then getting the link between >link<.
Joan Venge
http://codepaste.net/zmpaz6
Joan Venge
Hi Ahmad, did the sample helped? If it's not doable don't worry.
Joan Venge