ansaurus

Question

Simple regex question to parse similar things in .NET?

Answer 1

+1 A:

try:

http=\"(.+)\"

Anatoly G 2010-03-04 21:56:44

Thanks I tried Regex ( "http=\"(.+)\"" ), with pattern.Matches ( text ) but didn't match anything.

Joan Venge 2010-03-04 21:59:13

Answer 2

+4 A:

(href="[^"]*")|(>[^<]*<)

Starts with href=", followed by characters that are not ", ending with "

or

Starts with >, followed by characters that are not <, ending with <

mbeckish 2010-03-04 22:03:05

Thanks alot. Do I have to escape these characters? It doesn't let me to compile because of red underlines in the editor. I just encapsulated it with "".

Joan Venge 2010-03-04 22:04:53

Yes - if you surround this with " to make a string, you'll need to escape the " within the string.

mbeckish 2010-03-04 22:07:55

Note, this just matches literally what you asked for. It doesn't take anything else into account, such as ensuring there is a valid URL being captured, looking at what kind of tag it is, etc.

mbeckish 2010-03-04 22:09:36

Thanks got it. In the matches, I get many (><) mathes though. Is it because they fit the criteria too? I actually made it ( "(href=yahoo\"[^\"]*\")|(>[^<]*<)" )" but still picks up.

Joan Venge 2010-03-04 22:11:55

Sorry I meant: "(href=http://yahoo\"[^\"]*\")|(>[^<]*<)" )"

Joan Venge 2010-03-04 22:13:48

Ok now got it to do that. It works. One last thing I want to ask is, if it's possible not to capture the >< with the link?

Joan Venge 2010-03-04 22:18:47

Now I get the links like >coollink<

Joan Venge 2010-03-04 22:19:16

Yes, put the parts you don't want to capture in look-arounds.

mbeckish 2010-03-04 23:16:52

Answer 3

+3 A:

Try the following:

string[] inputs = { "href=\"http://yahoo.com/media/news.html\"", ">http://yahoo.com/media/news.html&lt;" };

string pattern = @"(?:href=""|>)(?<Url>http://.+?)[&lt;""]";
foreach (string input in inputs)
{
    Match m = Regex.Match(input, pattern);
    if (m.Success)
    {
        Console.WriteLine(m.Groups["Url"].Value);
    }
}

EDIT: another approach is to use look-arounds so that the text is matched but not captured. This allows you to use Match.Value directly instead of using groups. Try this alternate approach below.

string pattern = @"(?<=href=""|>)http://.+?(?=&lt;|"")";
foreach (string input in inputs)
{
    Match m = Regex.Match(input, pattern);
    if (m.Success)
    {
        Console.WriteLine(m.Value);
    }
}

EDIT #2: per the request in the comments here is a pattern that will not match URLs that contain "..." in the text.

string pattern = @"(?<=href=""|>)http://(?!.*\.{3}).+?(?=&lt;|"")";

The only change is the addition of (?!.*\.{3}) which is a negative look-ahead that allows the pattern to match if the specified suffix is absent. In this case it checks that the "..." is absent. If you need to match at least 3 dots then use {3,}.

Ahmad Mageed 2010-03-04 22:08:40

Thanks this works, but is it possible to just get the links? I am getting the links like ">coollink<"

Joan Venge 2010-03-04 22:29:45

@Joan: please show me what your input looks like. Is it different from the 2nd item I use in the `inputs` array?

Ahmad Mageed 2010-03-04 22:39:39

Thanks Ahmad. My input is exactly same, but I get the match results like this:>http://yahoo.com/media/news.html<So it includes the >< too.

Joan Venge 2010-03-04 22:42:06

Btw I am getting the results, using match.Value. Should I use the groups like you did?

Joan Venge 2010-03-04 22:43:14

@Joan: yes the group is crucial to the way I setup the regex. Using match.Value returns the entire match which is why you see the >< characters.

Ahmad Mageed 2010-03-04 22:45:49

Thanks Ahmad, now it works brilliantly.

Joan Venge 2010-03-04 22:48:16

@Joan: great! I updated my post with an alternate approach that allows you to use `Match.Value` if you prefer that. For large documents the first approach will probably perform better.

Ahmad Mageed 2010-03-04 22:50:36

Thanks Ahmad, so using groups is faster in performance than directly using Value?

Joan Venge 2010-03-04 22:52:10

@Joan: no that's not what I meant. Using patterns with look-arounds usually perform slower than patterns that don't use them. I wouldn't avoid them without testing since they are extremely useful and make pattern matching easier at times.

Ahmad Mageed 2010-03-04 22:54:18

Thanks Ahmad, learnt a few things now :O

Joan Venge 2010-03-04 22:55:52

Actually finished the parsing but have one more question if you don't mind. If there was a string inside the string where if it exists you don't want to match it, would it be doable with regex? Or should I just skip that result out using standard logic like I do now. Basically I am trying to avoid links with "..." in them.

Joan Venge 2010-03-04 22:59:51

Also Ahmad, it looks like there is a problem. When I get the matches, I checked some of the matches and they only received one or the other. So for instance in the website if a link is coded like a link then it will have both >< and the "" version but I only get one of them which in my case seems to be ><s not ""s, which contain the full link.

Joan Venge 2010-03-04 23:35:10

@Joan I'll have to get back to you on the earlier question. As for the last one, are you using `Regex.Matches`? Can you post a representative sample of the input so I'm sure I understand the issue?

Ahmad Mageed 2010-03-04 23:38:46

@Joan: please see my 2nd edit. It handles the "..." scenario.

Ahmad Mageed 2010-03-05 01:17:35

Thanks Ahmad, I have cropped the website stream that I am using. It shows all cases, which are links with ""s, and ><s. Also shows the ... case. Basically what it is, is if a link is inside href="", then it's formatted as a link inside the forum, which will have both the display text (links between ><) and the link itself. But if they just code it as text, then they will appear fine and there will be only 1 link that's inside ><. Does it make sense?For me I only need the full link which is always getting the href="link", and if that's not available then getting the link between >link<.

Joan Venge 2010-03-05 01:54:32

http://codepaste.net/zmpaz6

Joan Venge 2010-03-05 01:56:44

Hi Ahmad, did the sample helped? If it's not doable don't worry.

Joan Venge 2010-03-05 21:53:03

ansaurus

tags:

views:

answers:

Simple regex question to parse similar things in .NET?

related questions