tags:

views:

625

answers:

3

hi i got problems to get my regex to work. im working with C# asp.net i will post the code i use now and what i cant get to work is the second regex to get whatever is in the href="LINK"

thx in advance

var textBody = "lorem ipsum... <a href='http://www.link.com'&gt;link&lt;/a&gt;";


        var urlTagPattern = new Regex(@"<a.*?href=[""'](?<url>.*?)[""'].*?>(?<name>.*?)</a>", RegexOptions.IgnoreCase);



        //THIS IS THE REGEX
        var hrefPattern = new Regex(@"HREF={:q}\>", RegexOptions.IgnoreCase);




        var urls = urlTagPattern.Matches(textBody);


        foreach (Match url in urls)
        {

            var hrefs = hrefPattern.Match(url.ToString());


            litStatus.Text = hrefs.ToString();
        }
A: 

The following example searches an input string and prints out all the href="…" values and their locations in the string. It does this by constructing a compiled Regex object and then using a Match object to iterate through all the matches in the string. In this example, the metacharacter \s matches any space character, and \S matches any nonspace character.

' VB

Sub DumpHrefs(inputString As String)

Dim r As Regex
Dim m As Match

r = New Regex("href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", _
    RegexOptions.IgnoreCase Or RegexOptions.Compiled)

m = r.Match(inputString)
While m.Success
    Console.WriteLine("Found href " & m.Groups(1).Value _
        & " at " & m.Groups(1).Index.ToString())
    m = m.NextMatch()
End While

End Sub

// C#

void DumpHrefs(String inputString) {

Regex r;
Match m;

r = new Regex("href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))",
    RegexOptions.IgnoreCase|RegexOptions.Compiled);
for (m = r.Match(inputString); m.Success; m = m.NextMatch())
{
    Console.WriteLine("Found href " + m.Groups[1] + " at "
        + m.Groups[1].Index);
}

}

Romina
that dont work for me it gets <a href="my link">link
Dejan.S
and that was from <a href="http://www.link.com">link</a> so what it did was removed the </a>.. i need to get the http://www.link.com
Dejan.S
A: 

Second regular expression should be:

href=['"](?<link>[^'"]*)
Mijalko
it is closer but I get href='http://www.link.com with that Mijalko
Dejan.S
href='http://www.link.com
Dejan.S
well you get it, it is supose to be http://www.
Dejan.S
+4  A: 

Welcome to your daily installment of Don't Use Regex To Parse HTML. In this edition of Don't Use Regex To Parse HTML, we'll be reminding you not to use regex to parse HTML because HTML cannot reliably be parsed by a regex and dozens of valid HTML constructs will break the naïve regex proposed. We won't be mentioning all the additional invalid ones in common use on the web in Don't Use Regex To Parse HTML today.

Also in Don't Use Regex To Parse HTML, we'll be linking to the Html Agility Pack, a .NET library you can use to parse HTML properly and subsequently extract link URLs reliably in just a couple of lines of code (a very similar example being present on that page).

We hope you have enjoyed today's Don't Use Regex To Parse HTML, and look forward to seeing you again tomorrow for another exciting edition of Don't Use Regex To Parse HTML, when someone posts another question about using regex to parse HTML. But that's all from Don't Use Regex To Parse HTML for now. Bye!

bobince
Was this a canned response you already used somewhere else or you wrote it explicitly? (+1)
Paolo Tedesco
Is it alright to rate answers +1 based on humor (assuming they're correct?) If not, consider me a rebel against the system!
Duroth
solved my issue with a regex
Dejan.S