views:

38

answers:

3
+2  Q: 

regex expression

I am trying to get all the text between the following tags and it is just not workind

If Not String.IsNullOrEmpty(_html) Then
               Dim regex As Regex = New Regex( _
                            ".*<entry(?<link>.+)</entry>", _
                            RegexOptions.IgnoreCase _
                            Or RegexOptions.CultureInvariant _
                            Or RegexOptions.Multiline _
                            )

            Dim ms As MatchCollection = regex.Matches(_html)
            Dim url As String = String.Empty
            For Each m As Match In ms
                 url = m.Groups("link").Value
                 urls.Add(url)
            Next
            Return urls

I have already wrote my fetch functions to get the html as string. I was looking at an example of the html agility pack and I dont have files saved as html docs

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm");
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
 {
  HtmlAttribute att = link["href"];
  att.Value = FixLink(att);
   }
  doc.Save("file.htm");
A: 

Obligatory "don't use regex to parse HTML" warning:

Using regex to parse HTML has been covered at length on SO. Please read the following post:

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Would it be possible to convert your HTML to XHTML and parse it using xpath?

Using a tool like HTML Tidy or SGML you can do this conversion. Then you could use xpath to extract the desired data: //entry/link

Abe Miessler
+4  A: 

I would use this software to help with your regexes.

Free RegExBuilder software.

Jimmie Clark
+1  A: 

The best way to do this in .Net is via the HTML Agility Pack. Using regular expressions on html is not usually a good idea.

The exceptions are situations where you can make certain assumptions about the structure of the html, such as one-off jobs (where you can study the actual input for your program) or when the html is generated by a trusted source. For example, can you assume that the html is well-formed or that tags will not be nested beyond a certain depth? (Note that neither of those assumptions by themselves are good enough to build an expression that won't fall down given some edge case or other.)

If you meet this criteria we need to know exactly what assumptions you are allowed to make before we can write an accurate expression.

Joel Coehoorn