views:

71

answers:

4

Hello, I have to extract from a string in visual basic some text, like this:

<div id="div">
<h2 id="id-date">09.09.2010</h2> , here to extract the date 

<h3 id="nr">000</h3> , here a number </div>

I need to extract the date from the div and the number all this from within the div... Also and this will be in loop, meaning there are more div block needed to be parsed.! thank you! Adrian

A: 

Try this taken from this link -

private string StripHTML(string htmlString)
{
    //This pattern Matches everything found inside html tags;
    //(.|\n) - > Look for any character or a new line
    // *?  -> 0 or more occurences, and make a non-greedy search meaning
    //That the match will stop at the first available '>' it sees, and not at the last one
    //(if it stopped at the last one we could have overlooked 
    //nested HTML tags inside a bigger HTML tag..)
    // Thanks to Oisin and Hugh Brown for helping on this one...

    string pattern = @"<(.|\n)*?>";  

    return  Regex.Replace(htmlString,pattern,string.Empty);
}
Sachin Shanbhag
"inside html tags" is not the same as "between html tags"
adf88
@adf88 - This is a function to which you can pass your HTML string and it will return you back the value after removing html tags. So this depends on user on what he passes to this function. In this case, the user needs to pass '<h3 id="nr">000</h3>' as input and it will return 000 as output. Why is this wrong?
Sachin Shanbhag
what if I have inside a <div id="div"> those tags...</div>
Adrian
@Adrian - I agree that this is not a method for complete html parsing, but for the given question it suits and gets the answer required for the user. right?
Sachin Shanbhag
+1  A: 

You should not be parsing HTML with regular expressions because HTML is not regular as stated by Daniel Vandersluis. You can use the HTML Agility Pack

npinti
is this lib capable to extract from within a tag another information from more than one tag ??
Adrian
Apparently yes: http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=215674. I have never used this packaged, however, it is highly recommended here on SO
npinti
+1  A: 

Why not just use Html Agility Pack ?

Giorgi
+1  A: 

Parsing HTML with regex is not ideal. Others have suggested the HTML Agility Pack. However, if you can guarantee that your input is well-defined and you always know what to expect then using a regex is possible.

If you can make that guarantee, read on. Otherwise you need to consider the other suggestions or define your input better. In fact, you should define your input better regardless because my answer makes a few assumptions. Some questions to consider:

  • Will the HTML be on one line or multiple lines, separated by newline characters?
  • Will the HTML always be in the form of <div>...<h2...>...</h2><h3...>...</h3></div>? Or can there be h1-h6 tags?
  • On top of the hN tags, will the date and number always be between the tags with id-date and nr values for the id attribute?

Depending on the answers to these questions the pattern can change. The following code assumes each HTML fragment follows the structure you shared, that it will have an h2 and h3 with date and number, respectively, and that each tag will be on a new line. If you feed it different input it will likely break till the pattern matches your input's structure.

Dim input As String = "<div id=""div"">" & Environment.Newline & _
               "<h2 id=""id-date"">09.09.2010</h2>" & Environment.Newline & _
               "<h3 id=""nr"">000</h3>" & Environment.Newline & _
               "</div>"

Dim pattern As String = "<div[^>]+>.*?" & _
                 "<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?" & _
                 "<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"

Dim m As Match = Regex.Match(input, pattern, RegexOptions.Singleline)

If m.Success Then
    Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)
    Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)
    Console.WriteLine("Parsed Date: " & m.Groups("Date").Value)
    Console.WriteLine("Actual Date: " & actualDate)
    Console.WriteLine("Parsed Number: " & m.Groups("Number").Value)
    Console.WriteLine("Actual Number: " & actualNumber)
Else
    Console.WriteLine("No match!")
End If

The pattern can be on one line but I broke it up for clarity. RegexOptions.Singleline is used to allow the . metacharacter to handle \n for newlines.

You also said:

Also and this will be in loop, meaning there are more div block needed to be parsed.

Are you looping over separate strings? Or are you expecting multiple occurrences of the above HTML structure in a single string? If the former, the above code should be applied to each string. For the latter you'll want to use Regex.Matches and treat each Match result similarly to the above piece of code.


EDIT: here is some sample code to demonstrate parsing multiple occurrences.

Dim input As String = "<div id=""div"">" & Environment.Newline & _
               "<h2 id=""id-date"">09.09.2010</h2>" & Environment.Newline & _
               "<h3 id=""nr"">000</h3>" & Environment.Newline & _
               "</div>" & _
               "<div id=""div"">" & Environment.Newline & _
               "<h2 id=""id-date"">09.14.2010</h2>" & Environment.Newline & _
               "<h3 id=""nr"">123</h3>" & Environment.Newline & _
               "</div>"

Dim pattern As String = "<div[^>]+>.*?" & _
                 "<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?" & _
                 "<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"

For Each m As Match In Regex.Matches(input, pattern, RegexOptions.Singleline)
    Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)
    Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)
    Console.WriteLine("Parsed Date: " & m.Groups("Date").Value)
    Console.WriteLine("Actual Date: " & actualDate)
    Console.WriteLine("Parsed Number: " & m.Groups("Number").Value)
    Console.WriteLine("Actual Number: " & actualNumber)
Next
Ahmad Mageed
Yes, will be on separated lines, yes it will be with only div,h2,h3 ... yes will be exactly like that formatted....and yes this is a large string that contains multiple block of similar information ....
Adrian
@Adrian: I've updated my answer to show how to handle multiple occurrences using the `Regex.Matches` method and a `For Each` loop.
Ahmad Mageed