views:

71

answers:

2

I have a string and need a RegEx Pattern for this, so I can extract only the date and the numbers from the tags:

Dim a as string= "<table id=table-1 > <tbody> <td align=right> <h2 id=date-one>12.09.2010</h2> </td> </tr> </tbody></table> <table id=table-2 border=0 cellspacing=0 cellpadding=0><tbody><tr><td align=center valign=middle><h3 id=nb-a>01</h3></td><td align=center valign=middle><h3 id=nb-a>>02</h3></td><td align=center valign=middle><h3 id=nb-a>03</h3></td></tr></tbody></table>"

This string will have more than one block of similar data ...so I must be in loop ... Thank you! Adrian

+1  A: 

An html parser (e.g., the HtmlAgilityPack) will be simpler in the long term but as a guide to Regex here's how to do it for your case:

  Dim pattern As String = "" 'what goes here?
  ' wrapping line for viewing, 
  ' imagine the following is a single line
  Dim a As String = 
     "<table id=table-1 > <tbody> <td align=right> 
     <h2 id=date-one>12.09.2010</h2> </td> </tr> </tbody></table> 
     <table id=table-2 border=0 cellspacing=0 cellpadding=0>
     <tbody><tr><td align=center valign=middle><h3 id=nb-a>01
     </h3></td><td align=center valign=middle><h3 id=nb-a>>02
     </h3></td><td align=center valign=middle><h3 id=nb-a>03</h3>
     </td></tr></tbody></table>"
  ' end of the a variable declaration
  For Each match As Match In Regex.Matches(a, pattern)
     Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index)
  Next

Naively for the first attempt match any numbers:

Dim pattern As String = "[\d]+"    ' \d matches any number,
                                   ' + specifies one or more

This of course matches way too many items and does not match the date as a single group. In your case each match is inside a tag and so is preceeded by a '>' and followed by a '<'.

Dim pattern As String = ">[.\d]+<" ' allow the '.' as well as numbers
                                   ' capture any string that starts with '>'
                                   ' followed by one or more numbers and '.'
                                   ' ending with '<'

This unforturnately includes the '>' and the '<' in your matches. Now we need positive lookbehind and positive lookahead:

Dim pattern As String = "(?<=>)[.\d]+(?=<)" 
                                   ' (?<=regex) is positive lookbehind for regex
                                   ' (?=regex) is positive lookahead for regex
                                   ' capture any string after '>' 
                                   ' with by one or more numbers and '.'
                                   ' before '<'

Now things are looking good because we're only matching the date and three numbers! However, what if the date was separated by '-' or '/' instead of '.'?

Dim pattern As String = "(?<=>)[-/.\d]+(?=<)" 
                                   ' add '-' and '/' to date separators

Easily handled. But what if there are spaces before or after the number or date within the element text?

Dim pattern As String = "(?<=>\s*)[-/.\d]+(?=\s*<)"
                                   ' lookbehind regex is ">\s*" means match
                                   '    the char '>' 
                                   '    followed by 0 or more whitespace chars
                                   ' lookahead regex is "\s*<" means match
                                   '    0 or more whitespace chars
                                   '    followed by the char '<' 

Not too bad. The only problem is that this method still takes more effort and breaks more easily than using an html parser to loop through all the elements, check if the element text is a valid number or date, and add the matching elements text to a list.

Consider for example altering the Regex method to handle currencies (where "$100.03.45" should not match) or commas in numbers or ensuring that dates have exactly three groups, each with one, two, or four digits, where only one group can have four, and one of the two digit groups can not exceed 12, etc. Insanity lies down that road.

jball
+1  A: 

Just building off of the example posted by jball. I just felt it would be easier this way than to be concerned with a lookbehind regex or lookahead regex. Here, I used parentheses to take advantage of Match.Groups.

m.Groups(0).Value = ">xxxxxx<"

m.Groups(1).Value = ">"

m.Groups(2).Value = "xxxxxx"

m.Groups(3).Value = "<"

   Dim input As String = "<table id=table-1 > <tbody> <td align=right> <h2 id=date-one>12.09.2010</h2> </td> </tr> </tbody></table> <table id=table-2 border=0 cellspacing=0 cellpadding=0><tbody><tr><td align=center valign=middle><h3 id=nb-a>01</h3></td><td align=center valign=middle><h3 id=nb-a>>02</h3></td><td align=center valign=middle><h3 id=nb-a>03</h3></td></tr></tbody></table>"

        Dim regex1 As Regex = New Regex("(>)([\d.]+)(<)")
        Dim matches As MatchCollection = regex1.Matches(input)

        For Each m As Match In matches
            Console.WriteLine(String.Format("{1}{0}", m.Groups(2).Value, Environment.NewLine))
        Next
ntsdev1