tags:

views:

118

answers:

3

Hi ,

i am no RegEx expert. I need to extract a certain number out of an HTML table.
An example:

<td>13</td><td>
  </td><td align="right">29.543</td>
  <td align="right">1.777</td>
  <td align="right">2.588</td>
</tr><tr><td><a href="player.php?p=84668" >Caterdamus</a></td>
  <td>7</td><td>
  Meister</td><td align="right">9.874</td>
  <td align="right">1.716</td>
  <td align="right">5.791</td>
</tr><tr><td><a href="player.php?p=87216" >grappa</a></td>
  <td>2</td><td>
  </td><td align="right">1.044</td>
  <td align="right">21</td>
  <td align="right">146</td>
</tr></table>

The pattern looks like this :

<td>13</td><td>
<td>7</td><td>
<td>2</td><td>

How do i extract the numbers out of the text and store it into a variable. Hint: the numbers are positive integers.

Thanks:)

+2  A: 
<td>(\d+)</td>

should do the job.

eWolf
Don't forget to escape the forward-slash...
Tenner
+3  A: 

I don't know java regex exactly but I'ld suggest something like

/<td>(\d+)<\/td><td>/

since syntax of regex is quite similar in multiple languages.

Explanations

  • ( ... ) captures the content inside of the regex's return variables
  • \d represents a digit
  • + stays for one or more occurences of the token on it's left side

since you use only positive integers, you don't have to care about signs and decimal points.

Etan
to be more safe even you could add the whitespace on both sides and get sth like /^\s*<td>(\d+)<\/td><td>\s*$/
Peter Kofler
+8  A: 

I wouldn't use regular expressions to parse HTML or XML. Instead, I would load the document into an HTML DOM parser - you can find several open source ones here. I can't vouch for any of these - I've never worked with anything other than XML in Java.

Thomas Owens
This has the advantage of being robust against changes in the cells' attributes.
Ewan Todd
This game never seems to get old… Q: "How can I do HTML with regex" - A: "Don't". Amazing. :)
Tomalak
Of course an HTML parser is the more elegant way, and also the easier way if you want to process many data from the HTML document (especially cool if you can use XPath). But for some numbers, it is a bit too big.
eWolf
for a quick scraping, just for the numbers I would always go with the regex because it's less code and less hassle. Sure it's not robust, but much faster to implement for simple things
Peter Kofler