ansaurus

Question

RegEx : Extract Number out of Source Code

Answer 1

+2 A:

<td>(\d+)</td>

should do the job.

eWolf 2009-10-21 12:39:19

Don't forget to escape the forward-slash...

Tenner 2009-10-21 12:50:30

Answer 2

+3 A:

I don't know java regex exactly but I'ld suggest something like

/<td>(\d+)<\/td><td>/

since syntax of regex is quite similar in multiple languages.

Explanations

( ... ) captures the content inside of the regex's return variables
\d represents a digit
+ stays for one or more occurences of the token on it's left side

since you use only positive integers, you don't have to care about signs and decimal points.

Etan 2009-10-21 12:39:41

to be more safe even you could add the whitespace on both sides and get sth like /^\s*<td>(\d+)<\/td><td>\s*$/

Peter Kofler 2009-10-22 07:26:52

Answer 3

+8 A:

I wouldn't use regular expressions to parse HTML or XML. Instead, I would load the document into an HTML DOM parser - you can find several open source ones here. I can't vouch for any of these - I've never worked with anything other than XML in Java.

Thomas Owens 2009-10-21 12:41:13

This has the advantage of being robust against changes in the cells' attributes.

Ewan Todd 2009-10-21 12:44:20

This game never seems to get old… Q: "How can I do HTML with regex" - A: "Don't". Amazing. :)

Tomalak 2009-10-21 13:26:23

Of course an HTML parser is the more elegant way, and also the easier way if you want to process many data from the HTML document (especially cool if you can use XPath). But for some numbers, it is a bit too big.

eWolf 2009-10-21 15:36:22

for a quick scraping, just for the numbers I would always go with the regex because it's less code and less hassle. Sure it's not robust, but much faster to implement for simple things

Peter Kofler 2009-10-22 07:25:47

ansaurus

tags:

views:

answers:

RegEx : Extract Number out of Source Code

related questions