tags:

views:

63

answers:

2

I need to extract a value from a large body of text. I'm assuming the best way to do this would be to use a regular expression. If anyone thinks there's a better way to do it, feel free to offer up a suggestion.

The value I need to extract always appears in a string of the form:

[formatted_int_value] results across [the_integer_value_I_need_to_extract] pages

e.g: 3,342 results across 67 pages

In the example above the value I'm trying to extract is 67. Also note that each word in the example above may be separated by one or more whitespaces and/or newline characters. And, as mentioned above, this text is part of a larger body of text (I'm screen scraping a web page).

Can someone help me with a regex to extract the int value I need (67 in my example above) that takes into consideration the conditions I've provided?

Thanks.

+1  A: 

The regex would be quite straight-forward:

([\d,]+)\s+results\s+across\s+(\d+)\s+pages

The 67 would be in group 2, the other number (if you need it) in group 1.

var text = "some text here 3,342 results across 67 pages some more text here";
var regex = /([\d,]+)\s+results\s+across\s+(\d+)\s+pages/;

var matches = regex.exec(text);

/* matches will be this array:

["3,342 results across 67 pages", "3,342", "67"]
---- entire match --------------  --g1---  -g2-    
*/
Tomalak
And to meet his whitespace requirements, replace the spaces with `\s+`
Michael Brewer-Davis
@Michael: Done, thanks. I've overlooked that part.
Tomalak
Refresh my memory pls, will `\s` handle newlines automagically? He said the answer could span lines.
Tony Ennis
@Tony Ennis: Yes, newlines are part of `\s`. Check for yourself: `/^\s+$/.test("\r\n");` returns `true`.
Tomalak
@Peter: It should take newlines into account. Did you notice that I've changed my answer after @Michael's comment? If it does not wirk for you, you did not mention all necessary details. Also the group count does not change unless you are not doing it something as described.
Tomalak
A: 
int theIntYouWantToExtract = Integer.parseInt(yourLongText.replaceAll(
        ".*([\d,]+) results across ([\d,]+) pages.*",
        "$2"));
Martijn Courteaux
This is way too underspecified--it'll grab integers not matching the "X results across Y" pattern.
Michael Brewer-Davis
@Michael: Oh! Yes, indeed. I didn't see.
Martijn Courteaux