views:

763

answers:

8

I have to parse a bunch of stats from text, and they all are formatted as numbers.

For example, this paragraph:

A total of 81.8 percent of New York City students in grades 3 to 8 are meeting or exceeding grade-level math standards, compared to 88.9 percent of students in the rest of the State.

I want to match just the 81 and 88 numbers, not the ".8" and ".9" that follow.

How can I do this? I've heard the term back-reference or look-aheads or something. Will any of that help?

I am using C#.

Edit: It's required that I get the "3" and the "8" in the above example. It's just a simple example, but I need pretty much all numbers.

A: 
/(\d+)\.\d/g

This will match any number that has a decimal following it (which I think is what you want), but will only capture the numbers before the decimal. \d will only capture numbers (same as [0-9]), so it makes this pretty simple.

Edit: If you want the three and the eight as well, you don't even need to check for the decimal.

Edit2: Sorry, fixed it so it will ignore all the decimal places.

/(\d+)(?:\.\d+)?/g
tj111
Please see my edit - I need to get all numbers, but strip out the numbers after the decimal point (my actual data has crazy precision)
Jeff Meatball Yang
If I use your second one, I get the 9 and the 1, which I don't want.
Jeff Meatball Yang
+3  A: 
/[^.](\d+)[^.]/

As stated below just use MatchObj.Groups(1) to get the digit.

Won't that also grab the digits following the decimal point? Might want to put a [^.] at the front of that.
Michael Myers
+1  A: 

Try:

[0-9]*(?=[3])

It uses a lookahead to match only numbers followed by a decimal point.

C# Code:

Regex regex = new Regex("[0-9]+(?=[.])");
MatchCollection matches = regex.Matches(input);
Stephan
You will get a blank entry at every period, because you match 0 or more digits instead of 1 or more.
Michael Myers
Thanks, was in a rush earlier and wasn't really paying attention
Stephan
+2  A: 

If you don't want to deal with groups, you can use a lookahead like you say; this pattern finds the integer part of all decimal numbers in the string:

Regex integers = new Regex(@"\d+(?=\.\d)");
MatchCollection matches = integers.Matches(str);

matches will contain 81 and 88. If you'd like to match the integer part of ANY numbers (decimal or not), you can instead search for integers that don't start with a .:

Regex integers = new Regex(@"(?<!\.)\d+");

This time, matches would contain 81, 3, 8 and 88.

ojrac
In your first regex, you ought to put `\d+` before the final closing paren so that you don't get false positives at the ends of sentences.
Ben Blank
Excellent point. I went with `\d` since I don't care how many there are. Thanks for the correction.
ojrac
In your second code block, what kind of syntax is that? I don't know what ?<! means. Thanks.
Jeff Meatball Yang
(?<!pattern) is a negative lookbehind -- so, it prevents any matches that follow the pattern `\.`
ojrac
Link for more in-depth info: http://www.regular-expressions.info/lookaround.html#lookbehind
ojrac
I was able to use these code snips as starting points - thanks. It turns out that a lookahead for OR'ed patterns is what I was looking for.
Jeff Meatball Yang
A: 

Try using /(\d+)((\.\d+)?)/

This basically means match a sequence of digits and an optional decimal point with another sequence of digits. Then, use MatchObj.Groups(1) for the first match value, ignoring the second one.

Yuval F
+1  A: 
[^.](\d+)

From your example, this will match " 81", " 3", " 8", " 88"

You'll get an extra character before you get your number, but you can just trim that out in your code.

jimyi
A: 

This is not in the language you asked about, but it may help you think about the problem.

$ echo "A total of 81.8 percent of New York City students in grades 3 to 8 are meeting or exceeding grade-level math standards, compared to 88.9 percent of students in the rest of the State." \
| fmt -w 1 | sed -n -e '/^[0-9]/p' | sed -e 's,[^0-9].*,,' | fmt -w 72
81 3 8 88

The first fmt command asks the following commands to consider each word separately. The "sed -n" command outputs only those words which start with at least one number. The second sed command removes the first non-digit character in the word, and everything after. The second fmt command combines everything back into one line.

$ echo "This tests notation like 6.022e+23 and 10e100 and 1e+100." \
| fmt -w 1 | sed -n -e '/^[0-9]/p' | sed -e 's,[^0-9].*,,' | fmt -w 72
6 10 1
Jason Catena
+2  A: 

Complete C# solution:

/// <summary>
/// Use of named backrefence 'roundedDigit' and word boundary '\b' for ease of
/// understanding
/// Adds the rounded percents to the roundedPercents list
/// Will work for any percent value
/// Will work for any number of percent values in the string
/// Will also give those numbers that are not in percentage (decimal) format
/// </summary>
/// <returns>true if success, false otherwise</returns>
public static bool TryGetRoundedPercents(string digitSequence, out List<string> roundedPercents)
{
    roundedPercents = null;
    string pattern = @"(?<roundedDigit>\b\d{1,3})(\.\d{1,2}){0,1}\b";

    if (Regex.IsMatch(digitSequence, pattern))
    {
        roundedPercents = new List<string>();
        Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.ExplicitCapture);

        for (Match m = r.Match(digitSequence); m.Success; m = m.NextMatch())
            roundedPercents.Add(m.Groups["roundedDigit"].Value);

        return true;
    }
    else
        return false;
}

From your example returns 81, 3, 8 and 88

Rashmi Pandit