tags:

views:

167

answers:

9

i have the following string

Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)

i want to capture

212,323.222
2-2.24
0.5

i.e. i want the above three results from the string,
can any one help me with this regex

A: 

Okay I didn't notice the C# tag until now. I will leave the answer but I know that's not what you expected, see if you can do something with it. Perhaps the title should have mentioned the programming language?


Sure:

Fat mass loss was (.*) greater for GPLC \((.*) vs. (.*)kg\)

Find your substrings in \1, \2 and \3. If for Emacs, swap all parentheses and escaped parentheses.

Pascal Cuoq
A: 

How about something like this:

^.*((?:\d+,)*\d+(?:\.\d+)?).*(\d+(?:\.\d+)?(?:-\d+(?:\.\d+))?).*(\d+(?:\.\d+)).*$

A little more general, I think. I'm a little concerned about .* being greedy.

mswebersd
* tested with echo "Fat mass loss was 2121,323.222 greater for GPLC (2.2.4kg vs. 0.5kg)" | \grep -rnP "^.*((?:\d+,)*\d+(?:\.\d+)?).*(\d+(?:\.\d+)?(?:-\d+(?:\.\d+))?).*(\d+(?:\.\d+)).*$"
mswebersd
You shouldn't capture it from start to end... you can have a text with many numbers, nut just this sample.
Kobi
Agreed, use Zen's instead.
mswebersd
A: 

Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)

a generalized extractor:

/\D+?([\d\,\.\-]+)/g

explanation:

/           # start pattern
 \D+        # 1 or more non-digits
  (         # capture group 1          
   [\d,.-]+ # character class, 1 or more of digits, comma, period, hyphen
  )         # end capture group 1
/g          # trailing regex g modifier (make regex continue after last match)

sorry I don't know c# well enough for a full writeup, but the pattern should plug right in.

see: http://www.radsoftware.com.au/articles/regexsyntaxadvanced.aspx for some implementation examples.

zen
You don’t need to escape `,` and `.` inside a character class. And in this case the `-` neither (because it’s at the end). And `\D+` doesn’t need to be non-greedy since the first character after is always a digit.
Gumbo
@Gumbo, Thank you!
zen
A: 

It looks like you're trying to find all numbers in the string (possibly with commas inside the number), and all ranges of numbers such as "2-2.4". Here is a regex that should work:

\d+(?:[,.-]\d+)*

From C# 3, you can use it like this:

var input = "Fat mass loss was 2121,323.222 greater for GPLC (2-2.4kg vs. 0.5kg)";
var pattern = @"\d+(?:[,.-]\d+)*";

var matches = Regex.Matches(input, pattern);

foreach ( var match in matches )
  Console.WriteLine(match.Value);
Richard Beier
\d matches any digit. \D is the opposite, it matches any character that is not a digit. See http://msdn.microsoft.com/en-us/library/az24scfc.aspx. (?:) does mean cluster without capture, but we don't want to capture the individual parts within each match. E.g. we don't need "2121", "323", and "222" separately. We want the whole "2121,323.222", which will be one of the elements in the MatchCollection returned by Regex.Match().
Richard Beier
I now see its only the the digit extraction portion, but how will the regex progress through the unwanted characters from left to right? That's why I mentioned \D+
zen
Oops... I should have used Regex.Matches() rather than Regex.Match. Let me fix my answer... thanks for catching that.Regex.Matches() will find all matching substrings and will return only the matches. So your regex doesn't need to match the unwanted parts.
Richard Beier
A: 

I came out with something like this atrocity:

-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?(?:[–-]-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?)?

Out of witch -?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))? is repeated twice, with in the middle (note that this is a long hyphen).
This should take care of dots and commas outside of numbers, eg: hello,23,45.2-7world - will capture 23,45.2-7.

Kobi
+1  A: 

I noticed that your hyphen in 2–2.4kg is not really hyphen, its a unicode 0x2013 "DASH".

So, here is another regex in C#

@"[0-9]+([,.\u2013-][0-9]+)*"

Test

MatchCollection matches = Regex.Matches("Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)", @"[0-9]+([,.\u2013-][0-9]+)*");
foreach (Match m in matches) {
    Console.WriteLine(m.Groups[0]);
}

Here is the results, my console does not support printing unicode char 2013, so its "?" but its properly matched.

2121,323.222
2?2.4
0.5
S.Mark
why the two hyphens? And does Matches() wrap the regex with .*?
zen
Try this. its a unicode DASH, `javascript:alert("–".charCodeAt(0))`
S.Mark
Good catch! I was wondering why my regex test was failing with that hyphen... I started to question my sanity...
Richard Beier
For a more generic approach you could use 'dash punctuation': \p{Pd}
Huppie
A: 

Hmm, this is a tricky question, especially because the input string contains unicode character – (EN DASH) instead of - (HYPHEN-MINUS). Therefore the correct regex to match the numbers in the original string would be:

\d+(?:[\u2013,.]\d+)*

If you want a more generic approach would be:

\d+(?:[\p{Pd}\p{Pc}\p{Po}]\d+)*

which matches dash punctuation, connecter punctuation and other punctuation. See here for more information about those.

An implementation in C# would look like this:

string input = "Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)";
try {
    Regex rx = new Regex(@"\d+(?:[\p{Pd}\p{Pc}\p{Po}\p{C}]\d+)*", RegexOptions.IgnoreCase | RegexOptions.Multiline);
    Match match = rx.Match(input);
    while (match.Success) {
        // matched text: match.Value
        // match start: match.Index
        // match length: match.Length
        match = match.NextMatch();
    } 
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}
Huppie
A: 

Thanks everybody for your co-operation actually i tried a bit more after having a cup of tea and finally got the solution to my problem :)

Following is the Regex that gave my desired result

(([0-9]+)([–.,-]*))+

Thanks a lot everyone who helped me in solving my problem

Regards, Muhammad Waqas

Muhammad Waqas
A: 

Let's try this one :

(?=\d)([0-9,.-]+)(?<=\d)

It captures all expressions containing only :

  • "[0-9,.-]" characters,
  • must start with a digit "(?=\d)",
  • must finish with a digit "(?<=\d)"

It works with a single digit expression and does not include beginning or trailing [.,-].

Hope this helps.

Arno