ansaurus

Question

Regex to parse international floating-point numbers

Answer 1

+1 A:

How about

/(\d{1,3}(?:,\d{3})*)(\.\d{2})?/

if you care about validating that the commas separate every 3 digits exactly, or

/(\d[\d,]*)(\.\d{2})?/

if you don't.

Avi 2009-08-18 17:38:56

This won't validate his first example; 111.111,11

Håkon 2009-08-18 18:28:53

True. I didn't notice that one. Sorry.

Avi 2009-08-18 18:31:14

Answer 2

A:

If I'm interpreting your question correctly so that you are saying the result SHOULD look like what you say is "would" look like, then I think you just need to leave the comma out of the character class, since it is used as a separator and not a part of what is to be matched.

So get rid of the "." first, then match the two parts.

$value = "111,111.11";
$value =~ s/\.//g;
$value =~ m/(\d+)(?:,(\d+))?/;

$1 = leading integers with periods removed $2 = either undef if it didn't exist, or the post-comma digits if they do exist.

Devin Ceartas 2009-08-18 17:40:09

Answer 3

+2 A:

I would at first use this regex to determine wether a comma or a dot is used as a comma delimiter (It fetches the last of the two):

[0-9,\.]*([,\.])[0-9]*

I would then strip all of the other sign (which the previous didn't match). If there were no matches, you already have an integer and can skip the next steps. The removal of the chosen sign can easily be done with a regex, but there are also many other functions which can do this faster/better.

You are then left with a number in the form of an integer possible followed by a comma or a dot and then the decimals, where the integer- and decimal-part easily can be separated from eachother with the following regex.

([0-9]+)[,\.]?([0-9]*)

Good luck!

Edit:

Here is an example made in python, I assume the code should be self-explaining, if it is not, just ask.

import re

input = str(raw_input())
delimiterRegex = re.compile('[0-9,\.]*([,\.])[0-9]*')
splitRegex = re.compile('([0-9]+)[,\.]?([0-9]*)')

delimiter = re.findall(delimiterRegex, input)

if (delimiter[0] == ','):
    input = re.sub('[\.]*','', input)
elif (delimiter[0] == '.'):
    input = re.sub('[,]*','', input)

print input

With this code, the following inputs gives this:

111.111,11

111111,11
111,111.11

111111.11
111,111

111,111

After this step, one can now easily modify the string to match your needs.

Håkon 2009-08-18 19:19:50

I'm pretty sure this answer is wrong, but I can't say for certain because you don't really say how you're using the regexes (but that's sufficient reason for a downvote right there). Can you explain how you're distinguishing the thousands separator from the decimal separator (with tested examples)?

Alan Moore 2009-08-18 22:49:14

The first regex will determine what is the decimal separator by finding which of them that occurs last. You then strip the number of the other operator. And you will be left with a number without thousand separators. The rest should be piece of cake. Will post example-code later.

Håkon 2009-08-19 10:20:35

According to the OP, the comma in `111,111` is a thousands separator (TS). A decimal separator (DS), if present, must be followed by exactly two digits (he cleared that up in the comments under the question). So your first regex would have to end with `([,.][0-9]{2})?` like the OP's did. But he's also trying to validate that the TS's are correctly distributed.

Alan Moore 2009-08-20 12:57:41

Answer 4

+5 A:

Fisrt Answer:

This matches #,###,##0.00:

^[+-]?[0-9]{1,3}(?:\,?[0-9]{3})*(?:\.[0-9]{2})?$

And this matches #.###.##0,00:

^[+-]?[0-9]{1,3}(?:\.?[0-9]{3})*(?:\,[0-9]{2})?$

Joining the two (there are smarter/shorter ways to write it, but it works):

(?:^[+-]?[0-9]{1,3}(?:\,?[0-9]{3})*(?:\.[0-9]{2})?$)
|(?:^[+-]?[0-9]{1,3}(?:\.?[0-9]{3})*(?:\,[0-9]{2})?$)

You can also, add a capturing group to the last comma (or dot) to check which one was used.

Second Answer:

As pointed by Alan M, my previous solution could fail to reject a value like 11,111111.00 where a comma is missing, but the other isn't. After some tests I reached the following regex that avoids this problem:

^[+-]?[0-9]{1,3}
(?:(?<comma>\,?)[0-9]{3})?
(?:\k<comma>[0-9]{3})*
(?:\.[0-9]{2})?$

This deserves some explanation:

^[+-]?[0-9]{1,3} matches the first (1 to 3) digits;
(?:(?<comma>\,?)[0-9]{3})? matches on optional comma followed by more 3 digits, and captures the comma (or the inexistence of one) in a group called 'comma';
(?:\k<comma>[0-9]{3})* matches zero-to-any repetitions of the comma used before (if any) followed by 3 digits;
(?:\.[0-9]{2})?$ matches optional "cents" at the end of the string.

Of course, that will only cover #,###,##0.00 (not #.###.##0,00), but you can always join the regexes like I did above.

Final Answer:

Now, a complete solution. Indentations and line breaks are there for readability only.

^[+-]?[0-9]{1,3}
(?:
    (?:\,[0-9]{3})*
    (?:.[0-9]{2})?
|
    (?:\.[0-9]{3})*
    (?:\,[0-9]{2})?
|
    [0-9]*
    (?:[\.\,][0-9]{2})?
)$

And this variation captures the separators used:

^[+-]?[0-9]{1,3}
(?:
    (?:(?<thousand>\,)[0-9]{3})*
    (?:(?<decimal>\.)[0-9]{2})?
|
    (?:(?<thousand>\.)[0-9]{3})*
    (?:(?<decimal>\,)[0-9]{2})?
|
    [0-9]*
    (?:(?<decimal>[\.\,])[0-9]{2})?
)$

edit 1: "cents" are now optional; edit 2: text added; edit 3: second solution added; edit 4: complete solution added; edit 5: headings added; edit 6: capturing added; edit 7: last answer broke in two versions;

jpbochi 2009-08-18 19:45:17

+1. I would move the anchors outside the alternation. You could move the common leading and trailing elements outside it as well, but that's not necessarily worth the tradeoff in readability

Alan Moore 2009-08-18 20:59:35

Readability is not a strong point of regular expressions, but I agree. Thanks for the vote :)

jpbochi 2009-08-18 21:47:22

Just noticed, the thousands separators should *not* be optional; e.g., `(?:\.?[0-9]{3})*` should be `(?:\.[0-9]{3})*`. Otherwise, you could match things like `11,111111.00` or `1111.111,00`.

Alan Moore 2009-08-18 23:36:41

Ok, but what if you want them to be optional?

jpbochi 2009-08-19 02:59:02

Now, it's optional and doesn't have the problem you pointed. :)

jpbochi 2009-08-19 03:25:51

Very nice! I wasn't even thinking about handling numbers with no thousands separators (since it's not in the question), but that's downright elegant.

Alan Moore 2009-08-19 05:58:54

I think the final answer is missing the backtracking...

LuRsT 2009-08-19 09:52:15

ok, now it has backtracking. :)

jpbochi 2009-08-19 12:09:30

Oh, you meant **capturing**! I couldn't figure out what you (@LuRsT) meant by *backtracking*, but now I see you meant **capturing** all along. And again (@jpbochi), nicely done! You capture each separator (if there is one) in its own named group, so later you can remove all the thousands separators and split on the decimal separator. Unfortunately, it will only work in the **.NET**, **JGSoft**, or **Perl 5.10+** regex flavors; as of now, no others permit group names to be reused within a regex (which is a damn shame--that's a killer feature).

Alan Moore 2009-08-19 13:12:23

Wow, great regex, but yes, I can't use it (I'm using php 5.3) can you make a version for that? Even If I have to search through the results to find the groups correctly :)

LuRsT 2009-08-19 14:28:30

Also, backtrack in your final answer doesn't work for the first numbers :(

LuRsT 2009-08-19 15:20:00

I didn't get it. Which first numbers are you talking about?

jpbochi 2009-08-19 16:05:09

From the first line, or are they in the thousand group?

LuRsT 2009-08-19 17:26:34

The 'thousand' group will only capture the separator character (`','` or `'.'`). I realized later that you wanted to capture the numbers themselves. I'm not sure it's possible with a raw regex.You may use the regex that I wrote to validate the string and capture the separators. Then, in a second step, you may split the digits and remove the separators.

jpbochi 2009-08-19 19:26:09

You can replace the named groups with old-fashioned numbered groups, eg, `(\,)` instead of `(?<thousand>\,)`. Then, if group 1 or 3 matched anything, that's your thousands separator (TS); if group 2, 4 or 5 matched, that's the decimal separator (DS). Delete all the TS's and split on the DS, and Bob's your uncle. (If none of them participated in the match, the number's an integer--no post-processing required.)

Alan Moore 2009-08-20 03:50:54

Answer 5

A:

See Perl's Regexp::Common::number.

Sinan Ünür 2009-08-19 13:55:28

ansaurus

tags:

views:

answers:

Regex to parse international floating-point numbers

related questions