tags:

views:

442

answers:

5

I need a regex to get numeric values that can be

111.111,11

111,111.11

111,111

And separate the integer and decimal portions so I can store in a DB with the correct syntax

I tried ([0-9]{1,3}[,.]?)+([,.][0-9]{2})? With no success since it doesn't detect the second part :(

The result should look like:

111.111,11 -> $1 = 111111; $2 = 11
+1  A: 

How about

/(\d{1,3}(?:,\d{3})*)(\.\d{2})?/

if you care about validating that the commas separate every 3 digits exactly, or

/(\d[\d,]*)(\.\d{2})?/

if you don't.

Avi
This won't validate his first example; 111.111,11
Håkon
True. I didn't notice that one. Sorry.
Avi
A: 

If I'm interpreting your question correctly so that you are saying the result SHOULD look like what you say is "would" look like, then I think you just need to leave the comma out of the character class, since it is used as a separator and not a part of what is to be matched.

So get rid of the "." first, then match the two parts.

$value = "111,111.11";
$value =~ s/\.//g;
$value =~ m/(\d+)(?:,(\d+))?/;

$1 = leading integers with periods removed $2 = either undef if it didn't exist, or the post-comma digits if they do exist.

Devin Ceartas
+2  A: 

I would at first use this regex to determine wether a comma or a dot is used as a comma delimiter (It fetches the last of the two):

[0-9,\.]*([,\.])[0-9]*

I would then strip all of the other sign (which the previous didn't match). If there were no matches, you already have an integer and can skip the next steps. The removal of the chosen sign can easily be done with a regex, but there are also many other functions which can do this faster/better.

You are then left with a number in the form of an integer possible followed by a comma or a dot and then the decimals, where the integer- and decimal-part easily can be separated from eachother with the following regex.

([0-9]+)[,\.]?([0-9]*)

Good luck!

Edit:

Here is an example made in python, I assume the code should be self-explaining, if it is not, just ask.

import re

input = str(raw_input())
delimiterRegex = re.compile('[0-9,\.]*([,\.])[0-9]*')
splitRegex = re.compile('([0-9]+)[,\.]?([0-9]*)')

delimiter = re.findall(delimiterRegex, input)

if (delimiter[0] == ','):
    input = re.sub('[\.]*','', input)
elif (delimiter[0] == '.'):
    input = re.sub('[,]*','', input)

print input

With this code, the following inputs gives this:

  • 111.111,11

    111111,11

  • 111,111.11

    111111.11

  • 111,111

    111,111

After this step, one can now easily modify the string to match your needs.

Håkon
I'm pretty sure this answer is wrong, but I can't say for certain because you don't really say how you're using the regexes (but that's sufficient reason for a downvote right there). Can you explain how you're distinguishing the thousands separator from the decimal separator (with tested examples)?
Alan Moore
The first regex will determine what is the decimal separator by finding which of them that occurs last. You then strip the number of the other operator. And you will be left with a number without thousand separators. The rest should be piece of cake. Will post example-code later.
Håkon
According to the OP, the comma in `111,111` is a thousands separator (TS). A decimal separator (DS), if present, must be followed by exactly two digits (he cleared that up in the comments under the question). So your first regex would have to end with `([,.][0-9]{2})?` like the OP's did. But he's also trying to validate that the TS's are correctly distributed.
Alan Moore
+5  A: 

Fisrt Answer:

This matches #,###,##0.00:

^[+-]?[0-9]{1,3}(?:\,?[0-9]{3})*(?:\.[0-9]{2})?$

And this matches #.###.##0,00:

^[+-]?[0-9]{1,3}(?:\.?[0-9]{3})*(?:\,[0-9]{2})?$

Joining the two (there are smarter/shorter ways to write it, but it works):

(?:^[+-]?[0-9]{1,3}(?:\,?[0-9]{3})*(?:\.[0-9]{2})?$)
|(?:^[+-]?[0-9]{1,3}(?:\.?[0-9]{3})*(?:\,[0-9]{2})?$)

You can also, add a capturing group to the last comma (or dot) to check which one was used.


Second Answer:

As pointed by Alan M, my previous solution could fail to reject a value like 11,111111.00 where a comma is missing, but the other isn't. After some tests I reached the following regex that avoids this problem:

^[+-]?[0-9]{1,3}
(?:(?<comma>\,?)[0-9]{3})?
(?:\k<comma>[0-9]{3})*
(?:\.[0-9]{2})?$

This deserves some explanation:

  • ^[+-]?[0-9]{1,3} matches the first (1 to 3) digits;

  • (?:(?<comma>\,?)[0-9]{3})? matches on optional comma followed by more 3 digits, and captures the comma (or the inexistence of one) in a group called 'comma';

  • (?:\k<comma>[0-9]{3})* matches zero-to-any repetitions of the comma used before (if any) followed by 3 digits;

  • (?:\.[0-9]{2})?$ matches optional "cents" at the end of the string.

Of course, that will only cover #,###,##0.00 (not #.###.##0,00), but you can always join the regexes like I did above.


Final Answer:

Now, a complete solution. Indentations and line breaks are there for readability only.

^[+-]?[0-9]{1,3}
(?:
    (?:\,[0-9]{3})*
    (?:.[0-9]{2})?
|
    (?:\.[0-9]{3})*
    (?:\,[0-9]{2})?
|
    [0-9]*
    (?:[\.\,][0-9]{2})?
)$

And this variation captures the separators used:

^[+-]?[0-9]{1,3}
(?:
    (?:(?<thousand>\,)[0-9]{3})*
    (?:(?<decimal>\.)[0-9]{2})?
|
    (?:(?<thousand>\.)[0-9]{3})*
    (?:(?<decimal>\,)[0-9]{2})?
|
    [0-9]*
    (?:(?<decimal>[\.\,])[0-9]{2})?
)$


edit 1: "cents" are now optional; edit 2: text added; edit 3: second solution added; edit 4: complete solution added; edit 5: headings added; edit 6: capturing added; edit 7: last answer broke in two versions;

jpbochi
+1. I would move the anchors outside the alternation. You could move the common leading and trailing elements outside it as well, but that's not necessarily worth the tradeoff in readability
Alan Moore
Readability is not a strong point of regular expressions, but I agree. Thanks for the vote :)
jpbochi
Just noticed, the thousands separators should *not* be optional; e.g., `(?:\.?[0-9]{3})*` should be `(?:\.[0-9]{3})*`. Otherwise, you could match things like `11,111111.00` or `1111.111,00`.
Alan Moore
Ok, but what if you want them to be optional?
jpbochi
Now, it's optional and doesn't have the problem you pointed. :)
jpbochi
Very nice! I wasn't even thinking about handling numbers with no thousands separators (since it's not in the question), but that's downright elegant.
Alan Moore
I think the final answer is missing the backtracking...
LuRsT
ok, now it has backtracking. :)
jpbochi
Oh, you meant **capturing**! I couldn't figure out what you (@LuRsT) meant by *backtracking*, but now I see you meant **capturing** all along. And again (@jpbochi), nicely done! You capture each separator (if there is one) in its own named group, so later you can remove all the thousands separators and split on the decimal separator. Unfortunately, it will only work in the **.NET**, **JGSoft**, or **Perl 5.10+** regex flavors; as of now, no others permit group names to be reused within a regex (which is a damn shame--that's a killer feature).
Alan Moore
Wow, great regex, but yes, I can't use it (I'm using php 5.3) can you make a version for that? Even If I have to search through the results to find the groups correctly :)
LuRsT
Also, backtrack in your final answer doesn't work for the first numbers :(
LuRsT
I didn't get it. Which first numbers are you talking about?
jpbochi
From the first line, or are they in the thousand group?
LuRsT
The 'thousand' group will only capture the separator character (`','` or `'.'`). I realized later that you wanted to capture the numbers themselves. I'm not sure it's possible with a raw regex.You may use the regex that I wrote to validate the string and capture the separators. Then, in a second step, you may split the digits and remove the separators.
jpbochi
You can replace the named groups with old-fashioned numbered groups, eg, `(\,)` instead of `(?<thousand>\,)`. Then, if group 1 or 3 matched anything, that's your thousands separator (TS); if group 2, 4 or 5 matched, that's the decimal separator (DS). Delete all the TS's and split on the DS, and Bob's your uncle. (If none of them participated in the match, the number's an integer--no post-processing required.)
Alan Moore
A: 

See Perl's Regexp::Common::number.

Sinan Ünür