views:

603

answers:

2

I've a regex that matches comma separated numbers with an optional two digit decimal part in a given multiline text.

/(?<=\s|^)\d{1,3}(,\d{3})*(\.\d{2})?(?=\s|$)/m

It matches strings like 1, 12, 12.34, 12,345.67 etc successfully. How can I modify it to match a number with only the decimal part like .23?

EDIT: Just to clarify - I would like to modify the regex so that it matches 12, 12.34 and .34

And I am looking for 'stand alone' valid numbers. i.e., number-strings whose boundaries are either white space or start/end of line/string.

+4  A: 

This:

\d{1,3}(,\d{3})*(\.\d\d)?|\.\d\d

matches all of the following numbers:

1
12
.99
12.34 
12,345.67
999,999,999,999,999.99

If you want to exclude numbers like 123a (street addresses for example), or 123.123 (numbers with more than 2 digits after the decimal point), try:

(?<=\s|^)(\d{1,3}(,\d{3})*(\.\d\d)?|\.\d\d)(?=\s|$)

A little demo (I guessed you're using PHP):

$text = "666a 1 fd 12 dfsa .99 fds 12.34 dfs 12,345.67 er 666.666 er 999,999,999,999,999.99";
$number_regex = "/(?<=\s|^)(?:\d{1,3}(?:,\d{3})*(?:\.\d\d)?|\.\d\d)(?=\s|$)/";
if(preg_match_all($number_regex, $text, $matches)) {
  print_r($matches);
}

which will output:

Array
(
    [0] => Array
        (
            [0] => 1
            [1] => 12
            [2] => .99
            [3] => 12.34
            [4] => 12,345.67
            [5] => 999,999,999,999,999.99
        )

)

Note that it ignores the strings 666a and 666.666

Bart Kiers
But that also matches `14` in `14.` or `145` in `145.2` and `1,344.12` in `1,344.123`
Amarghosh
See my comment about word boundaries.
Bart Kiers
I'm using actionscript and apparently `.` is considered as word boundary and hence it still matches `14` in `asd 14. asd`. I'd already tried `\b` and found this issue, that's when I chose to use look around that I got from another SO thread. And btw, if you remember, I started this with the regex that you gave me in http://stackoverflow.com/questions/1547574/regex-for-prices/1547585#1547585
Amarghosh
Yes, you're right. Then use what you already posted yourself: replace the word boundaries by `(?<=\s|^)` and `(?=\s|$)`. Can your data contain strings like `123,123.12,456.45` (ie two successive numbers separated by a comma)? If so, could you then please adjust your original question and add all these corner cases?
Bart Kiers
The numbers from my previous reply are `123,123.12` and `456.45` by the way...
Bart Kiers
Nope. I am looking for 'stand alone' valid numbers. Boundaries would be either white space or start/end of line/string. Will add this to the question. Sorry for the misunderstanding.
Amarghosh
No problem Amarghosh. See the edit, I think that covers it now.
Bart Kiers
Your updated regex that changed the structure from `(fixed.decimal)|(.decimal)` to `(fixed.decimal|.decimal)` is working fine. What is the difference between the two?
Amarghosh
Like that, there is no difference, but when doing: `^a|b$` it matches either an `a` at the beginning of the string, OR a `b` at the end. While `^(a|b)$` means: either an `a` or `b`.
Bart Kiers
Example: `a(b|c)|(d|e)f` would match ab or df, but not abf, wherase `a((b|c)|(d|e))f` would match abf, bit not df or ab
Mez
Eureka.. `(fix.dec)|(.dec)` fails on `14.` because it matches `^` followed by a `fix.dec` where dec is optional (which is `14`) followed by whatever (`.` here) or whatever followed by `.dec`. Thanks a ton, especially for that last two comments.
Amarghosh
+1  A: 
/(?<=\s|^)(\d{1,3}(,\d{3})*(\.\d{2})?|\.(\d{2}))(?=\s|$)/m

Or taking into account some countries where . is used as a thousand seperator, and , is used as a decimal seperator

/(?<=\s|^)(\d{1,3}(,\d{3})*(\.\d{2})?|\d{1,3}(\.\d{3})*(,\d{2})?|\.(\d{2})|,(\d{2}))(?=\s|$)/m

Insane Regex for Internationalisation

/((?<=\s)|(?<=^))(((\d{1,3})((,\d{3})|(\.\d{3}))*(((?<=(,\d{3}))(\.\d{2}))|((?<=(\.\d{3}))(,\d{2}))|((?<!((,\d{3})|(\.\d{3})))([\.,]\d{2}))))|([\.,]\d{2}))(?=\s|$)/m

Matches

14.23
14,23
114,114,114.23
114.114.114,23

Doesn't match

14.
114,114,114,23
114.114.144.23
,
.
<empty line>
Mez
`([0-9,\.])` matches a single character, to begin with. Even if you add a + it would matches , etc.
Amarghosh
Good point, I've removed that from my answer
Mez
What is the difference between your 1st regex that encloses the whole thing in parenthesis like `(fixed.decimal|.decimal)` and `/(?<=\s|^)(\d{1,3}(,\d{3})*(\.\d{2})?)|(\.\d{2})(?=\s|$)/` that puts them in separate parenthesis like `(fixed.decimal)|(.decimal)`? (other than the fact that second one matches `14` in `14.`)
Amarghosh
To rephrase the last comment, why doesn't `(fixed.decimal)|(.decimal)` work? Is there any operator precedence that I am missing?
Amarghosh
I don't think there is a difference..... I'd need to see the 2 alongside each other to spot the difference.
Mez
Ah, it depends on what you've got around it.`a(fixed.decimal)|(.decimal)b` would not be the same as`a((fixed.deciman)|(.decimal))b`
Mez
`/(?<=\s|^)(\d{1,3}(,\d{3})*(\.\d{2})?)|(\.\d{2})(?=\s|$)/m` and the first regex in your answer.
Amarghosh
In this case, no. As the look behinds/aheads are none matching. In a case where the look behind or ahead were not matching, then the brackets are there to limit the boundaries of the or.
Mez
But `/(?<=\s|^)(\d{1,3}(,\d{3})*(\.\d{2})?|\.\d{2})(?=\s|$)/m` and `/(?<=\s|^)(\d{1,3}(,\d{3})*(\.\d{2})?)|(\.\d{2})(?=\s|$)/m` are giving me different results. 1st one works fine, but the second one matches `14` in `asd 14. asd`, `145` in `asd 145.2 asd` and `1,344.12` in `asd 1,344.123 asd`
Amarghosh
So these are line seperated?
Mez
Yeah. Thanks to your and @Bart's comments about `()`, now I understand the difference. I wish I could accept both answers :)
Amarghosh
I understood the second one. But I don't dare even try the 'insane' one today. I've just started learning and I believe am not yet ready for such 'insanities' yet. Will come back later and try to break it down. Thanks again. Unfortunately I can accept only one answer, but I'm gonna upvote you somewhere else :)
Amarghosh