tags:

views:

565

answers:

7

I can't for the life of me understand why the following regex can't match 4 floats. there is a couple of rules for the way theese floats can be written.

  • the float ranges from 0 to 1
  • you can skip the first digit if its 0
  • there is an unlimited number of digits after the period.

Theese are valid floats

  • 1
  • 1.0
  • 0.0
  • .0
  • 0.123
  • .123

Now for the code I've tried amongst others

string input = " 0 0 0 .4";
string regex = @"[0-1]*(\.[0-9]*)*\s[0-1]*(\.[0-9]*)*\s[0-1]*(\.[0-9]*)*\s[0-1]*(\.[0-9]*)*";
Regex r = new Regex(regex, RegexOptions.Compiled);
Match m = r.Match(input);

m.Value Returns " 0 0 0" where i'd expect it to return "0 0 0 .4"

I've tried

[0-1]{0,1}(\.[0-9]*)*\s[0-1]{0,1}(\.[0-9]*)*\s[0-1]{0,1}(\.[0-9]*)*\s[0-1]{0,1}(\.[0-9]*)*

aswell but it looks like .net does not cope well with the {0,1} syntax (or I am just using it wrong)

I've tried looking at http://www.regular-expressions.info/reference.html and the {0,1} should be valid to my understanding atleast.

I managed to make a regex that matched the string in the little regex matcher tool I have at my disposal, but that regex did not work with the .net Regex class

UPDATE

I'm using the regex in conjunction with a Tokenizer parsing a larger document.

Combineing what Pavel Minaev and psasik wrote the following regex made an expected match

([0,1]|([0,1]?\.[0-9]+))\s([0,1]|([0,1]?\.[0-9]+))\s([0,1]|([0,1]?\.[0-9]+))\s([0,1]|([0,1]?\.[0-9]+))

The following matches the actual float

([0,1]|([0,1]?\.[0-9]+))
+1  A: 
float [0-1]|([0-1]?\.[0-9]+)
ws [ \t]

{ws}*{float}{ws}+{float}{ws}+{float}{ws}+{float}{ws}*
DevDevDev
ended up using ([0,1]|([0,1]?\.[0-9]+))\s([0,1]|([0,1]?\.[0-9]+))\s([0,1]|([0,1]?\.[0-9]+))\s([0,1]|([0,1]?\.[0-9]+)) and getting a successfull match :)
thmsn
Cool. Just thought I'd make it a bit easier to read/maintain for you, also maybe you want variable whitespace.
DevDevDev
+3  A: 

For starters, your regex is wrong in general - because of overuse of *, it will happily match something like 10101.10101.10101.

The reason for your peculiar match result is because your input string starts with a space " " character. Thus the match goes like this:

  • first [0-1]* matches empty string at the beginning
  • first (\.[0-9]*)* matches empty string "following" that empty string
  • first \s matches the starting space character in the input
  • second [0-1]* matches the first 0 in the input ...
  • third \s matches the third space character in the input (the one preceding the third 0)

No groups actually match anything (or rather they all match empty strings, because you use *).

Pavel Minaev
Could you go into more details as to why they match as you've written? I thought * ment 0 or more matches
thmsn
It does, and "0 matches" will, quite obviously, match an empty string. It _has_ to match that space at the very beginning somehow, and the first thing that matches it in your regex is `\s`. So it tries to match everything preceding `\s` as well, and since you're using `*` everywhere, it matches it all against empty string "preceding" the space.
Pavel Minaev
A: 

I don't know about c#, but the following regex should meet your requirements:

(?:(?<=\s)\.\d+|0\.\d+|[01]|1\.0)(?=\s|$)

Edit: Oh and if you want to check if there are exactly 4 floats in the string it would be like this:

(?:(?:(?<=\s)\.\d+|0\.\d+|[01]|1\.0)(?:\s|$)){4}

A little explanation on the Expression:

The outer group (?: ) is just for repeating the whole thing 4 times. The first inner group is what actually matches the floats. There are four cases:

  • (?<=\s)\.\d+ This matches a dot followed by at least one digit, if it is preceded by a whitespace. Matches .123, .1 etc. The (?<=\s) is a positive lookbehind. The difference between just a simple \s and (?<=\s) is that in the second case the whitespace is not part of the match
  • 0\.\d+ This matches a zero followed by a dot followed by at least one digit, e.g. 0.1, 0.123, 0.88
  • [01] This matches 0 or 1
  • 1\.0 The last possibility, which is 1.0 and according to your requirements the upper boundary of the float

The second inner group matches either a whitespace or a newline. So in english, the expression means 'match one of the first group followed by one of the second group repeated four times'.

Schtibe
A: 

Try this one:

[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?

From this great page: Regex Float Example

Paul Sasik
A: 

I would use

(?:0(?:\.\d+)?|1(?:\.0+)?|\.\d+)(?:\s+(?:0(?:\.\d+)?|1(?:\.0+)?|\.\d+)){3}

The regex for a single number being

0(?:\.\d+)?|1(?:\.0+)?|\.\d+

which matches:

  • a zero, optionally followed by a decimal point and one or more digits, or

  • a one, optionally followed by a decimal point and one or more zeroes, or

  • a decimal point followed by one or more digits.

It's not as compact as your latest core regex, ([01]|([01]?\.[0-9]+)), but it's much clearer, both to the regex engine and to the human reader. If you need to capture the numbers individually, you'll have to get rid of the {3} quantifier and spell the whole thing out. Don't be afraid to split a regex up into multiple lines for readability:

string regex = @"(0(?:\.\d+)?|1(?:\.0+)?|\.\d+)\s+"
             + @"(0(?:\.\d+)?|1(?:\.0+)?|\.\d+)\s+"
             + @"(0(?:\.\d+)?|1(?:\.0+)?|\.\d+)\s+"
             + @"(0(?:\.\d+)?|1(?:\.0+)?|\.\d+)";

EDIT: I don't speak C#, but I just read that verbatim strings can span multiple lines. That means you could also take advantage of free-spacing mode:

string regex = @"(?x)
                 (0(?:\.\d+)?|1(?:\.0+)?|\.\d+)\s+
                 (0(?:\.\d+)?|1(?:\.0+)?|\.\d+)\s+
                 (0(?:\.\d+)?|1(?:\.0+)?|\.\d+)\s+
                 (0(?:\.\d+)?|1(?:\.0+)?|\.\d+)
                ";

Or, instead of using the inline modifier, (?x), you can pass the appropriate flag to the constructor:

Regex r = new Regex(regex, RegexOptions.IgnorePatternWhitespace);

Either way, the regex compiler ignores all whitespace in the string.

Alan Moore
A: 

Maybe? (\d|.\d+|\d.\d+)\s+(\d|.\d+|\d.\d+)\s+(\d|.\d+|\d.\d+)\s+(\d|.\d+|\d.\d+)

Havenard
A: 

This captures exactly a float that adheres to your rules:

/^(\d?\.?\d+)$/

This captures things like "12.1", i.e. floats > 1:

/^(\d*\.?\d+)$/

since the Regexp is so short, I'd simply copy it four times and then put \s+ between the capturing parentheses:

/^(\d*\.?\d+)\s+(\d*\.?\d+)\s+(\d*\.?\d+)\s+(\d*\.?\d+)$/

In case you can use PCRE and wish to shorten the expression:

/^(?:(\d*\.?\d+)\s+){3}(\d*\.?\d+)$/

Check if capturing parentheses are interpolated, though. That depends on the Regexp dialect of your language.

polemon
Oh, I forgot about negative values! Do you need them?
polemon