views:

434

answers:

4

What is a good regular expression for handling a floating point number (i.e. like Java's Float)

The answer must match against the following targets:

1) 1.
2) .2
3) 3.14
4) 5e6
5) 5e-6
6) 5E+6
7) 7.e8
8) 9.0E-10
9) .11e12

In summary, it should

  • ignore preceeding signs
  • require the first character to the left of the decimal to be non-zero
  • allow 0 or more digits on either side of the decimal
  • permit a number without a decimal
  • allow scientific notation
  • allow capital or lowercase 'e'
  • allow positive or negative exponents

For those who are wondering, yes this is a homework problem. We received this as an assignment in my graduate CS class on compilers. I've already turned in my answer for the class and will post it as an answer to this question.

[Epilogue] My solution didn't get full credit because it didn't handle more than 1 digit to the left of the decimal. The assignment did mention handling Java floats even though none of the examples had more than 1 digit to the left of the decimal. I'll post the accepted answer in it's own post.

+1  A: 

Just make both the decimal dot and the E-then-exponent part optional:

[1-9][0-9]*\.?[0-9]*([Ee][+-]?[0-9]+)?

I don't see why you don't want a leading [+-]? to capture a possible sign too, but, whatever!-)

Edit: there might in fact be no digits left of the decimal point (in which case I imagine there must be the decimal point and 1+ digits after it!), so a vertical-bar (alternative) is clearly needed:

(([1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?
Alex Martelli
Note that this doesn't match anything of the form `.x` or `0.x`.
Anon.
@Alex: He might not want to capture the sign in case it's part of an expression, as in "5-2.5". That's expected if you're tokenizing things, as you would be when writing a compiler.
John Feminella
@Anon, right: `0.x` must be rejected by the second rule.
Alex Martelli
You might want to note the explicit presence of `.2` in the list of examples that need to be matched, though. I guess this is somewhere a clear spec is required. :/
Anon.
Aha, good point. Then a vertical bar is definitely needed to express the fact that there might be no digits to the left of the decimal point, but, if there are, then the first one must be non-zero. Let me edit the A to show this.
Alex Martelli
And btw, @Anon, tx and +1 for pointing out this inconsistency in the question as expressed (an example contradicting a rule;-).
Alex Martelli
@Alex: I asked about the preceeding sign and was told not to bother with it for this assignment. It'd be simple enough to prepend [-+]? though.
Kelly French
What about #4 where there is no decimal? It's the reason I have a third alternative in my answer but I'm interested in other ways to achieve the same result.
Kelly French
Ah yes - I *said* the period had to be optional, made it optional in the first RE, but the question mark slipped away as I wrote the second one;-). Editing the A to fix!
Alex Martelli
A: 

Here is what I turned in.

(([1-9]+\.[0-9]*)|([1-9]*\.[0-9]+)|([1-9]+))([eE][-+]?[0-9]+)?

To make it easier to discuss, I'll label the sections

( ([1-9]+ \. [0-9]* ) | ( [1-9]* \. [0-9]+ ) | ([1-9]+))  ( [eE] [-+]? [0-9]+ )?     
--------------------------------------------------------  ----------------------
                           A                                        B

A: matches everything up to the 'e/E'
B: matches the scientific notation

Breaking down A we get three parts

( ([1-9]+ \. [0-9]* ) | ( [1-9]* \. [0-9]+ ) | ([1-9]+) )
  -------------------   --------------------   --------
           1                       2               3

Part 1: Allows 1 or more digits from 1-9, decimal, 0 or more digits after the decimal (target 1)
Part 2: Allows 0 or more digits from 1-9, decimal, 1 or more digits after the decimal (target 2)
Part 3: Allows 1 or more digits from 1-9 with no decimal (see #4 in target list)


Breaking down B we get 4 basic parts

 ( [eE] [-+]? [0-9]+ )?   
   ---- ----- ------ --  
     1    2     3     4  

Part 1: requires either upper or lowercase 'e' for scientific notation (e.g. targets 8 & 9)
Part 2: allows an optional positive or negative sign for the exponent (e.g. targets 4, 5, & 6)
Part 3: allows 1 or more digits for the exponent (target 8)
Part 4: allows the scientific notation to be optional as a group (target 3)

Kelly French
Your first part (1) of (A) does not allow `10.`.
tur1ng
Part (1) of (A) should probably be `([1-9][0-9]*\.[0-9]*)`. A similar change is needed to part (3).
Anon.
@tur1ng: true but blame the the test input! 8-)
Kelly French
A: 

[This is the answer from the professor]

Define:

N = [1-9]
D = 0 | N
E = [eE] [+-]? D+
L = 0 | ( N D* )

Then floating point numbers can be matched with:

( ( L . D* | . D+ ) E? ) | ( L E )

It was also acceptable to use D+ rather than L, and to prepend [+-]?.

A common mistake was to write D* . D*, but this can match just '.'.

Kelly French