tags:

views:

100

answers:

3

My homework is to write a regular expression representing the language of numerical literals from C programming language. I can use l for letter, d for digit, a for +, m for -, and p for point. Assume that there are no limits on the number of consecutive digits in any part of the expression.

Some of the examples of valid numerical literals were 13. , .328, 41.16, +45.80, -2.e+7, -.4E-7, 01E-06, +0

I came up with: (d+p+a+m)(d+p+E+e+a+m)*
update2: (l+d+p+a+m)(d+p+((E+e)(a+m+d)d*) )* im not sure how to prevent something like 1.0.0.0eee-e1.

+2  A: 

Your regular expression does not support the various suffixes (l, u, f, etc.), nor does it support hexadecimal or octal constants.

The leading signs (+ or - in front of the number) are not lexically part of the constant; they are the unary + and - operators. Effectively, all integer and floating constants are positive.

If you need to fully support C99 floating constants, you need to support hexadecimal exponents (p instead of e).

Your regular expression also accepts many invalid sequences of characters, like 1.0.0.0eee-e1.

A single regular expression to match all C integer and floating literals would be quite long.

James McNellis
...neither hex nor octal... if he's going that way.
dmckee
@dmckee: True; if special handling isn't done for octal constants, `019` would get matched incorrectly (though that would not be a valid C token anyway).
James McNellis
A: 

This Regex will match all your need:

  [+-]?(?P<Dot1>\.)?\d+(?(Dot1)(?#if_dot_exist_in_the_beginning__do_nothing)|(?#if_dot_not_exist_yet__we_accept_optional_dot_now)(?P<Dot2>\.)?)\d*(?P<Exp>[Ee]?)(?(Exp)[+-]?\d*)
Vantomex
This regular expression does not match all valid C numeric constants: the sign of the number is not lexically part of the constant, octal and hexadecimal constants must be accepted, suffixes must be accepted, and hexadecimal exponents must be accepted. Also, you don't need such complex handling of the decimal point. You can take advantage of the fact that a numeric constant must begin with either a digit or the decimal point, so you can match `([0-9.])` at the beginning.
James McNellis
Hexadecimal needs a special treatment. There is an ambiguity of "e" meaning there. I just want to help him to do his homework. If hexadecimal and octal must be taken into account, the regex should become too long, and I don't even think his teacher can do it. :-)
Vantomex
@James, a numeric constant can also be preceded by either a minus or an optional plus sign.
Vantomex
and also can be preceded by a dot.
Vantomex
No, the sign is not part of the constant; the `+` and `-` that may precede the constant are the unary `+` and unary `-` operators and are lexically distinct from the constant itself. The first character of a numeric constant must be a `.` or a digit.
James McNellis
@James, sometimes complexity cannot be avoided. I think you abuse the voting-down rights. -1 is intended for an answer which is *totally* unusable/incorrect, or not related with the question. We just try to help in our busy time.
Vantomex
@James, please give us your definition of a constant, also a numerical constant. And how do you assure that the questions is about a lexical meaning of constants, not a practical meaning of constants? I know there is a concept of a signed constant in real programming worlds. Are you talking in imaginary world?
Vantomex
You can find the definition of a constant in the ISO C standard, for which I provided a link and a paragraph citation in a comment to the question a while ago. The question asks about a regular expression to match C numeric literals; it doesn't ask about a regular expression to match a subset of valid C numeric literals or a set of inputs similar to but not the same as valid C numeric literals. It is, of course, quite possible that the problem is underspecified, in which case the OP should clarify his request.
James McNellis
+1  A: 

Untested, but this should be along the right lines for decimal at least. (Also, it accepts the string ".", or I think it does anyway; to fix that would eliminate the last of the common code between integer and FP, the leading [0-9]*.)

Please don't downvote if you see a flaw. Leave a constructive comment and I'll make this community wiki.

[0-9]*([0-9]([uU](ll?+LL?)+(ll?+LL?)?[uU]?)+(\.[0-9]*)?([eE][+-]?[0-9]+)[fFlL])
Potatoswatter