tags:

views:

125

answers:

3

I want to write regular expression for constants in C language. So I tried this:

Let

  • digit -> 0-9,
  • digit_oct -> 0-7,
  • digit_hex -> 0-9 | a-f | A-F

Then:

  • RE = digit+ U 0digit_oct+ U 0xdigit_hex+

I want to know whether I have written correct R.E. Is there any other way of writing this?

+2  A: 

The 'RE' makes sense if we interpret the 'U' as being similar to set union. However, it is more conventional to use a '|' symbol to denote alternatives.

First, you are only dealing with integer constants, not with floating point or character or string constants, let alone more complex constants.

Second, you have omitted '0X' as a valid hex prefix.

Third, you have omitted the various suffixes: U, L, LL, ULL (and their lower-case and mixed case synonyms and permutations).

Also, the C standard (§6.4.4.1) distinguishes between digits and non-zero digits in a decimal constant:

decimal-constant:
    nonzero-digit
    decimal-constant digit

Any integer constant starting with a zero is an octal constant, never a decimal constant. In particular, writing 0 is writing an octal constant.

Jonathan Leffler
+2  A: 

First, C does not support Unicode literals, so you can eliminate the last rule. You also only define integer literals, not floating-point literals and not string or character literals. For the sake of my convenience I assume that that is what you intended.

INT    := OCTINT | DECINT | HEXINT
DECINT := [1-9] [0-9]* [uU]? [lL]? [lL]?
OCTINT := 0 [0-7]* [uU]? [lL]? [lL]?
HEXINT := 0x [0-9a-fA-F]+ [uU]? [lL]? [lL]?

These only describe the form of the literals, not any logic such as maximum values.

wilhelmtell
Technically, the `+` and `-` are the unary plus and minus operators and are not part of the integer constant itself. I also _think_ that the literal `0` is considered as an octal constant (not that it matters...).
James McNellis
@James Ah. Good point.
wilhelmtell
don't forget aggregate literals such as `{ 0 }`.
Philip Potter
@Philip is this part of the formal definition of integer literals? I'm not sure ...
wilhelmtell
@Philip: That wouldn't be an _integer constant_ (or any type of _constant,_ for that matter).
James McNellis
@James @wilhelm: I wasn't suggesting aggregate literals *are* integer constants, just adding to the list "not floating-point literals and not string or character literals". I could have made myself clearer.
Philip Potter
@wilhelmtell - should that be `HEXINT := 0[xX]...` ? Also, OCTINT should not include the digits 8 and 9.
bstpierre
+7  A: 

There is another type of integer constants, namely integer character constants such as 'a' or '\n'. In C99 these are constants and their type is just int.

The best regular expressions for all these are found in the standard, section 6.4, http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf

Jens Gustedt
+1 - good call.
Philip Potter