I am parsing documents which contain large amounts of formatted numbers, an example being:
Frc consts -- 1.4362 1.4362 5.4100
IR Inten -- 0.0000 0.0000 0.0000
Atom AN X Y Z X Y Z X Y Z
1 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 1 0.40 -0.20 0.23 -0.30 -0.18 0.36 0.06 0.42 0.26
These are separated lines all with a significant leading space and there may or may not be significant trailing whitespace). They consist of 72,72, 78, 78, and 78 characters. I can deduce the boundaries between fields. These are describable (using fortran format (nx = nspaces, an = n alphanum, in = integer in n columns, fm.n = float of m characters with n places after the decimal point) by:
(1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
(1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
(1x,a4,a4,3(2x,3a7))
(1x,2i4,3(2x,3f7.2))
(1x,2i4,3(2x,3f7.2))
I have potentially several thousand different formats (which I can autogenerate or farm out) and am describing them by regular expressions describing the components. Thus if regf10_4 represents a regex for any string satisfying the f10.4 constraint I can create a regex of the form:
COMMENTS
(\s
.{14}
\s
regf10_4,
\s{13}
regf10_4,
\s{13}
regf10_4,
)
I would like to know whether there are regexes that satisfy re-use in this way. There is a wide variety in the way computers and humans create numbers that are compatible with, say f10.4. I believe the following are all legal input and/or output for fortran (I do not require suffixes of the form f or d as in 12.4f) [the formatting in SO should be read as no leading spaces for the first, one for the second, etc.]
-1234.5678
1234.5678
// missing number
12345678.
1.
1.0000000
1.0000
1.
0.
0.
.1234
-.1234
1E2
1.E2
1.E02
-1.0E-02
********** // number over/underflow
They have to be robust against the content of the neighbouring fields (e.g. only examine precisely 10 characters in a precise position. Thus the following are legal for (a1,f5.2,a1):
a-1.23b // -1.23
- 1.23. // 1.23
3 1.23- // 1.23
I am using Java so need regex constructs compatible with Java 1.6 (e.g. not perl extensions)