As was mentioned, you need some structure in your regex. In refatoring your code, I made a couple assumptions
- You don't want to just print it out in a tabbed delimited format
- The only reason for the
$x
variable is so that you only print one line.
(although, a last
at the end of the loop would have worked just fine.).
Having assumed these things, I decided that, in addressing your question, I would:
- Show you how to make a good modifiable regex.
- Code very simple "semantic actions" which store the data and let you
use it as you please.
In addition is should be noted that I changed input to a __DATA__
section and
output is restricted to STDERR--through the use of Smart::Comment
comments,
that hep me inspect my structures.
First the code preamble.
use strict; # always in development!
use warnings; # always in development!
use English qw<$LIST_SEPARATOR>; # It's just helpful.
#use re 'debug';
#use Smart::Comments
Note the commented-out use re
.... If you really want to see the way a regular
expression gets parsed, it will put out a lot of information that you probably
don't want to see (but can make your way through--with a little knowledge about
regex parsing, nonetheless.) It's commented out because it is just not newbie
friendly, and will monopolize your output. (For more about that see re.)
Also commented out is the use Smart::Comments
line. I recommend it, but you
can get by using Data::Dumper
and print Dumper( \%hash )
lines. (See Smart::Comments
.)
Specifying the Expression
But on to the regex. I used an exploded form of regex so that the parts of the
whole are explained (see perlre). We want a single alphanumeric character OR a quoted string
(with allowed escapes).
We also used a list of modifier names, so that the "language" can progress.
The next regex we make in a "do block" or as I like to call it a "localization
block", so that I can localize $LIST_SEPARATOR
(aka $"
) to be the regex
alternation character. ('|'). Thus when I include the list to be interpolated,
it is interpolated as an alternation.
I'll give you time to look at the second regex before talking about it.
# Modifiable list of modifiers
my @mod_names = qw<constant fixup private>;
# Break out the more complex chunks into separate expressions
my $arg2_regex
= qr{ \p{IsAlnum} # accept a single alphanumeric character
| # OR
" # Starts with a double quote
(?> # -> We just want to group, not capture
# the '?> controls back tracing
[^\\"\P{IsPrint}]+ # any print character as long as it is not
# a backslash or a double quote
| \\" # but we will accept a backslash followed by
# a double quote
| (\\\\)+ # OR any amount of doubled backslashes
)* # any number of these
"
}msx;
my $line_RE
= do { local $LIST_SEPARATOR = '|';
qr{ \A # the beginning
\s* # however much whitespace you need
# A sequence of modifier names followed by space
((?: (?: @mod_names ) \s+ )*)
( \p{IsAlnum}+ ) # at least one alphanumeric character
\s* # any amount of whitespace
= # an equals sign
\s* # any amount of whitespace
< # open angle bracket
(\p{IsAlnum}+) # Alphanumeric identifier
\s+ # required whitespace
( $arg2_regex ) # previously specified arg #2 expression
[^>]*?
> # close angle bracket
}msx
;
};
The regex just says that we want any number of recognized "modifiers" separated
by whitespace followed by an alphanumeric idenfier (I'm not sure why you don't
want underscores; I don't include them, regardless.)
That is followed by any amount of whitespace and an equals sign. Since the sets
of alphanumeric characters, whitespace, and the equals sign are all disjoint,
there is no reason to require whitespace. On the other side of the equals sign,
the value is delimited by angle brackets, so I don't see any reason to require
whitespace on that side either. Before the equals all you've allowed is
alphanumerics and whitespace and on the other side, it all has to be in angle
brackets. Required whitespace gives you nothing, while not requiring it is more
fault-tolerant. Ignore all that and change the *
s to +
if you are expecting
a machine output.
On the other side of the equals sign, we require an angle bracket pair. The pair
consists of an alphanumeric argument, with the second argument being EITHER a
single alphanumeric character (based on your spec) OR a string which can contain
escaped escapes or quotes and even the end angle bracket--as long as the string
doesn't end.
Storing the Data
Once the specification has been made, here's just one of the things you can do
with it. Because I don't know what you wanted to do with this besides print it
out--which I'm going to assume is not the whole purpose of the script.
### $line_RE
my %fixup_map;
while ( my $line = <DATA> ) {
### $line
my ( $mod_text, $identifier, $first_arg, $second_arg )
= ( $line =~ /$line_RE/ )
;
die 'Did not parse!' unless $identifier;
$fixup_map{$identifier}
= { modifiers_for => { map { $_ => 1 } split /\s+/, $mod_text }
, first_arg => $first_arg
, second_arg => $second_arg
};
### $fixup_map{$identifier} : $fixup_map{$identifier}
}
__DATA__
constant fixup ConfigAlarms = <U1 0>
constant fixup ConfigAlarms2 = <U1 2>
constant fixup private AlarmFileName = <A "C:\\TMP\\ALARM.LOG">
At the end you can see the DATA
section, when you're at the beginning stage as
you seem to be here, it's most convenient to dispense with IO logic and use the
builtin handle DATA
as I do here.
I collect the modifiers in a hash, so that my semantic actions could be
#...
my $data = $fixup_map{$id};
#...
if ( $data->{modifiers_for}{public} ) {
#...
}
Soap Box
The main problem however is that you don't seem to have a plan. For the second "argument" in the angle brakets, you have a regex that specifies only a single alphanumeric character, but want to expand it to allow escaped strings. I have to expect that you are implementing a small subset and gradually want expand it do do other things. If you neglect a good design from the beginning, it's only going to become more and more of a headache to implement the full-featured "parser".
You may want to implement multi-line values at some point. If you don't understand how to get from a single alphanumeric to a quote-delimited argument, the line-by-line method and the adjustments to the regex dwarf that complexity gap.
So I advise you to use the code here only as a guideline for expanding complexity. I'm answering a question and indicating a direction, not designing or coding a project, so my regex code isn't as expandable as it probably should be.
If the parsing job was complex enough, I would specify a minimal lookahead grammar for Parse::RecDescent
, and stick to coding the semantic actions. That's another recommendation.