This sounds like a job for a regex and an array of hashes.
First, let's create a pattern that can find the information. You are looking for a constant string "The Scheme GUID: "
that is followed by a contiguous string of alpha-numeric and hyphen characters followed by a space and then a contiguous string of alpha-numeric characters surrounded by parentheses. In regex, this is /The Scheme GUID: [a-zA-Z0-9-]+ \([a-zA-Z0-9]+\)/
. Now, that will only match the string, and we want to pull out pieces of it, so we need to add captures to the regex and catch its return:
my ($guid, $scheme) = /The Scheme GUID: ([a-zA-Z0-9-]+) \(([a-zA-Z0-9]+)\)/;
The ()
are used to denote the parts we want to save from the string and are called captures.
Now that we have the values, you want to create a record-like structure. In Perl, you commonly use a hash for this purpose:
my %record = (
guid => $guid,
scheme => $scheme
);
You can now access the guid by saying $record{guid}
. To build an array of these records, just push the record onto an array:
my @records;
while (<>) {
my ($guid, $scheme) = /The Scheme GUID: ([a-zA-Z0-9-]+) \(([a-zA-Z0-9])\)/;
my %record = (
guid => $guid,
scheme => $scheme
);
push @records, \%record;
}
You can now access the third record's scheme like this: $records[2]{scheme}
.
Your last requirement requires a change to the regex. You need to look for that star and do somehthing special if you see it. Unfortunately star means something to regexes, so you will need to escape it like you did with parentheses. And the star is not always present, so you will need to use non-grouping parentheses (?:)
and the ?
quantifier to tell the regex that not matching that part of the string is okay:
my ($guid, $scheme, $star) = /The Scheme GUID: ([a-zA-Z0-9-]+) \(([a-zA-Z0-9]+)\)(?: (\*))?/;
The regex has gotten very long and hard to read at this point, so it is probably a good idea to use the /x
flag and add some whitespace and comments to the regex:
my ($guid, $scheme, $star) = m{
The [ ] Scheme [ ] GUID:
([a-zA-Z0-9-]+) #capture the guid
[ ]
\( ([a-zA-Z0-9]+) \) #capture the scheme
(?:
[ ]
(\*) #capture the star if it exists
)?
}x;
They way I would write a program like this is:
#!/usr/bin/perl
use strict;
use warnings;
my $primary_record;
my @records;
while (<DATA>) {
next unless my ($guid, $scheme, $star) = m{
The [ ] Scheme [ ] GUID: [ ]
([a-zA-Z0-9-]+) #capture the guid
[ ]
\( ([a-zA-Z0-9]+) \) #capture the scheme
(?:
[ ]
([*]) #capture the star if it exists
)?
}x;
my %record = (
guid => $guid,
scheme => $scheme,
starred => defined $star ? 1 : 0
);
if ($record{starred}) {
$primary_record = \%record;
}
push @records, \%record;
}
print "records:\n";
for my $record (@records) {
print "\tguid: $record->{guid} scheme: $record->{scheme}\n";
}
print "primary record is $primary_record->{guid}\n";
__DATA__
The Scheme GUID: 123-abc (Scheme1) *
The Scheme GUID: 456-def (Scheme2)
The Scheme GUID: 789-ghi (Scheme3)
If you have the data in an array, for you can replace the while
loop with a for
loop:
for my $line (@lines) {
next unless my ($guid, $scheme, $star) = $line =~ m{
The [ ] Scheme [ ] GUID: [ ]
([a-zA-Z0-9-]+) #capture the guid
[ ]
\( ([a-zA-Z0-9]+) \) #capture the scheme
(?:
[ ]
([*]) #capture the star if it exists
)?
}x;
The next unless match
idiom says that to get a different line if this one doesn't match the regex. The m{regex}
is the generalized form of /regex/
. I tend to use the generalized form when I stretch a regex across multiple lines because it makes matching the beginning and ending of the regex easier in my editor.