views:

116

answers:

6

I have a string where different predefined keywords introduce different data. Is there a way to do that using clever use of regexp, or something? Here is an example:

Keywords can be "first name: " and "last name: ". Now I want to parse:

"character first name: Han last name: Solo"

into

{ "first name: " => "Han ", "last name: " => "Solo" }

Of course, the order of the keywords in the input string is not fixed. This should also work on :

"character last name: Solo first name: Han"

I understand there are issues to be raised with spaces and so on. I'll ignore them here.

I know how to solve this problem looping on the different keywords, but I don't find that very pretty.

Split almost fits the bill. Its only problem is that it returns an array and not a hash, so I don't know which is the first name or the last name.

My example is somewhat misleading. Here is another one:

my @keywords = ("marker 1", "marker 2", "marker 3");
my $rawString = "beginning marker 1 one un marker 2 two deux marker 3 three trois and the rest";
my %result;
# <grind result>
print Dumper(\%result);

will print:

$VAR1 = {
      'marker 2' => ' two deux ',
      'marker 3' => ' three trois and the rest',
      'marker 1' => ' one un '
    };
+2  A: 
use strict;
use warnings;
use Data::Dump 'dump';   # dump allows you to see what %character 'looks' like

my %character;
my $nameTag = qr{(?:first|last) name:\s*};

# Use an array slice to populate the hash in one go
@character{ ($1, $3) } = ($2, $4) if $string =~ /($nameTag)(.+)($nameTag)(.+)/;

dump %character; # returns ("last name: ", "Solo", "first name: ", "Han ")
Zaid
I couldn't make your example work. Please note that the keywords have common substrings only by accident, eg a third keyword could be `"hair color"`
Jean-Denis Muys
@Jean-Denis Muys : Yeah, I had forgotten to make the nested grouping non-capturing. It should work now. This solves the original problem. Now for the more generic case...
Zaid
This is pretty slick (once I got it to work :)
brian d foy
A: 

Use Text::ParseWords. It probably doesn't do all of what you want, but you're much better building on it than trying to solve the whole problem from scratch.

Colin Fine
A: 

This is possible IF:

1) You can identify a small set of regexes that can pick out the tags 2) The regex for extracting the value can be written so that it picks out only the value and ignores following extraneous data, if any, between the end of the value and the start of the next tag.

Here's a sample of how to do it with a very simple input string. This is a debug session:

  DB<14> $a = "a 13 b 55 c 45";
  DB<15> %$b = $a =~ /([abc])\s+(\d+)/g;
  DB<16> x $b
0  HASH(0x1080b5f0)
   'a' => 13
   'b' => 55
   'c' => 45
Jim Garrison
condition 1 is yes: the set of keywords is determined in advance. condition 2 is no: the data stops whenever a new keywords starts, or at the end of the string, whichever comes first. I had hoped the right set of greediness might help.
Jean-Denis Muys
Why the downvote? This is a perfectly good approach and quite usable if you can write generic regexes for the tag and value.
Jim Garrison
+7  A: 

Here is a solution using split (with separator retention mode) that is extensible with other keys:

use warnings;
use strict;

my $str = "character first name: Han last name: Solo";

my @keys = ('first name:', 'last name:');

my $regex = join '|' => @keys;

my ($prefix, %hash) = split /($regex)\s*/ => $str;

print "$_ $hash{$_}\n" for keys %hash;

which prints:

last name: Solo
first name: Han 

To handle keys that contain regex metacharacters, replace the my $regex = ... line with:

 my $regex = join '|' => map {quotemeta} @keys;
Eric Strom
Thank you. It's perfect. I didn't know split could return a hash as you show here. Also surprising to me is your use of => as an argument separator. Is this a common idiom?
Jean-Denis Muys
`split` always returns a list. You can assign a list to a hash. `=>` is the "fat comma": It has the effect of automatically quoting a bareword preceding it.
Sinan Ünür
OK I got it, and now I also appreciate the elegance of the solution. Today is a good day: I learned two things.
Jean-Denis Muys
It's not just that split is returning a list, but that it's using separator retention mode.
brian d foy
Damn, it breaks when one of the keys contains a "|". I guess I will need to "quote" the $regex content. What's the best way to do that?
Jean-Denis Muys
@Jean-Denis Muys: Filter the keys through `quotemeta` before building the regex.
Michael Carman
`quotemeta` would work... if only it didn't quote individual bytes of multibyte characters in UTF8 strings. One of my actual keyword is `"approuvé le"`. The `é` is a two-byte char in utf8.
Jean-Denis Muys
@Jean-Denis Muys => at least in perl 5.12 (only one have installed on this box), the code `"approuvé le" =~ quotemeta "approuvé le"` returns true
Eric Strom
Yes it does. It messes with my logging text, but that doesn't matter much. Thanks.
Jean-Denis Muys
+1  A: 

This works.

use 5.010;
use Regexp::Grammars;
my $parser = qr{
        (?:
            <[Name]>{2}
        )
        <rule: Name>
            ((?:fir|la)st name: \w+)
}x;

while (<DATA>) {
    /$parser/;
    use Data::Dumper; say Dumper $/{Name};
}

__DATA__
character first name: Han last name: Solo
character last name: Solo first name: Han

Output:

$VAR1 = [
          ' first name: Han',
          ' last name: Solo'
        ];

$VAR1 = [
          ' last name: Solo',
          ' first name: Han'
        ];
daxim
Regex::Grammars is the new black.
brian d foy
Ugh. Scary Damian ware. It's usually shiny to look at but the shine wears off with time. In the end, using another parser generator (Parse::Yapp/Eyapp are my favourite) is probably your best bet if you need one at all.
tsee
A Yapp is fine too. (While we are at snowcloning〜…)
daxim
+3  A: 

The following loops over the string once to find matches (after normalizing the string). The only way you can avoid the loop is if each keyword can only appear once in the text. If that were the case, you could write

my %matches = $string =~ /($re):\s+(\S+)/g;

and be done with it.

The script below deals with possible multiple occurrences.

#!/usr/bin/perl

use strict; use warnings;

use File::Slurp;
use Regex::PreSuf;

my $re = presuf( 'first name', 'last name' );

my $string = read_file \*DATA;
$string =~ s/\n+/ /g;

my %matches;

while ( $string =~ /($re):\s+(\S+)/g ) {
    push @{ $matches{ $1 } }, $2;
}

use Data::Dumper;
print Dumper \%matches;

__DATA__
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore character first name: Han last
name: Solo et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud character last name: Solo first name: Han exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute
irure dolor in reprehenderit in voluptate velit esse cillum
character last name: Solo first name: Han dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum
Sinan Ünür