ansaurus

Question

How do I match a list of things on regex?

Answer 1

+5 A:

You need two regexes, one to identify new categories and one to parse user records.

#!/usr/bin/perl

use strict;
use warnings;

my %users;
my $cur;
while (<DATA>) {
    if (my ($category) = /^(.*)--$/) {
     $cur = $category;
     next;
    }
    next unless my ($id, $user) = /([0-9]+): (\w+)/;
    die "no category found" unless defined $cur;
    $users{$user}{$cur} = $id;
}

use Data::Dumper;
print Dumper \%users;

__DATA__
CategoryA--
5: UserA
6: UserB
7: UserC
CategoryB--
4: UserA
5: UserB

Or, if you have Perl 5.10 or later, you can use named captures with one regex:

#!/usr/bin/perl

use 5.010;
use strict;
use warnings;

my %users;
my $cur;
while (<DATA>) {
    next unless /^(?:(?<category>.*)--|(?<id>[0-9]+): (?<user>\w+))$/;
    if (exists $+{category}) {
     $cur = $+{category};
     next;
    }
    die "no category found" unless defined $cur;
    $users{$+{user}}{$cur} = $+{id};
}

use Data::Dumper;
print Dumper \%users;

__DATA__
CategoryA--
5: UserA
6: UserB
7: UserC
CategoryB--
4: UserA
5: UserB

Chas. Owens 2009-05-20 15:06:15

You could probably make this a fair bit simpler by changing the input record separator to "--\n".

Nic Gibson 2009-05-20 15:54:13

@newt: The "--\n" sequence doesn't mark the end of a record. Setting $/ to that would get no data for CategoryA and the data from CategoryA grouped with CategoryB.

Michael Carman 2009-05-20 16:38:32

@newt I considered doing that and priming the pump before the loop, but decided that the two regexes were easier to understand.

Chas. Owens 2009-05-20 16:42:46

@michael good point!

Nic Gibson 2009-05-20 16:48:05

I don't think two regexes are absolutely necessary, you can put each different type in an alternation and capture each separately. I've done that before.

Axeman 2009-05-20 19:44:56

@Axeman You can but it doesn't look as nice.

Chas. Owens 2009-05-20 20:15:11

@Axeman I keep forgetting about named captures, they make it easier to do one regex, but I still think the two regex version is cleaner.

Chas. Owens 2009-05-20 20:41:11

Answer 2

A:

This will split it up for you.

prompt> ruby e.rb 
[["CategoryA--", nil, nil], [nil, "5", "UserA"], [nil, "6", "UserB"], [nil, "7", "UserC"], ["CategoryB--", nil, nil], [nil, "4", "UserA"], [nil, "5", "UserB"]]
prompt> cat e.rb 
s = <<TXT
CategoryA--
5: UserA
6: UserB
7: UserC
CategoryB--
4: UserA
5: UserB
TXT
p s.scan(/(^.*--$)|(\d+): (.*$)/)
prompt>

neoneye 2009-05-20 15:10:01

Answer 3

+3 A:

This perl code seems to do what your looking for (mostly with one change). I laied out the data structure a bit differently but not much.

#!/usr/bin/perl

use strict;

my @array = (
    "CategoryA--",
    "5: UserA",
    "6: UserB",
    "7: UserC",
    "CategoryB--",
    "4: UserA",
    "5: UserB"
);

my ($dataFileContents, $currentCategory);

for (@array) {
    $currentCategory = $1 if (/(Category[A-Z])--/);
    if (/(\d+): (User[A-Z])/) {
        $dataFileContents->{$2}->{$currentCategory} = $1
    }

}

Copas 2009-05-20 15:15:43

my $blah if TEST; will be deprecated in Perl 5.12 I believe and has always been discouraged. Either use a state variable (Perl 5.10) or declare a variable in a surrounding scope (earlier versions of Perl).

Chas. Owens 2009-05-20 16:09:26

\d does not match only [0-9] in Perl 5.8 and 5.10; it matches any UNICODE character that has the digit attribute (including "\x{1815}" and "\x{FF15}"). If you mean [0-9] you must either use [0-9] or use the bytes pragma (but it turns all strings in 1 byte characters and is normally not what you want).

Chas. Owens 2009-05-20 16:12:29

Using the __DATA__ token (as Chas. did) makes this kind of testing easier, and easier to read.

glenn jackman 2009-05-20 16:13:28

@Chas: I don't think he's trying to emulate a state variable. $currentCategory is defined before the loop. The my() inside the loop is erroneous and should be deleted.

Michael Carman 2009-05-20 16:41:09

@Michael: Correct it was typed in haste and has be removed thanks.@Glenn: Thanks for the input it appears I could learn allot from Chas.

Copas 2009-05-20 18:22:38

Answer 4

+1 A:

Not exactly trying to golf here, but it can be done in a single alternation:

my ( %data, $category );
while ( <DATA> ) { 
    next unless /^(?:(Category\w+)|(\d+):\s*(User\w+))/;
    ( $1 ? $category = $1 : 0 ) or $data{$3}{$category} = $2;    
}

Data::Dumper (actually Smart::Comments) shows the output:

{
  UserA => {
             CategoryA => '5',
             CategoryB => '4'
           },
  UserB => {
             CategoryA => '6',
             CategoryB => '5'
           },
  UserC => {
             CategoryA => '7'
           }
}

Axeman 2009-05-20 19:56:48

Answer 5

A:

#!/usr/bin/perl

use strict;
use Data::Dumper;

print "Content-type: text/html\n\n";

my ($x,%data);
do {
    if (/^(Category\w+)/) {
        $x=$1;
    } elsif (/^([0-9]+):\s*(User\w)/) {
        if (!defined($data{$2})) {
            $data{$2} = {$x,int($1)};
        } else {
            $data{$2}{$x} = int($1);
        }
    }   
} while (<DATA>);

print Dumper \%data;


__DATA__
CategoryA--
5: UserA
6: UserB
7: UserC
CategoryB--
4: UserA
5: UserB

RESULT:

$VAR1 = {
    'UserC' => {
        'CategoryA' => 7
                 },
    'UserA' => {
        'CategoryA' => 5,
        'CategoryB' => 4
                 },
    'UserB' => {
         'CategoryA' => 6,
         'CategoryB' => 5
     }
};

Fran Corpier 2009-05-21 05:51:41

ansaurus

tags:

views:

answers:

How do I match a list of things on regex?

related questions