views:

146

answers:

4

Hi, I've got some data that I'm parsing in Perl, and will be adding more and more differently formatted data in the near future. What I would like to do is write an easy-to-use function, that I could pass a string and a regex to, and it would return anything in parentheses. It would work something like this (pseudocode):

sub parse {
  $data = shift;
  $regex = shift;

  $data =~ eval ("m/$regex/")
  foreach $x ($1...$n)
  {
    push (@ra, $x); 
  }
  return \@ra;
}

Then, I could call it like this:

@subs = parse ($data, '^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)');

As you can see, there's a couple of issues with this code. I don't know if the eval would work, the 'foreach' definitely wouldn't work, and without knowing how many parentheses there are, I don't know how many times to loop.

This is too complicated for split, so if there's another function or possibility that I'm overlooking, let me know.

Thanks for your help!

A: 

You are trying to parse a complex expression with a regex - which is an insufficient tool for the job. Recall that regular expressions cannot parse higher grammars. For intuition, any expression which might be nested cannot be parsed with regex.

Yuval A
perl's regexen are irregular. you can use `(??{blah})`, though it's not exactly recommended practice.
sreservoir
perl's regex engine also supports recursion, which allows it to match nested constructs easily
Eric Strom
True - many regex implementations can actually parse more than the set of regular languages, but this is not consistent. If you need to parse a grammar - use a proper grammar parser.
Yuval A
+6  A: 

In list context, a regular expression will return a list of all the parenthesized matches.

So all you have to do is:

my @matches = $string =~ /regex (with) (parens)/;

And assuming that it matched, @matches will be an array of the two capturing groups.

So using your regex:

my @subs = $data =~ /^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)/;

Also, when you have long regexes, Perl has the x modifier, which goes after the closing regex delimiter. The x modifier allows you to put white-space and newlines inside the regex for increased readability.

If you are worried about the capturing groups that might be zero length, you can pass the matches through @subs = grep {length} @subs to filter them out.

Eric Strom
If you don't know whether the regex has parens or not, and want to return nothing if it does not (instead of the default entire matched string), add an extra set: `$string =~ /(regex)/` and discard it from the results.
ysth
That grep will filter out parens not actually used in the match, but not zero-length ones (which will be defined and "")
ysth
@ysth => you're right, fixed.
Eric Strom
Thank you! I've been doing Perl for years, how did I never know that you can return matches in list context? Might have to go back and re-read my books.
coding_hero
A: 

When you want to find text inside of pairs of parenthesis, you want to use Text::Balanced.

But, that is not what you want to do, so it will not help you.

Kevin Panko
despite the name of the question, it doesn't seem like the OP is actually looking to match nested parens, just to use a regex that could have any number of sequential capturing groups
Eric Strom
Sorry, I should have said 'parenthetical groupings' instead of 'parentheses'.
coding_hero
+1  A: 

Then, I could call it like this:

@subs = parse($data, 
          '^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)');

Instead, call it like:

parse($data, 
    qr/^"([0-9]+)",([^:]*):(\W+):([A-Z]{3}[0-9]{5}),ID=([0-9]+)/);

Further, your task would be made simpler if you can use named captures (i.e. Perl 5.10 and later). Here is an example:

#!/usr/bin/perl

use strict; use warnings;

my %re = (
    id => '(?<id> [0-9]+ )',
    name => '(?<name> \w+ )',
    value => '(?<value> [0-9]+ )',
);

my @this = (
    '123,one:12',
    '456,two:21',
);

my @that = (
    'one:[12],123',
    'two:[21],456',
);

my $this_re = qr/$re{id}   ,   $re{name}    : $re{value}/x;
my $that_re = qr/$re{name} : \[$re{value}\] , $re{id}   /x;

use YAML;

for my $d ( @this ) {
    print Dump [ parse($d, $this_re) ];
}

for my $d ( @that ) {
    print Dump [ parse($d, $that_re) ];
}

sub parse {
    my ($d, $re) = @_;
    return unless $d =~ $re;
    return my @result = @+{qw(id name value)};
}

Output:

---
- 123
- one
- 12
---
- 456
- two
- 21
---
- 123
- one
- 12
---
- 456
- two
- 21
Sinan Ünür
Thank you for this, it is good to know!
coding_hero