ansaurus

Question

Perl: extracting data from text using regex

Answer 1

+3 A:

I am not sure what benefit there would be in getting the values as back references - who would you wish to deal with the case of duplicated keys (like "C" in the second line). Also I am not sure what you wish to do with the values once extracts.

But I would start with something like:

use Data::Dumper;

while (<DATA>)
{
    my @a = m!\(Int "(.*?)" ([0-9]+)\)!g;
    print Dumper(\@a);
}

__DATA__
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C"     6)(Int "D" 34896)(Int "E" 38046)) 
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

This gives you an array of repeated key,value(s).

Beano 2009-05-17 16:53:25

+1, Could you explain the m!\pattern!g; ?

Andomar 2009-05-17 17:02:42

\d does not match [0-9] in Perl 5.8 and 5.10; it matches any UNICODE character that has the digit attribute (including "\x{1815}", MONGOLIAN DIGIT FIVE). If you mean [0-9] you must either use [0-9] or use the bytes pragma (but it turns all strings in 1 byte characters and is normally not what you want).

Chas. Owens 2009-05-17 17:39:31

How this a problem?

2009-05-17 19:28:49

Explanation of the m!! regex. I tend to use the 'm!!' form of pattern match to the usual '//' because I have to escape the '/' character more often than the '!' character. You can use any character to delimit your pattern match (this also applies to sed). The regex itself is matches the characters '(Int "' then marks the least number of any character followed by '" ' then marks some digits followed by ')'. Use the 'g' extension to match repeatedly and you have a solution. If this does not explain what you intended, please ask again.

Beano 2009-05-17 22:14:39

With regard to the \d matching a UNICODE character with the digit attribute, whereas [0-9] matches the specific span of ASCII characters. I guess when considering this, you need to bear in mind what your input data is going to consist of - the above example I made the assumption of a ASCII data range (a reasonable assumption I thought), as this illustrated the use of the regular expression. I would have thought if my input data was Mongolian, then I would probably have been interested in "digit five" and therefore the regex would still have been valid.

Beano 2009-05-17 22:23:50

@Beano Assumptions such as "my data will always be ASCII" are the source of lots of bugs. Use [0-9] if that is what you are looking for, only use \d if you mean to match any digit character ("\x{1815}" is just a way out there example of the sort of character you don't want to match, there are others that are more likely to show up like "\x{FF15}" (FULLWIDTH DIGIT FIVE) which looks like a normal "\x{0035}", but you can't do math with it.

Chas. Owens 2009-05-17 23:09:30

I would argue that without the full context of the application, input data, etc. it is hard to say one way or the other what is the "correct" behavior.

Beano 2009-05-18 06:48:55

And re-reading your comment, I did not state that I would ALWAYS assume my data is ASCII, I said for "the above example" - without full context, who can say what is correct or not. There must be cases where '\d' is a valid use, otherwise the digit attribute is a bit of a waste of time.

Beano 2009-05-18 06:54:50

Answer 2

+1 A:

My initial thought was to use named captures and to get the values from %-:

my $pattern = qr/
  \(
    "Data"\s+
    \(Int\s+"A"\s+(?<A>[0-9]+)\)
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )+
    \(Int\s+"D"\s+(?<D>[0-9]+)\)
    \(Int\s+"E"\s+(?<E>[0-9]+)\)
  \)
/x;

Unfortunately, the (?:...) grouping doesn't trigger capturing multiple values for B and C. I suspect that this is a bug. Doing it explicitly does capture all the values but you would have to know the maximum number of instances ahead of time.

my $pattern = qr/
  \(
    "Data"\s+
    \(Int\s+"A"\s+(?<A>[0-9]+)\)
    \(Int\s+"B"\s+(?<B>[0-9]+)\)
    \(Int\s+"C"\s+(?<C>[0-9]+)\)
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )?
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )?
    # repeat (?:...) N times
    \(Int\s+"D"\s+(?<D>[0-9]+)\)
    \(Int\s+"E"\s+(?<E>[0-9]+)\)
  \)
/x;

The simplest approach is to use m//g. You can either capture name/value pairs as Beano suggests or use multiple patterns to capture each value:

my @b = m/Int "B" ([0-9]+)/g;
my @c = m/Int "C" ([0-9]+)/g;
# etc.

Michael Carman 2009-05-17 17:10:30

Captures inside of quantified matches only return the last capture, this isn't really a bug or feature, just the way they work. As far as I know, C# has the only implementation that captures multiple times out of a quantified match.

Chas. Owens 2009-05-17 17:38:26

Answer 3

+8 A:

Don't try to use one regex a set of regexes and splits are easier to understand:

#!/usr/bin/perl

use strict;
use warnings;

while (<DATA>) {
    next unless my ($data) = /\("Data" (.*)\)/;
    print "on line $., I saw:\n";
    for my $item ($data =~ /\((.*?)\)/g) {
     my ($type, $var, $num) = split " ", $item;
     print "\ttype $type var $var num $num\n";
    }
}

__DATA__
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

If your data can stretch across lines, I would suggest using a parser instead of a regex.

Chas. Owens 2009-05-17 17:28:55

Answer 4

A:

Hi Guys,

Great answers, Thanks. I used a bit of both.

I have a question about matching the B and C parts. I want to get a backreference for all the B and C items. Hence using this reference I can do the array search. What is the best way to do this?

Thanks again.

my $pattern = qr/
  \(
    "Data"\s+
    \(Int\s+"A"\s+(?<A>[0-9]+)\) 
    (
      \(Int\s+"B"\s+[0-9]+\)
      \(Int\s+"C"\s+[0-9]+\)
    )*
    \(Int\s+"D"\s+(?<D>[0-9]+)\)
    \(Int\s+"E"\s+(?<E>[0-9]+)\)
  \)/x;

my perl script is

while (<DATA>){
    my $data=$_;
    print $data;

    if($data=~/\("Data" \(Int "A" (?<A>[0-9]+)\)(\(Int "B" [0-9]+\)\(Int "C" [0-9]+\))+\(Int "D" ([0-9]+)\)\(Int "E" ([0-9]+)\)\)/) {

        print "matched\n";
        print "1: $1\n";
        print "2: $2\n";
        print "3: $3\n";
        print "4: $4\n";

    }
    print "\n\n"
}
__DATA__
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))

and is not matching all the B/C items as shown by the output below. Any pointer would be great.

("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
matched
1: 22
2: (Int "B" 1)(Int "C" 2)
3: 34896
4: 38046


("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "D" 34896)(Int "E" 38046))
matched
1: 22
2: (Int "B" 3)(Int "C" 4)
3: 34896
4: 38046


("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))
matched
1: 22
2: (Int "B" 5)(Int "C" 6)
3: 34896
4: 38046

Donald 2009-05-17 20:17:06

It would be better to update your question so that people could provide answers to this. To use backreferences ($1, $2, etc.) you need to have that many pairs of capturing parentheses. In other words, you need to know how many things you want to capture. I had hoped that the named capture mechanism would bypass this but alas, it doesn't. (See Chas Owen's comment on my answer.) Your best bet is using m/<pattern>/g to get a list of captured values (as in Beano's answer).

Michael Carman 2009-05-17 20:23:45

The B/C grouping in your pattern is optional here due to the "*" (zero-or more) quantifier. To require at least one instance use "+" instead. This would return all the B/C data in $2 which you would have to further parse to get the individual values.

Michael Carman 2009-05-17 20:27:15

I've add my script to the post, which is not working, any pointers would be great?

Donald 2009-05-17 21:01:11

Can you please explain why would want back-references - as you do not seem to be using them?

Beano 2009-05-17 22:26:39

ansaurus

tags:

views:

answers:

Perl: extracting data from text using regex

related questions