tags:

views:

521

answers:

4

I am using Perl to do text processing with regex. I have no control over the input. I have shown some examples of the input below.

As you can see the items B and C can be in the string n times with different values. I need to get all the values as back reference. Or if you know of a different way i am all ears.

I am trying to use branch reset pattern (as outlined at perldoc: "Extended Patterns") I am not having much luck matching the string.

("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

My Perl is below, any help would be great. Thanks for any help you can give.

if($inputString =~/\("Data" \(Int "A" ([0-9]+)\)(?:\(Int "B" ([0-9]+)\)\(Int "C" ([0-9]+)\))+\(Int "D" ([0-9]+)\)\(Int "E" ([0-9]+)\)\)/) {

    print "\n\nmatched\n";

    print "1: $1\n";
    print "2: $2\n";
    print "3: $3\n";
    print "4: $4\n";
    print "5: $5\n";
    print "6: $6\n";
    print "7: $7\n";
    print "8: $8\n";
    print "9: $9\n";

}
+3  A: 

I am not sure what benefit there would be in getting the values as back references - who would you wish to deal with the case of duplicated keys (like "C" in the second line). Also I am not sure what you wish to do with the values once extracts.

But I would start with something like:

use Data::Dumper;

while (<DATA>)
{
    my @a = m!\(Int "(.*?)" ([0-9]+)\)!g;
    print Dumper(\@a);
}

__DATA__
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C"     6)(Int "D" 34896)(Int "E" 38046)) 
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

This gives you an array of repeated key,value(s).

Beano
+1, Could you explain the m!\pattern!g; ?
Andomar
\d does not match [0-9] in Perl 5.8 and 5.10; it matches any UNICODE character that has the digit attribute (including "\x{1815}", MONGOLIAN DIGIT FIVE). If you mean [0-9] you must either use [0-9] or use the bytes pragma (but it turns all strings in 1 byte characters and is normally not what you want).
Chas. Owens
How this a problem?
Explanation of the m!! regex. I tend to use the 'm!!' form of pattern match to the usual '//' because I have to escape the '/' character more often than the '!' character. You can use any character to delimit your pattern match (this also applies to sed). The regex itself is matches the characters '(Int "' then marks the least number of any character followed by '" ' then marks some digits followed by ')'. Use the 'g' extension to match repeatedly and you have a solution. If this does not explain what you intended, please ask again.
Beano
With regard to the \d matching a UNICODE character with the digit attribute, whereas [0-9] matches the specific span of ASCII characters. I guess when considering this, you need to bear in mind what your input data is going to consist of - the above example I made the assumption of a ASCII data range (a reasonable assumption I thought), as this illustrated the use of the regular expression. I would have thought if my input data was Mongolian, then I would probably have been interested in "digit five" and therefore the regex would still have been valid.
Beano
@Beano Assumptions such as "my data will always be ASCII" are the source of lots of bugs. Use [0-9] if that is what you are looking for, only use \d if you mean to match any digit character ("\x{1815}" is just a way out there example of the sort of character you don't want to match, there are others that are more likely to show up like "\x{FF15}" (FULLWIDTH DIGIT FIVE) which looks like a normal "\x{0035}", but you can't do math with it.
Chas. Owens
I would argue that without the full context of the application, input data, etc. it is hard to say one way or the other what is the "correct" behavior.
Beano
And re-reading your comment, I did not state that I would ALWAYS assume my data is ASCII, I said for "the above example" - without full context, who can say what is correct or not. There must be cases where '\d' is a valid use, otherwise the digit attribute is a bit of a waste of time.
Beano
+1  A: 

My initial thought was to use named captures and to get the values from %-:

my $pattern = qr/
  \(
    "Data"\s+
    \(Int\s+"A"\s+(?<A>[0-9]+)\)
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )+
    \(Int\s+"D"\s+(?<D>[0-9]+)\)
    \(Int\s+"E"\s+(?<E>[0-9]+)\)
  \)
/x;

Unfortunately, the (?:...) grouping doesn't trigger capturing multiple values for B and C. I suspect that this is a bug. Doing it explicitly does capture all the values but you would have to know the maximum number of instances ahead of time.

my $pattern = qr/
  \(
    "Data"\s+
    \(Int\s+"A"\s+(?<A>[0-9]+)\)
    \(Int\s+"B"\s+(?<B>[0-9]+)\)
    \(Int\s+"C"\s+(?<C>[0-9]+)\)
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )?
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )?
    # repeat (?:...) N times
    \(Int\s+"D"\s+(?<D>[0-9]+)\)
    \(Int\s+"E"\s+(?<E>[0-9]+)\)
  \)
/x;

The simplest approach is to use m//g. You can either capture name/value pairs as Beano suggests or use multiple patterns to capture each value:

my @b = m/Int "B" ([0-9]+)/g;
my @c = m/Int "C" ([0-9]+)/g;
# etc.
Michael Carman
Captures inside of quantified matches only return the last capture, this isn't really a bug or feature, just the way they work. As far as I know, C# has the only implementation that captures multiple times out of a quantified match.
Chas. Owens
+8  A: 

Don't try to use one regex a set of regexes and splits are easier to understand:

#!/usr/bin/perl

use strict;
use warnings;

while (<DATA>) {
    next unless my ($data) = /\("Data" (.*)\)/;
    print "on line $., I saw:\n";
    for my $item ($data =~ /\((.*?)\)/g) {
     my ($type, $var, $num) = split " ", $item;
     print "\ttype $type var $var num $num\n";
    }
}

__DATA__
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

If your data can stretch across lines, I would suggest using a parser instead of a regex.

Chas. Owens
A: 

Hi Guys,

Great answers, Thanks. I used a bit of both.

I have a question about matching the B and C parts. I want to get a backreference for all the B and C items. Hence using this reference I can do the array search. What is the best way to do this?

Thanks again.

my $pattern = qr/
  \(
    "Data"\s+
    \(Int\s+"A"\s+(?<A>[0-9]+)\) 
    (
      \(Int\s+"B"\s+[0-9]+\)
      \(Int\s+"C"\s+[0-9]+\)
    )*
    \(Int\s+"D"\s+(?<D>[0-9]+)\)
    \(Int\s+"E"\s+(?<E>[0-9]+)\)
  \)/x;

my perl script is

while (<DATA>){
    my $data=$_;
    print $data;

    if($data=~/\("Data" \(Int "A" (?<A>[0-9]+)\)(\(Int "B" [0-9]+\)\(Int "C" [0-9]+\))+\(Int "D" ([0-9]+)\)\(Int "E" ([0-9]+)\)\)/) {

        print "matched\n";
        print "1: $1\n";
        print "2: $2\n";
        print "3: $3\n";
        print "4: $4\n";

    }
    print "\n\n"
}
__DATA__
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))

and is not matching all the B/C items as shown by the output below. Any pointer would be great.

("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
matched
1: 22
2: (Int "B" 1)(Int "C" 2)
3: 34896
4: 38046


("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "D" 34896)(Int "E" 38046))
matched
1: 22
2: (Int "B" 3)(Int "C" 4)
3: 34896
4: 38046


("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))
matched
1: 22
2: (Int "B" 5)(Int "C" 6)
3: 34896
4: 38046
Donald
It would be better to update your question so that people could provide answers to this. To use backreferences ($1, $2, etc.) you need to have that many pairs of capturing parentheses. In other words, you need to know how many things you want to capture. I had hoped that the named capture mechanism would bypass this but alas, it doesn't. (See Chas Owen's comment on my answer.) Your best bet is using m/<pattern>/g to get a list of captured values (as in Beano's answer).
Michael Carman
The B/C grouping in your pattern is optional here due to the "*" (zero-or more) quantifier. To require at least one instance use "+" instead. This would return all the B/C data in $2 which you would have to further parse to get the individual values.
Michael Carman
I've add my script to the post, which is not working, any pointers would be great?
Donald
Can you please explain why would want back-references - as you do not seem to be using them?
Beano