tags:

views:

351

answers:

5

I have the following input to a Perl script and I wish to get the first occurrence of NAME="..." strings in each of the <table>...</table> structures.

The entire file is read into a single string and the regex acts on that input.

However, the regex always returns the last occurrence of NAME="..." strings. Can anyone explain what is going on and how this can be fixed?

Input file: 
ADSDF
<TABLE>
NAME="ORDERSAA"
line1
line2
NAME="ORDERSA"
line3
NAME="ORDERSAB"
</TABLE>
<TABLE>
line1
line2
NAME="ORDERSB"
line3
</TABLE>
<TABLE>
line1
line2
NAME="ORDERSC"
line3
</TABLE>
<TABLE>
line1
line2
NAME="ORDERSD"
line3
line3
line3
</TABLE>
<TABLE>
line1
line2
NAME="QUOTES2"
line3
NAME="QUOTES3"
NAME="QUOTES4"
line3
NAME="QUOTES5"
line3
</TABLE>
<TABLE>
line1
line2
NAME="QUOTES6"
NAME="QUOTES7"
NAME="QUOTES8"
NAME="QUOTES9"
line3
line3
</TABLE>
<TABLE>
NAME="MyName IsKhan"
</TABLE>

Perl Code starts here:

use warnings;
use strict;

my $nameRegExp = '(<table>((NAME="(.+)")|(.*|\n))*</table>)';

sub extractNames($$){
 my ($ifh, $ofh) = @_;
 my $fullFile;
 read ($ifh, $fullFile, 1024);#Hardcoded to read just 1024 bytes.
 while( $fullFile =~ m#$nameRegExp#gi){
  print "found: ".$4."\n";
 }
}

sub main(){
 if( ($#ARGV + 1 )!= 1){
  die("Usage: extractNames infile\n");
 }
 my $infileName = $ARGV[0];
 my $outfileName = $ARGV[1];
 open my $inFile, "<$infileName" or die("Could not open log file $infileName");
 my $outFile;
 #open my $outFile, ">$outfileName" or die("Could not open log file $outfileName");
 extractNames( $inFile, $outFile );
 close( $inFile );
 #close( $outFile );
}

#call 
main();
+1  A: 

Try making your regex non-greedy:

my $nameRegExp = '(<table>((NAME="(.+?)")|(.*?|\n))*</table>)';

Even the above regex will not list all the NAME lines in the file. It will list just one NAME line (last one)from each <TABLE>...</TABLE> block.

To list all the NAME lines you can do:

my $nameRegExp = 'NAME="(.+?)"';

and print $1;

codaddict
I had tried the non-greedy option too. I was wondering why it does not find the first occurrence and ALWAYS finds the last. Any thoughts on that?
Sai Charan
+1  A: 

First of all, its a bad idea to parse XML with Regular Expressions. Second you need to change your regex to the following:

my $nameRegExp = '(<table>((NAME="(.+)?")|(.*?|\n))*?</table>)';

This way the regex becomes non greedy and should return the first occurence.

Aurril
I had already tried not being greedy option; It does NOT work. And, this is not an XML file. It just happened to have a structure that had the table stucture - hence I opted to work with regular expressions.
Sai Charan
+1  A: 
$/ = '</TABLE>';
while (<>) {
    chomp;
    @F = split "\n";
    $g = 0;
    for ($o = 0; $o <= $#F; $o++) {
        if ($F[$o] =~ /^NAME=/) {
            $F[$o] =~ s/^NAME=//g;
            $v = $F[$o];
            $g = 1;
            last;
        }
    }    
    if ($g) {  print $v."\n"; }
}

output

$ perl myscript.pl file
"ORDERSAA"
"ORDERSB"
"ORDERSC"
"ORDERSD"
"QUOTES2"
"QUOTES6"
"MyName IsKhan"

the whole gist of it: use </TABLE> as record separator and newline as field separator. Go through each field and find NAME=. If found, substitute and get the string after the = sign.

ghostdog74
I don't understand what this script is doing, but it looks good.
Aurril
I had considered this alternative; but was pretty sure it can be handled by a single regex since the input has some structure.
Sai Charan
+3  A: 

Try this:

'(?><TABLE>\n+(?:(?!</TABLE>|NAME=).*\n+)*)NAME="([^"]+)"'

The (?:.*\n+)* consumes any unwanted lines, while the embedded lookahead -- (?!</TABLE>|NAME=) -- keeps it from overrunning the first NAME field or the end of the TABLE record. Just in case there's a record with no NAME field, I wrapped most of the expression in an atomic group -- (?>...) -- to prevent pointless backtracking.

Notice that there's only one capturing group now. It's good practice to use them only when you really need to capture something; otherwise, use the non-capturing variety: (?:...).


EDIT: As to why your regex didn't work, the short answer is greediness. After matching the opening tag, this part takes over:

((NAME="(.+)")|(.*|\n))*

The part in the outermost parens can match anything: tags, NAME= lines, linefeeds--even empty lines. Wrap that in a group controlled by a greedy *, and now it matches everything. There's nothing in there to make it stop matching at the first NAME field, or even at the end of a record.

So it's actually "finding" every occurrence of NAME="..." strings, but it's doing it in a single match attempt that consumes the entire input at once. With each iteration of the enclosing *, the capture groups are overwritten; when it's done, the final NAME value -- MyName IsKhan -- is what happens to be left in group 4.

I used a negative lookahead to check the greediness, but you can also do that more directly, by using a non-greedy quantifier. Here's how my regex would look with a reluctant * in place of the negative lookahead:

'<TABLE>\n+(?:.*\n+)*?NAME="([^"]+)"'

Simply switching to a non-greedy quantifier wouldn't help with your regex though; you'd have to make some structural changes as well.

Alan Moore
Thanks! That worked. I was unaware of independent sub-expressions - learnt something new.
Sai Charan
Can you explain why the original regex did not work? In particular, why was it always finding the LAST occurrence?
Sai Charan
@scharan: No problem; see my edit.
Alan Moore
A: 

Charan and I also tried this regex:

<table>.*?(NAME="(.*)").*?</table>

but it didn't work. I thought it should work -- find a <table>, then match as few characters as possible, and then a NAME= section. What am I missing here? Thanks.

Kartick Vaddadi
You would need to use the `s` modifier, so the `.` can match newlines: `/<table>.*?(NAME="(.*)").*?<\/table>/s`
Alan Moore