tags:

views:

289

answers:

5

I'm writing a Perl script and I've come to a point where I need to parse a Java source file line by line checking for references to a fully qualified Java class name. I know the class I'm looking for up front; also the fully qualified name of the source file that is being searched (based on its path).

For example find all valid references to foo.bar.Baz inside the com/bob/is/YourUncle.java file.

At this moment the cases I can think of that it needs to account for are:

  1. The file being parsed is in the same package as the search class.

    find foo.bar.Baz references in foo/bar/Boing.java

  2. It should ignore comments.

    // this is a comment saying this method returns a foo.bar.Baz or Baz instance 
    // it shouldn't count
    
    
    /*   a multiline comment as well
     this shouldn't count
     if I put foo.bar.Baz or Baz in here either */
    
  3. In-line fully qualified references.

    foo.bar.Baz fb = new foo.bar.Baz();
    
  4. References based off an import statement.

    import foo.bar.Baz;
    ...
    Baz b = new Baz();
    

What would be the most efficient way to do this in Perl 5.8? Some fancy regex perhaps?

open F, $File::Find::name or die;
# these three things are already known
# $classToFind    looking for references of this class
# $pkgToFind      the package of the class you're finding references of
# $currentPkg     package name of the file being parsed
while(<F>){
  # ... do work here   
}
close F;
# the results are availble here in some form
+5  A: 

A Regex is probably the best solution for this, although I did find the following module in CPAN that you might be able to use

  • Java::JVM::Classfile - Parses compiled class files and returns info about them. You would have to compile the files before you could use this.

Also, remember that it can be tricky to catch all possible variants of a multi-line comment with a regex.

Paul Wicks
+5  A: 

You also need to skip quoted strings (you can't even skip comments correctly if you don't also deal with quoted strings).

I'd probably write a fairly simple, efficient, and incomplete tokenizer very similar to the one I wrote in node 566467.

Based on that code I'd probably just dig through the non-comment/non-string chunks looking for \bimport\b and \b\Q$toFind\E\b matches. Perhaps similar to:

if( m[
        \G
        (?:
            [^'"/]+
          | /(?![/*])
        )+
    ]xgc
) {
    my $code = substr( $_, $-[0], $+[0] - $-[0] );
    my $imported = 0;
    while( $code =~ /\b(import\s+)?\Q$package\E\b/g ) {
        if( $1 ) {
            ... # Found importing of package
            while( $code =~ /\b\Q$class\E\b/g ) {
                ... # Found mention of imported class
            }
            last;
        }
        ... # Found a package reference
    }
} elsif( m[ \G ' (?: [^'\\]+ | \\. )* ' ]xgc
    ||   m[ \G " (?: [^"\\]+ | \\. )* " ]xgc
) {
    # skip quoted strings
} elsif(  m[\G//.*]g­c  ) {
    # skip C++ comments
tye
+2  A: 

This is really just a straight grep for Baz (or for /(foo.bar.| )Baz/ if you're concerned about false positives from some.other.Baz), but ignoring comments, isn't it?

If so, I'd knock together a state engine to track whether you're in a multiline comment or not. The regexes needed aren't anything special. Something along the lines of (untested code):

my $in_comment;
my %matches;
my $line_num = 0;
my $full_target = 'foo.bar.Baz';
my $short_target = (split /\./, $full_target)[-1];  # segment after last . (Baz)

while (my $line = <F>) {
    $line_num++;
    if ($in_comment) {
        next unless $line =~ m|\*/|;  # ignore line unless it ends the comment
        $line =~ s|.*\*/||;           # delete everything prior to end of comment
    } elsif ($line =~ m|/\*|) {
        if ($line =~ m|\*/|) {        # catch /* and */ on same line
            $line =~ s|/\*.*\*/||;
        } else {
            $in_comment = 1;
            $line =~ s|/\*.*||;       # clear from start of comment to end of line
        }
    }

    $line =~ s/\\\\.*//;   # remove single-line comments
    $matches{$line_num} = $line if $line =~ /$full_target| $short_target/;
}

for my $key (sort keys %matches) {
    print $key, ': ', $matches{$key}, "\n";
}

It's not perfect and the in/out of comment state can be messed up by nested multiline comments or if there are multiple multiline comments on the same line, but that's probably good enough for most real-world cases.

To do it without the state engine, you'd need to slurp into a single string, delete the /.../ comments, and split it back into separate lines, and grep those for non-//-comment hits. But you wouldn't be able to include line numbers in the output that way.

Dave Sherohman
+2  A: 

This is what I came up with that works for all the different cases I've thrown at it. I'm still a Perl noob and its probably not the fastest thing in the world but it should work for what I need. Thanks for all the answers they helped me look at it in different ways.

  my $className = 'Baz';
  my $searchPkg = 'foo.bar';
  my @potentialRefs, my @confirmedRefs;
  my $samePkg = 0;
  my $imported = 0;
  my $currentPkg = 'com.bob';
  $currentPkg =~ s/\//\./g;
  if($currentPkg eq $searchPkg){
    $samePkg = 1;  
  }
  my $inMultiLineComment = 0;
  open F, $_ or die;
  my $lineNum = 0;
  while(<F>){
    $lineNum++;
    if($inMultiLineComment){
      if(m|^.*?\*/|){
        s|^.*?\*/||; #get rid of the closing part of the multiline comment we're in
        $inMultiLineComment = 0;
      }else{
        next;
      }
    }
    if(length($_) > 0){
      s|"([^"\\]*(\\.[^"\\]*)*)"||g; #remove strings first since java cannot have multiline string literals
      s|/\*.*?\*/||g;  #remove any multiline comments that start and end on the same line
      s|//.*$||;  #remove the // comments from what's left
      if (m|/\*.*$|){
        $inMultiLineComment = 1 ;#now if you have any occurence of /* then then at least some of the next line is in the multiline comment
        s|/\*.*$||g;
      }
    }else{
      next; #no sense continuing to process a blank string
    }

    if (/^\s*(import )?($searchPkg)?(.*)?\b$className\b/){
      if($imported || $samePkg){
        push(@confirmedRefs, $lineNum);
      }else {
        push(@potentialRefs, $lineNum);
      }
      if($1){
        $imported = 1;
      } elsif($2){
        push(@confirmedRefs, $lineNum);
      }
    }
  }
  close F;      
  if($imported){
    push(@confirmedRefs,@potentialRefs);
  }

  for (@confirmedRefs){
    print "$_\n";
  }
polarbear
+1  A: 

If you are feeling adventurous enough you could have a look at Parse::RecDescent.

dsm