ansaurus

Question

Is there an easy way to chunk a text file into brace-balanced sections?

Answer 1

+5 A:

Show what you are feeding Parse::RecDescent; it may be possible to make it much better.

Or you could try using Text::Balanced to parse the { ... }.

ysth 2009-06-15 23:20:20

The grammar is pretty big, and I'm not sure if I should share it.Tried Text::Balanced, only to find that it runs on RecDescent, so is still slow. Thanks anyways..

Marty 2009-06-15 23:25:37

RecDescent isn't necessarily slow; depends what you do with it. I hope you actually did give it a try.

ysth 2009-06-15 23:32:50

Ok, I was surprised when you said that Text::Balanced "runs on" RecDescent, because that wasn't my memory of it. Looking at it again shows you are wrong; Parse::RecDescent uses Text::Balanced to parse grammars, but Text::Balanced does not use Parse::RecDescent at all.

ysth 2009-06-15 23:39:00

Oops. I don't know where I got that notion from. Thanks for setting me straight.

Marty 2009-06-16 00:01:53

Here's the meat of my Text::Balanced version<PRE><CODE>#### Seperate cells into their own files##do { $libtext =~ m/cell\s*$\s*(\w+)\s*$\s*/g; my $cellname = $1; print "Found cell $cellname\n"; my $cellbody = extract_bracketed( $libtext, "{" ); open( CELL, ">./lib_split/$libname/$cellname.txt" ) or die "oops!\n"; print CELL "/* $cellname */\n"; print CELL "$cellbody\n"; close(CELL);} while ( pos($libtext) > 0 );</CODE></PRE>BTW, input files are ~8MB

Marty 2009-06-16 00:09:00

Hmm. It occurs to me that I'm not using extract_brackete() properly within a loop.

Marty 2009-06-16 00:31:57

Answer 2

+2 A:

Why does RecDescent take so long? Is it because your grammar is complex? If that's the case, you could two a bi-level pass using Parse::RecDescent. The idea is that you would define a simple grammar that parses cell ... { ... } and then passes parsed output from the first parser into a call to Parse::RecDescent with your more complex grammar. This is guessing about the reason for RecDescent being slow on your data.

Another option is to write your own simple parser that matches on the cell entries, counts the number of braces it's seen so far, and then finds the matching brace when the closing brace count is equal to the opening brace count. That should be fast, but the suggestion above might be faster to implement and easier to maintain.

Edit: You should definitely try Parse::RecDescent with a simplified grammar. The algorithmic complexity of recursive descent parsing is proportional to the number of possible parse trees, which should be something like is B ^ N, where B is the number of branching points in your grammar, and N is the number of nodes.

If you'd like to try rolling your own simple parser for a first pass over your input, the following code can get you started.

#!/usr/bin/perl -w

use strict;

my $input_file = "input";
open FILE, "<$input_file" or die $!;

my $in_block = 0;
my $current_block = '';
my $open_bracket_count = 0;
while( my $line = <FILE> ) {
    if ( $line =~ /cell/ ) {
     $in_block = 1;
    }

    if ( $in_block ) {
     while ( $line =~ /([\{\}]{1})/g ) {
      my $token = $1;
      if ( $token eq '{' ) {
       $open_bracket_count++;
      } elsif ( $token eq '}' ) {
       $open_bracket_count--;
      }
     }

     $current_block .= $line;
    }

    if ( $open_bracket_count == 0 && $current_block ne '' ) {
     print '-' x 80, "\n";
     print $current_block, "\n";
     $in_block = 0;
     $current_block = '';
    }
}
close FILE or die $!;

Edit: changed code to avoid slurping the entire file into memory. While this is trivial for an 8MB file, it's cleaner to just read the file in line-by-line.

James Thompson 2009-06-15 23:30:40

Answer 3

+1 A:

Use yapp LALR(1) parser which works in linear time and constant space.

Hynek -Pichi- Vychodil 2009-06-16 10:14:37

ansaurus

tags:

views:

answers:

Is there an easy way to chunk a text file into brace-balanced sections?

related questions