ansaurus

Question

Extracting a block of text where the closing expression depends on the opening one.

Answer 1

A:

#!/usr/bin/perl

my $all_lines = join "", <>;

# match a Heading that ends with (2) and read everything between that match
# and the next heading of the same depth (\1 matches the 1st matched group)
if ( $all_lines =~ /(=+ Heading )\([2]\)(.*?)\1/s ) {
    print "$2";
}

codelogic 2009-02-02 19:28:25

Thank you, but this does not address the case where the block would be terminated by a heading shallower than the starting one. I'll edit the question to emphasise this.

2009-02-02 20:26:53

Answer 2

A:

This splits the file in sections:

my @all = split /(?=^= )/m, join "", <$filehandle>;
shift @all;

Leon Timmermans 2009-02-02 20:27:24

Answer 3

A:

I'd skip trying to use a complex regex. Instead write a simple parser and build up a tree.

Here's a rough and ready implementation. It's only optimized for lazy coding. You may want to use libraries from CPAN to build your parser and your tree nodes.

#!/usr/bin/perl

use strict;
use warnings;

my $document = Node->new();
my $current = $document;

while ( my $line = <DATA> ) {

    if ( $line =~ /^=+\s/ ) {

        my $current_depth = $current->depth;
        my $line_depth = Node->Heading_Depth( $line );

        if ( $line_depth > $current_depth ) {
            # child node.
            my $line_node = Node->new();
            $line_node->heading( $line );
            $line_node->parent( $current );
            $current->add_children( $line_node );
            $current = $line_node;
        }
        else {

            my $line_node = Node->new();
            while ( my $parent = $current->parent ) {

                if ( $line_depth == $current_depth ) {
                    # sibling node.
                    $line_node->heading( $line );
                    $line_node->parent( $parent );
                    $current = $line_node;
                    $parent->add_children( $current );

                    last;
                }

                # step up one level.
                $current = $parent;
            }
        }

    }
    else {
        $current->add_children( $line );
    }


}

use Data::Dumper;
print Dumper $document;

BEGIN {
    package Node;
    use Scalar::Util qw(weaken blessed );

    sub new {
        my $class = shift;

        my $self = {
            children => [],
            parent   => undef,
            heading  => undef,
        };

        bless $self, $class;
    }

    sub heading {
        my $self = shift;
        if ( @_ ) {
            $self->{heading} = shift;
        }
        return $self->{heading};
    }

    sub depth {
        my $self = shift;

        return $self->Heading_Depth( $self->heading );
    }

    sub parent {
        my $self = shift;
        if ( @_ ) {
            $self->{parent} = shift;
            weaken $self->{parent};
        }
        return $self->{parent};
    }

    sub children {
        my $self = shift;
        return @{ $self->{children} || [] };
    }

    sub add_children {
        my $self = shift;
        push @{$self->{children}}, @_;
    }

    sub stringify {
        my $self = shift;

        my $text = $self->heading;
        foreach my $child ( $self->children ) {
            no warnings 'uninitialized';
            $text .= blessed($child) ? $child->stringify : $child;
        }

        return $text;
    }

    sub Heading_Depth {
        my $class  = shift;
        my $heading = shift || '';

        $heading =~ /^(=*)/;
        my $depth = length $1;


        return $depth;
    }

}

__DATA__
= Heading (1)

Some text

= Heading (2)

Some more text

== Subheading (3)

Some details here

== Subheading (3)

Some details here

= Heading (4)

daotoad 2009-02-02 21:03:30

Answer 4

A:

daotoad and jrockway are absolutely right. If you're trying to parse a tree-like data structure, bending regex to your will only results in a brittle inscrutable and still-not-general-enough intricate blob of code.

If you insist, though, here's a revised snippet that works. Matching up to same-depth separator OR end of string is one complication. Matching strings at depths less then or equal the current depth is more challenging and needed a two-step.

#!/usr/bin/perl

my $all_lines = join "", <>;
# match a Heading that ends with (2) and read everything between that match
# and the next heading of the same depth (\2 matches the 2nd parenthesized group)
if ( $all_lines =~ m/((=+) [^\n]*\(2\)(.*?))(\n\2 |\z)/s ) {
    # then trim it down to just the point before any heading at lesser depth
    my $some_lines = $1;
    my $depth = length($2);
    if ($some_lines =~ m/(.*?)(\n={1,$depth} |\z)/s) {
        print "$1\n";
    }
}

But my advice is to avoid this route and parse it with something readable and maintainable!

Liudvikas Bukys 2009-02-02 22:05:10

Answer 5

A:

Just for a giggle:

/^(?>(=+).*\(2\))(?>[\r\n]+(?=\1=|[^=]).*)*/m

The lookahead ensures that, if a line starts with an equals sign, there is at least one more equals sign than in the prefix of the original line. Notice that the second part of the lookahead matches any character other than an equals sign, including a linefeed or carriage return. That lets it match an empty line, but not the end of the string.

Alan Moore 2009-02-03 13:48:24

ansaurus

tags:

views:

answers:

Extracting a block of text where the closing expression depends on the opening one.

related questions