tags:

views:

169

answers:

5

I have a text string structured like this:

= Some Heading (1)

Some text

== Some Sub-Heading (2)

Some more text

=== Some Sub-sub-heading (3)

Some details here

= Some other Heading (4)

I want to extract the content of second heading, including any subsection. I do not know beforehand what is the depth of the second heading, so I need to match from there to the next heading that is of the same depth, or shallower, or the end of the string.

In the example above, this would yield:

== Some Sub-Heading (2)

Some more text

=== Some Sub-sub-heading (3)

Some details here

This is where I get stuck. How can I use the matched sub-expression opening the second heading as part of the sub-expression for closing the section.

A: 
#!/usr/bin/perl

my $all_lines = join "", <>;

# match a Heading that ends with (2) and read everything between that match
# and the next heading of the same depth (\1 matches the 1st matched group)
if ( $all_lines =~ /(=+ Heading )\([2]\)(.*?)\1/s ) {
    print "$2";
}
codelogic
Thank you, but this does not address the case where the block would be terminated by a heading shallower than the starting one. I'll edit the question to emphasise this.
A: 

This splits the file in sections:

my @all = split /(?=^= )/m, join "", <$filehandle>;
shift @all;
Leon Timmermans
A: 

I'd skip trying to use a complex regex. Instead write a simple parser and build up a tree.

Here's a rough and ready implementation. It's only optimized for lazy coding. You may want to use libraries from CPAN to build your parser and your tree nodes.

#!/usr/bin/perl

use strict;
use warnings;

my $document = Node->new();
my $current = $document;

while ( my $line = <DATA> ) {

    if ( $line =~ /^=+\s/ ) {

        my $current_depth = $current->depth;
        my $line_depth = Node->Heading_Depth( $line );

        if ( $line_depth > $current_depth ) {
            # child node.
            my $line_node = Node->new();
            $line_node->heading( $line );
            $line_node->parent( $current );
            $current->add_children( $line_node );
            $current = $line_node;
        }
        else {

            my $line_node = Node->new();
            while ( my $parent = $current->parent ) {

                if ( $line_depth == $current_depth ) {
                    # sibling node.
                    $line_node->heading( $line );
                    $line_node->parent( $parent );
                    $current = $line_node;
                    $parent->add_children( $current );

                    last;
                }

                # step up one level.
                $current = $parent;
            }
        }

    }
    else {
        $current->add_children( $line );
    }


}

use Data::Dumper;
print Dumper $document;

BEGIN {
    package Node;
    use Scalar::Util qw(weaken blessed );

    sub new {
        my $class = shift;

        my $self = {
            children => [],
            parent   => undef,
            heading  => undef,
        };

        bless $self, $class;
    }

    sub heading {
        my $self = shift;
        if ( @_ ) {
            $self->{heading} = shift;
        }
        return $self->{heading};
    }

    sub depth {
        my $self = shift;

        return $self->Heading_Depth( $self->heading );
    }

    sub parent {
        my $self = shift;
        if ( @_ ) {
            $self->{parent} = shift;
            weaken $self->{parent};
        }
        return $self->{parent};
    }

    sub children {
        my $self = shift;
        return @{ $self->{children} || [] };
    }

    sub add_children {
        my $self = shift;
        push @{$self->{children}}, @_;
    }

    sub stringify {
        my $self = shift;

        my $text = $self->heading;
        foreach my $child ( $self->children ) {
            no warnings 'uninitialized';
            $text .= blessed($child) ? $child->stringify : $child;
        }

        return $text;
    }

    sub Heading_Depth {
        my $class  = shift;
        my $heading = shift || '';

        $heading =~ /^(=*)/;
        my $depth = length $1;


        return $depth;
    }

}

__DATA__
= Heading (1)

Some text

= Heading (2)

Some more text

== Subheading (3)

Some details here

== Subheading (3)

Some details here

= Heading (4)
daotoad
A: 

daotoad and jrockway are absolutely right. If you're trying to parse a tree-like data structure, bending regex to your will only results in a brittle inscrutable and still-not-general-enough intricate blob of code.

If you insist, though, here's a revised snippet that works. Matching up to same-depth separator OR end of string is one complication. Matching strings at depths less then or equal the current depth is more challenging and needed a two-step.

#!/usr/bin/perl

my $all_lines = join "", <>;
# match a Heading that ends with (2) and read everything between that match
# and the next heading of the same depth (\2 matches the 2nd parenthesized group)
if ( $all_lines =~ m/((=+) [^\n]*\(2\)(.*?))(\n\2 |\z)/s ) {
    # then trim it down to just the point before any heading at lesser depth
    my $some_lines = $1;
    my $depth = length($2);
    if ($some_lines =~ m/(.*?)(\n={1,$depth} |\z)/s) {
        print "$1\n";
    }
}

But my advice is to avoid this route and parse it with something readable and maintainable!

Liudvikas Bukys
A: 

Just for a giggle:

/^(?>(=+).*\(2\))(?>[\r\n]+(?=\1=|[^=]).*)*/m

The lookahead ensures that, if a line starts with an equals sign, there is at least one more equals sign than in the prefix of the original line. Notice that the second part of the lookahead matches any character other than an equals sign, including a linefeed or carriage return. That lets it match an empty line, but not the end of the string.

Alan Moore