tags:

views:

125

answers:

3

I'm trying to match the point between 2nd and 3rd paragraphs to insert some content. Paragraphs are delimited either by <p> or 2 newlines, mixed. Here's an example:

text text text text
text text text text

<p>
text text text text
text text text text
</p>
<--------------------------- want to insert text here
<p>
text text text text
text text text text
</p>

+3  A: 

Assuming there are no nested paragraphs...

my $to_insert = get_thing_to_insert();
$text =~ s/((?:<p>.*?</p>|\n\n){2})/$1$to_insert/s;

should just about do it.

With extended formatting:

$text =~ s{
    (             # a group
        (?:       # containing ...
            <p>   # the start of a paragraph
            .*?   # to...
            </p>  # its closing tag
        |         # OR...
           \n\n   # two newlines alone. 
        ){2}      # twice
    )             # and take all of that...
}
{$1$to_insert}xms # and append $val to it

Note, I used \n\n as the delimiter; if you're using a windows style text file, this needs to be \r\n\r\n, or if it might be mixed, something like \r?\n\r?\n to make the \r optional.

Also note that because the '\n\n' is after the |, the <p> blocks can have double newlines in them - <p> to </p> takes priority. If you want newlines inside the <p>'s to take priority, swap those around.

Robert P
+1: Capturing a non-capturing group? That's gonna take some getting used to. Oh, you might need to add the `/s` modifier if you're to match `\n` with `.`
Zaid
Ah, true true. Fixing.
Robert P
Also brought in the part about the double newline.
Robert P
A: 

Text:

my $text = '
text text text text
text text text text

<p>
text text text text
text text text text
</p>
<p>
text text text text
text text text text
</p>
';

This should work with:

our $cnt = 0;
our $where = 2;

my $new_stuff='<- want to insert text here';
$text =~ s/
           (
            (?:\n|<\/p>)\n
           )
           (?{ ++$cnt })
           (??{ $cnt==$where?'':'!$' })
          /$1$new_stuff\n/xs;

Result:

text text text text
text text text text

<p>
text text text text
text text text text
</p>
<- want to insert text here
<p>
text text text text
text text text text
</p>

Regards

rbo

rubber boots
Instead of downvoting (don't be stupid, it costs you a point) please give advice what's wrong with the answer and how to do better. Thanks.
rubber boots
I didn't do the downvote, but I suspect it has to do with the fact that he needed a regex only solution, and this needed variables.
Robert P
@Robert, I see, I missed that. Thanks!rbo
rubber boots
A: 

Instead of using a regular expression, use an HTML tree walker to find the second paragraph and add whatever you like. I talked about this sort of thing in my Process HTML with a Perl module article for InformIT.

The advantage of something like HTML::TreeBuilder is that you deal with the logical structure of the HTML rather than contending with the position and order of random characters in a regular expression. If the structure stays the same, a tree walker should keep working. If you change almost anything, the regex is probably going to break.

An HTML::TreeBuilder example looks something like this:

#!perl
use strict;
use warnings;

use HTML::TreeBuilder;
use HTML::Element;

my $html  = HTML::TreeBuilder->new;
my $root  = $html->parse_file( *DATA );

my $second = ( $root->find_by_tag_name('p') )[1];

my $new_para = HTML::Element->new( 'p' );
$new_para->push_content( 'Add this line' );

$second->postinsert( $new_para );

print $root->as_HTML( undef, "\t", {} );

__END__
<p>
This is the first paragraph
</p>

<p>
This is the second paragraph
</p>

<p>
This is the last paragraph
</p>

If you need to clean up your data first, you can throw in some steps to use HTML::Tidy with the enclose_text option.

brian d foy
I was caught by this mistake too, but there's two problems. The question specifically mentioned that it has to be a regex - "the language is perl, but i'm limited to regex only, no scripting can be done - this is in context of a template language." In addition, two newlines also count as a 'new paragraph' - HTML::TreeBuilder can't know about that requirement since it's not a valid HTML paragraph.
Robert P
Actually, the template stuff is hidden in a comment.
brian d foy
Ok, that point conceded. However the "two newlines counts as a paragraph" requirement is still a problem.
Robert P
For that, I'm go through all the templates and make them valid HTML. Building solutions on broken situations is just more future pain.
brian d foy