views:

368

answers:

3

Hi everone! I need to parse an HTML file and i've got something like this:

<TAG1>
    <TAG1>
        TEXT_TO_FIND
        KEY
        <TAG1>
        </TAG1>
        <TAG1>
        </TAG1>
    </TAG1>
</TAG1>

Taking into account that there are multiple levels of anidation. How can I get the text TEXT_TO_FIND?

In plain english, what I need to do is to get the text between "the last that has the text KEY after it" and "the text KEY", which only appearse once on the document.

Note1: I found this question but it didn't seem to work; I kept getting an empty result. This would be the expression:

/<TAG1>(?!.*<TAG1>)(.*)KEY/ism

Note2: If I remove the KEY from the expression of the previous note, I get the text from the last to the end of file.

Thanks everyone in advance!

+1  A: 

Hi everone! I need to parse an HTML file and i've got something like this:

Then you need an HTML parser. Regular Expressions aren't powerful enough to do it properly.

Once you've parsed the HTML and got the contents of each of your TAGs, you can use something like:

/(.*)KEY/is

to check whether the text contains KEY and if so, to grab the stuff that precedes it.

Anon.
A: 

If you just don't want to use a HTML parser, this is a regexp that works if TEXT_TO_FIND does not contain "<" or ">":

/\s*([^<>]*?)\s*?KEY/ism
Leventix
Thanks, this solved it!PS: Yes, I should probably use an HTML parser.
A: 

Use each tool in its appropriate context: find text chunks with an HTML parser, and then match against those with regular expressions.

#! /usr/bin/perl

use warnings;
use strict;

use HTML::Parser;

my $p = HTML::Parser->new(
  api_version => 3,
  text_h => [
    sub {
      local($_) = @_;
      print $1, "\n" if /(\S.+?)\s*\bKEY\b/s;
    },
    "dtext"
  ],
);

# for demo only
*ARGV = *DATA;

undef $/;
$p->parse(<>);

__DATA__
<TAG1>
    <TAG1>
        TEXT_TO_FIND
        KEY
        <TAG1>
        </TAG1>
        <TAG1>
        </TAG1>
    </TAG1>
</TAG1>

Output:

$ ./find-text
TEXT_TO_FIND
Greg Bacon