ansaurus

Question

How can I remove every third HTML tag in Perl?

Answer 1

+2 A:

Well, you're right that you shouldn't be parsing HTML with regular expressions. And since that is the case, it probably won't "just work."

Ideally, you need to be using an HTML parsing and manipulation library. Don't think of HTML as a big string for you to manipulate with text functions: it's a serialized, formatted data structure. You should monkey with it only using a library for that purpose. The various libraries have already fixed the hundreds of bugs that you are likely to face, making it a zillion times more likely that a simple HTML manipulation routine written against them will "just work." The master-level Perl programmers would generally not parse HTML this way, and it's not because they're obsessive and irrational about code quality and purity -- it's because they know that reinventing the wheel themselves is unlikely to yield something that rolls as smooth as the existing machinery.

I recommend HTML::Tree because it functions the way I think of HTML (and XML). I think there are a couple of other libraries that may be more popular.

The real truth is, if you can't even get your program to compile, you need to invest a little more time (a half day or so) figuring out the basics before you come looking for help. You have an error in your syntax for using the s///g regular expression substitution operator, and you need to find out how that is supposed to work before you go any further. It's not hard, and you can find out what you need from the Camel book, or the perlretut manpage, or several other sources. If you don't learn how to debug your program now, then likely any help you receive here is just going to take you to the next syntax error which you won't be able to get past.

skiphoppy 2009-03-16 01:30:28

It isn't that HTML is a serialized data structure that makes it hard to work with, it is the fact that constructs like this are valid: <img src="ptag.png" alt="<p>">.

Chas. Owens 2009-03-17 18:47:19

Answer 2

+3 A:

When your code doesn't compile, read the error and warning messages you get. If they don't make sense, consult perldoc perldiag (or put "use diagnostics;" in your code to automatically do this for you).

ysth 2009-03-16 01:54:19

Answer 3

+1 A:

The subroutine has lost its way. Start by taking a look at the structure of that:

sub remove {                                   # First opening bracket
    my $input = $ARGV[0];
    my $output = $ARGV[1];
    open INPUT, $input or die "couldn't open file $input: $!\n";
    open OUTPUT, ">$output" or die "couldn't open file $output: $!\n";

    my @file = <INPUT>;
    foreach (@file) {                          # Second opening bracket
        my $int = 0;
        if ($_ =~ '<div class="cell">') {      # Third opening bracket
        $int++;
        {                                      # Fourth opening bracket
        if ($int % 4 == 3) {                   # Fifth opening bracket
        $_ =~ '/s\<div class="cell">\+.*<\/div>/;/g';
            }                                  # First closing bracket
    }                                          # Second closing bracket
    print OUTPUT @file;
}                                              # Third closing bracket
                                               # No fourth closing bracket?
                                               # No fifth closing bracket?

I think you wanted this:

sub remove {
    my $input = $ARGV[0];
    my $output = $ARGV[1];
    open INPUT, $input or die "couldn't open file $input: $!\n";
    open OUTPUT, ">$output" or die "couldn't open file $output: $!\n";

    my @file = <INPUT>;
    foreach (@file) {
        my $int = 0;
        if ($_ =~ '<div class="cell">') {
          $int++;
        }
        if ($int % 4 == 3) {
          $_ =~ '/s\<div class="cell">\+.*<\/div>/;/g';
        }
    }
    print OUTPUT @file;
}

That will compile, and takes us to the next issue: Why are you single-quoting the regex? (Also see Cebjyre's point about the placement of my $int = 0.)

(To pick up on Ysth's point, you can also always run a script with perl -Mdiagnostics script-name to get the longer diagnostic messages.)

Telemachus 2009-03-16 01:57:37

Answer 4

+4 A:

Ken Fox 2009-03-16 02:17:47

close -- but it cuts off the fourth div also

Overflown 2009-03-16 03:30:32

You were closest, so you win. But see the above;

Overflown 2009-03-16 06:23:50

Did you test that or are you guessing that it's wrong because I left the </div> out at the end?

Ken Fox 2009-03-16 12:30:57

Answer 5

+2 A:

Once you get the squiggly brackets matching each other, and start using the substitution regex properly, you also need to move the

my $int = 0;

out of the for loop - it is currently being reset on every line that is read, so it will only ever have the value of 0 or 1.

Cebjyre 2009-03-16 02:24:33

ansaurus

tags:

views:

answers:

How can I remove every third HTML tag in Perl?

related questions