tags:

views:

45

answers:

1

I am trying to rework many pages across many sites. The pages may contain JavaScript, PHP, or ASP code in addition to HTML. The problem I'm encountering is that the module rewrites things I don't want rewritten. I've managed to handle most of the symbols (e.g., ", >) in HTML tags like script, but they get changed into entities (e.g., ", >) in the php sections. Plus, the php tags are stripped out at the same time.

If I have a PHP file that looks like this:

<html>
  <head><title>My Page</title></head>
  <body>
    <p>Some cruft &nbsp; which I want to repeat</p>
    <form name="foo"> (form content to be replaced)
    </form>
    <script type="JavaScript">
       <!--
       Some javaScript to be left alone
       -->
    </script>
    <a href="somepage.php">Link to be removed</a>
    <?php
       if (strlen($txtKeyword) > 2)
         {
           echo " or <a href=\"database_search_keyword.htm\">Search again?</a></p>";
           if(isset($_REQUEST['nr']))
         {
           $numRows = $_REQUEST['nr'];
           ....
    ?>
  </body>
</html>

I want the final result to look like:

<html>
  <head><title>My Page</title></head>
  <body>
    <p>Some cruft &nbsp; which I want to repeat</p>
    <ul><li>List replacing form</li>
    </ul>
    <script type="JavaScript">
       <!--
       Some javaScript to be left alone
       -->
    </script>
    <?php
       if (strlen($txtKeyword) > 2)
         {
           echo " or <a href=\"database_search_keyword.htm\">Search again?</a></p>";
           if(isset($_REQUEST['nr']))
         {
           $numRows = $_REQUEST['nr'];
           ....
    ?>
  </body>
</html>

As I said, I'm able to get everything working except the php. It gets managled, so the result

<html>
  <head><title>My Page</title></head>
  <body>
    <p>Some cruft &nbsp; which I want to repeat</p>
    <ul><li>List replacing form</li>
    </ul>
    <script type="JavaScript">
       <!--
       Some javaScript to be left alone
       -->
    </script>
    <?php
      if (strlen($txtKeyword) &gt; 2)
        {
          echo &quot; or &quot;;
          if(isset($_REQUEST[&#39;nr&#39;]))
        {
          $numRows = $_REQUEST[&#39;nr&#39;];
          ....
    ?>
  </body>
</html>

I have been working with HTML::TreeBuilder 3.23. I've tried the developer release 3.23_3, but it gives an error message due to php code (e.g., a has an invalid attribute name '"&section_id' ' . $section_id . ' ).

Example code for what I've done so far (with the filesystem walking, etc. chopped out) is

#!/usr/bin/perl -w

use strict;

use HTML::TreeBuilder;

# Set up replacement forms
my $artistSearch = HTML::Element->new ('~literal', 'text', <<EOF);
<p>Please select from the list below.</p>
<ul>
  <li><a href="http://firstlink.com/"&gt;item 1</a></li>
  <li><a href="http://secondlink.com/"&gt;item 1</a></li>
</ul>
EOF

my $filename = "AFA.php";
my $file = HTML::TreeBuilder->new();
$file->store_comments(1);
$file->ignore_ignorable_whitespace(1);
$file->no_space_compacting(1);
my $tree = $file->parse_file($filename);


my $form = $tree->find_by_tag_name('form');
my $fname = $form->attr('name');
if ($fname eq 'mainform') {
  $form->delete;
} elsif ($fname eq 'artist_search') {
  $form->replace_with($artistSearch)->delete;
} else {
  # It's a form we're not changing
}

my $printout =  $file->as_HTML("", "  ", {});
open (PAGE, "> $filename");
print PAGE $printout;
close (PAGE);
$file->delete;

I am open to any suggestions, examples, etc. I'm not necessarily tied to any particular module, but I'm not exactly an expert programmer.

Thank you!

+3  A: 

The problem here is obviously the <?php .. ?> tag. You could accomplish this with a preparser. I'll use a simple regex for this:

use strict;
use warnings;
undef $/;
$_=<>;
my @phps;
push @phps, $1 while s/<\?php (.*?) \?>/__PHP_CODE__/;

use Data::Dumper;
die Dumper [$_, \@phps];

You can try it:

echo "foo<?php phpfoo ?> bar <?php phpbar ?> baz" | filter.pl


$VAR1 = [
          'foo__PHP_CODE__ bar __PHP_CODE__ baz',
          [
            'phpfoo',
            'phpbar'
          ]
        ];

Now, when you're done with it. You can just do the reverse to get the PHP code out of the @phps array and back into the proper order in the output:

my $count = 0;
s/__PHP_CODE__/<?php $phps[$count++] ?>/g;

Make no mistake about it, this is a hack; but, it will get your job done quite effectively without much thought. It is fairly simple to implement too. I can think of a ton of better ways to do this -- like extending HTML::Element to include a pseudo <?php .. ?> element. What you don't want is to undo mangling (like character-encoding) by HTML::Element in TT -- that sounds like a far worse idea to me. You could even implement the stuff that goes from the __PHP_CODE__ token to the real PHP code using an Template filter.

It should be noted this doesn't take care of shorttags (though it could easily!) And, I'm not sure of the logic that triggers the PHP interpreter (escaping <?php or ?> for instance). It should be obvious, though I'll disclose, that this pays no respect to PHP code like this:

echo '?>';
Evan Carroll
This looks entirely workable ... assuming I can wrap my head around getting it to work with the rest of the code. Thank you for the quick reply, and I'll update after I've chewed on it a while.
tmsilver
This seems to be working, but I'm having to do a lot of read/write to get it to work. I end up 1)Reading file, replacing PHP with the token; 2) Writing file with token; 3) Reading file for TreeBuilder; 4) Writing file with TB changes; 5) Reading file, replacing token with code; 6) Writing file with complete changes. If I try to skip any of those, it omits or overwrites some of my changes. This could be a newbie thing...Thanks for your help!
tmsilver
You don't have to "write the file" more than once. You can do it all in memory. You can (1) *slurp* the file, (2) sub the php tokens (3) run `new_from_content` with TreeBuilder, (4) transform, (5) `->as_HTML` it, (6) run regex to replace php tokens (7) write the file once. This is still a lot of passes in memory - many not needed in theory, but that's still not file io.
Evan Carroll
Hmmm. I'm not at work, so I don't have direct access to the files. I think the piece I am missing is the `new_from_content` - I was trying to slurp in the file, do the substitution, then do `parse_content` , transform, set a new variable to `->as_HTML`, then replace the tokens. That's where things went wonky... I'll keep chewing - and thanks again for your help!
tmsilver
Thank you for all the help! This has worked wonderfully.
tmsilver