views:

1119

answers:

3

I have this input text:

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body><table cellspacing="0" cellpadding="0" border="0" align="center" width="603">   <tbody><tr>     <td><table cellspacing="0" cellpadding="0" border="0" width="603">       <tbody><tr>         <td width="314"><img height="61" width="330" src="/Elearning_Platform/dp_templates/dp-template-images/awards-title.jpg" alt="" /></td>         <td width="273"><img height="61" width="273" src="/Elearning_Platform/dp_templates/dp-template-images/awards.jpg" alt="" /></td>       </tr>     </tbody></table></td>   </tr>   <tr>     <td><table cellspacing="0" cellpadding="0" border="0" align="center" width="603">       <tbody><tr>         <td colspan="3"><img height="45" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/top-bar.gif" alt="" /></td>       </tr>       <tr>         <td background="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" width="12"><img height="1" width="12" src="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" alt="" /></td>         <td width="580"><p>&nbsp;what y all heard?</p><p>i'm shark oysters.</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p></td>         <td background="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" width="11"><img height="1" width="11" src="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" alt="" /></td>       </tr>       <tr>         <td colspan="3"><img height="31" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/bottom-bar.gif" alt="" /></td>       </tr>     </tbody></table></td>   </tr> </tbody></table> <p>&nbsp;</p></body></html>

As you can see, there's no newline in this chunk of HTML text, and I need to look for all image links inside, copy them out to a directory, and change the line inside the text to something like ./images/file_name.

Currently, the Perl code that I'm using looks like this:

my ($old_src,$new_src,$folder_name);
    foreach my $record (@readfile) {
     ## so the if else case for the url replacement block below will be correct
     $old_src = "";
     $new_src = "";
        if ($record =~ /\<img(.+)/){
      if($1=~/src=\"((\w|_|\\|-|\/|\.|:)+)\"/){
       $old_src = $1;
             my @tmp = split(/\/Elearning/,$old_src);
       $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
       push (@images, $new_src);
       $folder_name = "images";
      }## end if
     }
     elsif($record =~ /background=\"(.+\.jpg)/){
      $old_src = $1;
      my @tmp = split(/\/Elearning/,$old_src);
      $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
      push (@images, $new_src);
      $folder_name = "images";
     }
     elsif($record=~/\<iframe(.+)/){
      if($1=~/src=\"((\w|_|\\|\?|=|-|\/|\.|:)+)\"/){
                $old_src = $1;
                my @tmp = split(/\/Elearning/,$old_src);
                $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
       ## remove the ?rand behind the html file name
       if($new_src=~/\?rand/){
        my ($fname,$rand) = split(/\?/,$new_src);
        $new_src = $fname;
        my ($fname,$rand) = split(/\?/,$old_src);
        $old_src = $fname."\\?".$rand;
       }
     print "old_src::$old_src\n"; ##s7test
     print "new_src::$new_src\n\n"; ##s7test
                push (@iframes, $new_src);
       $folder_name = "iframes";
            }## end if
     }## end if

     my $new_record = $record;
     if($old_src && $new_src){
      $new_record =~ s/$old_src/$new_src/ ;
    print "new_record:$new_record\n"; ##s7test
      my @tmp = split(/\//,$new_src);
      $new_record =~ s/$new_src/\.\\$folder_name\\$tmp[-1]/;
##  print "new_record2:$new_record\n\n"; ##s7test
     }## end if
     print WRITEFILE $new_record;
    } # foreach

This is only sufficient to handle HTML text with newlines in them. I thought only looping the regex statement, but then i would have to change the matching line to some other text.

Do you have any idea if there an elegant Perl way to do this? Or maybe I'm just too dumb to see the obvious way of doing it, plus I know putting global option doesn't work.

thanks. ~steve

+9  A: 

There are excellent HTML parsers for Perl, learn to use them and stick with that. HTML is complex, allows > in attributes, heavily use nesting, etc. Using regexes to parse it, beyond very simple tasks (or machine generated code), is prone to problems.

PhiLho
hi there,i'm using mod perl and we're running in unix, i need management approval to add a module, so was hoping to find a simple perl way to get it done or maybe default modules in mod perl. thanks
melaos
well, you can always look at the module source. As for management, you can tell them that someone has already done it correctly and if you get to use the existing correct solution, they save time and money and you can move onto the next problem.
brian d foy
makes sense, i would much rather use the test proven method that another one of my horrendous hack...hope my pointy haired boss oblige.
melaos
Nothing like re-inventing the wheel, and ending up with a rectangular 'wheel'.
Brad Gilbert
+2  A: 

If you must avoid any additional module, like an HTML parser, you could try:

while ($string =~ m/(?:\<\s*(?:img|iframe)[^\>]+src\s*=\s*\"((?:\w|_|\\|-|\/|\.|:)+)\"|background\s*=\s*\"([^\>]+\.jpg)|\<\s*iframe)/g) {
  $old_src = $1;
            my @tmp = split(/\/Elearning/,$old_src);
                    $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
  if($new_src=~/\?rand/){
    // remove rand and push in @iframes
  else
  {
    // push into @images
  }
}

That way, you would apply this regex on all the source (newlines included), and have a more compact code (plus, you would take into account any extra space between attributes and their values)

VonC
People really should leave comments for down-voting. +1 because you're answering for a particular all-too-real case.
Axeman
Just got back to my post. That was down-voted ? Sure, an HTML parser is the way to go, but I like to also answer to the actual case of the user. Thank you Axeman for recognizing this "answer" for what it is.
VonC
yea, this answer match my case properly as i really can't simply introduce more modules usage unless necessary :)
melaos
+3  A: 

I think you want my HTML::SimpleLinkExtor module:

use HTML::SimpleLinkExtor;

my $extor = HTML::SimpleLinkExtor->new;
$extor->parse_file( $file );

my @imgs = $extor->img;

I'm not sure what exactly you're trying to do, but it surely sounds like one of the HTML parsing modules should do the trick if mine doesn't.

brian d foy
well basically, i'm trying to export the html out as an external file, thus i need to copy the image and also export out the images into a image folder and change the img src into the original html.
melaos
That's the sort of information you should include in your question, not buried in a comment. :)
brian d foy