views:

452

answers:

5

I have a folder with multiple files, and I'd like to remove all <script> tags and everything in between, e.g.:

This:

<script type="text/javascript">function(foo);</script>

As well as this:

<script type="text/javascript" src="scripts.js"></script>

I think in PHP it would be something like this:

<?php $string = preg_replace('#(\n?<script[^>]*?>.*?</script[^>]*?>)|(\n?<script[^>]*?/>)#is', '', $string); ?>

But I'm at a loss when it comes to UNIX.

A: 

You can use perl to replace strings in many files.

perl -pi -w -e 's/search/replace/g;' *.html

-e means execute the following line of code.
-i means edit in-place
-w write warnings
-p loop

You'll have to come up with the regex on your though. (The one you have should work.)

Milan Ramaiya
The asker also wishes to locate <script> and </script> and everything in between.
Hamish Grubijan
Uh, yeah, without correcting for escaped characters, <script>.*<script> does exactly that. It'll need fine tuning to make sure the correct closing tag is select (For greediness)
Milan Ramaiya
A: 

well you can run PHP from the commandline, or translate that line pretty easily into perl (the "p" in "preg_replace"). You could use sed to do something similar, but the regexes aren't as flexible. Regexes may or may not be good enough depending on where your input is coming from and what your goal is.

Draemon
+2  A: 

eg gawk

$ cat file
blah
<script type="text/javascript">function(foo);</script>
<script type="text/javascript" src="scripts.js"></script>
blah
<script type="text/javascript"
    src="script1.js">
</script>
end

$ awk 'BEGIN{RS="</script>"}/<script/{gsub("<script.*","")}{print}END{if(RS=="")print}' file
blah




blah


end

so run it inside a for loop to go over your files(eg html)

for file in *.html
do
  awk 'BEGIN{RS="</script>"}/<script/{gsub("<script.*","")}{print}END{if(RS=="")print}' $file >temp
  mv temp $file
done

You can also do it with Perl,

perl -i.bak -0777ne 's|<script.*?</script>||gms;print' *.html
ghostdog74
I went the perl route, and in the context of what I was trying to accomplish, it worked perfectly. Thanks!
+2  A: 

The only way you stand a chance of getting this right is to load the file (I'm assuming it's an HTML file) into a HTML/XML parser and remove the script nodes that way. Any other way will likely fall foul of the <script> tag containing "<script>" as part of its contents, for example:

<script>
    document.write('</script>');
</script>
Rob
Actually, browser DOM parsers will interpret that the same way the regular expression will--with the first </script> as the actual end of the script node. But using regular expressions on HTML is still a bad idea. http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
Annie
@Annie, Fair point, I was half aleep when I wrote that, but yeah, the "don't use regular expressions on HTML" bit still stands :)
Rob
+1  A: 

I'd just use something like HTML::TreeBuilder and remove all SCRIPT nodes as I walk the tree:

#!/usr/local/perls/perl-5.10.1/bin/perl

use 5.010;

use HTML::TreeBuilder;

my $html = HTML::TreeBuilder->new;
my $root = $html->parse_file( *DATA );

my @queue = ( $root->elementify );

while( my $element = shift @queue )
    {
    foreach ( $element->content_list )
        {
        when ( ! ref ) { 1 }
        when ( $_->tag eq 'script' )
            {
            $_->delete;
            }
        default
            {
            push @queue, $_
            }
        }
    }

print $html->as_HTML;

__END__
<html>
<head>
    <title>This is a title</title>
    <script>
    code section 1
    </script>
</head>

<body>
<h1>This is a heading</h1>
    <script>
    code section 2
    </script>

<div>
    <script>
    code section 
    </script>
</div>

</body>
</html>
brian d foy