ansaurus

Question

How would I remove all <script> tags (and everything in between) from multiple files using UNIX?

Answer 1

A:

You can use perl to replace strings in many files.

perl -pi -w -e 's/search/replace/g;' *.html

-e means execute the following line of code.
-i means edit in-place
-w write warnings
-p loop

You'll have to come up with the regex on your though. (The one you have should work.)

Milan Ramaiya 2010-02-27 01:32:00

The asker also wishes to locate <script> and </script> and everything in between.

Hamish Grubijan 2010-02-27 01:35:07

Uh, yeah, without correcting for escaped characters, <script>.*<script> does exactly that. It'll need fine tuning to make sure the correct closing tag is select (For greediness)

Milan Ramaiya 2010-02-27 01:38:03

Answer 2

A:

well you can run PHP from the commandline, or translate that line pretty easily into perl (the "p" in "preg_replace"). You could use sed to do something similar, but the regexes aren't as flexible. Regexes may or may not be good enough depending on where your input is coming from and what your goal is.

Draemon 2010-02-27 01:33:47

Answer 3

+2 A:

eg gawk

$ cat file
blah
<script type="text/javascript">function(foo);</script>
<script type="text/javascript" src="scripts.js"></script>
blah
<script type="text/javascript"
    src="script1.js">
</script>
end

$ awk 'BEGIN{RS="</script>"}/<script/{gsub("<script.*","")}{print}END{if(RS=="")print}' file
blah




blah


end

so run it inside a for loop to go over your files(eg html)

for file in *.html
do
  awk 'BEGIN{RS="</script>"}/<script/{gsub("<script.*","")}{print}END{if(RS=="")print}' $file >temp
  mv temp $file
done

You can also do it with Perl,

perl -i.bak -0777ne 's|<script.*?</script>||gms;print' *.html

ghostdog74 2010-02-27 01:37:46

I went the perl route, and in the context of what I was trying to accomplish, it worked perfectly. Thanks!

2010-03-01 19:53:53

Answer 4

+2 A:

The only way you stand a chance of getting this right is to load the file (I'm assuming it's an HTML file) into a HTML/XML parser and remove the script nodes that way. Any other way will likely fall foul of the <script> tag containing "<script>" as part of its contents, for example:

<script>
    document.write('</script>');
</script>

Rob 2010-02-27 01:40:30

Actually, browser DOM parsers will interpret that the same way the regular expression will--with the first </script> as the actual end of the script node. But using regular expressions on HTML is still a bad idea. http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

Annie 2010-02-27 01:48:24

@Annie, Fair point, I was half aleep when I wrote that, but yeah, the "don't use regular expressions on HTML" bit still stands :)

Rob 2010-02-27 09:59:16

Answer 5

+1 A:

I'd just use something like HTML::TreeBuilder and remove all SCRIPT nodes as I walk the tree:

#!/usr/local/perls/perl-5.10.1/bin/perl

use 5.010;

use HTML::TreeBuilder;

my $html = HTML::TreeBuilder->new;
my $root = $html->parse_file( *DATA );

my @queue = ( $root->elementify );

while( my $element = shift @queue )
    {
    foreach ( $element->content_list )
        {
        when ( ! ref ) { 1 }
        when ( $_->tag eq 'script' )
            {
            $_->delete;
            }
        default
            {
            push @queue, $_
            }
        }
    }

print $html->as_HTML;

__END__
<html>
<head>
    <title>This is a title</title>
    <script>
    code section 1
    </script>
</head>

<body>
<h1>This is a heading</h1>
    <script>
    code section 2
    </script>

<div>
    <script>
    code section 
    </script>
</div>

</body>
</html>

brian d foy 2010-03-16 14:28:19

ansaurus

tags:

views:

answers:

How would I remove all <script> tags (and everything in between) from multiple files using UNIX?

related questions