views:

53

answers:

2

Need a solution to kill nodes like <footer>foobar</footer> and <div class="nav"></div> from many several HTML files.

I want to dump a site to disk without the menus and footers and what not. Ideally I would accomplish this task using basic unix tools like sed. Since it's not XML I can't use xmlstarlet.

Could anyone please suggest recipes, so I can ideally have a script running kill-node.sh 'div class="toplinks"' *.html to prune the bits I don't want. Thank you,

+2  A: 

sed is based on regular expressions. Parsing html with regular expressions is a topic that comes up over and over again here on SO, see e.g regular expression to extract text from HTML or even better Can you provide some examples of why it is hard to parse XML and HTML with a regex?.

That said, if the html pages are written in a similar way you may still be able to construct a regexp that does the job, but be prepared that it is impossible (yes indeed theoretically provable impossible) to build a complete solution working in all cases using regexps.

Anders Abel
In my case, matching the start and end tag should be straightforward.Nonetheless if you can suggest a better saner command line tool, I'm all ears!
hendry
@hendry The <center> can not hold , its too late! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Tim Post
A: 

Just to drive you regex haters nuts, try this on for size:

sed ':a;$!N;$!ba;s/B/-B/g;s/A/BB/g;s/<\/foo>/A/g;:b;s/<foo>[^A]*A//;tb;s/BB/A/g;s/-B/B/g' foo.html

With foo.html being:

<header>
keep me
<foo>gtg</foo>
</header>
<foo>
delete me</foo>
<foo>gtg</foo>
<foo>gtg</foo>

Otherwise can someone do a cmdline HTML5 parser please. Thanks. x

hendry