ansaurus

Question

Answer 1

+2 A:

sed is based on regular expressions. Parsing html with regular expressions is a topic that comes up over and over again here on SO, see e.g regular expression to extract text from HTML or even better Can you provide some examples of why it is hard to parse XML and HTML with a regex?.

That said, if the html pages are written in a similar way you may still be able to construct a regexp that does the job, but be prepared that it is impossible (yes indeed theoretically provable impossible) to build a complete solution working in all cases using regexps.

Anders Abel 2010-05-03 11:17:57

In my case, matching the start and end tag should be straightforward.Nonetheless if you can suggest a better saner command line tool, I'm all ears!

hendry 2010-05-03 11:22:27

@hendry The <center> can not hold , its too late! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Tim Post 2010-05-03 11:41:18

Answer 2

A:

Just to drive you regex haters nuts, try this on for size:

sed ':a;$!N;$!ba;s/B/-B/g;s/A/BB/g;s/<\/foo>/A/g;:b;s/<foo>[^A]*A//;tb;s/BB/A/g;s/-B/B/g' foo.html

With foo.html being:

<header>
keep me
<foo>gtg</foo>
</header>
<foo>
delete me</foo>
<foo>gtg</foo>
<foo>gtg</foo>

Otherwise can someone do a cmdline HTML5 parser please. Thanks. x

hendry 2010-05-03 12:01:23

ansaurus

tags:

views:

answers:

Killing HTML nodes from shell

related questions