tags:

views:

32

answers:

3

I wish to extract data between known HTML tags. For example:

Hello, <i>I<i> am <i>very</i> glad to meet you.

Should become:

'I

very'

So I have found something that works to nearly do this. Unfortunately, it only extracts the last entry.

sed -n -e 's/.*<i>\(.*\)<\/i>.*/\1/p'

Now I can append any end tag </i> with a newline character and this works fine. But is there a way to do it with just one sed command?

A: 

No. Just no.

No.

Regular expressions cannot be used to parse HTML, because the regular expressions language is of insufficient complexity compared to HTML. Use an HTML parser instead, a simple event-driven (SAX-type) thing should be sufficient.

Williham Totland
So I recognize this often frowned upon, but I know this to be a well formatted HTML document. Working with an HTML parser seems ridiculously complex for such a simple task.
Nic
A: 
$ awk -vFS="<.[^>]*>" '{for(i=2;i<=NF;i+=2)print $i}' file
I
very
ghostdog74
+1  A: 

Give this a try:

sed -n 's|[^<]*<i>\([^<]*\)</i>[^<]*|\1\n|gp'

And your example is missing a "/":

Hello, <i>I</i> am <i>very</i> glad to meet you.
Dennis Williamson