views:

195

answers:

4

I have an HTML file and I am interested in the data enclosed by <pre> </pre> tags. Is there a one-liner that can do achieve this?

Sample file :

<html>
<title>
Hello There!
</title>
<body>
<pre>
John   Working
Kathy  Working
Mary   Working
Kim    N/A
</pre>
</body>
</html>

Output should be :

John 
Kathy 
Mary 
Kim 

Much appreciated guys, thank you!

+2  A: 

If you have XHTML then use xmlstarlet. If you don't then pass it through HTML Tidy first, and turn it into XHTML.

Ignacio Vazquez-Abrams
Is there a solution using 'sed'? Thank you..
ThinkCode
+1  A: 

My Perl-fu is weak, but this works for your example:

$ cat file.html | perl -e'while(<>){if(/<\/pre>/){$a=0;}if($a){print}if(/<pre>/){$a=1;}}' | cut -f1 -d' '
John
Kathy
Mary
Kim
Thomas
Awesome, it works! Just out of curiosity, can we achieve the same using 'sed'? sed -n '/pre/=' file.txt (fetches line numbers) --> sed -n '76,216p' file.txt (print lines fetched from the above sed). How to integrate them both?
ThinkCode
+3  A: 

Get your hands on the twig tools. One of the things it has is something called xml_grep. Your problem reduces into

cat foo.txt | xml_grep --nowrap pre 

pre is an xpath expression. Followed by some simple text processing and this will work even if your XML is formatted differently.

Word of advice - don't use sed and other stream based text processing tools to manipulate structured data like XML. Use a proper parser.

Noufal Ibrahim
Sounds like a great tool. I don't know how to install and I don't want to ask my admin to install them for me, looking for something quick and easy to finish my task. One Up though :)
ThinkCode
Thanks. Be warned though. If your input changes slightly, raw text based parsing of XML will break.
Noufal Ibrahim
+2  A: 

Since you specifically asked about a solution using sed... Assuming that the interesting lines are always between lines containing <pre> and </pre> (appearing exactly like that) and that the interesting content is never on the same line than the opening or closing tag, and assuming that the first such block is the only one you want to extract, and assuming that while you understand that this is really the wrong way to solve this problem you still want to do it, then you could do this using sed for example like this:

sed '1,/<pre>/d;/<\/pre>/,$d'

It deletes all lines from the first up to the one containing <pre> and all lines from the one containing </pre> to the last.

(FWIW, I would rather use an XPath expression for selecting the interesting content. For example using xmlstarlet as suggested by Ignacio Vazquez-Abrams it could go like this: xmlstarlet sel -t -v /html/body/pre.)

Jukka Matilainen
I like your solution too, thanks!
ThinkCode