Ideally, what I'd like to be able to do is:
cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
Ideally, what I'd like to be able to do is:
cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
Well, you can use xpath utility. I guess perl's XML::Xpath contains it.
I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.
My XML::Twig Perl module comes with such a tool: xml_grep
, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt
(the -t
option gives you the result as text instead of xml)
Command-line tools that can be called from shell scripts include:
I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.
Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.
You can do that very easily using only bash. You only have to add this function:
rdom () { local IFS=\> ; read -d \< E C ;}
Now you can use rdom like read but for html documents. When called rdom will assign the element to variable E and the content to var C.
For example, to do what you wanted to do:
while rdom; do
if [[ $E = title ]]; then
echo $C
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
Here's a function which will convert XML name-value pairs and attributes into bash variables.
http://www.humbug.in/2010/parse-simple-xml-files-using-bash-extract-name-value-pairs-and-attributes/
After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on: