tags:

views:

74

answers:

4
count_items=`curl -u username:password -L "websitelink" | sed -e 's/<\/title>/<\/title>\n/g' | sed -n -e 's/.*<title>\(.*\)<\/title>.*/\1/p' | wc -l`

Above I have a Bash script that extracts the titles from an XML file, but how do I change the regex so that it extracts a title name from a div tag?

Example: extract title out of: <div id="example""><a href="">title</a></div>

I know it's silly to be done via Bash but I have no choice, any help would be appreciated.

+3  A: 

I recommend using xmlstarlet instead of trying to parse XML with a regex.

Ignacio Vazquez-Abrams
im not parsing xml im extracting, and i have to use bash for this. any help would be greatly appreciated!
Extracting requires parsing, and xmlstarlet is a command-line tool.
Ignacio Vazquez-Abrams
yeah but it isn't installed to a linux machine by default is it, i need to use a simple bash script that doesnt need anything to be installed
A: 

Just for the single-line example given:

echo '<div id="example""><a href="">title</a></div>' | sed -E -n 's/(.*<div.*<a href="">)([^<]*)(<.*<\/div>.*)/\2/p'
creek
+2  A: 

Parsing XML without a parser is ugly; the SO crowd always strongly recommends against it, and people always insist on doing it anyway. Usually the brute-force, special-case solutions kludged together with the wrong tools fail beyond a certain level of complexity, and then those people are back where they started. You have been warned! ;)

You mention elsewhere that you need to be able to do this on a "plain Linux machine with nothing installed." While you may not find specialized XML parsing tools on every Linux box, these days it's hard to find one that doesn't have Perl installed. Or at least awk. When you hit the limits of what you can do with regular expressions in sed, I recommend going with either awk or perl for a clean, flexible and legible solution. Use of Perl with a "real" Perl XML library would be optimal but in a pinch you can still get a lot done with "out of the box" Perl.

Carl Smotricz
A: 

Using nothing but Bash:

$ string='<div id="example""><a href="">title</a></div>'
$ pattern='.*>([^<]+)<.*'
$ [[ $string =~ $pattern ]]
$ target=${BASH_REMATCH[1]}
$ echo $target
title

There are lots of ways for this to fail. Here's one:

$ string='<div id="example""><a href="">title</a>this text will be grabbed instead</div>'

You can keep trying to make the regex more robust:

pattern='.*>([^<]+)</a.*'

but it's an uphill battle. Use a proper parser.

Dennis Williamson