views:

170

answers:

5

Hello! I want to write a shell script to get an image from an rss feed. Right now I have:

curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/  height="400" \/>//' | sed 's/ //g'

This I use to grab the first occurence of an image URL in the file. Now I want to put this URL in a variable to use cURL again to download the image. Any help appreciated! (Also you might give tipps on how to better remove everything from the line with the URL. This is the line:

 <img src="http://www.nichtlustig.de/comics/full/100802.jpg" alt="" width="400" height="400" />

There's probably some better regex to remove everything except the URL than my solution.) Thanks in advance!

+2  A: 

Using a regexp to parse HTML/XML is a Bad Idea in general. Therefore I'd recommend that you use a proper parser.

If you don't object to using Perl, let Perl do the proper XML or HTML parsing for you using appropriate parser libraries:

HTML

curl http://BOGUS.com |& perl -e '{use HTML::TokeParser; 
    $parser = HTML::TokeParser->new(\*STDIN); 
    $img = $parser->get_tag('img') ; 
    print "$img->[1]->{src}\n"; 
}'

/content02/groups/intranetcommon/documents/image/blk_logo.gif

XML

curl http://BOGUS.com/whdata0.xml | perl -e '{use XML::Twig;
    $twig=XML::Twig->new(twig_handlers =>{img => sub { 
       print $_[1]->att("src")."\n"; exit 0;}}); 
    open(my $fh, "-");
    $twig->parse($fh);
}'

/content02/groups/intranetcommon/documents/image/blk_logo.gif
DVK
Added XML example - probably more useful for RSS
DVK
A: 

Use a DOM parser and extract all img elements using getElementsByTagName. Then add them to a list/array, loop through and separately fetch them.

I would suggest using Python, but any language would have a DOM library.

meder
A: 
#!/bin/sh
URL=$(curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/  height="400" \/>//' | sed 's/ //g')
curl -C - -O $URL

This totally does the job! Any idea on the regex?

tzippy
"Any idea on the regex" ? Yes. **DONT USE A REGEX**, use a Dom lib :)
meder
If it does the job, why are you asking the question?
Jesse Dhillon
A: 

Here's a quick Python solution:

from BeautifulSoup import BeautifulSoup
from os import sys

soup = BeautifulSoup(sys.stdin.read())
print soup.findAll('img')[0]['src']

Usage:

$ curl http://www.google.com/`curl http://www.google.com | python get_img_src.py`

This works like a charm and will not leave you trying to find the magical regex that will parse random HTML (Hint: there is no such expression, especially not if you have a greedy matcher like sed.)

Jesse Dhillon
A: 

I used wget instead of curl, but its just the same

#!/bin/bash
url='http://www.nichtlustig.de/rss/nichtrss.rss'
wget -O- -q "$url" | awk 'BEGIN{ RS="</a>" }
/<img src=/{
  gsub(/.*<img src=\"/,"")
  gsub(/\".[^>]*>/,"")
  print
}'  |  xargs -i wget "{}"
ghostdog74