views:

47

answers:

2

I'm trying something new, I would normally do this in C# or VB. But for speed reason I'd like to do this on my server.

  1. Open File terms.txt
  2. Take each item one at a time from terms.txt and open a url (possibly curl or something else) and go to http://system.com/set=terms
  3. View the HTML source and extract pic names (stringB). Look for image=StringB&location
  4. Save StringB to imgname.txt
  5. Close file and cycle to the next item in terms.txt

I was looking at sed but I believe awk might be the best way? This is all new to me building a command like this to run under shell. I'm familiar with using linux just need help with the commands.

+3  A: 

Something not entirely unlike this should do ya, depending on the precise format of terms.txt (shell scripts cope best with one entry per line) and whether you actually need to parse the HTML (I'm hoping you don't):

#! /bin/sh

if [ $# -ne 2 ]; then
    echo "usage: $0 termfile baseurl" >&2
    exit 1
fi
termfile="$1"
baseurl="$2"

while read term; do
    wget -q -O- "$baseurl/set=$term" |
      sed -ne 's/^.*image=\([^&]*\)&.*$/\1/p'
done < "$termfile"

You save this to a file named "extractimages", chmod +x it, and run it like so:

$ ./extractimages terms.txt http://system.com > imgname.txt
Zack
I'm receiving no output. tried changing $baseurl/set=$term to $baseurl/set=$termfile still nothing. thought that might of been a typo
acctman
That was not a typo; $term gets each of the individual terms read from $termfile. What happens if you comment out the sed command and the vertical bar (*just* the vertical bar) on the previous line?
Zack
(beware, that might produce an enormous amount of output)
Zack
when I comment out the sed and pipe, the output to the imgname.txt is all the html coding.
acctman
ok I can get it to work sort of, if I remove -ne for the sed command. I did some testing. With html its grabbing all the HTML around it and inserting into the file as well. It works fine on a test file with no html inside of it.
acctman
I'm not 100% sure but it looks like the wget -q -O- "$baseurl/set=$term" command it outputting the html doc to the saved text file. how do I prevent that?
acctman
Gah! Really pathetic typo. There needs to be a 'p' at the end of the sed command. Fixed the script.
Zack
Further explanation: Without the sed command, getting "all the html coding" in imgname.txt is the expected behavior. "wget -q -O- <url>" retrieves a URL and writes it to standard output, and the script loops over all terms, so you get the HTML page for every term, concatenated together. (continued...)
Zack
The sed command was *supposed* to delete every line that didn't match the regular expression, and print the result of the regular expression match on lines that did. sed -n prints only what it is told to by 'p' commands, but I forgot to put any 'p' commands, so you got nothing.
Zack
Thanks Zack, that worked perfect output.
acctman
A: 
sed 's|^.*$|wget -q -O- http:\/\/system.com/set=&|' file | bash |sed -ne 's/^.*image=\([^&]*\)&.*$/\1/p' 
ghostdog74
tried this not getting any output. I ran the wget with a static link and it displays the HTML so the sites being processed fine. I replaced **file** with terms.txt and added > imgname.txt to the end. no outputs at all are being processed. the terms.txt file has 1 term per line.
acctman
This script has the same error in the sed command - there needs to be a 'p' right before the close quote.
Zack
yes, i left out the "p"
ghostdog74