ansaurus

Question

Answer 1

+3 A:

Something not entirely unlike this should do ya, depending on the precise format of terms.txt (shell scripts cope best with one entry per line) and whether you actually need to parse the HTML (I'm hoping you don't):

#! /bin/sh

if [ $# -ne 2 ]; then
    echo "usage: $0 termfile baseurl" >&2
    exit 1
fi
termfile="$1"
baseurl="$2"

while read term; do
    wget -q -O- "$baseurl/set=$term" |
      sed -ne 's/^.*image=\([^&]*\)&.*$/\1/p'
done < "$termfile"

You save this to a file named "extractimages", chmod +x it, and run it like so:

$ ./extractimages terms.txt http://system.com > imgname.txt

Zack 2010-08-03 01:53:58

I'm receiving no output. tried changing $baseurl/set=$term to $baseurl/set=$termfile still nothing. thought that might of been a typo

acctman 2010-08-03 03:18:38

That was not a typo; $term gets each of the individual terms read from $termfile. What happens if you comment out the sed command and the vertical bar (*just* the vertical bar) on the previous line?

Zack 2010-08-03 03:51:34

(beware, that might produce an enormous amount of output)

Zack 2010-08-03 03:52:12

when I comment out the sed and pipe, the output to the imgname.txt is all the html coding.

acctman 2010-08-03 04:05:13

ok I can get it to work sort of, if I remove -ne for the sed command. I did some testing. With html its grabbing all the HTML around it and inserting into the file as well. It works fine on a test file with no html inside of it.

acctman 2010-08-03 06:35:24

I'm not 100% sure but it looks like the wget -q -O- "$baseurl/set=$term" command it outputting the html doc to the saved text file. how do I prevent that?

acctman 2010-08-03 15:11:10

Gah! Really pathetic typo. There needs to be a 'p' at the end of the sed command. Fixed the script.

Zack 2010-08-03 16:25:32

Further explanation: Without the sed command, getting "all the html coding" in imgname.txt is the expected behavior. "wget -q -O- <url>" retrieves a URL and writes it to standard output, and the script loops over all terms, so you get the HTML page for every term, concatenated together. (continued...)

Zack 2010-08-03 16:28:24

The sed command was *supposed* to delete every line that didn't match the regular expression, and print the result of the regular expression match on lines that did. sed -n prints only what it is told to by 'p' commands, but I forgot to put any 'p' commands, so you got nothing.

Zack 2010-08-03 16:29:47

Thanks Zack, that worked perfect output.

acctman 2010-08-03 16:50:15

Answer 2

A:

sed 's|^.*$|wget -q -O- http:\/\/system.com/set=&|' file | bash |sed -ne 's/^.*image=\([^&]*\)&.*$/\1/p'

ghostdog74 2010-08-03 02:55:57

tried this not getting any output. I ran the wget with a static link and it displays the HTML so the sites being processed fine. I replaced **file** with terms.txt and added > imgname.txt to the end. no outputs at all are being processed. the terms.txt file has 1 term per line.

acctman 2010-08-03 03:32:21

This script has the same error in the sed command - there needs to be a 'p' right before the close quote.

Zack 2010-08-03 16:30:22

yes, i left out the "p"

ghostdog74 2010-08-03 16:36:33

ansaurus

tags:

views:

answers:

Find and Copy a String within HTML code

related questions