views:

231

answers:

2

Are you able to write a Bash script that will download an HTTP resource from the web?

The inputs are:

$hostname 
$port     
$path     
$output

You can't:

  • use external commands other than telnet (no sed, no awk, no wget, ...)
  • use other shells

You can:

  • use /dev/tcp pseudo devices
  • use telnet

Your MUST pass this test (you can change the input if you want, I don't receive anything if you visit me):

hostname=andreafrancia.it # use what you want instead of this 
port=80
path=/bash-http-contest.txt

./your-script "$hostname" "$port" "$path" output.actual
wget http://"$hostname:$port/$path" -O output.expected

diff --binary output.actual output.expected # this should return 0 i.e. they are equals.

You MUST explicitly state that your code snippet uses one of these licenses: GPL (v1,v2,v3), or Apache License or that is in public domain or use some other license that allow someone else reuse your script.

You MUST put your code in the stackoverflow page (no external links or attached file).

The first win. The date that counts is the last edit date.

+1  A: 

Try something like this:

(exec 3<>/dev/tcp/$hostname/$port
 echo -e "GET $path HTTP/1.1\r\nConnection: close\r\n\r\n" >&3
 cat <&3) > $output

Updated for Mike Ottum's bug fix.

DigitalRoss
Please say that we can reuse your script. Or you cannot win.
Andrea Francia
You actually have to talk HTTP, you don't. Keywords: Splitted data, 3xx, 4xx, 5xx messages, etc etcI hacked an IRC bot in Bash, but IRC is not as complicate as HTTP can be.
TheBonsai
Look, it was free, it does fetch files, and it gets the OP 95% of what was required. I'm sorry it wasn't good enough for you, but the SO guidelines say to downvote misleading or incorrect information. We have a relatively sophisticated OP who certainly can understand the limitations of this answer, so it isn't misleading or incorrect. Sheesh.
DigitalRoss
@Andrea: posts to SO are covered by a license that most likely gives you exactly what you need. See the bottom of the page you are looking at right now.
DigitalRoss
You're right for the downvote. But the 95% is worth a discussion.
TheBonsai
HTTP headers must be separated by \r\n, not just \n, and the header should end with a pair of \r\n's. Like this:GET / HTTP/1.1\r\nConnection: close\r\n\r\n
Mike Ottum
Sorry, I missed the thing about the SO license. I didn't downvoted you. Unfortunately your script doesn't remove the http metadata.
Andrea Francia
Maybe something with paremeter expansion will help to remove the headers.
Andrea Francia
I think the right way to remove the headers is to parse the output file in bash and compute the number of bytes to strip, then output a `dd(1)` command to do the binary-safe heavy lifting. However, the OP didn't want any external commands so I'm kind of stuck...
DigitalRoss
This tries remove almost the headers: echo "${raw_output#HTTP*Content-Type: text/html; charset=iso-8859-1}"
Andrea Francia
Ok, I've got something that seems to remove the headers. It will only work for things that the shell can read, so it needs a text-friendly encoding. (Yes, limitations.) See http://pastie.org/722698
DigitalRoss
I hope that shell is more binary friendly that we know.
Andrea Francia
This is my attempt to put it all together: http://pastie.org/722716
Andrea Francia
Unfortunately this gives to me HTTP/1.1 400 Bad Request, you should add a "Host: $hostname".
Andrea Francia
Oh right, in 1.1 the hostname is required. Just make it an HTTP 0.9 request, or add the hostname. That crossed my mind at first then I got distracted...
DigitalRoss
I found how remove the headers: `"${raw_output#*$'\r\n\r\n'}"`, thanks to you for illustrating me the use of `$'\r'`.
Andrea Francia
Wow. I'll be kind of amazed if this really works. Let us know!
DigitalRoss
A: 

Thanks to DigitalRoss, Mike Ottum and the other contributors I created the following that does 99% of the works.

I used the parameter expansion to remove headers. The problem is the last newline character of the page. This depends of the usage of the $() construct and I think that this problem couldn't be solved.

function download() {
    local hostname="$1"
    local port="$2"
    local path="$3"

    raw_output="$(download_raw "$hostname" "$port" "$path")"

    # strip the headers
    echo -n "${raw_output#*$'\r\n\r\n'}"
}

function download_raw() {
    local hostname="$1"
    local port="$2"
    local path="$3"

    (exec 3<>/dev/tcp/$hostname/$port
     echo -en "GET $path HTTP/1.1\r\nConnection: close\r\nHost: $hostname\r\n\r\n" >&3
     cat <&3)
}

hostname=andreafrancia.it
port=80
path=/

download "$hostname" "$port" "$path" > output.txt
wget http://"$hostname:$port/$path" -O output.expected
diff --binary output.txt output.expected

The result is:

[root@localhost ~]# diff --binary output.txt output.expected
74c74,75
< </html>
\ No newline at end of file
---
> </html>
>

Feel free to reuse and improve this solution.

Andrea Francia
Wow. I didn't totally count on this working out. :-) Nice job.
DigitalRoss