views:

48

answers:

2

Hello everyone,

I was trying to use the program cURL inside of BASH to download a webpage's source code. I am having difficulty when trying to download page's code when the page is using more complex encoding than simple HTML. For example I am trying to view the following page's source code with the following command:

curl "http://shop.sprint.com/NASApp/onlinestore/en/Action/DisplayPhones?INTNAV=ATG:HE:Phones"

However the result of this doesn't match the source code generated by Firefox when I click "View source". I believe it is because there are javascript elements on the page, but I can not be sure.

For example, I can not do:

curl "http://shop.sprint.com/NASApp/onlinestore/en/Action/DisplayPhones?INTNAV=ATG:HE:Phones" | grep "Access to 4G speeds"

Even though that phrase is clearly found in the Firefox source. I tried looking through the man pages but I don't know enough about the problem to figure out a possible solution.

A preferable answer will include why this is not working the way I expect it to and a solution to the issue using curl or another tool executable from a Linux box.

EDIT:

Upon suggestion below I have also included a useragent switch with no success

curl "http://shop.sprint.com/NASApp/onlinestore/en/Action/DisplayPhones?INTNAV=ATG:HE:Phones" -A "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.3) Gecko/20100423 Ubuntu/10.04 (lucid) Firefox/3.6.3" | grep -i "Sorry"
+3  A: 

I don't see the "Access to 4G speed" thing in the first place when I go to that page.

The two most likely culprits for this difference are cookies and your user-agent.

You can specify cookies manually using both curl or wget. Dump out your cookies from Firefox using whatever plugins you want, or just

javascript:prompt('',document.cookie);

in your location bar Then stick read through the man pages for wget or curl and see how to include that cookie.


EDIT: It appears to be what I thought, a missing cookie.

curl --cookie "INSERT THE COOKIE YOU GOT HERE" http://shop.sprint.com/NASApp/onlinestore/en/Action/DisplayPhones?INTNAV=ATG:HE:Phones | grep "Access to 4G"

As stated above, you can grab whatever you cookie is from above: javascript:prompt('',document.cookie) then copy the default text that comes up. Make sure you're on the sprint page when you stick that in the location bar (otherwise you'll end up with the wrong website's cookie)


EDIT 2

The reason your browser cookie and your shell cookie were different was the different in interaction that took place.

The reason I didn't see the Access to 4G speed thing you were talking about in the first place was that I hadn't entered my zip code.

If you want to have a constantly relevant cookie, you can force curl to do whatever is required to obtain that cookie, in this case, entering a zip code.

In curl, you can do this with multiple requests and holding the retrieved cookies in a cookie jar:

 [stackoverflow]  curl --help | grep cookie
 -b/--cookie <name=string/file> Cookie string or file to read cookies from (H)
 -c/--cookie-jar <file> Write cookies to this file after operation (H)
 -j/--junk-session-cookies Ignore session cookies read from file (H)

So simply specify a cookie jar, send the request to send the zipcode, then work away.

Jamie Wong
Right under the Evo - More Views HTC EVO™ 4G * Access to 4G speeds that are up to 10x faster than 3G * Dual-mode 3G/4G device, access to dependable 3G
Ryan
Thanks a ton!! My last question if you have a second is if this cookie will persist long enough to run this script over a period of time? Or will I Have to automate downloading this cookie again?
Ryan
A: 

If you are getting different source code from the same source the server is, most likelly sniffing your user agent and laying out specific code.

Javascript can act on the DOM and do all sorts of things but if you use 'see source' the code will be exactly the same as the one your browser first read (before DOM manipulation).

Frankie