tags:

views:

671

answers:

1

I would like to download Google Trends csv data using wget, but I'm unfamiliar with using wget. An example URL is:

http://www.google.com/insights/search/overviewReport?cat=71&geo=US&q=apple&date&cmpt=q&content=1&export=1

Opening this with a web browser, I retrieve the expected file. To do this with wget, I tried the following command:

wget "http://www.google.com/insights/search/overviewReport?cat=71&geo=US&q=apple&date&cmpt=q&content=1&export=1" -O report.csv

which results in the following:

<html><head><title>Redirecting</title>
<meta http-equiv="refresh" content="0; url=&#39;http://www.google.com/insights/search#content=1&amp;amp;cat=71&amp;amp;geo=US&amp;amp;q=apple&amp;amp;date&amp;amp;cmpt=q&amp;#39;"&gt;&lt;/head&gt;
<body bgcolor="#ffffff" text="#000000" link="#0000cc" vlink="#551a8b" alink="#ff0000"><script type="text/javascript" language="javascript">
    location.replace("http://www.google.com/insights/search#content\x3d1\x26cat\x3d71\x26geo\x3dUS\x26q\x3dapple\x26date\x26cmpt\x3dq")
  </script></body></html>

My first guess is that wget doesn't have access to cookies with proper authentication.

Anybody?

+2  A: 

You are getting a redirect message. The URL in the location.replace bit and you get a valid index.html from Google is you that URL in a second call to wget.

Methinks you simply don't have the proper URL from where the csv data is downloaded. For a working example of how to 'hit' a CGI interface with a downloader, look at R's get.hist.quote() in the tseries package.

Edit: Here is what get.hist.quote() does:

R> IBM <- get.hist.quote("IBM")
trying URL 'http://chart.yahoo.com/table.csv?s=IBM&amp;a=0&amp;b=02&amp;c=1991&amp;d=9&amp;e=08&amp;f=2009&amp;g=d&amp;q=q&amp;y=0&amp;z=IBM&amp;x=.csv'
Content type 'text/csv' length unknown
opened URL
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
.......... .......... .......... .......... ..........
.......... .......... .......... ......
downloaded 236 Kb

R>

You could hit that same URL directly as shown in the code you could study. If you need cookies you may need to look at Duncan TL's code to hit Google Docs etc.

Dirk Eddelbuettel
Dirk: get.hist.quote() uses download.file(), which has three options (wget, lynx, and internal). I tried each of them, but ran into errors. The url is valid in my web browser, so I still think I need to be able to authenticate one of these downloaders (wget, lynx, etc). Thoughts?
Christopher DuBois
a) Yes, it worked for me. I get a 124 kB file index.html, else I would not have suggested the answer I offered above.b) If tseries::get.hist.quote does not work for you, then something else is amiss, maybe a proxy or something else. You probably need to sort that our first.c) I only offered get.hist.quote() to show you that may be able to all this from within R. That said, wget has a gazillion options and a detailed manual. Maybe time for RTFineM?
Dirk Eddelbuettel
Hmm. I get a 124 kb index.html as well, but the original file I'm trying to download is a 23 kb csv file. (And rest assured, once I figure out the issue with wget, I will be bringing this back into R with download.file() or something.)
Christopher DuBois
Same here. I get the csv if I do it from the browser is a session where I have Google cookies, but not from the command-line. You need to hit the wget docs / web to learn how to tell wget about the cookies used by your browser ... and the wget will masquerade as the browser. The csv is still ugly with the header / footer lines but we can tell R how to cope with that. So back to fixing cookies...
Dirk Eddelbuettel
Christopher DuBois
--load-cookies was the answer.
Christopher DuBois
I had actually tried that too but my cookies.txt must have been to old old (as I switched to Chromium instead of Firefox and I had no time to chase the cookies.txt for Chromium...) Good to know you have it sorted out.
Dirk Eddelbuettel