views:

298

answers:

2

Hi, I was wondering if it is possible to "automate" the task of typing in entries to search forms and extracting matches from the results. For instance, I have a list of journal articles for which I would like to get DOI's (digital object identifier); manually for this I would go to the journal articles search page (e.g., http://pubs.acs.org/search/advanced), type in the authors/title/volume (etc.) and then find the article out of its list of returned results, and pick out the DOI and paste that into my reference list. I use R and Python for data analysis regularly (I was inspired by a post on RCurl) but don't know much about web protocols... is such a thing possible (for instance using something like Python's BeautifulSoup?). Are there any good references for doing anything remotely similar to this task? I'm just as much interested in learning about web scraping and tools for web scraping in general as much as getting this particular task done... Thanks for your time!

A: 
WebRequest req = WebRequest.Create("http://www.URLacceptingPOSTparams.com");

req.Proxy = null;
req.Method = "POST";
req.ContentType = "application/x-www-form-urlencoded";

//
// add POST data
string reqString = "searchtextbox=webclient&searchmode=simple&OtherParam=???";
byte[] reqData = Encoding.UTF8.GetBytes (reqString);
req.ContentLength = reqData.Length;
//
// send request
using (Stream reqStream = req.GetRequestStream())
  reqStream.Write (reqData, 0, reqData.Length);

string response;
//
// retrieve response
using (WebResponse res = req.GetResponse())
using (Stream resSteam = res.GetResponseStream())
using (StreamReader sr = new StreamReader (resSteam))
  response = sr.ReadToEnd();

// use a regular expression to break apart response
// OR you could load the HTML response page as a DOM

(Adapted from Joe Albahri's "C# in a nutshell")

Mitch Wheat
Thank you - good to know it is possible! ...I am guessing. (not too familiar with .NET, though I hear it is all the rage...)
Stephen
+2  A: 

Hey Stephen,

Beautiful Soup is great for parsing webpages- that's half of what you want to do. Python, Perl, and Ruby all have a version of Mechanize, and that's the other half:

http://wwwsearch.sourceforge.net/mechanize/

Mechanize let's you control a browser:

# Follow a link
browser.follow_link(link_node)

# Submit a form
browser.select_form(name="search")
browser["authors"] = ["author #1", "author #2"]
browser["volume"] = "any"
search_response = br.submit()

With Mechanize and Beautiful Soup you have a great start. One extra tool I'd consider is Firebug, as used in this quick ruby scraping guide:

http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/

Firebug can speed your construction of xpaths for parsing documents, saving you some serious time.

Good luck!

mixonic
Great!! Thank you - very helpful!
Stephen
Stephen! Mark me an answer! I'm racing a co-worker to 100 points :-)
mixonic
I'm trying! I just got an OpenID but it tells me I have to have 15 reputation to vote up?? Sorry, first time on stackoverflow... is it this complicated?
Stephen
Heh, Thanks Stephen. You can always pick an answer, but you need 10 points to vote things up.
mixonic
Ah... sorry I could not do more, but your answer was super helpful!
Stephen
I'll help you out a the vote.
BenMaddox