Hi, I was wondering if it is possible to "automate" the task of typing in entries to search forms and extracting matches from the results. For instance, I have a list of journal articles for which I would like to get DOI's (digital object identifier); manually for this I would go to the journal articles search page (e.g., http://pubs.acs.org/search/advanced), type in the authors/title/volume (etc.) and then find the article out of its list of returned results, and pick out the DOI and paste that into my reference list. I use R and Python for data analysis regularly (I was inspired by a post on RCurl) but don't know much about web protocols... is such a thing possible (for instance using something like Python's BeautifulSoup?). Are there any good references for doing anything remotely similar to this task? I'm just as much interested in learning about web scraping and tools for web scraping in general as much as getting this particular task done... Thanks for your time!
A:
WebRequest req = WebRequest.Create("http://www.URLacceptingPOSTparams.com");
req.Proxy = null;
req.Method = "POST";
req.ContentType = "application/x-www-form-urlencoded";
//
// add POST data
string reqString = "searchtextbox=webclient&searchmode=simple&OtherParam=???";
byte[] reqData = Encoding.UTF8.GetBytes (reqString);
req.ContentLength = reqData.Length;
//
// send request
using (Stream reqStream = req.GetRequestStream())
reqStream.Write (reqData, 0, reqData.Length);
string response;
//
// retrieve response
using (WebResponse res = req.GetResponse())
using (Stream resSteam = res.GetResponseStream())
using (StreamReader sr = new StreamReader (resSteam))
response = sr.ReadToEnd();
// use a regular expression to break apart response
// OR you could load the HTML response page as a DOM
(Adapted from Joe Albahri's "C# in a nutshell")
Mitch Wheat
2009-07-23 07:24:44
Thank you - good to know it is possible! ...I am guessing. (not too familiar with .NET, though I hear it is all the rage...)
Stephen
2009-07-23 19:42:08
+2
A:
Hey Stephen,
Beautiful Soup is great for parsing webpages- that's half of what you want to do. Python, Perl, and Ruby all have a version of Mechanize, and that's the other half:
http://wwwsearch.sourceforge.net/mechanize/
Mechanize let's you control a browser:
# Follow a link
browser.follow_link(link_node)
# Submit a form
browser.select_form(name="search")
browser["authors"] = ["author #1", "author #2"]
browser["volume"] = "any"
search_response = br.submit()
With Mechanize and Beautiful Soup you have a great start. One extra tool I'd consider is Firebug, as used in this quick ruby scraping guide:
http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
Firebug can speed your construction of xpaths for parsing documents, saving you some serious time.
Good luck!
mixonic
2009-07-23 12:26:47
I'm trying! I just got an OpenID but it tells me I have to have 15 reputation to vote up?? Sorry, first time on stackoverflow... is it this complicated?
Stephen
2009-07-24 05:42:52
Heh, Thanks Stephen. You can always pick an answer, but you need 10 points to vote things up.
mixonic
2009-07-24 11:00:06