tags:

views:

128

answers:

2

Hello!

I'm trying the following command on my shell:

curl -b usptoCookies -L -d "patentNum=6836866&applicationNum=10007391&maintFeeAction=Get+Bibliographic+Data&maintFeeYear=04" https://ramps.uspto.gov/eram/getMaintFeesInfo.do;jsessionid=0000Nmdd1Q_YsDF90HKmb9EIIgq:11g0uehq7

Pretty straighforward. It is attempting to post a few variables to a form. You can see the web page here: https://ramps.uspto.gov/eram/

Try putting in the patent number and application number as: 6836866 and 10007391. Then hit the Get Bibliographic Data button.

The web page returns stuff (a "neatly" formatted table), but the curl call seems to experience "some" problem. I am at a loss. I've used firebug on the browser to confirm that the three vars above are all that are required to complete the form post.

It is not a problem with https, because i do get a response back. I need help.

Anyone?

Shaheeb Roshan

+2  A: 

There's a bunch of other hidden fields in that form including a "signature". Which seems to be some unique string each time you request a page. This is probably a feature used to ensure that you aren't scraping all the information off their database.

When I emptied out the hidden signature field, it returned an error. If you want to write a program to fetch this information, you will probably have to do something a little more complicated, and fetch the page with the "signature" on it first, so you can post that value back to the site to get a proper response.

Kibbee
Indeed. Also, as 'some' suggested, you should put the URL in quotes (due to the semicolon).
mweerden
A: 

Hi Kibbee,

I thought that might be the case, so on one of my scrapes, I had it output the content to a page that I could open in my browser. This allowed me to manipulate the form elements and re-submit to see if removing certain hidden fields would affect the post. When I opened the page and removed all hidden fields (including sessionId, signature and loadtime), I was still able to submit the form to get a valid response.

Thinking this may be pointing to some cookie or session related problem, I fired up Selenium (through the Testing_Selenium package) and attempted the same scrape. The idea was that since Selenium actually uses a real browser, any session/cookie issue should be eliminated.

When the Selenium run failed in the same way as the curl run, I was at my wit's end.

I was hoping someone could see something weird or unusual about this page that might explain the failure.

Thanks for your input, what do you think?

Shaheeb R.