views:

32

answers:

1

I'm trying to get the "Added contacts" on this page:

The data is injected with an Ajax call, and I used Wireshark to capture the call which I think gets the data:

POST /dwr/call/plaincall/UserActionAPI.getRecentlyTitleChangedContacts.dwr HTTP/1.1

This posts a bunch of variables and a cookie, and so I made a curl call with the following string:

curl -b "cookiejar.txt" -v -e "http://www.jigsaw.com/showContactUpdateTab.xhtml?companyId=212324" \
    -F "type=text/plain&callCount=1&page=/showContactUpdateTab.xhtml?companyId=212324&httpSessionId=9CDBDA38B4F0C2A84622B523E79C0C38&scriptSessionId=784885169D0457ECDCA26FEF7B6DD7CF305&c0-scriptName=UserActionAPI&c0-methodName=getRecentlyAddedContacts&c0-id=0&c0-param0=number:212324&c0-param1=boolean:false&c0-param2=boolean:false&batchId=0" \
    "http://www.jigsaw.com/dwr/call/plaincall/UserActionAPI.getRecentlyAddedContacts.dwr"

But it always returns what looks like more JS script, even though I can see the proper data returned in Wireshark. I've been looking over this for a while but still can't figure a way to get the data. Help?

+1  A: 

You're attempting to scrape an Ajax powered html page using curl.

That's ambitious since the original page is reaching a certain state (obtaining a session from the server, cookies, etc), then making an ajax call.

You'll need to exactly mimic what the page is doing.

For example, the call is both sending cookies with the session id and sending the same session id as one of its post parameters. -- So you need to look at the incoming cookie value in order to properly create the outgoing Post parameter. I don't know how you'd do that using just curl.

I suggest that you may need to use Perl Mechanize or some other more capable scraping system when dealing with this web site.

Also note that the server is returning the data that you want as a JS fragment, not as JSON. So you'll need to parse the reply once you're able to convince the server to give it to you.

Added: You may want to try the Net tab in Firebug and Fiddler in addition to Wireshark when seeing the differences between the original page and your emulation of it.

A worthy project...

Added in response to comment about Perl Mechanize not supporting Javascript:

You do not need your scraping program to do Javascript. You need your program to emulate the HTML page's interaction with the server. If your program sends the exact same bits to the server as the real html page does when it is running in a browser, then the server will respond with the data that you want.

Since it isn't responding with the data, you are not sending the same bits.

You should start by exactly emulating the browser. For instance, send the same headers in your requests, including the user-agent, accepts and other headers. The server could be inspecting those.

Larry K
WWW::Mechanize perl package doesn't support Javascript. Yeah, I'll have to parse the data, that's not a problem. Looking in Firefox, the difference is that Firefox gets the data and curl doesn't. I've opened the page with Chrome, then changed the User Agent, session IDs and copied the cookies to make the Curl call, same result.There has to be some Perl module to do this correctly? I mean, all I want to do is run Javascript, then refresh the DOM, right?
Sho Minamimoto
I added to the answer, see above. Since you're trying to scrape the server, you don't have a dom on your client. You're emulating a browser, your software is not a browser. Opening a session in a browser and then trying to complete the session in curl will usually not work because you'll be coming into the server with a different TCP connection--the server will generate a new session for you. Your client needs to emulate the entire conversation with the server.
Larry K