views:

611

answers:

3

There's a search site whose search results are generated dynamically by javascript. So the user enters a query, and the site displays the content on the page, without refreshing.

I need to grab those search results programmatically (say from a Java program or a perl/python script).

So ideally, I can launch my program with 100 queries as user inputs, and then the program would hit that website with each query and spit out on my screen all the search results as returned by the website.

The obvious problem is that the site is in javasript instead of simple HTML, so sending a URL request and parsing the resultant output is not going to work (as the source code for this page is always just a bunch of reference to various .js files).

Given the above conditions, what are my options?

A: 

Install Firebug, study the requests that are made by the site's javascript and mimic them in your program. Chances are there is a single request that needs to be made and the resulds would come in some nice form like JSON

artemb
+2  A: 

Unless the search provider gives you an API to work with (either via backchannel agreement or a publicly available one), then nothing you do will be likely to work for very long.

You may go to great pains to fool the site into believing you are an ordinary website user. Then, they will make some minor change to how their site works (because they have no idea someone is using it in the fashion you are) and all of a sudden your hack won't work. Sometime later, they may notice that you are using them in this fashion, and detect your usage and flat out block it.

Basically, unless they give you an API, you are essentially stealing, and should expect to receive all the courtesy that deserves... none.

Lest you think I am judging you, I'll let you know I speak from experience ;)

larson4
What about Dyanmic Data Exchange? Can't you use that to grab any content from any window in windows? So I can just grab the content from my browser, and parse it in my program?
Saobi
@Saobi, you do realize that DDE is very, very old and not used in modern web browsers?
unforgiven3
what problems will I run into with DDE ?
Saobi
A: 

Javascript does http requests almost just like a browser does, once you figure out what they are you can try to re-create them in perl/python/etc. With Firefox+Firebug you can see the requests in the 'Net' panel.

Things you might have to take into account are user-agent string, cookies, the fact that sometimes the returned data is meant to be run/interpreted by Javascript etc. Maybe your language of choice has a nice httpbrowser class you can use?


Just took a look, searching for IBM, took the post data from Firebug, replaced newlines with '&' and put it after the request url:

[http://bcode.bloomberg.com/sym/dwr/call/plaincall/searchMgr.search.dwr?callCount=1&windowName=&c0-scriptName=searchMgr&c0-methodName=search&c0-id=0&c0-e1=string:ibm&c0-e2=string:&c0-e3=number:100&c0-e4=number:0&c0-e5=boolean:false&c0-param0=Object_SearchCriteria:{search:reference:c0-e1,%20filter:reference:c0-e2,%20limit:reference:c0-e3&,%20start:reference:c0-e4,%20allSources:reference:c0-e5}&batchId=4&page=%2Fsym%2F&httpSessionId=&scriptSessionId=FBC68693A4E1BC08D6E0DDFBDF6D0860]

but it returns

throw 'allowScriptTagRemoting is false.';
//#DWR-REPLY
if (window.dwr) dwr.engine.remote.handleBatchException({ name:'java.lang.SecurityException', message:'GET Disallowed' });
else if (window.parent.dwr) window.parent.dwr.engine.remote.handleBatchException({ name:'java.lang.SecurityException', message:'GET Disallowed' });

and no data.. So it looks like you have to script a post request. Looking at their restrictions and guidelines, maybe you should just get in touch and ask if there's a public API?

MSpreij
Ok I used the Net panel monitoring of Firebug. And whenever I submit a query on that website, the request is a POST, but the URL has search.dwr appended to it, not the actual query I submitted.
Saobi
The search is handled by Javascript, so it can put together and use any url it wants, obviously. You'd have to look into the source to see how it does that, or just see what it posts where and try to mimic that in your code. Is the search site public?
MSpreij
But that POST request I saw in Firebug, how can I dig deeper to see the equivalent http request (preferably with a search query appended) ?Yes this is a public site.
Saobi
The POST request *is* the http request, the Net panel should also show what post data was sent along. You can try taking that apart and tacking it onto the URL it's posted to as GET parameters, but that won't necessarily work (depends on if the server supports GET queries).But maybe you need to use curl or somesuch to do an actual post from your script. Basically the script has to behave like a browser.
MSpreij
Saw your sample search above. So it looks like this is not easily doable? You managed to figure out the exact request to send. But that's not enough....
Saobi
That's right :-) As I said, you'd need to re-create the POST request in your scripting language. For PHP, I'd point you at http://www.php.net/curl and http://www.php.net/manual/en/function.curl-setopt.php , search for 'post'.
MSpreij