views:

80

answers:

2

Hey Guys,

Iam building a scrapper which needs to scrap some web content. Iam facing an issue, the page I need to crawl has loads of java scripts and it seems that the java-script calls are setting up some cookies and some query string parameters for next requests.

Iam able to set the cookies by sending requests to the js files, but seems the query string params are getting generated by some encoded javascript calls.

I am not able to decipher them, I tried googling for tools to compile JS to C# but in vain. If someone has solved similar issues earlier, please shed some light on how can I compile a javascript file like a browser and generate html from my C# code directly.

Any help would be deeply appreciated.

+2  A: 

Why not use a web proxy like fiddler to find out what headers and cookies are setup and use this data directly in your C#?

That way you will not need to execute the JS just to figure out headers and cookies.

Update:

You can also use a web automation suite such as WatiN to crawl the site - I believe it already supports JS, so you don't need to do much more.

Update2:

Since WatiN is no good for your requirements, perhaps compiling it directly using a javascript to .NET compiler will be possible - see JScript.NET, though I doubt any DOM manipulation will result.

Oded
This will only work for static cookies. If the script sets uses fancy logic with session id's to generate the cookies you are left to duplicate the logic. Which is hard, and scraping is often hard.
Mikael Svenson
@Mikael Svenson - true enough. Will add other options.
Oded
+1 @Oded: Good link with WatiN, and it's a good choice for "troublesome" sites. But I would not use this for bulk crawling, as it uses IE/FF for the actual crawling, and it might require you to add your sites to "trusted sites" etc.
Mikael Svenson
@Mikael Svenson - I understand your concerns, but apart from embedding a JS engine, what would you do?
Oded
@Oded: Embedding would be the only way afaik, that's why i gave you an upvote. Scraping gives headaches ;) I think your answer highlights the options at hand.
Mikael Svenson
@oded, Iam already aware about Watin, the issue is it uses IE, So I can't make multiple threads, by not able to make a multithreaded solution I mean I will need to run a IE instance in each thread, which would be too resource intensive. I want to build a multi threaded solution to make the crawling faster, and run around 200-500 threads on a dual core server machine. Iam still looking for possible solutions to code this effectively.
Sumit Ghosh
@Sumit Ghosh - The only other alternative is to compile the javascript with JScript.NET - http://en.wikipedia.org/wiki/JScript_.NET
Oded
+1  A: 

It may be more complicated than you think. Take a look at these two topics:

http://stackoverflow.com/questions/1283370/any-javascript-engine-for-net-c

http://stackoverflow.com/questions/172753/embedding-javascript-engine-into-net-c

negative