views:

209

answers:

5

I want to mine large amounts of data from the web using the IE browser. However, spawning lots and lots of instances of IE via WatiN crashes the system. Is there a better way of doing this? Note that I can't simply do WebRequests - I really need the browser due to having to interact with JS-driven behaviors on the site.

+1  A: 

What about launching multiple instances of WebBrowser control (it's IE anyway) in a .NET app to process data mining jobs in async manner?

If perf is a problem - splitting the job and pushing it to the cloud might also help.

Rinat Abdullin
Yeah I thought about moving it into the cloud, but somehow I doubt that spawning `WebBrowser` controls is any different to spawning actual instances of IE. Will need to check. Thanks for the idea!
Dmitri Nesteruk
It's not much different, probably. But you'll have better control of the exceptions and crashes.
Rinat Abdullin
+1  A: 

Did you try the commercial version of iMacros yet? It is a bit like WatiN, but more designed towards web automation / web scraping. Basically they added special code to deal with all the different browser annoyances. Their sample code contains C#/VB.NET multi-threading sample code for use with IE and Firefox. We use it with Ruby ;)

We have no problem running many instances / server. While I can not reveal the name of our company, I know that AlertFox uses the same approach for web monitoring.

Ruby8848
Thanks, but I'm really not inclined to use anything commercial here.
Dmitri Nesteruk
+1  A: 

The best way would be to actually create one process per instance of web browser, this is because web browser is not a managed code, its COM, and there are cases where unmanaged exceptions can not be handled in managed code, and application will certainly crash.

The better thing would be to create a process host that will spawn multiple processes and you can use named pipes or sockets or WCF to communicate between each of the process if you need to.

The best thing would be to create a small SQL Embedded database and you can queue your jobs in it, the mining process can fetch a new request, and post request back to database, and this database can be used to synchronize everything.

Akash Kava
Sync is pain and databases don't scale to multiple machines (esp. embedded). I believe a simple queue would do the trick better for the communication between multiple processes here (MSMQ for a single machine deployment and some cloud or AMQP implementation for the distributed case).
Rinat Abdullin
Yes, I was only suggesting for single machine, for multiple machines, some sort of queue with MSMQ or some distributed communication will suffice.
Akash Kava
+1  A: 

I am mining a lot of pages with WatiN. Actually 30+ in this moment. Of course it takes a lot of resources - about 2.5 GB of RAM but it is almost impossible to do the same with WebRequest. I can't imagine myself doing such a thing in reasonable time. With WatiN it takes a couple of hours.

I don't know if it helps you, but I am using webbrowser control to do that. Every instance is a separate process. But, what's I think is more important to you, I tried once to reduce amount of used memory by doing all of it in single process. It's possible to just make separate AppDomain's instead of processes and force them to use the same dll (especially Microsoft.mshtml.dll) instead of loading the same dll separately for each new app domain. I can't remember how to do that now, but it's not hard to google that. What I remember is that everything worked fine and the usage of RAM was decreased significantly, so I think is worth trying.

prostynick
I guess I have two questions here. First, say I'm limited to 1Gb RAM for *everything* (the OS, WatiN, SQL Server, etc.). What can I do about it? Second - how many concurrent processes were you running? The fundamental issue right now is that I can't spawn, say, 100 instances of IE.
Dmitri Nesteruk
As I said, there is 30+ concurrent processes that takes 2.5 GB of RAM. I think 1 GB is enough to run them in separate app domains with shared dll, but I can't say how many more could be run.
prostynick
+1  A: 

I had a project where I scraped on the order of 45 million requests (with form submissions) over an extended basis. On a sustained basis, I was scraping with about 20 simultaneous clients and my pipe was the bottleneck.

I used Selinium Remote-Control after experimenting with writing my own WebClient, WaTiN/WaTiR, and using Microsoft's UI Automation API.

Selenium RC let's you choose your browser. I used Firefox. Setting up the initial scraping scripts took about an hour of experimentation and tuning. Selenium was vastly faster than writing my own code and a lot more robust with little investment. Great tool.

To scale the process, I tried a few different approaches, but ultimately what worked best was sticking each SRC instance in its own stripped down VM and then spawning as many of those as the workstation had ram to support. An equivalent number of SRC instances running native in the host instead of the vms inevitably ground to a halt as I got up to +10 instances. This required more overhead and setup time before a scraping run, but it would run strongly for days, uninterrupted.

Another consideration -- tune your firefox preferences down so no homepage is loading, turn off everything non-essential (spoofing checks, cookies if not required for your scrape, images, adblock and flashblock, etc).

Grant