I want to mine large amounts of data from the web using the IE browser. However, spawning lots and lots of instances of IE via WatiN crashes the system. Is there a better way of doing this? Note that I can't simply do WebRequests
- I really need the browser due to having to interact with JS-driven behaviors on the site.
views:
209answers:
5What about launching multiple instances of WebBrowser control (it's IE anyway) in a .NET app to process data mining jobs in async manner?
If perf is a problem - splitting the job and pushing it to the cloud might also help.
Did you try the commercial version of iMacros yet? It is a bit like WatiN, but more designed towards web automation / web scraping. Basically they added special code to deal with all the different browser annoyances. Their sample code contains C#/VB.NET multi-threading sample code for use with IE and Firefox. We use it with Ruby ;)
We have no problem running many instances / server. While I can not reveal the name of our company, I know that AlertFox uses the same approach for web monitoring.
The best way would be to actually create one process per instance of web browser, this is because web browser is not a managed code, its COM, and there are cases where unmanaged exceptions can not be handled in managed code, and application will certainly crash.
The better thing would be to create a process host that will spawn multiple processes and you can use named pipes or sockets or WCF to communicate between each of the process if you need to.
The best thing would be to create a small SQL Embedded database and you can queue your jobs in it, the mining process can fetch a new request, and post request back to database, and this database can be used to synchronize everything.
I am mining a lot of pages with WatiN. Actually 30+ in this moment. Of course it takes a lot of resources - about 2.5 GB of RAM but it is almost impossible to do the same with WebRequest
. I can't imagine myself doing such a thing in reasonable time. With WatiN it takes a couple of hours.
I don't know if it helps you, but I am using webbrowser control to do that. Every instance is a separate process. But, what's I think is more important to you, I tried once to reduce amount of used memory by doing all of it in single process. It's possible to just make separate AppDomain
's instead of processes and force them to use the same dll (especially Microsoft.mshtml.dll) instead of loading the same dll separately for each new app domain. I can't remember how to do that now, but it's not hard to google that. What I remember is that everything worked fine and the usage of RAM was decreased significantly, so I think is worth trying.
I had a project where I scraped on the order of 45 million requests (with form submissions) over an extended basis. On a sustained basis, I was scraping with about 20 simultaneous clients and my pipe was the bottleneck.
I used Selinium Remote-Control after experimenting with writing my own WebClient, WaTiN/WaTiR, and using Microsoft's UI Automation API.
Selenium RC let's you choose your browser. I used Firefox. Setting up the initial scraping scripts took about an hour of experimentation and tuning. Selenium was vastly faster than writing my own code and a lot more robust with little investment. Great tool.
To scale the process, I tried a few different approaches, but ultimately what worked best was sticking each SRC instance in its own stripped down VM and then spawning as many of those as the workstation had ram to support. An equivalent number of SRC instances running native in the host instead of the vms inevitably ground to a halt as I got up to +10 instances. This required more overhead and setup time before a scraping run, but it would run strongly for days, uninterrupted.
Another consideration -- tune your firefox preferences down so no homepage is loading, turn off everything non-essential (spoofing checks, cookies if not required for your scrape, images, adblock and flashblock, etc).