views:

318

answers:

3

Hi! I'm fairly new to programming and am working for my dissertation on a web crawler. I've been provided by a web crawler but i found it to be too slow since it is single threaded. It took 30 mins to crawl 1000 webpages. I tried to create multiple threads for execution and with 20 threads simultaneously running the 1000 webpages took only 2 minutes. But now I'm encountering "Heap Out of Memory" errors. I'm sure what i did was wrong which was create a for loop for 20 threads. What would be the right way to multi-thread the java crawler without giving out the errors? And speaking of which, is multi-threading the solution to my problem or not?

+3  A: 

My first suggestion is that you increase the heap size for the JVM:

http://www.informix-zone.com/node/46

Alex Black
A: 

Regarding the speed of your program:

If your web crawler obeys the robots.txt file on servers, (which it should to avoid being banned by the site admins) then there may be little that can be done.

You should profile your program, but I expect most of the time is your crawler downloading html pages, and site admins will usually not be happy if you download so fast you drain their bandwidth.

In summary, Downloading a whole site without hurting that site will take a while.

daveb
hi daveb. It does obey the robots.txt file, and multi-threading did kind of solve the speed problem, right now i just wish to find out the most efficient way to a multi-threaded program, that could avoid errors. There must be one, or else what is the whole point of threads then.
Tobias
+1  A: 

The simple answer (see above) is to increase the JVM memory size. This will help, but it is likely that the real problem is that your web crawling algorithm is creating an in-memory data structure that grows in proportion to the number of pages you visit. If that is the case, the solution maybe to move the data in that data structure to disc; e.g. a database.

The most appropriate solution to your problem depends on how your web crawler works, what it is collecting, and how many pages you need to crawl.

Stephen C