views:

150

answers:

4

Hey folks, every once in a while I have the need to automate data collection tasks from websites. Sometimes I need a bunch of URLs from a directory, sometimes I need an XML sitemap (yes, I know there is lots of software for that and online services).

Anyways, as follow up to my previous question I've written a little webcrawler that can visit websites.

  • Basic crawler class to easily and quickly interact with one website.

  • Override "doAction(String URL, String content)" to process the content further (e.g. store it, parse it).

  • Concept allows for multi-threading of crawlers. All class instances share processed and queued lists of links.

  • Instead of keeping track of processed links and queued links within the object, a JDBC connection could be established to store links in a database.

  • Currently limited to one website at a time, however, could be expanded upon by adding an externalLinks stack and adding to it as appropriate.

  • JCrawler is intended to be used to quickly generate XML sitemaps or parse websites for your desired information. It's lightweight.

Is this a good/decent way to write the crawler, provided the limitations above? Any input would help immensely :)

http://pastebin.com/VtgC4qVE - Main.java
http://pastebin.com/gF4sLHEW - JCrawler.java
http://pastebin.com/VJ1grArt - HTMLUtils.java

+1  A: 

I have written a custom web-crawler in my company and I follow similar steps as you have mentioned and I found them perfect.The only add-on I want to say is that it should have a polling frequency to crawl after certain period of time.

So it should follow "Observer" design pattern so that if any new update is found on a given url after certain period of time then it will update or write to a file.

Shashank T
Thank you for your answer. What is polling? The spider itself is not indented on being run continuously, although, I suppose, I could make it do that with a few changes; in which case I absolutely agree with the observer design pattern. In fact, I personally would probably implement the update thing in doAction.
Jan Kuboschek
Polling is nothing but a time-period for crawling.Let say 5 min, it means after every 5 minutes it will crawl to a particular URL.
Shashank T
+4  A: 

Your crawler does not seem to respect the robots.txt in any way and uses a fake User-Agent string to show off as if it is a webbrowser. This may lead to legal trouble in the future. Keep this into account.

BalusC
"In the future" being the key phrase here. Disobeying a robots.txt file has never been upheld as illegal in court. There's precious little precedent, but the Wayback Machine was involved in an action in 2007 that might be of interest: http://www.theregister.co.uk/2007/07/26/wayback_firm_suit/
jasonmp85
Still then, most sites would mark such a webcrawler as suspicious activity and may block/ban it from accessing the site. I strongly recommend to respect the robots.txt and use a sensible user agent string such as `JCrawler/1.0 http://jcrawler.org` and *have* a site wherein you expose all details about the crawler and what users/sitemanagers may expect from this crawler.
BalusC
Further to @BalusC's comment, having a dedicated agent means you won't be masquerading as IE6 (of all things), and convincing poor unknowing admins that they need to keep supporting it.
kibibu
Hiding who you are on the Web is worse than using fake ID to get into places you shouldn't get into. If you get blocked because of the user agent, it means **you're not a wanted visitor**, suck it up and move elsewhere.
Esko
Duly noted and I appreciated all the feedback thus far. Has anybody looked at the style? Anything I can/should improve there?
Jan Kuboschek
A: 

I would recommend open source JSpider as the start point for your crawler project, it covers all the major concerns of a web crawler, including robots.txt, and has a plug-in scheme that you can use to apply your own tasks to each page it visits.

This is a brief and slightly dated review of JSpider. The pages around this one review several other Java spidering applications.

http://www.mksearch.mkdoc.org/research/spiders/j-spider/

codestyle
A: 

crawler4j is a simple java crawler that can be configured in a few minutes and supports your requirements.

Yasser
i wasn't asking for alternatives. i was hoping to get feedback on my code.
Jan Kuboschek