views:

347

answers:

3

Hi all,

I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can imagine, not an optimal solution.

There are 3 consistent themes with this system:

  1. the retrieval of data has a "conceptual structure" for lack of a better phrase i.e. the retrieval of information follows a particular path
  2. we are only looking for very specific information so we dont have to really worry about extensive crawling for awhile (think thousands-tens of thousands of pages vs millions)
  3. crawls are url-based instead of site-based.

As I enhance this alpha version to a more production-level beta I am looking to add automation and management of the retrieval of data. Additionally our other systems are Java (which I'm more proficient in) and I'd like to compartmentalize the perl aspects so we dont have to lean heavily on outside help.

I've evaluated the usual suspects Nutch, Droid etc but the time spent on modifying those frameworks to suit our specific information retrieval cant be justified.

So I'd like your thoughts regarding the following architecture.

I want to create a solution which

  • use Java as the interface for managing and execution of the perl scripts
  • use Java for configuration and data access
  • stick with perl for retrieval

An example use case would be

  1. a data analyst delivers us a requirement for crawling
  2. perl developer creates the required script and uses this webapp to submit the script (which gets saved to the filesystem)
  3. the script gets kicked off from the webapp with specific parameters ....

Webapp should be able to create multiple threads of the perl script to initiate multiple crawlers.

So questions are

  1. what do you think
  2. how solid is integration between Java and Perl specifically from calling perl from java
  3. has someone used such a system which actually is part perl repository

The goal really is to not have a whole bunch of unorganized perl scripts and put some management and organization on our information retrieval. Also, I know I can use perl do do the web part of what we want - but as I mentioned before - trying to keep perl focused. But it seems assbackwards I'm not adverse to making it an all perl solution.

Open to any all suggestions and opinions.

Thanks

+1  A: 

how solid is integration between Java and Perl specifically from calling perl from java

IMO, the best way to call Perl from Java is to have Java launch Perl programs in separate processes. You could try calling Perl directly from Java using JNI / JNA, but it is hard to get right. And if you get it wrong you'll be dealing with crashed JVMs.

Open to any all suggestions and opinions.

IMO you'll get a more maintainable solution if you go pure Perl or pure Java. If that means you have to learn Perl, then so be it. (It is possible to write well-structured, maintainable apps in Perl. You just need to be disciplined about it.)

Stephen C
Thanks Stephen and Esko. Instinctively, I'm also getting the same feeling. I dont feel too comfortable with the mix and match myself. I'll let you know what we eventually decide on.
Bigtwinz
+1  A: 

I've had my fair share of creating crawlers with Java using Lucene and in fact I've answered to a related question before about the actual creation process and structure of a web crawler here. This isn't directly applicable to your question but I do think it's worth mentioning here.

Anyway, I have to agree with Stephen C, you're better off with pure Java or pure perl solution instead of a mix of both, however my opinion is based on the fact that they're completely different from each other and hammering two (or more) different mindsets together isn't usually the most optimal thing one could do.

What you described also got me thinking on improving my own crawler (the one I reference in my other answer I linked in the first paragraph), mainly the part about the actual crawling pattern. While I do believe it will take significantly more time to develop a way to manually instruct a Java application to crawl some URL in a specific pattern as the same would take in perl, doing that in Java would eventually lead up to a lot more usable piece of software with all sorts interesting small features which wouldn't be a pain to maintain.

On the other hand, the scripting side of Java is a bit meh, there is a scripting API but since scripting is about loosely defining what you want to do and Java can be annoyingly strict at times, it's not as flexible as one would hope.

To really give an opinion, I think you should minimize the part of programming language which is harder to maintain. I don't know which one it is for you but I'd assume perl. Basically commit to one of the languages and use it to its full extent, don't use the other language as a shortcut.

Esko
A: 

You can try webcrawling with HtmlUnit or Selenium and do scheduling using Quartz or put the whole project in application server like Glassfish. If you would like to stick with Perl, you could probably use crontab. The Perl APIs which can be used for webcrawling, may not have proper cookie handling. I hope that is not a problem for you. The only hack I know for this, is calling wget.

Navi