Hi all,
I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can imagine, not an optimal solution.
There are 3 consistent themes with this system:
- the retrieval of data has a "conceptual structure" for lack of a better phrase i.e. the retrieval of information follows a particular path
- we are only looking for very specific information so we dont have to really worry about extensive crawling for awhile (think thousands-tens of thousands of pages vs millions)
- crawls are url-based instead of site-based.
As I enhance this alpha version to a more production-level beta I am looking to add automation and management of the retrieval of data. Additionally our other systems are Java (which I'm more proficient in) and I'd like to compartmentalize the perl aspects so we dont have to lean heavily on outside help.
I've evaluated the usual suspects Nutch, Droid etc but the time spent on modifying those frameworks to suit our specific information retrieval cant be justified.
So I'd like your thoughts regarding the following architecture.
I want to create a solution which
- use Java as the interface for managing and execution of the perl scripts
- use Java for configuration and data access
- stick with perl for retrieval
An example use case would be
- a data analyst delivers us a requirement for crawling
- perl developer creates the required script and uses this webapp to submit the script (which gets saved to the filesystem)
- the script gets kicked off from the webapp with specific parameters ....
Webapp should be able to create multiple threads of the perl script to initiate multiple crawlers.
So questions are
- what do you think
- how solid is integration between Java and Perl specifically from calling perl from java
- has someone used such a system which actually is part perl repository
The goal really is to not have a whole bunch of unorganized perl scripts and put some management and organization on our information retrieval. Also, I know I can use perl do do the web part of what we want - but as I mentioned before - trying to keep perl focused. But it seems assbackwards I'm not adverse to making it an all perl solution.
Open to any all suggestions and opinions.
Thanks