views:

68

answers:

1

I am a newbie when it comes to information extraction. For the past several days, I have read a lot of academic papers and ordered a book on NLP. I want to figure out how I can build a FlipDog.com like system (hopefully not from scratch). They extract job openings from more than 60,000 company web sites. How do I get started?

I am open to learning any programming language. Has anybody used Mallet/GATE/MinorThird or RoadRunner? Ideally, I want to be able to train a system with the data set particular to my domain and have it extract information based on that. Which platform would you recommend for this purpose?

Thanks!

+1  A: 

The faster way to extract job offerings is to use dapper.net (a web scraping service from websites). You can very easily to teach dapper to extract data using visual editor. It works very well when on your target websites you have tables.

To learn Information Extraction, I suggest to start from lingpipe. It is a java framework for Information Extraction, so you do not need to learn architectural specific features of the framework, such as Gate or Apache UIMA. On lingpipe website you will find a lot of tutorials which will help you to learn various Information Extraction approaches. After that I suggest to learn Gate and UIMA.

If you want to realize such a website, you also need to learn how to use web crawler frameworks (e.g., nutch), web search engines (yahoo, google, bing), and Information Retrieval engines (such as, apache lucene) to provide a search service on the top of extracted data.

Skarab
Thanks so much! I will start looking at dapper and lingpipe
smitten11