I am a newbie when it comes to information extraction. For the past several days, I have read a lot of academic papers and ordered a book on NLP. I want to figure out how I can build a FlipDog.com like system (hopefully not from scratch). They extract job openings from more than 60,000 company web sites. How do I get started?
I am open to learning any programming language. Has anybody used Mallet/GATE/MinorThird or RoadRunner? Ideally, I want to be able to train a system with the data set particular to my domain and have it extract information based on that. Which platform would you recommend for this purpose?
Thanks!