Nine years ago when I started to parsing HTML and free text with Perl I read the classic Data Munging with Perl. Does someone know if David is planning to update the book or if there are similar books or web pages where the new parsing modules like XML-Twig, Regexp-Grammars, etc, are explained?
I assume that in the last nine years some modules still are as good as they were, some are up to date but with new interesting methods and some have better replacements. For example, is still Parse-RecDescent the only option for free text parsing or will be the Perl 6 influenced Regexp-Grammars its replacement in many scenarios?
I have been four years without active HTML, XML or free text data mining with Perl, so probably my toolkit in this area is a bit outdated. Therefore any feedback for HTML and DOM manipulation, link extraction/verification, web-testing like Mechanize, XML manipulation and free text parsing , from people that is up to date with the current CPAN modules in this area will be more than welcome.
Some new additions to my toolkit:
still in my toolkit:
- HTML-TableExtract # not updated since 2006
- WWW-Mechanize
- Parse-RecDescent
- HTML-TokeParser
- URI-Escape
- [more...]