We extract various information from e-mails - flights, car rentals, hotels and more. the method is to extract the body of the mail, usually in HTML form but sometime it's text or we use the information in a PDF/Word/RTF attachment. We then apply regular expressions (sometimes in several steps) in order to get information, which is provided in a tabular form (you can think of a flight table, hotel table, etc.). Notice, even though we parse HTML, this is not web scraping.
Currently we are using QL2's WebQL engine, but we are looking to replace it from business reasons. Can you recommend on another engine? It must run on Linux and be accessible from Java (a Java API would be the the best, but Web services are good solution as well). It also must support regular expressions for text extraction and not just to be based on the HTML structure.