views:

318

answers:

4

We extract various information from e-mails - flights, car rentals, hotels and more. the method is to extract the body of the mail, usually in HTML form but sometime it's text or we use the information in a PDF/Word/RTF attachment. We then apply regular expressions (sometimes in several steps) in order to get information, which is provided in a tabular form (you can think of a flight table, hotel table, etc.). Notice, even though we parse HTML, this is not web scraping.

Currently we are using QL2's WebQL engine, but we are looking to replace it from business reasons. Can you recommend on another engine? It must run on Linux and be accessible from Java (a Java API would be the the best, but Web services are good solution as well). It also must support regular expressions for text extraction and not just to be based on the HTML structure.

+3  A: 

I recommend that you have a look at R. It has an extensive number of text mining packages: have a look at the Natural Language Processing view. In particular, look at the tm package. Here are some relevant links:

In addition, R provides many tools for parsing HTML or XML. Have a look at this question for an example using the RCurl and XML packages.

Edit: You can integrate R with Java with JRI. It's a very widely used package, with many examples. You can also see these related questions.

Shane
How do I integrate it with my Java app?
David Rabinowitz
Updated to address your question about Java.
Shane
+2  A: 

Have a look at:

  • LingPipe - LingPipe is a suite of Java libraries for the linguistic analysis of human language.
  • Lucene - Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
ssn
LingPipe looks very interesting. Lucene looks like a bigger (and complicated) hammer than we need, but thanks
David Rabinowitz
A: 

Just wanted to update - our final decision was to implement the parsing in groovy, and to add some required functionality (html to text, pdf to text, clean whitespace, etc.) either by implementing it in Java ot by relying on 3rd party libraries.

David Rabinowitz
A: 

I use a custom parser made with GNU Flex and C++ for similar purposes. I'd suggest you take a look at parser generators in java (javaCC .jj files) javacc-faq Nutch does it this way. (NutchAnalysis.jj)

piotr
Thanks for the link. As I'm parsing e-mails, I have no fixed grammar I can generate parser to. All parsers are written by hand.
David Rabinowitz