tags:

views:

720

answers:

3

I am trying to parse text off of a PDF page into sentences but it is much more difficult than I had anticipated. There are a whole lot of special cases to consider such as initials, decimals, quotations, etc which contain periods but do not necessarily end the sentence.

I was curious if anyone here was familiar with an NLP library for C or C++ that could help me out with this task or just offer any advice?

Thank you for any help.

+6  A: 

This is a problem called sentence boundary disambiguation. The Wikipedia page for it lists a few libraries, but I'm not sure if any of them are easily callable from C.

You can find many papers on the theory of sentence boundary disambiguation. The Unicode Standard, in Unicode Standard Annex #29 - Unicode Text Segmentation defines a simple sentence boundary detection algorithm as well.

Avi
+2  A: 

This is a natural language, rather than a computer language, parsing problem. and as such, there is never going to be an easy answer. However, their may be heuristics you can apply and that we could recommend if we knew why you are splitting PDFs into sentences and what you want to do with the sentences once you have got them?

anon
I am splitting PDFs into sentences for purposes of 'reflow'. A new tagged PDF will be created from all of the sentences I strip which will allow for easier manipulation later.
outsyncof
Shouldn't your question then be "How to a convert a PDF to support reflow?" or something similar?
anon
Reflow is not a solved problem, thus I am trying to break it up into pieces. The first being getting a properly formatted sentence.
outsyncof
+3  A: 

Sentence boundary disambiguation (SBD) is a central problem in the field of NLP. Unfortunately, those I've found and used in the past aren't in C (as it's not the favourite language for string based tasks, unless speed is a major issue)

Pipeline

If at all possible I'd create a simple pipeline - if on a Unix system this shouldn't be a problem, but even if you're on Windows with a scripting language you should be able to fill in the gaps. This means that the SBD can be the best tool for the job, not merely the only SBD you could find for language Z. For example,

./pdfconvert | SBD | my_C_tool > ...

This is the standard way we do things in my work, and unless you have more strict requirements than you've stated it should be fine.

Tools

In regards to the tools you can use,

  • I'd suggest MXTERMINATOR, which is a SBD tool using Maximum Entropy modelling, as my supervisors used it in their own work recently. According to them it did miss a few sentence splits, but that was easily fixed by a sed script. They were doing SBD on astronomical papers. The main site appears down at the moment, but there is an FTP mirror available here.
  • OpenNLP have a reimplementation of the above algorithm using Maximum Entropy modelling in Java (JavaDoc) and is more up to date with a seemingly stronger community behind it.
  • Sentrick and many others exist also. For more there is an older list here that may be of use.

Models and Training

Now, some of these tools may give you good results out of the box, but some may not. OpenNLP includes a model for English sentence detection out of the box, which may work for you. If your domain is significantly different to the one which the tools were trained on they may not perform well however. For example, if they were trained on newspaper text they may be very good at that task but horrible at letters.

As such, you may want to train the SBD tool by giving it examples. Each of the tools should document this process, but I will warn you, it may be a bit of work. It would require you running the tool on document X, going through and manually fixing any incorrect splits and giving the correctly split document X back to the tool to train on. Depending on the sizes of the documents and the tool involved you may need to do this for one or a hundred documents until you get a reasonable result.

Good luck, and if you've any questions feel free to ask.

Smerity