ansaurus

Question

How to use NLP to separate a unstructured text content into distinct paragraphs ?

Answer 1

A:

I don't know much about this, so this answer is a stub for a better one. Nonetheless, two points

One name for this problem is topic identification, and http://research.microsoft.com/en-us/um/people/cyl/download/papers/thesis97.pdf is a frequently cited paper in this area.
This is probably very hard. I wouldn't have separated Philadelphia from American Revolution if you hadn't have told me.

John the Statistician 2010-07-13 18:29:10

Answer 2

+2 A:

So I've worked in NLP for a long time, and this is a really tough problem you're trying to tackle. You'll never be able to implement a solution with 100% accuracy, so you should decide up front whether it's better to make false-negative decisions (failing to find a paragraph-segmentation-point) or false-positive decisions (inserting spurious segmentation points). Once you do that, assemble a corpus of documents and annotate the true segmentation points you expect to find.

Once you've done that, you'll need a mechanism for finding EOS (end-of-sentence) points. Then, between every pair of sentences, you'll need to make a binary decision: should a paragraph boundary be inserted here?

You could measure the cohesion of concepts within each paragraph based on different segmentation points. For example, in a document with five sentences (ABCDE), there are sixteen different ways to segment it:

ABCDE   ABCD|E   ABC|DE   ABC|D|E   AB|CDE   AB|CD|E   AB|C|DE   AB|C|D|E
A|BCDE  A|BCD|E  A|BC|DE  A|BC|D|E  A|B|CDE  A|B|CD|E  A|B|C|DE  A|B|C|D|E

To measure cohesion, you could use a sentence-to-sentence similarity metric (based on some collection of features extracted for each sentence). For the sake of simplicity, if two adjacent sentences have a similarity metric of 0.95, then there's a 0.05 "cost" for combining them into the same paragraph. The total cost of a document segmentation plan is the aggregate of all the sentence-joining costs. To arrive at the final segmentation, you choose the plan with the least expensive aggregate cost.

Of course, for a document with more than a few sentences, there are too many different possible segmentation permutations to brute-force evaluate all of their costs. So you'll need some heuristic to guide the process. Dynamic programming could be helpful here.

As for the actual sentence feature extraction... well, that's where it gets really complicated.

You probably want to ignore highly syntactic words (connective words like prepositions, conjunctions, helping verbs, and clause markers) and base your similarity around more semantically relevant words (nouns and verbs, and to a lesser extent, adjectives and adverbs).

A naive implementation might just count up the number of instances of each word and compare the word counts in one sentence with the word counts in an adjacent sentence. If an important word (like "Philadelphia") appears in two adjacent sentences, then they might get a high similarity score.

But the problem with that is that two adjacent sentences might have very similar topics, even if those sentences have completely non-overlapping sets of words.

So you need to evaluate the "sense" of each word (its specific meaning, given the surrounding context) and generalize that meaning to encompass a broader domain.

For example, imaging a sentence with the word "greenish" in it. During my feature extraction process, I'd certainly include the exact lexical value ("greenish") but I'd also apply a morphological transform, normalizing the word to its root form ("green"). Then I'd lookup that word in a taxonomy and discover that it's a color, which can be further generalized as a visual descriptor. So, based on that one word, I might add four different features to my collection of sentence features ("greenish", "green", "[color]", "[visual]"). If the next sentence in the document referred to the color "green" again, then the two sentences would be very similar. If the next sentence used the word "red", then they'd still have a degree of similarity, but to a lesser extent.

So, there are a few basic ideas. You could elaborate on these ad infinitum and tweak the algorithm to perform well on your specific dataset. There are a million different ways to attack this problem, but I hope some of these suggestions are helpful in getting you started.

benjismith 2010-07-13 19:15:03

Answer 3

A:

For this sample, the best method is to find full stops that aren't followed by a space!

Tommy Herbert 2010-07-20 10:09:56

ansaurus

tags:

views:

answers:

How to use NLP to separate a unstructured text content into distinct paragraphs ?

related questions