views:

61

answers:

3

The following unstructured text has three distinct themes -- Stallone, Philadelphia and the American Revolution. But which algorithm or technique would you use to separate this content into distinct paragraphs?

Classifiers won't work in this situation. I also tried to use Jaccard Similarity analyzer to find distance between successive sentences and tried to group successive sentences into one paragraph if the distance between them was less than a given value. Is there a better method?

This is my text sample:

Sylvester Gardenzio Stallone , nicknamed Sly Stallone, is an American actor, filmmaker and screenwriter. Stallone is known for his machismo and Hollywood action roles. Stallone's film Rocky was inducted into the National Film Registry as well as having its film props placed in the Smithsonian Museum. Stallone's use of the front entrance to the Philadelphia Museum of Art in the Rocky series led the area to be nicknamed the Rocky Steps.A commercial, educational, and cultural center, Philadelphia was once the second-largest city in the British Empire (after London), and the social and geographical center of the original 13 American colonies. It was a centerpiece of early American history, host to many of the ideas and actions that gave birth to the American Revolution and independence.The American Revolution was the political upheaval during the last half of the 18th century in which thirteen colonies in North America joined together to break free from the British Empire, combining to become the United States of America. They first rejected the authority of the Parliament of Great Britain to govern them from overseas without representation, and then expelled all royal officials. By 1774 each colony had established a Provincial Congress, or an equivalent governmental institution, to form individual self-governing states.

A: 

I don't know much about this, so this answer is a stub for a better one. Nonetheless, two points

  1. One name for this problem is topic identification, and http://research.microsoft.com/en-us/um/people/cyl/download/papers/thesis97.pdf is a frequently cited paper in this area.
  2. This is probably very hard. I wouldn't have separated Philadelphia from American Revolution if you hadn't have told me.
John the Statistician
+2  A: 

So I've worked in NLP for a long time, and this is a really tough problem you're trying to tackle. You'll never be able to implement a solution with 100% accuracy, so you should decide up front whether it's better to make false-negative decisions (failing to find a paragraph-segmentation-point) or false-positive decisions (inserting spurious segmentation points). Once you do that, assemble a corpus of documents and annotate the true segmentation points you expect to find.

Once you've done that, you'll need a mechanism for finding EOS (end-of-sentence) points. Then, between every pair of sentences, you'll need to make a binary decision: should a paragraph boundary be inserted here?

You could measure the cohesion of concepts within each paragraph based on different segmentation points. For example, in a document with five sentences (ABCDE), there are sixteen different ways to segment it:

ABCDE   ABCD|E   ABC|DE   ABC|D|E   AB|CDE   AB|CD|E   AB|C|DE   AB|C|D|E
A|BCDE  A|BCD|E  A|BC|DE  A|BC|D|E  A|B|CDE  A|B|CD|E  A|B|C|DE  A|B|C|D|E

To measure cohesion, you could use a sentence-to-sentence similarity metric (based on some collection of features extracted for each sentence). For the sake of simplicity, if two adjacent sentences have a similarity metric of 0.95, then there's a 0.05 "cost" for combining them into the same paragraph. The total cost of a document segmentation plan is the aggregate of all the sentence-joining costs. To arrive at the final segmentation, you choose the plan with the least expensive aggregate cost.

Of course, for a document with more than a few sentences, there are too many different possible segmentation permutations to brute-force evaluate all of their costs. So you'll need some heuristic to guide the process. Dynamic programming could be helpful here.

As for the actual sentence feature extraction... well, that's where it gets really complicated.

You probably want to ignore highly syntactic words (connective words like prepositions, conjunctions, helping verbs, and clause markers) and base your similarity around more semantically relevant words (nouns and verbs, and to a lesser extent, adjectives and adverbs).

A naive implementation might just count up the number of instances of each word and compare the word counts in one sentence with the word counts in an adjacent sentence. If an important word (like "Philadelphia") appears in two adjacent sentences, then they might get a high similarity score.

But the problem with that is that two adjacent sentences might have very similar topics, even if those sentences have completely non-overlapping sets of words.

So you need to evaluate the "sense" of each word (its specific meaning, given the surrounding context) and generalize that meaning to encompass a broader domain.

For example, imaging a sentence with the word "greenish" in it. During my feature extraction process, I'd certainly include the exact lexical value ("greenish") but I'd also apply a morphological transform, normalizing the word to its root form ("green"). Then I'd lookup that word in a taxonomy and discover that it's a color, which can be further generalized as a visual descriptor. So, based on that one word, I might add four different features to my collection of sentence features ("greenish", "green", "[color]", "[visual]"). If the next sentence in the document referred to the color "green" again, then the two sentences would be very similar. If the next sentence used the word "red", then they'd still have a degree of similarity, but to a lesser extent.

So, there are a few basic ideas. You could elaborate on these ad infinitum and tweak the algorithm to perform well on your specific dataset. There are a million different ways to attack this problem, but I hope some of these suggestions are helpful in getting you started.

benjismith
A: 

For this sample, the best method is to find full stops that aren't followed by a space!

Tommy Herbert