views:

635

answers:

6

Hello, I am having "AUTOMATIC TEXT SUMMARIZER (linguistic approach)" as my final year project. I have collected enough research papers and gone through them. Still i am not very clear about the 'how-to-go-for-it' thing. Basically i found "AUTOMATIC TEXT SUMMARIZER (statistical based)" and found that it is much easier compared to my project. My project guide told me not to opt this (statistical based) and to go for linguistic based.

Anyone who has ever worked upon or even heard of this sort of project would be knowing that summarizing any document means nothing but SCORING each sentence (by some approach involving some specific algos) and then selecting sentences having score more than threshold score. Now the most difficult part of this project is choosing the appropriate algorithm for scoring and later implementing it.

I have moderate programming skills and would like to code in JAVA (beacuase there i`ll get lots of APIs resulting in lesser overheads). Now i want to know that for my project, what should be my approach and algos used. Also how to implement them. Note that i have to complete it in 3 months. Please help me. Its urgent.

Thanks in anticipation.

A: 

If you really have read those research papers and research books you probably know what is known. Now it is up to you to implement the knowledge of those research papers and research books in a Java application. Or you could expand the human knowledge by doing some innovation/invention. If you do expand human knowledge you have become a true scientist.

tuinstoel
A: 

Please make your question more specific, in these two main areas:

  1. Project definition: What is the goal of your project? Is the input unit a single document? A list of documents? Do you intend your program to use machine learning? What is the output? How will you measure success?
  2. Your background knowledge: You intend to use linguistic rather than statistical methods. Do you have background in parsing natural language? In semantic representation? I think some of these questions are tough. I am asking them because I spent too much time trying to answer similar questions in the course of my studies. Once you get these sorted out, I may be able to give you some pointers. Mani's "Automatic Summarization" looks like a good start, at least the introductory chapters.
Yuval F
A: 

Thanks Yuval F for taking interest in my problem.

Project definition: Goal of my project is to develop a tool which takes a single (not multiple) document as its input (simple text file) and to generate a summary. Summary's length would of course depend on user's choice.

Using Machine learning would require me to involve training and testing corpus. I have no problems in using it but the only limitation is the time factor of 3 months. I have this training and testing corpus of Reuters with me. As far as success evaluation is concerned, i would like to tell that since it is my curriculum project thus it need not be very accurate and perfect (because this will demand time and intense research). In short the goal is to build a basic summarizer (but using linguistics)

Background knowledge: I earlier had Automata as one of my subjects thus having some knowledge of parsing. S-->NP VP etc I tried to look for MANI's Automatic summarization's book and papers but it was fruitless. It would be great if you could provide these. Also please guide me in my furthur steps.

Thanks once again.

Hi shishirlearnz. I suggest we discuss this in detail out of SO. If you want to, please email me fyuval AT gmail Dot com.
Yuval F
A: 

The University of Sheffield did some work on automatic email summarising as part of the EU FASiL project a few years back.

Frank Shearar
A: 

Using Lexical Chains for Text Summarization (Microsoft Research)

An analysis of different algorithms: DasMartins.2007

Most important part in the doc:

• Nenkova (2005) analyzes that no system could beat the baseline with statistical significance
• Striking result!

Note there are 2 different nuances to the liguistic approach:

  • Linguistic rating system (all clear here)
  • Linguistic generation (rewrites sentences to build the summary)
clyfe
A: 

one thing to be noted here: the research.microsoft.com domain seems to be down since many days.i'm doing my research on exactly the same topic.and found many links to this domain.but nothing's working...

Navin Israni