views:

210

answers:

3

Is there an open source Java library/algorithm for finding if a particular piece of text is a question or not?
I am working on a question answering system that needs to analyze if the text input by user is a question.
I think the problem can probably be solved by using opensource NLP libraries but its obviously more complicated than simple part of speech tagging. So if someone can instead tell the algorithm for it by using an existing opensource NLP library, that would be good too.
Also let me know if you know a library/toolkit that uses data mining to solve this problem. Although it will be difficult to get sufficient data for training purposes, I will be able to use stack exchange data for training.

Update:
I have given up on NLP libraries. I tried to use uClassify (http://www.uclassify.com) for text classification and trained my classifier with 100000 stack overflow questions/answers. It is still not very useful. "I do what I want" is classified as question while "You do what you want" is classified as answer.
So if anyone can point me to a good training dataset, that will be great as well.
Are there any other alternatives?

A: 

Try the following:

  • Ends with a '?'
  • Has one of the following in it "who", "what", "where", "when", "how", "why", "or"

public boolean isQuestion(String sentence) {
      return sentence.contains("?") || sentence.contains("who") || sentence.contains("what") || sentence.contains("where") || sentence.contains("when") || sentence.contains("how") || sentence.contains("why") || sentence.contains("or");
}

Lee
This will match "I do **what** I want", "I sleep **when** I want", "I don't know **where** I am"..
Marko
"Who is on the second base"
belisarius
@belisarius: LOL that's a good one; http://en.wikipedia.org/wiki/Who%27s_on_First%3F
polygenelubricants
@polygenelubricants Yeah I was trying to prove that by induction :D
belisarius
+1  A: 

Many quasi-questions/requests-for-info are posed in the grammatical form of a statement; e.g. "I would like to know who stole my bike".

I would abandon all hope of determining from its structure whether the user's input is a question, and simply assume it is a question unless it is unambiguously not a question. You could take an iterative, interactive approach so the system could refine its "understanding" of the user's input:

     User: I would like to know how many angels fit on the head of a pin. 
     System: Do you have a question about angels?
     User: Yes.          
     System: Do you want to know if angels are fit?
     User: No.
     System: Do you want to know if angels have heads?
     User: Possibly.         
     System: Do you want to know if angels have pins?
     User: No.         
     System: Do you want to know if angels are numerous?
     User: No.
     System: Do you want to know the dimensions of an angel?
     User: Yes.
     System: Do you mean angels in paintings?
     User: No.
     System: Do you mean angels in myth and religious writing?
     User: Yes.
     System: Angels are metaphysical beings.


     User: I hear that Pennsylvania was founded by William Penn. Is that true?
     System: Do you have a question about hearing?
     User: No.
     System: Do you have a question about Pennsylvania?
     User: Yes.         
     System: Pennsylvania was founded by William Penn.
     User: When?         
     System: 1682.
     User: What does the name mean?
     System: What name?
     User: Pennsylvania!
     System: Do you want to know the meaning of Pennsylvania?
     User: Yes.
     System: Pennsylvania means Penn's Woods.
Tim
Interesting approach. :)
nabeelmukhtar
That is a nice method of doing it. Can i assume that this is purely theoretical.
Lee
@Lee: Do you have a question about "doing it" ?
Tim
+2  A: 

In a syntactic parse of a question, the correct structure will be in the form of:

(SBARQ (WH+ (W+) ...)
       (SQ ...*
           (V+) ...*)
       (?))

So, using anyone of the syntactic parsers available, a tree with an SBARQ node having an embedded SQ (optionally) will be an indicator the input is a question. The WH+ node (WHNP/WHADVP/WHADJP) contains the question stem (who/what/when/where/why/how) and the SQ holds the inverted phrase.

i.e.:

(SBARQ 
  (WHNP 
    (WP What)) 
  (SQ 
    (VBZ is) 
    (NP 
      (DT the) 
      (NN question)))
  (. ?))

Of course, having a lot of preceeding clauses will cause errors in the parse (that can be worked around), as will really poorly-written questions. For example, the title of this post "How to find out if a sentence is a question?" will have an SBARQ, but not an SQ.

msbmsb