views:

153

answers:

6

This is a question:

"Where is the car?"

This is NOT a question:

"Check this out: http://domain.com/?q=test"

How do I write a function to analyze a string so that we know for sure it is a question and not part of a URL?

A: 

The question mark will not have white space either side or a line break/end-of-string after it, if it is in a url

Steve De Caux
+1  A: 

You can for example check if the question mark is immediately followed by a non-space, non-line break character. But I guess that a more safe way would be to strip any possible URL from the string before searching the question mark on it.

Konamiman
+2  A: 

If question mark is always there, you could check like

if question.strip().endswith("?") and "://" not in question:
    # do something ?

If you really want to parse real sentence, you may need nltk, I am not sure for that case.

p.s this is just an sample if the text is fixed, nobody can parse real English grammar with regex.

S.Mark
Would not work with "Is this http://domain.com/?q=test a good site?"
Burkhard
Well, to fully understand English, There is many many things need to do. Even I could parse for that sentense, there is many other things to do so, Its worse than parsing html with regex, its parsing English with regex, its not possible, If you need to cover all the patterns.
S.Mark
let me give you an example, `this url is valid - http://google.com?`
S.Mark
and the question is `Is this a url http://google.com?`
S.Mark
+3  A: 

This regex finds question marks following a word character, and followed by either whitespace or the end of the string/line. Not perfect, but should catch most cases...

\w\?[$\s]

Edit (lack of caffeine strikes...):

That should have been:

\w\?(\s|$)

In the original, $ is interpreted as a literal character. (Thanks Gumbo)

mavnn
correction: This regex finds question marks following **one** word **character**
exhuma
Correct, typo on my part. All it's there for is to exclude 'hanging' question marks. Will update.
mavnn
`[$\s]` means either the `$` character or a whitespace character.
Gumbo
Yes. But other space apart from a literal 'space' character is just as relevant. I'll update to make it clearer.
mavnn
@mavnn: I think you didn’t get me: `$` inside a character class is interpreted as a literal character. So `[$\s]` means a literal `$` character or a whitespace character.
Gumbo
*Slaps forehead*. More coffee needed...
mavnn
+2  A: 

Essentially what others say is correct. There should be no whitespace before the ?. If the question is entered by a user, things get more ambiguous however.

In that case a proper parser using a context free grammar may yield better results. Even with questions not having a question mark at the end. But it may not recognize all questions. Covering all possible structure variations, inflections and whatnot is not straight-forward.

But, if you are certain that the questions always end with a question mark, you could do something as simple as

if question_text.strip().endswith("?"):
    print `question_text`, "is a question"

Or:

import re
p = re.compile( r"\w+\?\s*" )
if p.search( question_text ):
    print `question_text`, "contains a question"

Not tested, but should work for most cases.

exhuma
Using `\s*` will also allow no whitespace at all.
Gumbo
Yes. That was intended as such.
exhuma
A: 

A probably not very robust approach that you might be able to get some traction with would be to look for "question words" in strings that end with question marks. In English, most question sentences or clauses (i.e. following a comma) start with "who", "what", "where", "when", "how", "why", "can", "may", "will", "won't, "does", "doesn't", etc. You could probably build up a pretty good heuristic this way that might work better than a regex (or could be incorporated into one or more regexes).

Drew Hall