This is a question:
"Where is the car?"
This is NOT a question:
"Check this out: http://domain.com/?q=test"
How do I write a function to analyze a string so that we know for sure it is a question and not part of a URL?
This is a question:
"Where is the car?"
This is NOT a question:
"Check this out: http://domain.com/?q=test"
How do I write a function to analyze a string so that we know for sure it is a question and not part of a URL?
The question mark will not have white space either side or a line break/end-of-string after it, if it is in a url
You can for example check if the question mark is immediately followed by a non-space, non-line break character. But I guess that a more safe way would be to strip any possible URL from the string before searching the question mark on it.
If question mark is always there, you could check like
if question.strip().endswith("?") and "://" not in question:
# do something ?
If you really want to parse real sentence, you may need nltk, I am not sure for that case.
p.s this is just an sample if the text is fixed, nobody can parse real English grammar with regex.
This regex finds question marks following a word character, and followed by either whitespace or the end of the string/line. Not perfect, but should catch most cases...
\w\?[$\s]
Edit (lack of caffeine strikes...):
That should have been:
\w\?(\s|$)
In the original, $ is interpreted as a literal character. (Thanks Gumbo)
Essentially what others say is correct. There should be no whitespace before the ?
. If the question is entered by a user, things get more ambiguous however.
In that case a proper parser using a context free grammar may yield better results. Even with questions not having a question mark at the end. But it may not recognize all questions. Covering all possible structure variations, inflections and whatnot is not straight-forward.
But, if you are certain that the questions always end with a question mark, you could do something as simple as
if question_text.strip().endswith("?"):
print `question_text`, "is a question"
Or:
import re
p = re.compile( r"\w+\?\s*" )
if p.search( question_text ):
print `question_text`, "contains a question"
Not tested, but should work for most cases.
A probably not very robust approach that you might be able to get some traction with would be to look for "question words" in strings that end with question marks. In English, most question sentences or clauses (i.e. following a comma) start with "who", "what", "where", "when", "how", "why", "can", "may", "will", "won't, "does", "doesn't", etc. You could probably build up a pretty good heuristic this way that might work better than a regex (or could be incorporated into one or more regexes).