ansaurus

Question

I am trying to determine if a string is a Question. How can I analyze the "?" symbol (python)

Answer 1

A:

The question mark will not have white space either side or a line break/end-of-string after it, if it is in a url

Steve De Caux 2009-11-24 09:46:50

Answer 2

+1 A:

You can for example check if the question mark is immediately followed by a non-space, non-line break character. But I guess that a more safe way would be to strip any possible URL from the string before searching the question mark on it.

Konamiman 2009-11-24 09:48:41

Answer 3

+2 A:

If question mark is always there, you could check like

if question.strip().endswith("?") and "://" not in question:
    # do something ?

If you really want to parse real sentence, you may need nltk, I am not sure for that case.

p.s this is just an sample if the text is fixed, nobody can parse real English grammar with regex.

S.Mark 2009-11-24 09:49:12

Would not work with "Is this http://domain.com/?q=test a good site?"

Burkhard 2009-11-24 09:53:15

Well, to fully understand English, There is many many things need to do. Even I could parse for that sentense, there is many other things to do so, Its worse than parsing html with regex, its parsing English with regex, its not possible, If you need to cover all the patterns.

S.Mark 2009-11-24 09:56:15

let me give you an example, `this url is valid - http://google.com?`

S.Mark 2009-11-24 10:02:27

and the question is `Is this a url http://google.com?`

S.Mark 2009-11-24 10:07:25

Answer 4

+3 A:

This regex finds question marks following a word character, and followed by either whitespace or the end of the string/line. Not perfect, but should catch most cases...

\w\?[$\s]

Edit (lack of caffeine strikes...):

That should have been:

\w\?(\s|$)

In the original, $ is interpreted as a literal character. (Thanks Gumbo)

mavnn 2009-11-24 09:50:40

correction: This regex finds question marks following **one** word **character**

exhuma 2009-11-24 09:58:03

Correct, typo on my part. All it's there for is to exclude 'hanging' question marks. Will update.

mavnn 2009-11-24 10:16:31

`[$\s]` means either the `$` character or a whitespace character.

Gumbo 2009-11-24 10:20:18

Yes. But other space apart from a literal 'space' character is just as relevant. I'll update to make it clearer.

mavnn 2009-11-24 10:22:57

@mavnn: I think you didn’t get me: `$` inside a character class is interpreted as a literal character. So `[$\s]` means a literal `$` character or a whitespace character.

Gumbo 2009-11-24 10:28:37

*Slaps forehead*. More coffee needed...

mavnn 2009-11-24 10:52:11

Answer 5

+2 A:

Essentially what others say is correct. There should be no whitespace before the ?. If the question is entered by a user, things get more ambiguous however.

In that case a proper parser using a context free grammar may yield better results. Even with questions not having a question mark at the end. But it may not recognize all questions. Covering all possible structure variations, inflections and whatnot is not straight-forward.

But, if you are certain that the questions always end with a question mark, you could do something as simple as

if question_text.strip().endswith("?"):
    print `question_text`, "is a question"

Or:

import re
p = re.compile( r"\w+\?\s*" )
if p.search( question_text ):
    print `question_text`, "contains a question"

Not tested, but should work for most cases.

exhuma 2009-11-24 09:54:29

Using `\s*` will also allow no whitespace at all.

Gumbo 2009-11-24 10:21:03

Yes. That was intended as such.

exhuma 2009-11-24 13:31:57

Answer 6

A:

A probably not very robust approach that you might be able to get some traction with would be to look for "question words" in strings that end with question marks. In English, most question sentences or clauses (i.e. following a comma) start with "who", "what", "where", "when", "how", "why", "can", "may", "will", "won't, "does", "doesn't", etc. You could probably build up a pretty good heuristic this way that might work better than a regex (or could be incorporated into one or more regexes).

Drew Hall 2009-11-24 11:06:53

ansaurus

tags:

views:

answers:

I am trying to determine if a string is a Question. How can I analyze the "?" symbol (python)

related questions