tags:

views:

72

answers:

2

Hi, I have page designed in Django that has its own search engine. What I need help with is construction of regex that will filter only valid queries, which are consisting only of polish alphabet letters(both upper- and lowercase) and symbols * and ? , can anyone be of assistance?

EDIT: I tried something like that:

query_re = re.compile(r'^\w*[\*\?]*$', re.UNICODE)
if not query_re.match(self.cleaned_data['query']):
    raise forms.ValidationError(_('Illegal character'))

but it also allows some invalid characters from different alphabets and wont allow *somest?ing* queries.

+1  A: 

Try something like

regex = r'(?iL)^[\s\*\?a-z]*$'

assuming your machine's locale is Polish. The first part (?iL) sets the locale and ignorecase flags. The ^ matches the start of the string, \s matches any whitespace, and a-z any lowercase letter (or uppercase, thanks to the ignorecase flag).

Alternatively, instead of using (?L) and a-z, you could just explicitly list the allowable letters (e.g. abcdefghijklmnopqrstuvwxyz).

Yoni Samlan
Thanks for swift reply. Unfortunately this doesn't work, Polish letters are still considered invalid, machine locale is set correctly getlocale returns: ('pl_PL', 'UTF8'). I tried the a-z and aąbcćdeę... variants but still no luck.
Lhiash
+1  A: 

If your locale is correctly set, you would use

query_re = re.compile(r'^[\w\*\?]*$', re.LOCALE|re.IGNORECASE)

\w matches all locale-specific alphanumerics: http://docs.python.org/library/re.html

kibitzer
Other than using \w instead of a-z and splitting the flags out to be part of compilation instead of inline, how is this different from my answer?
Yoni Samlan