I need to check that text contain only small letters a-z and ,
best way to do it in python?
I need to check that text contain only small letters a-z and ,
best way to do it in python?
Not sure what do you mean with "contain", but this should go in your direction:
reobj = re.compile(r"[a-z,]+")
match = reobj.search(subject)
if match:
result = match.group()
else
result = ""
#!/usr/bin/env python
import string
text = 'aasdfadf$oih,234'
for letter in text:
if letter not in string.ascii_lowercase and letter != ',':
print letter
import string
allowed = set(string.lowercase + ',')
if set(text) - allowed:
# you know it has forbidden characters
else:
# it doesn't have forbidden characters
Doing it with sets will be faster than doing it with for loops (especially if you want to check more than one text) and is all together cleaner than regexes for this situation.
an alternative that might be faster than two sets, is
allowed = string.lowercase + ','
if not all(letter in allowed for letter in text):
# you know it has forbidden characthers
here's some meaningless mtimeit
results. one
is the generator expression and two
is the set based solution.
$ python -mtimeit -s'import scratch3' 'scratch3.one("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 3.98 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.two("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 4.39 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.two("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 3.51 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.one("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 7.7 usec per loop
You can see that the setbased one is significantly faster than the generator expression with a small expected alphabet and success conditions. the generator expression is faster with failures because it can bail. This is pretty much whats to be expected so it's interesting to see the numbers back it up.
another possibility that I forgot about is the hybrid approach.
not all(letter in allowed for letter in set(text))
$ python -mtimeit -s'import scratch3' 'scratch3.three("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 5.06 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.three("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 6.71 usec per loop
it slows down the best case-ish but speeds up the worst case-ish. All in all, you'd have to test the different possibilities over a sample of your expected input. the broader the sample, the better.
characters a -z are represented by bytes 97 - 122 and ord(char) returns the byte value of the character. Reading the file in binary and making the match should suffice.
f = open("myfile", "rb")
retVal = False
lowerAlphabets = range(97, 123)
try:
byte = f.read(1)
while byte != "":
# Do stuff with byte.
byte = f.read(1)
if byte:
if ord(byte) not in lowerAlphabets:
retVal = True
break
finally:
f.close()
if retVal:
print "characters not from a - z"
else:
print "characters from a - z"
import re
def matches(s):
return re.match("^[a-z,]*$", s) is not None
Which gives you:
>>> matches("tea and cakes")
False
>>> matches("twiddledee,twiddledum")
True
You can optimise a bit with re.compile:
import re
matcher = re.compile("^[a-z,]*$")
def matches(s):
return matcher.match(s) is not None
import re
if not re.search('[^a-z\,]', yourString):
# True: contains only a-z and comma
# False: contains also something else