views:

146

answers:

6

I need to check that text contain only small letters a-z and ,

best way to do it in python?

+1  A: 

Not sure what do you mean with "contain", but this should go in your direction:

reobj = re.compile(r"[a-z,]+")
match = reobj.search(subject)
if match:
    result = match.group()
else
    result = ""
splash
A: 
#!/usr/bin/env python

import string

text = 'aasdfadf$oih,234'

for letter in text:
    if letter not in string.ascii_lowercase and letter != ',':
        print letter
infrared
+14  A: 
import string

allowed = set(string.lowercase + ',')
if set(text) - allowed:
   # you know it has forbidden characters
else:
   # it doesn't have forbidden characters 

Doing it with sets will be faster than doing it with for loops (especially if you want to check more than one text) and is all together cleaner than regexes for this situation.

an alternative that might be faster than two sets, is

allowed = string.lowercase + ','
if not all(letter in allowed for letter in text):
    # you know it has forbidden characthers

here's some meaningless mtimeit results. one is the generator expression and two is the set based solution.

$ python -mtimeit -s'import scratch3' 'scratch3.one("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 3.98 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.two("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 4.39 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.two("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 3.51 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.one("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 7.7 usec per loop

You can see that the setbased one is significantly faster than the generator expression with a small expected alphabet and success conditions. the generator expression is faster with failures because it can bail. This is pretty much whats to be expected so it's interesting to see the numbers back it up.

another possibility that I forgot about is the hybrid approach.

not all(letter in allowed for letter in set(text))

$ python -mtimeit -s'import scratch3' 'scratch3.three("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 5.06 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.three("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 6.71 usec per loop

it slows down the best case-ish but speeds up the worst case-ish. All in all, you'd have to test the different possibilities over a sample of your expected input. the broader the sample, the better.

aaronasterling
I personally like the second approach better.
Mark
@Mark but see the timings. It depends on expected inputs.
aaronasterling
nice, i really like the second approach
bronzebeard
+1, no regex :)
Michał Niklas
@Michał Niklas: This is exactly the kind of job regexes are made for. As you can see from Aaron's comment to Emile's answer, the regex version is several times faster.
Tim Pietzcker
@Tim It's true but in this case I still think that `not all(letter in allowed for letter in letters)` is better then regexes. I mean, that's just plain english pretty much. It's incredibly clear. If it turned out to be too slow for the application, _then_ I would drop it in a heart beat and use a regex. We're talking microseconds here.
aaronasterling
Tim, regex was the first solution I was thinking of. Nice to see other, interesting alternatives. BTW I upvoted regex solutions too :)
Michał Niklas
@AaronMcSmooth, correction: allowed = string.lower + ',' --> allowed = string.lowercase + ','
Avadhesh
@Avadhesh, good looking out.
aaronasterling
A: 

characters a -z are represented by bytes 97 - 122 and ord(char) returns the byte value of the character. Reading the file in binary and making the match should suffice.

f = open("myfile", "rb")
retVal = False
lowerAlphabets = range(97, 123)
try:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)
        if byte:
            if ord(byte) not in lowerAlphabets:
                retVal = True
                break

finally:
    f.close()
    if retVal:
        print "characters not from a - z"
    else:
        print "characters from a - z"
pyfunc
or you could just set `lowerAlphabets = string.lowercase`
Mark
+10  A: 
import re
def matches(s):
    return re.match("^[a-z,]*$", s) is not None

Which gives you:

>>> matches("tea and cakes")
False
>>> matches("twiddledee,twiddledum")
True

You can optimise a bit with re.compile:

import re
matcher = re.compile("^[a-z,]*$")
def matches(s):
    return matcher.match(s) is not None
Emile
+1. I don't like it as much as my second solution stylistically but the timing on the precompiled version of this is 3 to 7 times faster than my solutions.
aaronasterling
+3  A: 
import re

if not re.search('[^a-z\,]', yourString):
    # True: contains only a-z and comma
    # False: contains also something else
eumiro