ansaurus

Question

python validate text

Answer 1

+1 A:

Not sure what do you mean with "contain", but this should go in your direction:

reobj = re.compile(r"[a-z,]+")
match = reobj.search(subject)
if match:
    result = match.group()
else
    result = ""

splash 2010-09-22 07:43:28

Answer 2

A:

#!/usr/bin/env python

import string

text = 'aasdfadf$oih,234'

for letter in text:
    if letter not in string.ascii_lowercase and letter != ',':
        print letter

infrared 2010-09-22 07:43:28

Answer 3

+14 A:

import string

allowed = set(string.lowercase + ',')
if set(text) - allowed:
   # you know it has forbidden characters
else:
   # it doesn't have forbidden characters

Doing it with sets will be faster than doing it with for loops (especially if you want to check more than one text) and is all together cleaner than regexes for this situation.

an alternative that might be faster than two sets, is

allowed = string.lowercase + ','
if not all(letter in allowed for letter in text):
    # you know it has forbidden characthers

here's some meaningless mtimeit results. one is the generator expression and two is the set based solution.

$ python -mtimeit -s'import scratch3' 'scratch3.one("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 3.98 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.two("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 4.39 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.two("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 3.51 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.one("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 7.7 usec per loop

You can see that the setbased one is significantly faster than the generator expression with a small expected alphabet and success conditions. the generator expression is faster with failures because it can bail. This is pretty much whats to be expected so it's interesting to see the numbers back it up.

another possibility that I forgot about is the hybrid approach.

not all(letter in allowed for letter in set(text))

$ python -mtimeit -s'import scratch3' 'scratch3.three("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 5.06 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.three("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 6.71 usec per loop

it slows down the best case-ish but speeds up the worst case-ish. All in all, you'd have to test the different possibilities over a sample of your expected input. the broader the sample, the better.

aaronasterling 2010-09-22 07:43:29

I personally like the second approach better.

Mark 2010-09-22 07:52:13

@Mark but see the timings. It depends on expected inputs.

aaronasterling 2010-09-22 08:05:51

nice, i really like the second approach

bronzebeard 2010-09-22 08:29:30

+1, no regex :)

Michał Niklas 2010-09-22 08:51:16

@Michał Niklas: This is exactly the kind of job regexes are made for. As you can see from Aaron's comment to Emile's answer, the regex version is several times faster.

Tim Pietzcker 2010-09-22 09:02:24

@Tim It's true but in this case I still think that `not all(letter in allowed for letter in letters)` is better then regexes. I mean, that's just plain english pretty much. It's incredibly clear. If it turned out to be too slow for the application, _then_ I would drop it in a heart beat and use a regex. We're talking microseconds here.

aaronasterling 2010-09-22 09:22:04

Tim, regex was the first solution I was thinking of. Nice to see other, interesting alternatives. BTW I upvoted regex solutions too :)

Michał Niklas 2010-09-22 09:28:07

@AaronMcSmooth, correction: allowed = string.lower + ',' --> allowed = string.lowercase + ','

Avadhesh 2010-09-22 09:40:20

@Avadhesh, good looking out.

aaronasterling 2010-09-22 09:47:07

Answer 4

A:

characters a -z are represented by bytes 97 - 122 and ord(char) returns the byte value of the character. Reading the file in binary and making the match should suffice.

f = open("myfile", "rb")
retVal = False
lowerAlphabets = range(97, 123)
try:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)
        if byte:
            if ord(byte) not in lowerAlphabets:
                retVal = True
                break

finally:
    f.close()
    if retVal:
        print "characters not from a - z"
    else:
        print "characters from a - z"

pyfunc 2010-09-22 07:47:58

or you could just set `lowerAlphabets = string.lowercase`

Mark 2010-09-22 16:45:05

Answer 5

+10 A:

import re
def matches(s):
    return re.match("^[a-z,]*$", s) is not None

Which gives you:

>>> matches("tea and cakes")
False
>>> matches("twiddledee,twiddledum")
True

You can optimise a bit with re.compile:

import re
matcher = re.compile("^[a-z,]*$")
def matches(s):
    return matcher.match(s) is not None

Emile 2010-09-22 07:53:24

+1. I don't like it as much as my second solution stylistically but the timing on the precompiled version of this is 3 to 7 times faster than my solutions.

aaronasterling 2010-09-22 08:32:01

Answer 6

+3 A:

import re

if not re.search('[^a-z\,]', yourString):
    # True: contains only a-z and comma
    # False: contains also something else

eumiro 2010-09-22 07:55:49

ansaurus

tags:

views:

answers:

python validate text

related questions