views:

6924

answers:

7

Hello,

I am fighting with Python to understand how do I check whether a string is in ASCII or not.

I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in the ord()'s documentation).

So my question is simple: is there another way to check for this?

A: 

You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.

Steve Moyer
+24  A: 

I think you are not asking the right question--

A string in python has no property corresponding to 'ascii', utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that's where you need to go for an answer.

Perhaps the question you can ask is: "Is this string the result of encoding a unicode string in ascii?" -- This you can answer by trying:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"
Vincent Marchetti
+3  A: 
def is_ascii(s):
    return all(ord(c) < 128 for c in s)
Alexander Kojevnikov
Pointlessly inefficient. Much better to try s.decode('ascii') and catch UnicodeDecodeError, as suggested by Vincent Marchetti.
ddaa
It's not inefficient. all() will short-circuit and return False as soon as it encounters an invalid byte.
John Millikin
Inefficient or not, the more pythonic method is the try/except.
Jeremy Cantrell
It is inefficient compared to the try/except. Here the loop is in the interpreter. With the try/except form, the loop is in the C codec implementation called by str.decode('ascii'). And I agree, the try/except form is more pythonic too.
ddaa
-1 Not only is the loop over Python code instead of C code, but also there's a Python function call `ord(c)` -- UGLY -- at the very least use `c <= "\x7F"` instead.
John Machin
+2  A: 

How about doing this?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True
miya
+2  A: 

Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.

Byte strings (e.g. "foo", or 'bar', in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u"foo" or u'bar') are sequences of unicode code points. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single code point.

Try this, instead:

>>> ord(u'é')
233

That tells you which code point "é" represents.

Instead of chr() to reverse this, there is unichr():

>>> unichr(233)
u'\xe9'
Glyph
'é' does *not* necessarily represent a single code point. It could be *two* code points (U+0065 + U+0301).
J.F. Sebastian
Each abstract character is *always* represented by a single code point. However, code points may be encoded to multiple bytes, depending on the encoding scheme. i.e., 'é' is two bytes in UTF-8 and UTF-16, and four bytes in UTF-32, but it is in each case still a single code point — U+00E9.
Ben Blank
@Ben Blank: U+0065 and U+0301 *are* code points and they *do* represent 'é' which can *also* be represented by U+00E9. Google "combining acute accent".
J.F. Sebastian
+1  A: 

A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.

However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.

JacquesB
A: 

I use the following to determine if the string is ascii or unicode:

>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>> 

Then just use a conditional block to define the function:

def is_ascii(input):
    if input.__class__.__name__ == "str":
        return True
    return False
-1 AARRGGHH this is treating all characters with ord(c) in range(128, 256) as ASCII!!!
John Machin