views:

121

answers:

5

I wrote a function in Python which is used to tell me whether the two words are similar or not.

Now I want to pass Japanese text in my same function. It is giving error "not a ascii character." I tried using utf-8 encoding, but then it giving the same error

Non-ASCII character '\xe3' in file

Is there any way to do that? I cant generate the msg file for that since the 2 keyword will be not be constant.

Here goes the code

def filterKeyword(keyword, adText, filterType):
if (filterType == 'contains'):
    try :
        adtext = str.lower(adText)
        keyword = str.lower(keyword)
        if (adtext.find(keyword)!=-1):
            return '0'
    except:
        return '1'
if (filterType == 'exact'):
    var = cmp(str.lower(adText), str.lower(keyword))
    if(var == 0 ):
        return '0'

return '1'

I have used the following:

filterKeyword(unicode('ポケモン').encode("utf-8"), unicode('黄色のポケモン').encode("utf-8"), 'contains')

filterKeyword('ポケモン'.encode("utf-8"), '黄色のポケモン'.encode("utf-8"), 'contains')

Both of them are giving the error.

A: 

Don't use UTF-8. Use unicodes.

Ignacio Vazquez-Abrams
UTF-8 is a representation of unicode, so that doesn't make sense. Also, your link is just the opening slide and doesn't have any useful info.
Michael Aaron Safyan
@Michael: It only doesn't make sense if you haven't read the presentation.
Ignacio Vazquez-Abrams
but it is not working for me
ha22109
A: 

Put:

# -*- coding: utf-8 -*-

In one of the first two lines of your script. This way the interpreter will know what encoding is used for the code and strings in it.

And use Unicode strings wherever possible. If you have luck the function may work well with the Unicode (e.g. u"something…" instead of "something...") arguments even if it was not written with Unicode in mind.

Jacek Konieczny
it is giving me errorUnicodeDecodeError"'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)"
ha22109
+1  A: 

Please do not do this:

adtext = str.lower(adText)
keyword = str.lower(keyword)

Please do this:

adtext= adText.lower()
keyword = keyword.lower()

Please do not do this:

cmp(str.lower(adText), str.lower(keyword))

Please do this:

return adText.lower() == keyword.lower()

Please do not do this:

try:
    # something
except:
    # handler

Please provide a specific exception. A generic (superclass) like Exception is fine. There are some non-exception errors which you cannot meaningfully catch.

try:
    # something
except Exception:
    # handler

Also, it's really unlikely that catching an exception would return True.

Please do not do this:

return '1' 
return '0'

It's unlikely you want to return a character. It's more likely you want to return True or False.

return True
return False

Your code will work, if you do things properly.

>>> u'ポケモン'.lower() == u'黄色のポケモン'.lower()
False
>>> u'ポケモン'.lower() in  u'黄色のポケモン'.lower()
True
S.Lott
You missed the bare `except`.
Ignacio Vazquez-Abrams
the code is not working only half .It is giving exception when i tries to find if (adtext.find(keyword)!=-1):
ha22109
@ha22109. I posted the `in` operator for a reason. What do you think that reason is?
S.Lott
+2  A: 

This worked for me:

# -*- coding: utf-8 -*-

def filterKeyword(keyword, adText, filterType):
    # same as yours

filterKeyword(u'ポケモン', u'黄色のポケモン', 'contains')
Daniel Stutzbach
thanks, it worked
ha22109
A: 

I would just like to note well:

unicode('ポケモン') (a non-unicode string constant passed to the unicode() constructor)

IS NOT THE SAME AS

u'ポケモン' (a unicode string constant)

Joe Koberg