ansaurus

Question

What is proper Tokenization algorithm? & Error: TypeError: coercing to Unicode: need string or buffer, list found

Answer 1

A:

There is no single perfect algorithm for tokenization, though your algorithm may suffice for information retrieval purposes. It will be easier to implement using a regular expression:

def Tokenize(text):
    words = re.split(r'[-\.,?!:;_()\[\]\'`"/\t\n\r \x0b\x0c]+', text)
    return [word.strip() for word in words if word.strip() != '']

It can be improved in various ways, such as handling abbreviations properly:

>>> Tokenize('U.S.')
['U', 'S']

And watch out what you do with the dash (-). Consider:

>>> Tokenize('A-level')
['A', 'level']

If 'A' or 'a' occurs in your stop list, this will be reduced to just level.

I suggest you check out Natural Language Processing with Python, chapter 3, and the NLTK toolkit.

larsmans 2010-10-31 15:02:03

Answer 2

A:

As larsman mentions, ntlk has a variety of different tokenizers that accept various options. Using the default:

>>> import nltk
>>> words = nltk.wordpunct_tokenize('''
... broker
... broker'
... broker,
... broker.
... broker/deal
... broker/dealer'
... broker/dealer,
... broker/dealer.
... broker/dealer;
... broker/dealers),
... broker/dealers,
... broker/dealers.
... brokerag
... brokerage,
... broker-deal
... broker-dealer,
... broker-dealers,
... broker-dealers.
... brokered.
... brokers,
... brokers.
... ''')
['broker', 'broker', "'", 'broker', ',', 'broker', '.', 'broker', '/', 'deal',       'broker', '/', 'dealer', "'", 'broker', '/', 'dealer', ',', 'broker', '/', 'dealer', '.', 'broker', '/', 'dealer', ';', 'broker', '/', 'dealers', '),', 'broker', '/', 'dealers', ',', 'broker', '/', 'dealers', '.', 'brokerag', 'brokerage', ',', 'broker', '-', 'deal', 'broker', '-', 'dealer', ',', 'broker', '-', 'dealers', ',', 'broker', '-', 'dealers', '.', 'brokered', '.', 'brokers', ',', 'brokers', '.']

If you want to filter out list items that are punctuation only, you could do something like this:

>>> filter_chars = "',.;()-/"
>>> def is_only_punctuation(s):
        '''
        returns bool(set(s) is not a subset of set(filter_chars))
        '''
        return not set(list(i)) < set(list(filter_chars))
>>> filter(is_only_punctuation, words)

returns

>>> ['broker', 'broker', 'broker', 'broker', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'broker', 'dealers', 'brokerag', 'brokerage', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'brokered', 'brokers', 'brokers']

twneale 2010-10-31 15:58:21

ansaurus

tags:

views:

answers:

What is proper Tokenization algorithm? & Error: TypeError: coercing to Unicode: need string or buffer, list found

related questions