Hello,
I'm doing an Information Retrieval Task. As part of pre-processing I want to doing.
- Stopword removal
- Tokenization
- Stemming (Porter Stemmer)
Initially, I skipped tokenization. As a result I got terms like this:
broker
broker'
broker,
broker.
broker/deal
broker/dealer'
broker/dealer,
broker/dealer.
broker/dealer;
broker/dealers),
broker/dealers,
broker/dealers.
brokerag
brokerage,
broker-deal
broker-dealer,
broker-dealers,
broker-dealers.
brokered.
brokers,
brokers.
So, Now I realized importance of tokenization. Is there any standard algorithm for tokenization for English language? Based on string.whitespace
and commonly used puncuation marks. I wrote
def Tokenize(text):
words = text.split(['.',',', '?', '!', ':', ';', '-','_', '(', ')', '[', ']', '\'', '`', '"', '/',' ','\t','\n','\x0b','\x0c','\r'])
return [word.strip() for word in words if word.strip() != '']
- I'm getting
TypeError: coercing to Unicode: need string or buffer, list found
error! - How can this Tokenization routine be improved?