views:

60

answers:

2

I am trying to loop through a bunch of documents I have to put each word in a list for that document. I am doing it like this. stoplist is just a list of words that I want to ignore by default.

texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

I am returned with a list of documents, and in each of those lists, is a list of words. Some of the words still contain the trailing punctuation or other anomalies. I thought I could do this, but it doesn't seem to be working right

texts = [[word.rstrip() for word in document.lower().split() if word not in stoplist]
         for document in documents]

Or

texts = [[word.rstrip('.,:!?:') for word in document.lower().split() if word not in stoplist]
         for document in documents]

My other question is this. I may see words like this where I want to keep the word, but dump the trailing numbers / special characters.

agency[15]
assignment[72],
you’ll
america’s

So to clean up most of the other noise, I was thinking I should keep removing characters from the end of a string until it's a-zA-Z or if there is more special characters than alpha chars in a string, toss it. You can see though in my last two examples, the end of the string is an alpha character. So in those cases, I should just ignore the word because of the amount of special chars (more than alpha chars). I was thinking I should just search the end of strings because I would like to keep hyphenated words intact if possible.

Basically I want to remove all trailing punctuation on each word, and possibly a subroutine that handles the cases I just described. I am not sure how to do that or if its the best way.

+1  A: 

Maybe try re.findall instead, with a pattern like [a-z]+:

import re
word_re = re.compile(r'[a-z]+')
texts = [[match.group(0) for match in word_re.finditer(document.lower()) if match.group(0) not in stoplist]
          for document in documents]

texts = [[word for word in word_re.findall(document.lower()) if word not in stoplist]
          for document in documents]

You can then easily tweak your regular expression to get the words you want. Alternate version uses re.split:

import re
word_re = re.compile(r'[^a-z]+')
texts = [[word for word in word_re.split(document.lower()) if word and word not in stoplist]
          for document in documents]
Radomir Dopieralski
I got an error on the first one "AttributeError: 'str' object has no attribute 'group'" and "UnboundLocalError: local variable 'word' referenced before assignment" on your second example.
Hallik
I'm sorry, I corrected the examples, they should run fine now.
Radomir Dopieralski
+1  A: 
>>> a = ['agency[15]','assignment72,','you’11','america’s']
>>> import re
>>> b = re.compile('\w+')
>>> for item in a:
...     print b.search(item).group(0)
...
agency
assignment72
you
america
>>> b = re.compile('[a-z]+')
>>> for item in a:
...     print b.search(item).group(0)
...
agency
assignment
you
america
>>>

Update

>>> a = "I-have-hyphens-yo!"
>>> re.findall('[a-z]+',a)
['have', 'hyphens', 'yo']
>>> re.findall('[a-z-]+',a)
['-have-hyphens-yo']
>>> re.findall('[a-zA-Z-]+',a)
['I-have-hyphens-yo']
>>> re.findall('\w+',a)
['I', 'have', 'hyphens', 'yo']
>>>
Robus
What about words that have hyphens in them? I would like to keep those words intact if possible. Examples might be self-paced, counter-intelligence, etc.
Hallik
Updated with capital letters/hyphens
Robus
This works perfect, thank you!
Hallik
you should be using `re.match`
SilentGhost