I am trying to loop through a bunch of documents I have to put each word in a list for that document. I am doing it like this. stoplist
is just a list of words that I want to ignore by default.
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
I am returned with a list of documents, and in each of those lists, is a list of words. Some of the words still contain the trailing punctuation or other anomalies. I thought I could do this, but it doesn't seem to be working right
texts = [[word.rstrip() for word in document.lower().split() if word not in stoplist]
for document in documents]
Or
texts = [[word.rstrip('.,:!?:') for word in document.lower().split() if word not in stoplist]
for document in documents]
My other question is this. I may see words like this where I want to keep the word, but dump the trailing numbers / special characters.
agency[15]
assignment[72],
you’ll
america’s
So to clean up most of the other noise, I was thinking I should keep removing characters from the end of a string until it's a-zA-Z or if there is more special characters than alpha chars in a string, toss it. You can see though in my last two examples, the end of the string is an alpha character. So in those cases, I should just ignore the word because of the amount of special chars (more than alpha chars). I was thinking I should just search the end of strings because I would like to keep hyphenated words intact if possible.
Basically I want to remove all trailing punctuation on each word, and possibly a subroutine that handles the cases I just described. I am not sure how to do that or if its the best way.