views:

105

answers:

5

Sorry if the question is bit confusing. This is similar to this question

I think this the above question is close to what I want, but in Clojure.

There is another question

I need something like this but instead of '[br]' in that question, there is a list of strings that need to be searched and removed.

Hope I made myself clear.

I think that this is due to the fact that strings in python are immutable.

I have a list of noise words that need to be removed from a list of strings.

If I use the list comprehension, I end up searching the same string again and again. So, only "of" gets removed and not "the". So my modified list looks like this

places = ['New York', 'the New York City', 'at Moscow' and many more]

noise_words_list = ['of', 'the', 'in', 'for', 'at']

for place in places:
    stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]

I would like to know as to what mistake I'm doing.

+1  A: 

Here is my stab at it. This uses regular expressions.

import re
pattern = re.compile("(of|the|in|for|at)\W", re.I)
phrases = ['of New York', 'of the New York']
map(lambda phrase: pattern.sub("", phrase),  phrases) # ['New York', 'New York']

Sans lambda:

[pattern.sub("", phrase) for phrase in phrases]

Update

Fix for the bug pointed out by gnibbler (thanks!):

pattern = re.compile("\\b(of|the|in|for|at)\\W", re.I)
phrases = ['of New York', 'of the New York', 'Spain has rain']
[pattern.sub("", phrase) for phrase in phrases] # ['New York', 'New York', 'Spain has rain']

@prabhu: the above change avoids snipping off the trailing "in" from "Spain". To verify run both versions of the regular expressions against the phrase "Spain has rain".

Manoj Govindan
Thanks. It works this way. I was able to understand the concept of lambda more clearly now as I got a chance to implement this.
prabhu
This doesn't work properly for the phrase "Spain has rain". It's easy to fix though
gnibbler
@Gnibbler: thanks for pointing it out. Am changing my answer accordingly.
Manoj Govindan
works only for strings with "two words" or more
killown
A: 

We need to see the full code to be sure -- in the sample you have given us, you have not defined place. The error is probably due to incorrectly iterating over place.

katrielalex
+1  A: 
>>> import re
>>> noise_words_list = ['of', 'the', 'in', 'for', 'at']
>>> phrases = ['of New York', 'of the New York']
>>> noise_re = re.compile('\\b(%s)\\W'%('|'.join(map(re.escape,noise_words_list))),re.I)
>>> [noise_re.sub('',p) for p in phrases]
['New York', 'New York']
gnibbler
Wow! That is a real cool way of doing, though I strained my brain. :-)
prabhu
+1  A: 

Since you would like to know what you are doing wrong, this line:

stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]

takes place, and then begins to loop over words. First it checks for "of". Your place (e.g. "of the New York") is checked to see if it starts with "of". It is transformed (call to replace and strip) and added to the result list. The crucial thing here is that result is never examined again. For every word you iterate over in the comprehension, a new result is added to the result list. So the next word is "the" and your place ("of the New York") doesn't start with "the", so no new result is added.

I assume the result you got eventually is the concatenation of your place variables. A simpler to read and understand procedural version would be (untested):

results = []
for place in places:
    for word in words:
        if place.startswith(word):
            place = place.replace(word, "").strip()
    results.append(place)

Keep in mind that replace() will remove the word anywhere in the string, even if it occurs as a simple substring. You can avoid this by using regexes with a pattern something like ^the\b.

wds
Thanks. That was very helpful.
prabhu
+3  A: 

Without regexp you could do like this:

places = ['of New York', 'of the New York']

noise_words_set = {'of', 'the', 'at', 'for', 'in'}
stuff = [' '.join(w for w in place.split() if w.lower() not in noise_words_set)
         for place in places
         ]
print stuff
Tony Veijalainen
Excellent! Thank you!
prabhu