views:

105

answers:

8

I'm currently running into a bit of a problem. I'm trying to write a program that will highlight occurrences of a word or phrase inside of another string, but only if the string it's being matched to is exactly the same. The part I'm running into troubles with is identifying whether or not the subphrase I'm matching the phrase with is contained within another larger subphrase.

A quick example which shows this problem:

>>> indicators = ["therefore", "for", "since"]
>>> phrase = "... therefore, I conclude I am awesome."
>>> indicators_in_phrase = [indicator for indicator in indicators 
                            if indicator in phrase.lower()]
>>> print indicators_in_phrase
['therefore', 'for']

I do not want 'for' included in that list. I know why it is being included, but I can't think of any expression that could filter out substrings like that.

I've noticed other similar questions on the site, but each involves a Regex solution, which is something I'm not feeling comfortable with yet, especially not in Python. Is there any kind-of-easy way to solve this problem without using a Regex expression? If not, the corresponding Regex expression and how it might be implemented in the above example would be very much appreciated.

+5  A: 

There are ways to do it without a regex, but most of those ways are so convoluted that you'll wish you had spent the time learning the simple regex sequence that you need for it.

Ignacio Vazquez-Abrams
That's fair, and what I figured. I was just making sure that there weren't any not convoluted solutions.
Mana
A: 

A little lengthy but gives an idea / of course regex is there to make it simple

>>> indicators = ["therefore", "for", "since"]
>>> phrase = "... therefore, I conclude I am awesome."
>>> phrase_list = phrase.split()
>>> phrase_list
['...', 'therefore,', 'I', 'conclude', 'I', 'am', 'awesome.']
>>> phrase_list = [ k.rstrip(',') for k in phrase_list]
>>> indicators_in_phrase = [indicator for indicator in indicators if indicator in phrase_list]
>>> indicators_in_phrase 
['therefore']
pyfunc
+1  A: 

I think what you are trying to do is something more like this:

import string

words_in_phrase = string.split(phrase)

Now you'll have the words in a list like this:

['...', 'therefore,', 'I', 'conclude', 'I', 'am', 'awesome.']

Then compare the lists like so:

indicators_in_phrase = []
for word in words_in_phrase:
  if word in indicators:
    indicators_in_phrase.append(word)

There's probably several ways to make this less verbose, but I prefer clarity. Also, you might have to think about removing punctuation as in "awesome." and "therefore,"

For that use rstrip as in the other answer

jgritty
+1  A: 

Is the problem with "for" that it's inside "therefore" or that it's not a word? For example, if one of your indicators was "awe", would you want it to be included in indicators_in_phrase?

How would you want the following situation to be handled? indicators = ["abc", "cde"] phrase = "One abcde two"

Francis Potter
If it was "awe", I would not want it to be included in indicators_in_phrase. In the example you gave, indicators_in_phrase would be the empty list.
Mana
A: 

You can strip off punctuations from your phrase, then do split on it so that all words are individual. Then you can do your string comparison

>>> indicators = ["therefore", "for", "since"]
>>> phrase = "... therefore, I conclude I am awesome."
>>> ''.join([ i for i in phrase.lower() if i not in string.punctuation]).strip().split()
['therefore', 'I', 'conclude', 'I', 'am', 'awesome']
>>> p = ''.join([ i for i in phrase.lower() if i not in string.punctuation]).strip().split()
>>> indicators_in_phrase = [indicator for indicator in indicators if indicator in p ]
>>> indicators_in_phrase
['therefore']
ghostdog74
+1  A: 

It is one line with regex...

import re

indicators = ["therefore", "for", "since"]
phrase = "... therefore, I conclude I am awesome."

indicators_in_phrase = set(re.findall(r'\b(%s)\b' % '|'.join(indicators), phrase.lower()))
Paulo Scardine
This is awesome, but can you please explain how the regex here works? I'm struggling to understand what's going on.
Mana
The regex is `\b(therefore|for|since)\b` which looks for either a word of the three, surround by *word boundaries* (`\b`). So you can be sure that those words are separate words like that.
poke
Ahh, wow. That's great. Definitely looking into learning Regex then.
Mana
+1  A: 
  1. Create set of indicators
  2. Create set of phrases
  3. Find intersection

Code:

indicators = ["therefore", "for", "since"]
phrase = "... therefore, I conclude I am awesome."
print list(set(indicators).intersection(set( [ each.strip('.,') for each in phrase.split(' ')])))

Cheers:)

ShyamLovesToCode
You can replace `each.strip('.').strip(',')` with `each.strip('.,')` see also http://docs.python.org/library/stdtypes.html#str.strip
rubik
Thanks for the information, I will make the change :)
ShyamLovesToCode
+1  A: 

The regex are the simplest way! Hint:

re.compile(r'\btherefore\b')

Then you can change the word in the middle!

EDIT: I wrote this for you:

import re

indicators = ["therefore", "for", "since"]

phrase = "... therefore, I conclude I am awesome. "

def find(phrase, indicators):
    def _match(i):
        return re.compile(r'\b%s\b' % (i)).search(phrase)
    return [ind for ind in indicators if _match(ind)]

>>> find(phrase, indicators)
['therefore']
rubik