tags:

views:

105

answers:

4

What is the best way to search for matching words inside a string?

Right now I do something like the following:

if re.search('([h][e][l][l][o])',file_name_tmp, re.IGNORECASE):

Which works but its slow as I have probably around 100 different regex statements searching for full words so I'd like to combine several using a | separator or something.

+3  A: 

Can you try:

if 'hello' in longtext:

or

if 'HELLO' in longtext.upper():

to match hello/Hello/HELLO.

eumiro
or hELLo or HElLO or .... ;)
KevinDTimm
... hElLo or hellO or...
Santiago Lezica
+2  A: 

If you are trying to check 'hello' or a complete word in a string, you could also do

if 'hello' in stringToMatch:
    ... # Match found , do something

To find various strings, you could also use find all

>>>toMatch = 'e3e3e3eeehellloqweweemeeeeefe'
>>>regex = re.compile("hello|me",re.IGNORECASE)
>>>print regex.findall(toMatch)
>>>[u'me']
>>>toMatch = 'e3e3e3eeehelloqweweemeeeeefe'
>>>print regex.findall(toMatch)
>>>[u'hello', u'me']
>>>toMtach = 'e3e3e3eeeHelLoqweweemeeeeefe'
>>>print regex.findall(toMatch)
>>>[u'HelLo', u'me']
pyfunc
that works, however I still need the regex functionality of a returning a group of matches as sometimes the words in the string are uppercase or lowercase
Joe
@Joe: In that case you could use regex with | statement . See my edited reply
pyfunc
+3  A: 
>>> words = ('hello', 'good\-bye', 'red', 'blue')
>>> pattern = re.compile('(' + '|'.join(words) + ')', re.IGNORECASE)
>>> sentence = 'SAY HeLLo TO reD, good-bye to Blue.'
>>> print pattern.findall(sentence)
['HeLLo', 'reD', 'good-bye', 'Blue']
Steven Rumbalski
+1 Good answer. However, I think it's also important to point out word-boundary conditions/options available.
pst
+2  A: 

You say you want to search for WORDS. What is your definition of a "word"? If you are looking for "meet", do you really want to match the "meet" in "meeting"? If not, you might like to try something like this:

>>> import re
>>> query = ("meet", "lot")
>>> text = "I'll meet a lot of friends including Charlotte at the town meeting"
>>> regex = r"\b(" + "|".join(query) + r")\b"
>>> re.findall(regex, text, re.IGNORECASE)
['meet', 'lot']
>>>

The \b at each end forces it to match only at word boundaries, using re's definition of "word" -- "isn't" isn't a word, it's two words separated by an apostrophe. If you don't like that, look at the nltk package.

John Machin