views:

926

answers:

6

Hello,

I have a list of possible substrings, e.g. ['cat', 'fish', 'dog']. In practice the list contains hundreds of entries.

I'm processing a string, and what I'm looking for is to find the index of first appearance of any of these substrings.

To clarify, for '012cat' the result is 3, and for '0123dog789cat' the result is 4.

I also need to know which substring was found (e.g. its index in the substring list or the text itself), or at least the length of the substring matched.

There are obvious brute-force ways to achieve this, I wondered if there's any elegant Python/Regex solution for this.

Thanks, Rax

+3  A: 
subs = ['cat', 'fish', 'dog']
sentences = ['0123dog789cat']

import re

subs = re.compile("|".join(subs))
def search():
    for sentence in sentences:
        result = subs.search(sentence)
        if result != None:
            return (result.group(), result.span()[0])

# ('dog', 4)
Unknown
I think he only has 1 "sentence"
Paolo Bergantino
Thanks, but this is not what I'm looking for. First, it does not find the first occurrence (in the second sentence it will return the occurrence of "cat", i.e. 10, instead of "dog"'s, i.e. 4). There are obvious solutions but it's very very brute force (iterate until the last substring and constantly maintain the first occurrence).I'm under the impression that Python must have some library function for this...
Roee Adler
I don't like when my answers get "sniped" either... but I didn't mean to steal your thunder. +1 because your solution is technically correct. Two comments: it does not discuss the scalability concerns that Rax had, and I don't like the "return" statement, as it would prematurely exit if you had more sentences in sentences. Other than that, it's short and to the point, and warrants some reputation.
Tom
@Tom, "I don't like the "return" statement, as it would prematurely exit if you had more sentences in sentences." But I thought Rax wanted to find the first match?
Unknown
@Unknown: the reason for my comment was that ***if*** you were to add more sentences to the sentences list, your code would short-circuit because it would only check the first sentence. ie - you shouldn't have used lists for subs and sentences if you weren't going to write code that generalized for larger lists.
Tom
Sorry, not just check the first sentence, but only check up to the first sentence that had a match (in this case, the first sentence).
Tom
@Unknown: (responding to your comment on my post): For the third time: all I meant was that you made a LIST (sentences) and your code produces the first match IN THE FIRST POSSIBLE SENTENCE.. I am merely saying that it would have been nice if your answer aggregated the first match IN EACH sentence since you wrote it to operate on lists. OR you could have just not used lists. Instead, you picked a way that is mildly confusing if someone were to go generalize it to performing a search on multiple sentences. ie - "Search for cat|fish|dog in each of 10 sentences in the list sentences". Make sense?
Tom
@Unknown: (responding to yet another comment on my post): No, I'm not wrong. You don't understand what I am saying because you think I am just trying to criticize you or something. I can't help but laugh at this back and forth. When I have more time, I will try to post as another answer so that I can show you code - then what I am saying will be crystal clear to you (I hope). Don't take what I was saying as an attack - it was a general suggestion about making code you write on posts more clear.
Tom
+17  A: 
Tom
This surely works, but I have a question - isn't there a limitation on the size of the regex definition? If I have 1000 substrings, will it still work? Is there any significant performance degradation relative to the number of words (i.e. that's more than linear in the size of the list)?Regarding your other clarifications, my list of substrings is being update only once a day or so, I think it's no issue to generate the regex definition and call "compile" at this frequency.Many thanks
Roee Adler
@ rax did you see my new solution? I basically fixed everything about it and submitted it 20 seconds after this one.
Unknown
@rax: Hopefully the example code I added helps convince you the re module will be fine :-).
Tom
I would suggest the "|".join(words) too, but there's the issue that searching for ["cacat", "cat"] in "this cacat is a cat" returns [("cacat", 5), ("cat", 16)] instead of [("cacat", 5), ("cat", 7)], although this could be the wanted behaviour.
ΤΖΩΤΖΙΟΥ
@Tom: many thanks. I will try it as soon as I can! - Rax
Roee Adler
Line 7 in your first snippet should be 'match_obj.start()'
Nick Presta
Many so-called regular expression syntaxes are not actually "regular". That is, they are actually more powerful than true regular expressions and therefore cannot be represented as a DFA. An example of this which shows up in Python, Perl and even grep is back-references. Take the Python regex r"(a+)b\1". This matches some number of a's, a b, and then the same number of a's as before. This is non-regular. RE engines that support backreferences actually use an NFA. Some RE engines are smart enough to switch to using DFAs for regexes that are actually regular, but I don't think Python does this.
Laurence Gonsalves
I hate it when I make unneeded edits and then someone else with the same answer gets bumped up and gets 10 votes while i get 0.
Unknown
@Laurence: Good insight. I am curious to post somewhere about the implementation of REs in Python. I don't see why an NFA is used though. NFAs and DFAs are equivalent. You can use Thompson's subset construction to convert an NFA to a DFA. Did you mean you need to use a PDA so that a stack can keep track of how many a's you have seen? I'm not even sure about this because I am not completely sure about the syntax... but I am sure NFAs and DFAs are equivalent.
Tom
@Unknown: see my comment on your post.
Tom
@TZOTZIOY: I am going to add an update mentioning something I tried with this... let me know if you agree.
Tom
@Tom, quote from Rax the author: "I'm processing a string, and what I'm looking for is to find the index of first appearance of any of these substrings." If his point is to break on the "FIRST APPEARANCE" then how am I wrong?
Unknown
@Unknown: I replied on your post.
Tom
@Tom, no you are wrong. It searches through every single sentence until it finds the first match. It does not only search one sentence if no match was found there.
Unknown
@Unknown: no... I replied on your post... this is getting ridiculous.
Tom
@Tom, no you just have a very narrow viewpoint. Imagine the list as a list of sentences in a paragraph. It still makes sense to be able to find the first match in a list of strings and doesn't make my answer any less valid.
Unknown
@Unknown: wasn't saying your answer was invalid... in fact, I AM THE ONLY PERSON TO +1 IT... please don't put words in my mouth.
Tom
+2  A: 

This is a vague, theoretical answer with no code provided, but I hope it can point you in the right direction.

First, you will need a more efficient lookup for your substring list. I would recommend some sort of tree structure. Start with a root, then add an 'a' node if any substrings start with 'a', add a 'b' node if any substrings start with 'b', and so on. For each of these nodes, keep adding subnodes.

For example, if you have a substring with the word "ant", you should have a root node, a child node 'a', a grandchild node 'n', and a great grandchild node 't'.

Nodes should be easy enough to make.

class Node(object):
    children = []

    def __init__(self, name):
        self.name = name

where name is a character.

Iterate through your strings letter by letter. Keep track of which letter you're on. At each letter, try to use the next few letters to traverse the tree. If you're successful, your letter number will be the position of the substring, and your traversal order will indicate the substring that was found.

Clarifying edit: DFAs should be much faster than this method, and so I should endorse Tom's answer. I'm only keeping this answer up in case your substring list changes often, in which case using a tree might be faster.

Wesley
Thanks, I completely understand the theory and practice of string indexing and searching, and can implement it myself, but I would expect Python to have a vehicle for this exact thing. I understand there's none?
Roee Adler
I don't know of such functionality built into Python, so I can't say whether it does or doesn't exist. As such, I'm afraid this answer doesn't help you in the least. The closest answer I see here is Tom's.
Wesley
A: 

First of all, I would suggest you to sort the initial list in ascending order. Because scanning for a shorter substring is faster that scanning for a longer substring.

Anonymous
Are you sure this makes a difference? If I were implementing the regex myself (as a DFA), the length would not matter. Every substring would be searched for at the same time. I am now curious as to how python implements regexes...
Tom
A: 

How about this one.

>>> substrings = ['cat', 'fish', 'dog']
>>> _string = '0123dog789cat'
>>> found = map(lambda x: (_string.index(x), x), filter(lambda x: x in _string, substrings))
[(10, 'cat'), (4, 'dog')]
>>> if found:
>>>     min(found, key=lambda x: x[0])
(4, 'dog')

Obviously, you could return something other than a tuple.

This works by:

  • Filtering the list of substrings down to those that are in the string
  • Building a list of tuples containing the index of the substring, and the substring
  • If a substring has been found, find the minimum value based on the index
DisplacedAussie
This seems to be a terribly inefficient answer. It will surely scan the string multiple times. Even a brute force approach where you manually use the string index() method for each string you are searching for (keeping track of the minimum on the fly) is better than this. map() can be a powerful function, but this is not example of such a case.
Tom
+1  A: 

I just want to point out the time difference between DisplacedAussie's answer and Tom's answer. Both were fast when used once, so you shouldn't have any noticeable wait for either, but when you time them:

import random
import re
import string

words = []
letters_and_digits = "%s%s" % (string.letters, string.digits)
for i in range(2000):
    chars = []
    for j in range(10):
        chars.append(random.choice(letters_and_digits))
    words.append(("%s"*10) % tuple(chars))
search_for = re.compile("|".join(words))
first, middle, last = words[0], words[len(words) / 2], words[-1]
search_string = "%s, %s, %s" % (last, middle, first)

def _search():
    match_obj = search_for.search(search_string)
    # Note, if no match, match_obj is None
    if match_obj is not None:
         return (match_obj.start(), match_obj.group())

def _map():
    search_for = search_for.pattern.split("|")
    found = map(lambda x: (search_string.index(x), x), filter(lambda x: x in search_string, search_for))
    if found:
        return min(found, key=lambda x: x[0])


if __name__ == '__main__':
    from timeit import Timer


    t = Timer("_search(search_for, search_string)", "from __main__ import _search, search_for, search_string")
    print _search(search_for, search_string)
    print t.timeit()

    t = Timer("_map(search_for, search_string)", "from __main__ import _map, search_for, search_string")
    print _map(search_for, search_string)
    print t.timeit()

Outputs:

(0, '841EzpjttV')
14.3660159111
(0, '841EzpjttV')
# I couldn't wait this long

I would go with Tom's answer, for both readability, and speed.

Nick Presta
Thanks Nick! In fairness to DisplacedAussie, you could help him out (a little bit) by removing the call to split("|") and just give him a list to start with. To be more comprehensive, you ought to add the brute force approach. for word in search_for:, index = search_string.index(word), if index < smallest_index:, # record the new smallest idx and the word that matched.(Sorry can't write code in comments). Then wait for and post all the times. This is a good thing to consider I wish there could be a special meta-post for things like this, since neither comments nor answer posts are good places.
Tom
+1 for actually doing benchmarks in a question about efficiency!
dbr