tags:

views:

63

answers:

3

I have a series of strings like:

'i would like a blood orange'

I also have a list of strings like:

["blood orange", "loan shark"]

Operating on the string, I want the following list:

["i", "would", "like", "a", "blood orange"]

What is the best way to get the above list? I've been using re throughout my code, but I'm stumped with this issue.

+1  A: 

Ah, this is crazy, crude and ugly. But looks like it works. You may wanna clean and optimize it but certain ideas here might work.

list_to_split = ['i would like a blood orange', 'i would like a blood orange ttt blood orange']
input_list = ["blood orange", "loan shark"]

for item in input_list:
    for str_lst in list_to_split:
        if item in str_lst:
            tmp = str_lst.split(item)
            lst = []
            for itm in tmp:
                if itm!= '':
                    lst.append(itm)
                    lst.append(item)
            print lst

output:

['i would like a ', 'blood orange']
['i would like a ', 'blood orange', ' ttt ', 'blood orange']
pyfunc
+1  A: 
blackkettle
Glenn Maynard
@Glenn I think I covered these issues in my initial description. Based on the OP description I would also think that "xxxblood orangexxx" and "blood orange" should be treated as completely different strings.
blackkettle
It's always nice when people accept the first answer they see, without bothering to read anything.
Glenn Maynard
That was my point: they should almost certainly be treated as different strings. This answer doesn't do that, and returns `"xxxblood orangexxx"` as a phrase. (The OP didn't specify--I doubt he thought it through that far, and will probably be bitten by bugs later on as a result.) Anyhow, the OP obviously isn't paying attention, so I'm moving on.
Glenn Maynard
@Glenn Maynard: @alphomega: Yeah. I did provide an answer that works very early, even if my variables names were ugly and is not efficient with lists
pyfunc
Sorry, and thanks for the answers
alphomega
+4  A: 

This is a fairly straightforward generator implementation: split the string into words, group together words which form phrases, and yield the results.

(There may be a cleaner way to handle skip, but for some reason I'm drawing a blank.)

def split_with_phrases(sentence, phrase_list):
    words = sentence.split(" ")
    phrases = set(tuple(s.split(" ")) for s in phrase_list)
    print phrases
    max_phrase_length = max(len(p) for p in phrases)

    # Find a phrase within words starting at the specified index.  Return the
    # phrase as a tuple, or None if no phrase starts at that index.
    def find_phrase(start_idx):
        # Iterate backwards, so we'll always find longer phrases before shorter ones.
        # Otherwise, if we have a phrase set like "hello world" and "hello world two",
        # we'll never match the longer phrase because we'll always match the shorter
        # one first.
        for phrase_length in xrange(max_phrase_length, 0, -1):
            test_word = tuple(words[idx:idx+phrase_length])
            if test_word in phrases:
                return test_word
        return None

    skip = 0
    for idx in xrange(len(words)):
        if skip:
            # This word was returned as part of a previous phrase; skip it.
            skip -= 1
            continue

        phrase = find_phrase(idx)
        if phrase is not None:
            skip = len(phrase)
            yield " ".join(phrase)
            continue

        yield words[idx]

print [s for s in split_with_phrases('i would like a blood orange',
    ["blood orange", "loan shark"])]
Glenn Maynard
You are correct. Your solution is definitely better than either of the ones that I hastily provided. I'm not sure whether it is more appropriate to correct mine or just point this out.
blackkettle
@blackkettle, I think you could correct yours, but if it would look exactly like this one, there would be no point, I think.
Geoffrey Van Wyk