ansaurus

Question

String splitting issue problem with multiword expressions

Answer 1

+1 A:

Ah, this is crazy, crude and ugly. But looks like it works. You may wanna clean and optimize it but certain ideas here might work.

list_to_split = ['i would like a blood orange', 'i would like a blood orange ttt blood orange']
input_list = ["blood orange", "loan shark"]

for item in input_list:
    for str_lst in list_to_split:
        if item in str_lst:
            tmp = str_lst.split(item)
            lst = []
            for itm in tmp:
                if itm!= '':
                    lst.append(itm)
                    lst.append(item)
            print lst

output:

['i would like a ', 'blood orange']
['i would like a ', 'blood orange', ' ttt ', 'blood orange']

pyfunc 2010-10-20 03:01:28

Answer 2

+1 A:

blackkettle 2010-10-20 03:02:28

Glenn Maynard 2010-10-20 03:29:26

@Glenn I think I covered these issues in my initial description. Based on the OP description I would also think that "xxxblood orangexxx" and "blood orange" should be treated as completely different strings.

blackkettle 2010-10-20 03:39:36

It's always nice when people accept the first answer they see, without bothering to read anything.

Glenn Maynard 2010-10-20 03:40:29

That was my point: they should almost certainly be treated as different strings. This answer doesn't do that, and returns `"xxxblood orangexxx"` as a phrase. (The OP didn't specify--I doubt he thought it through that far, and will probably be bitten by bugs later on as a result.) Anyhow, the OP obviously isn't paying attention, so I'm moving on.

Glenn Maynard 2010-10-20 03:48:56

@Glenn Maynard: @alphomega: Yeah. I did provide an answer that works very early, even if my variables names were ugly and is not efficient with lists

pyfunc 2010-10-20 03:52:38

Sorry, and thanks for the answers

alphomega 2010-10-20 04:19:25

Answer 3

+4 A:

This is a fairly straightforward generator implementation: split the string into words, group together words which form phrases, and yield the results.

(There may be a cleaner way to handle skip, but for some reason I'm drawing a blank.)

def split_with_phrases(sentence, phrase_list):
    words = sentence.split(" ")
    phrases = set(tuple(s.split(" ")) for s in phrase_list)
    print phrases
    max_phrase_length = max(len(p) for p in phrases)

    # Find a phrase within words starting at the specified index.  Return the
    # phrase as a tuple, or None if no phrase starts at that index.
    def find_phrase(start_idx):
        # Iterate backwards, so we'll always find longer phrases before shorter ones.
        # Otherwise, if we have a phrase set like "hello world" and "hello world two",
        # we'll never match the longer phrase because we'll always match the shorter
        # one first.
        for phrase_length in xrange(max_phrase_length, 0, -1):
            test_word = tuple(words[idx:idx+phrase_length])
            if test_word in phrases:
                return test_word
        return None

    skip = 0
    for idx in xrange(len(words)):
        if skip:
            # This word was returned as part of a previous phrase; skip it.
            skip -= 1
            continue

        phrase = find_phrase(idx)
        if phrase is not None:
            skip = len(phrase)
            yield " ".join(phrase)
            continue

        yield words[idx]

print [s for s in split_with_phrases('i would like a blood orange',
    ["blood orange", "loan shark"])]

Glenn Maynard 2010-10-20 03:18:28

You are correct. Your solution is definitely better than either of the ones that I hastily provided. I'm not sure whether it is more appropriate to correct mine or just point this out.

blackkettle 2010-10-20 04:00:46

@blackkettle, I think you could correct yours, but if it would look exactly like this one, there would be no point, I think.

Geoffrey Van Wyk 2010-10-20 13:48:55

ansaurus

tags:

views:

answers:

String splitting issue problem with multiword expressions

related questions