views:

534

answers:

4

I would like to be able to search a string for various words, when I find one, i want to split the string at that point into 3 parts (left, match, right), the matched text would be excluded, and the process would continue with the new string left+right.

Now, once i have all my matches done, i need to reverse the process by reinserting the matched words (or a replacement for them) at the point they were removed. I have never really found what i wanted in any of my searches, so I thought I would ask for input here on SO.

Please let me know if this question needs further description.

BTW - at the moment, i have a very poor algorithm that replaces matched text with a unique string token, and then replaces the tokens with the replacement text for the appropriate match after all the matches have been done.

This is the goal:

one two three four five six

match "three" replace with foo (remember we found three, and where we found it)

one two four five six
       |
     three

match "two four" and prevent it from being matched by anything (edited for clarity)

one five six
   |
 two four 
       |
     three

at this point, you cannot match for example "one two"

all the matches have been found, now put their replacements back in (in reverse order)

one two four five six
       |
     three


one two foo four five six

What's the point? Preventing one match's replacement text from being matched by another pattern. (all the patterns are run at the same time and in the same order for every string that is processed)

I'm not sure the language matters, but I'm using Lua in this case.

I'll try rephrasing, I have a list of patterns i want to find in a given string, if I find one, I want to remove that part of the string so it isnt matched by anything else, but I want to keep track of where i found it so I can insert the replacement text there once I am done trying to match my list of patterns

Here's a related question:

http://stackoverflow.com/questions/648862/shell-script-search-and-replace-text-in-multiple-files-using-a-list-of-strings

+1  A: 

Given the structure of the problem, I'd probably try an algorithm based on a binary tree.

Jon Seigel
no point, he's trying to solve a different problem
omouse
My answer was posted based on the original edition of the question... I'd still like to solve the problem, but what I've written so far may not be the best way to do it (as no one seems to fully understand the problem yet).
Jon Seigel
A: 

pseudocode:

for( String snippet in snippets )
{
    int location = indexOf(snippet,inputData);
    if( location != -1)
    {
        // store replacement text for a found snippet on a stack along with the
        // location where it was found
        lengthChange = getReplacementFor(snippet).length - snippet.length;
        for each replacement in foundStack
        {
            // IF the location part of the pair is greater than the location just found
            //Increment the location part of the pair by the lengthChange to account
            // for the fact that when you replace a string with a new one the location
            // of all subsequent strings will be shifted 
        }

        //remove snippet
        inputData.replace(snippet, "");
    }
}

for( pair in foundStack )
{
    inputData.insert( pair.text, pair.location);
}

This is basically just doing exactly as you said in your problem description. Step through the algorithm, putting everything on a stack with the location it was found at. You use a stack so when you reinsert in the second half, it happens in reverse order so that the stored "location" applies to the current state of the inputString.

Edited with a potential fix for commenter's criticism. Does the commented for block within the first one account for your criticisms, or is it still buggy in certain scenarios?

Brian Schroth
Except as a result of subsequent replacements location can be outside of the string. Or it could be in the middle of a replacement string.
Franci Penov
good point. Didn't think it through.
Brian Schroth
I edited with a potential solution that might address your criticism. Do you think it would work?
Brian Schroth
Even if this does work, on further consideration I think it would be better to do this recursively.
Brian Schroth
I tend to agree - recursion seems like a good way to solve this assuming the number of matches stays small.
sylvanaar
Well I tried to do it recursively and couldn't find a good way around the same problem of later replacements screwing up the insert positions of earlier replacements, so I'm inclined to go back to this, my original approach. My initial back of the envelope testing found it worked, at least.
Brian Schroth
+1  A: 

Your algorithm description is unclear. There's no exact rule where the extracted tokens should be re-inserted.

Here's an example:

  1. Find 'three' in 'one two three four five six'
  2. Choose one of these two to get 'foo bar' as result:

    a. replace 'one two' with 'foo' and 'four five six' with 'bar'

    b. replace 'one two four five six' with 'foo bar'

  3. Insert 'three' back in the step 2 resulting string 'foo bar'

At step 3 does 'three' goes before 'bar' or after it?

Once you've come up with clear rules for reinserting, you can easily implement the algorithm as a recursive method or as an iterative method with a replacements stack.

Franci Penov
I fixed the example while you were posting, it was a little unclear you are right.
sylvanaar
A: 

What you want to do is have a 2nd string which stores the output. You process the input and search for patterns in it. If no matching pattern is found, no replacement occurs so you just append the characters you read in directly to the output. If a pattern is found, append the replacement string to the output. Because you're always moving forward in the string, there is no chance of a pattern matching a previous replacement.

If you're searching character by character (brute-force search) you'll have to figure out how you want to prioritize the patterns; by length or by order they were added to the patterns list.

Otherwise, you'll be searching word by word or sentence by sentence which generalizes into searching using a buffer. For that you'll have to determine the separators (for words it'll be spaces, for sentences it'll be periods exclamation points and other things like that, for a comma-separated values file it'll be commas).

omouse
he needs to search the full string for each snippet, so "always moving forward in the string" won't work, if I understand the problem correctly.
Brian Schroth
You don't need to search through the full string for each snippet. He wants to prevent the replacement of already found strings, so to do that, you only search forward through the string since the previous slice of the string has been searched.
omouse