views:

119

answers:

2

Hi all,

I'm working on a portion of code that is essentially trying to reduce a list of strings down to a single string recursively.

I have an internal database built up of matching string arrays of varying length (say array lengths of 2-4).

An example input string array would be:

{"The", "dog", "ran", "away"}

And for further example, my database could be made up of string arrays in this manner:

(length 2) {{"The", "dog"},{"dog", "ran"}, {"ran", "away"}}
(length 3) {{"The", "dog", "ran"}.... and so on

So, what I am attempting to do is recursively reduce my input string array down to a single token. So ideally it would parse something like this:

1) {"The", "dog", "ran", "away"}

Say that (seq1) = {"The", "dog"} and (seq2) = {"ran", "away"}

2) {  (seq1), "ran", "away"}
3) {  (seq1), (seq2)}

In my sequence database I know that, for instance, seq3 = {(seq1), (seq2)}

4) {  (seq3)  }

So, when it is down to a single token, I'm happy and the function would end.

Here is an outline of my current program logic:

public void Tokenize(Arraylist<T> string_array, int current_size)
{
  // retrieve all known sequences of length [current_size] (from global list array)
  loc_sequences_by_length = sequences_by_length[current_size-min_size]; // sequences of length 2 are stored in position 0 and so on

  // escape cases
  if (string_array.Count == 1)
  {
    // finished successfully
    return;
  }
  else if (string_array.Count < current_size)
  {
    // checking sequences of greater length than input string, bail
    return;
  }
  else
  {
    // split input string into chunks of size [current_size] and compare to local database 
    // of known sequences
    // (splitting code works fine)

    foreach (comparison)
    {
      if (match_found)
      {
        // update input string and recall function to find other matches
        string_array[found_array_position] = new_sequence;
        string_array.Removerange[found_array_position+1, new_sequence.Length-1];
        Tokenize(string_array, current_size)
      }
    }   
  }

  // ran through unsuccessfully, increment length and try again for new sequence group
  current_size++;
  if (current_size > MAX_SIZE)
    return;
  else
    Tokenize(string_array, current_size);
}

I thought it was straightforward enough, but have been getting some strange results. Generally it appears to work, but upon further review of my output data I'm seeing some issues. Mainly, it appears to work up to a certain point...and at that point my 'curr_size' counter resets to the minimum value.

So it is called with a size of 2, then 3, then 4, then resets to 2. My assumption was that it would run up to my predetermined max size, and then bail completely.

I tried to simplify my code as much as possible, so there are probably some simple syntax errors in transcribing. If there is any other detail that may help an eagle-eyed SO user, please let me know and I'll edit.

Thanks in advance

+1  A: 

One bug is: string_array[found_array_position] = new_sequence;

I don't know where this is defined, and as far as I can tell if it was defined, it is never changed.

In your if statement, when if match_found ever set to true?

Also, it appears you have an extra close brace here, but you may want the last block of code to be outside of the function:

     }
    }   
  }

It would help if you cleaned up the code, to make it easier to read. Once we get past the syntactic errors it will be easier to see what is going on, I think.

James Black
A: 

Not sure what all the issues are, but the first thing I'd do is have your "catch-all" exit block right at the beginning of your method.

public void Tokenize(Arraylist<T> string_array, int current_size)
{
  if (current_size > MAX_SIZE)
    return;

  // Guts go here

  Tokenize(string_array, ++current_size);
}

A couple things:

  1. Your tokens are not clearly separated from your input string values. This makes it more difficult to handle, and to see what's going on.
  2. It looks like you're writing pseudo-code:
    • loc_sequences_by_length is not used
    • found_array_position is not defined
    • Arraylist should be ArrayList.
    • etc.

Overall I agree with James' statement:

It would help if you cleaned up the code, to make it easier to read.

-Doug

Doug
winner because you helped me to rethink my approach...thanks for the input!
espais
No problem - I'm glad it helped somehow! :)
Doug