Hi all,
I'm working on a portion of code that is essentially trying to reduce a list of strings down to a single string recursively.
I have an internal database built up of matching string arrays of varying length (say array lengths of 2-4).
An example input string array would be:
{"The", "dog", "ran", "away"}
And for further example, my database could be made up of string arrays in this manner:
(length 2) {{"The", "dog"},{"dog", "ran"}, {"ran", "away"}}
(length 3) {{"The", "dog", "ran"}.... and so on
So, what I am attempting to do is recursively reduce my input string array down to a single token. So ideally it would parse something like this:
1) {"The", "dog", "ran", "away"}
Say that (seq1) = {"The", "dog"} and (seq2) = {"ran", "away"}
2) { (seq1), "ran", "away"}
3) { (seq1), (seq2)}
In my sequence database I know that, for instance, seq3 = {(seq1), (seq2)}
4) { (seq3) }
So, when it is down to a single token, I'm happy and the function would end.
Here is an outline of my current program logic:
public void Tokenize(Arraylist<T> string_array, int current_size)
{
// retrieve all known sequences of length [current_size] (from global list array)
loc_sequences_by_length = sequences_by_length[current_size-min_size]; // sequences of length 2 are stored in position 0 and so on
// escape cases
if (string_array.Count == 1)
{
// finished successfully
return;
}
else if (string_array.Count < current_size)
{
// checking sequences of greater length than input string, bail
return;
}
else
{
// split input string into chunks of size [current_size] and compare to local database
// of known sequences
// (splitting code works fine)
foreach (comparison)
{
if (match_found)
{
// update input string and recall function to find other matches
string_array[found_array_position] = new_sequence;
string_array.Removerange[found_array_position+1, new_sequence.Length-1];
Tokenize(string_array, current_size)
}
}
}
// ran through unsuccessfully, increment length and try again for new sequence group
current_size++;
if (current_size > MAX_SIZE)
return;
else
Tokenize(string_array, current_size);
}
I thought it was straightforward enough, but have been getting some strange results. Generally it appears to work, but upon further review of my output data I'm seeing some issues. Mainly, it appears to work up to a certain point...and at that point my 'curr_size' counter resets to the minimum value.
So it is called with a size of 2, then 3, then 4, then resets to 2. My assumption was that it would run up to my predetermined max size, and then bail completely.
I tried to simplify my code as much as possible, so there are probably some simple syntax errors in transcribing. If there is any other detail that may help an eagle-eyed SO user, please let me know and I'll edit.
Thanks in advance