Let us say I have the following string:
"my ., .,dog. .jumps. , .and..he. .,is., .a. very .,good, .dog"
1234567890123456789012345678901234567890123456789012345678901 <-- char pos
Now, I have written a regular expression to remove certain elements from the string above, in this example, all whitespace, all periods, and all commas.
I am left with the following transformed string:
"mydogjumpsandheisaverygooddog"
Now, I want to construct k-grams of this string. Let us say I were to take 5-grams of the above string, it would look like:
mydog ydogj dogju ogjum gjump jumps umpsa ...
The problem I have is that for each k-gram, I want to keep track of its original character position in the first source text I listed.
So, "mydog", would have a start position of "0" and an end position of "11". However, I have no mapping between the source text and the modified text. So, I have no idea where a particular k-gram starts and ends in relation to the original, unmodified text. This is important to my program to keep track of.
I am creating a list of k-grams like this:
public class Kgram
{
public int start;
public int end;
public int text;
}
where start
and end
are positions in the source text (top) and the text is that of the k-gram text after the modifications.
Can anyone point me in the right direction for the best way to solve this problem?