views:

110

answers:

4

Let us say I have the following string:

 "my ., .,dog. .jumps. , .and..he. .,is., .a. very .,good, .dog"  
  1234567890123456789012345678901234567890123456789012345678901 <-- char pos

Now, I have written a regular expression to remove certain elements from the string above, in this example, all whitespace, all periods, and all commas.

I am left with the following transformed string:

 "mydogjumpsandheisaverygooddog"

Now, I want to construct k-grams of this string. Let us say I were to take 5-grams of the above string, it would look like:

  mydog ydogj dogju ogjum gjump jumps umpsa ...

The problem I have is that for each k-gram, I want to keep track of its original character position in the first source text I listed.

So, "mydog", would have a start position of "0" and an end position of "11". However, I have no mapping between the source text and the modified text. So, I have no idea where a particular k-gram starts and ends in relation to the original, unmodified text. This is important to my program to keep track of.

I am creating a list of k-grams like this:

public class Kgram
{
    public int start;  
    public int end;  
    public int text;  
}

where start and end are positions in the source text (top) and the text is that of the k-gram text after the modifications.

Can anyone point me in the right direction for the best way to solve this problem?

+5  A: 

Don't use a regular expression 'replace' API to do your replacing. Only use regexps to find the places you want to modify, do the mod yourself, and maintain an offset mapping. One form I've used is an array of ints as big as the original string, storing 'n chars deleted' here values, but there are a host of other possibilities.

The basic data structure here is an array of pairs. Each pair contains an offset and a correction. Depending on time/space tradeoffs, you may prefer to spread the information out over a data structure as large as the original string.

bmargulies
An int array as large as the original source text, or as large as the modified source text? If I had an integer array as large as large as the source text, why would the number of n-characters deleted here ever be greater than 1? Since there is an int position for each character? Or did I misinterpret? Can you elaborate some?
Simucal
I'll add to the answer.
bmargulies
+5  A: 

Here's how I would solve this problem in Haskell:

kgramify k string =
  let charsWithPos = zip string [1..]  -- attach original position to each char
      goodCWP      = filter (not o isWhitePeriodOrComma o fst) charsWithPos -- drop nasty chars
      groups       = takeEveryK k goodCWP -- clump remaining chars in groups of size k
      posnOfGroup g = (snd (head g), map fst g) -- position of first char with group
  in  map posnOfGroup groups

In informal English:

  1. Tag each character with its position
  2. Filter out uninteresting (character, position) pairs
  3. Take the remaining list of pairs and group them into a list of lists of length k
  4. For each inner list, take the position of the first character, and pair it with the list of the all the characters (with the positions dropped)

In any functional language like Clean, Haskell, ML, or Scheme, this kind of thing is very easy. In a language with explicit memory allocation (new) or even worse, malloc and free, such a solution would be very tedious.

Norman Ramsey
Heck, if you know J or APL, programming in a language that requires use of the space bar is tedious and long-winded ;-)
Steve Jessop
@Steve: I'd love to see the APL version---sure it would make mincement of this problem. I don't think I've written a line of APL in more than 25 years...
Norman Ramsey
Unfortunately I've never found the time or the three-foot-wide keyboard necessary to learn APL properly.
Steve Jessop
This seems to be missing the end points of the groups.
Svante
+2  A: 

A C solution, to show that as Norman Ramsey says, it's pretty tedious. It takes the filter as a callback with context, just for kicks, but in your case you could pass 0 as the filterdata and not_wspc as the filter:

int not_wspc(void *, char c) {
    if isspace((unsigned char)c) return 0;
    if ((c == '.') || (c == ',')) return 0;
    return 1;
}

typedef struct {
    char c;
    int pos;
} charwithpos;

KGram *foo(const char *input, int (*filter)(void *,char), void *filterdata) {
    size_t len = strlen(input);
    charwithpos *filtered = malloc(len * sizeof(*filtered));
    assert(filtered);

    // combine Norman's zip and filter steps
    charwithpos *current = filtered
    for (size_t i = 0; i < len; ++i) {
        if (filter(filterdata, input[i])) {
            current->c = input[i];
            current->pos = i;
            ++current;
        }
    }
    size_t shortlen = (current - filtered);

    // wouldn't normally recommend returning malloced data, but
    // illustrates the point.
    KGram *result = malloc((shortlen / 5 + 1) * sizeof(*result));
    assert(result);

    // take each 5 step
    KGram *currentgram = result;
    current = filtered;
    for (size_t i = 0; i < shortlen; ++i) {
        currentgram->text[i%5] = current->c;
        if ((i % 5) == 0) {
            currentgram->start = current->pos;
        } else if ((i % 5) == 4) {
            currentgram->end = current->pos;
            ++currentgram;
        }
        ++current;
    }
    if (shortlen % 5) != 0 {
        currentgram->end = filtered[shortlen-1].pos;
        currentgram->text[shortlen%5] = 0;
    }

    free(filtered);
    return(result);
}

Or something like that, I can't be actually compiling and testing it. Obviously this has the significant weakness that filtered sees the chars one at a time, which means it cannot apply backtracking algorithms. You could get around it by passing the whole string into the filter, so that if necessary it can do a lot of work on the first call, and store the results to return on all the rest of the calls. But if you need to apply regular-expression-like logic to arbitrary types, then C is probably not the right language to use.

Here's the beginnings of a C++ solution, without even using <functional>. Not sure what Norman saying about languages with new: just because the language has it doesn't mean you have to use it ;-)

template <typename OutputIterator>
struct KGramOutput {
    OutputIterator dest;
    KGram kgram;
    KGramOutput(OutputIterator dest) : dest(dest) {}
    void add(char, size_t);
    void flush(void);
};

template <typename InputIterator, typename OutputIterator, typename Filter>
void foo(InputIterator first, InputIterator last, OutputIterator dest, Filter filter) {
    size_t i = 0;
    KGramOutput<OutputIterator> kgram(dest);
    while (first != last) {
        if (filter(*first)) kgram.add(*first, i);
        ++first;
        ++i;
    }
    kgram.flush();
}

The add and flush functions are a bit tedious, they have to bundle up 5 pairs into a KGram struct, and then do *dest++ = kgram. The user could pass for example a pushback_iterator over a vector<KGram> as the output iterator. Btw the '5' and the 'char' could be template parameters too.

Steve Jessop
+1  A: 

This can be done in a single pass without needing to construct intermediate character-position pairs:

(defclass k-gram ()
  ((start :reader start :initarg :start)
   (end :accessor end)
   (text :accessor text)))

(defmethod initialize-instance :after ((k-gram k-gram) &rest initargs &key k)
  (declare (ignorable initargs))
  (setf (slot-value k-gram 'text) (make-array k :element-type 'character)))

(defun k-gramify (string k ignore-string)
  "Builds the list of complete k-grams with positions from the original
   text, but with all characters in ignore-string ignored."
  (loop
     for character across string
     for position upfrom 0
     with k-grams = ()
     do (unless (find character ignore-string)
          (push (make-instance 'k-gram :k k :start position) k-grams)
          (loop
             for k-gram in k-grams
             for i upfrom 0 below k
             do (setf (aref (text k-gram) i) character
                      (end k-gram) (1+ position))))
     finally (return (nreverse (nthcdr (- k 1) k-grams)))))
Svante