ansaurus

Question

How can I keep track of character positions after I remove elements from a string?

Answer 1

+5 A:

Don't use a regular expression 'replace' API to do your replacing. Only use regexps to find the places you want to modify, do the mod yourself, and maintain an offset mapping. One form I've used is an array of ints as big as the original string, storing 'n chars deleted' here values, but there are a host of other possibilities.

The basic data structure here is an array of pairs. Each pair contains an offset and a correction. Depending on time/space tradeoffs, you may prefer to spread the information out over a data structure as large as the original string.

bmargulies 2010-02-20 22:04:49

An int array as large as the original source text, or as large as the modified source text? If I had an integer array as large as large as the source text, why would the number of n-characters deleted here ever be greater than 1? Since there is an int position for each character? Or did I misinterpret? Can you elaborate some?

Simucal 2010-02-20 22:45:01

I'll add to the answer.

bmargulies 2010-02-20 23:19:56

Answer 2

+5 A:

Here's how I would solve this problem in Haskell:

kgramify k string =
  let charsWithPos = zip string [1..]  -- attach original position to each char
      goodCWP      = filter (not o isWhitePeriodOrComma o fst) charsWithPos -- drop nasty chars
      groups       = takeEveryK k goodCWP -- clump remaining chars in groups of size k
      posnOfGroup g = (snd (head g), map fst g) -- position of first char with group
  in  map posnOfGroup groups

In informal English:

Tag each character with its position
Filter out uninteresting (character, position) pairs
Take the remaining list of pairs and group them into a list of lists of length k
For each inner list, take the position of the first character, and pair it with the list of the all the characters (with the positions dropped)

In any functional language like Clean, Haskell, ML, or Scheme, this kind of thing is very easy. In a language with explicit memory allocation (new) or even worse, malloc and free, such a solution would be very tedious.

Norman Ramsey 2010-02-20 22:38:46

Heck, if you know J or APL, programming in a language that requires use of the space bar is tedious and long-winded ;-)

Steve Jessop 2010-02-20 22:40:43

@Steve: I'd love to see the APL version---sure it would make mincement of this problem. I don't think I've written a line of APL in more than 25 years...

Norman Ramsey 2010-02-20 23:15:42

Unfortunately I've never found the time or the three-foot-wide keyboard necessary to learn APL properly.

Steve Jessop 2010-02-21 00:17:28

This seems to be missing the end points of the groups.

Svante 2010-02-21 00:43:57

Answer 3

+2 A:

A C solution, to show that as Norman Ramsey says, it's pretty tedious. It takes the filter as a callback with context, just for kicks, but in your case you could pass 0 as the filterdata and not_wspc as the filter:

int not_wspc(void *, char c) {
    if isspace((unsigned char)c) return 0;
    if ((c == '.') || (c == ',')) return 0;
    return 1;
}

typedef struct {
    char c;
    int pos;
} charwithpos;

KGram *foo(const char *input, int (*filter)(void *,char), void *filterdata) {
    size_t len = strlen(input);
    charwithpos *filtered = malloc(len * sizeof(*filtered));
    assert(filtered);

    // combine Norman's zip and filter steps
    charwithpos *current = filtered
    for (size_t i = 0; i < len; ++i) {
        if (filter(filterdata, input[i])) {
            current->c = input[i];
            current->pos = i;
            ++current;
        }
    }
    size_t shortlen = (current - filtered);

    // wouldn't normally recommend returning malloced data, but
    // illustrates the point.
    KGram *result = malloc((shortlen / 5 + 1) * sizeof(*result));
    assert(result);

    // take each 5 step
    KGram *currentgram = result;
    current = filtered;
    for (size_t i = 0; i < shortlen; ++i) {
        currentgram->text[i%5] = current->c;
        if ((i % 5) == 0) {
            currentgram->start = current->pos;
        } else if ((i % 5) == 4) {
            currentgram->end = current->pos;
            ++currentgram;
        }
        ++current;
    }
    if (shortlen % 5) != 0 {
        currentgram->end = filtered[shortlen-1].pos;
        currentgram->text[shortlen%5] = 0;
    }

    free(filtered);
    return(result);
}

Or something like that, I can't be actually compiling and testing it. Obviously this has the significant weakness that filtered sees the chars one at a time, which means it cannot apply backtracking algorithms. You could get around it by passing the whole string into the filter, so that if necessary it can do a lot of work on the first call, and store the results to return on all the rest of the calls. But if you need to apply regular-expression-like logic to arbitrary types, then C is probably not the right language to use.

Here's the beginnings of a C++ solution, without even using <functional>. Not sure what Norman saying about languages with new: just because the language has it doesn't mean you have to use it ;-)

template <typename OutputIterator>
struct KGramOutput {
    OutputIterator dest;
    KGram kgram;
    KGramOutput(OutputIterator dest) : dest(dest) {}
    void add(char, size_t);
    void flush(void);
};

template <typename InputIterator, typename OutputIterator, typename Filter>
void foo(InputIterator first, InputIterator last, OutputIterator dest, Filter filter) {
    size_t i = 0;
    KGramOutput<OutputIterator> kgram(dest);
    while (first != last) {
        if (filter(*first)) kgram.add(*first, i);
        ++first;
        ++i;
    }
    kgram.flush();
}

The add and flush functions are a bit tedious, they have to bundle up 5 pairs into a KGram struct, and then do *dest++ = kgram. The user could pass for example a pushback_iterator over a vector<KGram> as the output iterator. Btw the '5' and the 'char' could be template parameters too.

Steve Jessop 2010-02-20 23:08:21

Answer 4

+1 A:

This can be done in a single pass without needing to construct intermediate character-position pairs:

(defclass k-gram ()
  ((start :reader start :initarg :start)
   (end :accessor end)
   (text :accessor text)))

(defmethod initialize-instance :after ((k-gram k-gram) &rest initargs &key k)
  (declare (ignorable initargs))
  (setf (slot-value k-gram 'text) (make-array k :element-type 'character)))

(defun k-gramify (string k ignore-string)
  "Builds the list of complete k-grams with positions from the original
   text, but with all characters in ignore-string ignored."
  (loop
     for character across string
     for position upfrom 0
     with k-grams = ()
     do (unless (find character ignore-string)
          (push (make-instance 'k-gram :k k :start position) k-grams)
          (loop
             for k-gram in k-grams
             for i upfrom 0 below k
             do (setf (aref (text k-gram) i) character
                      (end k-gram) (1+ position))))
     finally (return (nreverse (nthcdr (- k 1) k-grams)))))

Svante 2010-02-21 00:48:10

ansaurus

tags:

views:

answers:

How can I keep track of character positions after I remove elements from a string?

related questions