views:

263

answers:

4

I want to implement a basic search/replace translation table in C; that is, it will read in a list of pairs of words from a config file, and go through a text received at runtime, replacing each source word it finds with the corresponding target word. For example, if my user-input text was

"Hello world, how are you today?"

and my config file was

world user
how why

running the function would return

"Hello user, why are you today?"

I could do this with a modest amount of tedium (currently looking at the glib string utility functions, because they're there), but I'm thinking this has got to be a fully solved problem in some library or another. Any pointers?

(No, this is not homework, although I'll admit the problem sounds reasonably homeworky :) I'm writing a libpurple plugin, hence the pure C requirement.)

A: 

You could check out GNU gettext. (Also see its Wikipedia article.)

Evan Shaw
gettext requires that input strings be known at compile time; i want to translate strings received at runtime. also i need to locate those strings in the input text. it's a different problem altogether.
Martin DeMello
Okay, that wasn't clear to me from your question. I don't know of anything off the top of my head then.
Evan Shaw
thanks for pointing that out; i edited the problem description to make it clearer
Martin DeMello
A: 

How about the GNU C library's regex engine?

Jamie Hale
i looked at that, but it didn't seem to offer much more for this particular problem than strstr did. it doesn't do replacements, and i'm doing exact-string matches anyway.libpcre would do it, but that seems a huge sledgehammer for a rather small nut. i'm just surprised that this exact problem hasn't been solved by some string library; i spent a while googling before posting here and came up blank.
Martin DeMello
+3  A: 

Hi Martin,

I too was astonished at how hard it was to find very simple string manipulation methods. What I wanted was Procedural Language equivalent of the object oriented string.replace() method. From what i can tell this is the essence of your problem as well... with such a method you could add the additional code to read in a file line by line and tokenize it on spaces.

What makes implementing such a method tricky is it's really an application decision to specify what the best way would be for allocating the buffer to put the transformed version of the string into. You have a few options: 1) Make the user pass in a buffer to the application and leave it up to the user to ensure that the buffer is always big enough for the transformed version. 2) Perform some dynamic memory allocation inside the method and force the caller to call free() on the returned pointer.

I opted for #1 because the overhead of dynamic memory allocation is too great for embedded applications. Also, it's requires the user to call free() later which is something that is pretty easy to forget.

The resulting function gets pretty ugly looking. I did a very quick implementation and I have included it below. This method should be tested further before being used in production. I ended up taking the project in a different direction before using it.

#include <stdio.h>
#include <time.h>
#include <string.h>
#include <assert.h>

/*
 * searches an input string for occurrence of a particular string and replaces it with another.  The resulting string is
 * stored in a buffer which is passed in to the function. 
 * 
 * @param pDest is a buffer which the updated version of the string will be placed into.  THIS MUST BE PREALLOCATED.  It's 
          the callers responsibility to make sure that pDest is of sufficient size that the buffer will not be overflowed.
 * @param pDestLen is the number of chars in pDest
 * @param pSrc is a constant string which is the original string
 * @param pSearch is the string to search for in pSrc.
 * @param pReplacement is the string that pSearch will be replaced with.
 * @return if successful it returns the number of times pSearch was replaced in the string.  Otherwise it returns a negative number
 *         to indicate an error.  It returns -1 if one of the strings passed in == NULL, -2 if the destination buffer is of insufficient size.  
 *         Note: the value stored in pDest is undefined if an error occurs.  
 */
int string_findAndReplace( char* pDest, int pDestLen, const char* pSrc, const char* pSearch, const char* pReplacement) {
    int destIndex=0;
    char* next;
    const char* prev = pSrc;
    int copyLen=0;
    int foundCnt = 0;

    if( pDest == NULL || pDestLen == 0 || pSrc == NULL || pSrc == NULL || pReplacement == NULL ) {
     return -1;
    }

    // TODO: BEFORE EACH MEMCPY, IT SHOULD BE VERIFIED THAT IT WILL NOT COPY OUT OF THE BOUNDS OF THE BUFFER SPACE
    //       THIS IS A VERY BASIC CHECK 
    if( pDestLen < strlen(pSrc) ) {
     return -2;
    }


    memset(pDest, 0x00, pDestLen);

    //printf("Entered findAndReplace\r\n");

    do { 
     next = strstr( prev, pSearch );

     if( next != NULL ) {  
      //printf("  next -> %s\r\n", next);

      copyLen = (next-prev);

      // copy chars before the search string
      memcpy( &pDest[destIndex], prev, copyLen ); 
      destIndex += copyLen;

      // insert the replacement    
      memcpy( &pDest[destIndex], pReplacement, strlen(pReplacement) );
      destIndex += strlen(pReplacement);    

      prev = next;
      prev += strlen(pSearch);
      foundCnt++;   
     }
    }while( next != NULL );

    //copy what's left from prev to the end to the end of dest.
    copyLen = strlen(prev);
    memcpy( &pDest[destIndex], prev, copyLen+1); // +1 makes it null terminate.

    //printf("prev='%s'\r\ndest='%s'\r\n", prev, pDest);
    return foundCnt;
}


// --------- VERY BASIC TEST HARNESS FOR THE METHOD ABOVE --------------- // 

#define NUM_TESTS 8

// Very rudimentary test harness for the string_findAndReplace method.
int main(int argsc, char** argsv) {

int i=0;
char newString[1000];

char input[][1000] = { 
"Emergency condition has been resolved. The all clear has been issued.",
"Emergency condition has been resolved and the all clear has been issued.",
"lions, tigers, and bears",
"and something, and another thing and",
"too many commas,, and, also androids",
" and and and,, and and ",
"Avoid doors, windows and large open rooms.",
"Avoid doors and windows."

};

char output[][1000] = { 
"Emergency condition has been resolved. The all clear has been issued.",
"Emergency condition has been resolved, and the all clear has been issued.",
"lions, tigers,, and bears",
"and something,, and another thing and",
"too many commas,, and, also androids",
", and, and, and,,, and, and, ",
"Avoid doors, windows, and large open rooms.",
"Avoid doors, and windows."
};

    char searchFor[] = " and ";
    char replaceWith[] = ", and ";

    printf("String replacer\r\n");

    for( i=0; i< NUM_TESTS; i++ ) {

     string_findAndReplace( newString, sizeof( newString ), input[i], searchFor, replaceWith );

     if( strcmp( newString, output[i] ) == 0 ) {
      printf("SUCCESS\r\n\r\n");
     }
     else {
      printf("FAILED: \r\n IN :'%s'\r\n OUT:'%s'\r\n EXP:'%s'\r\n\r\n", input[i],newString,output[i]);
     }

    }

    printf("\r\nDONE.\r\n");
    return 0;
}
blak3r
thanks blak3r, that looks like a useful starting point. as for the paucity of string methods, when going through websites to learn up glib i discovered that someone actually proposed adding a replace() function to the string module, complete with patch, and got shot down on the grounds that it wasn't generally useful enough!
Martin DeMello
+1  A: 

If you didn't have the config-file requirement, you could get (f)lex to generate C code for you. But this would mean a recompilation every time the word-pair list changed.

Maybe it's overkill, but you could store each word in a node of a linked list. That makes it pretty easy to construct new sentences by shuffling around and substituting words.

Jason Catena
The linked-list idea is a really good one if the substitutions are guaranteed to be words.
Chris Lutz
doesn't work because i get a string from libpurple and i'm expected to return it a string. i do start off by splitting it into words, though.
Martin DeMello