tags:

views:

230

answers:

7

So pray tell, how would I go about getting the largest contiguous string of letters out of a string of garbage in C? Here's an example:

char *s = "(2034HEY!!11   th[]thisiswhatwewant44";

Would return...

thisiswhatwewant

I had this on a quiz the other day...and it drove me nuts (still is) trying to figure it out!

UPDATE:

My fault guys, I forgot to include the fact that the only function you are allowed to use is the strlen function. Thus making it harder...

+3  A: 

Uae strtok() to split your string into tokens, using all non-letter characters as delimiters, and find the longest token.

To find the longest token you will need to organise some storage for tokens - I'd use linked list.

As simple as this.

EDIT

Ok, if strlen() is the only function allowed, you can first find the length of your source string, then loop through it and replace all non-letter characters with NULL - basically that's what strtok() does.

Then you need to go through your modified source string second time, advancing one token at a time, and find the longest one, using strlen().

qrdl
+1  A: 

This sounds similar to the standard UNIX 'strings' utility.

Keep track of the longest run of printable characters terminated by a NULL. Walk through the bytes until you hit a printable character. Start counting. If you hit a non-printable character stop counting and throw away the starting point. If you hit a NULL, check to see if the length of the current run is greater then the previous record holder. If so record it, and start looking for the next string.

Charles E. Grant
A: 

First, define "string" and define "garbage". What do you consider a valid, non-garbage string? Write down a concrete definition you can program - this is how programming specs get written. Is it a sequence of alphanumeric characters? Should it start with a letter and not a digit?

Once you get that figured out, it's very simple to program. Start with a naive method of looping over the "garbage" looking for what you need. Once you have that, look up useful C library functions (like strtok) to make the code leaner.

Eli Bendersky
+1  A: 

What defines the "good" substrings compared to the many others -- being lowercase alphas only? (i.e., no spaces, digits, punctuation, uppercase, &c)?

Whatever the predicate P that checks for a character being "good", a single pass over s applying P to each character lets you easily identify the start and end of each "run of good characters", and remember and pick the longest. In pseudocode:

longest_run_length = 0
longest_run_start = longest_run_end = null
status = bad
for i in (all indices over s):
  if P(s[i]):  # current char is good
    if status == bad:  # previous one was bad
      current_run_start = current_run_end = i
      status = good
    else: # previous one was also good
      current_run_end = i
  else:  # current char is bad
    if status == good:  # previous one was good -> end of run
      current_run_length = current_run_end - current_run_start + 1
      if current_run_length > longest_run_length:
        longest_run_start = current_run_start
        longest_run_end = current_run_end
        longest_run_length = current_run_length
      status = bad

# if a good run ends with end-of-string:
if status == good:  # previous one was good -> end of run
  current_run_length = current_run_end - current_run_start + 1
  if current_run_length > longest_run_length:
    longest_run_start = current_run_start
    longest_run_end = current_run_end
    longest_run_length = current_run_length
Alex Martelli
I wonder why your pseudocode looks so much like Python...
Chris Lutz
@Chris, because Python _is_ (just about) "executable pseudocode" -- so it lets you (just by transcribing a few generalities such as `all indices over s` into Python, `range(len(s))` in this case, and using little tricks such as my dataholder class when you need assign-and-test in your pseudocode) verify your pseudocode's logic by running some test cases (before you transcribe it into whatever other language you may require -- C in this case -- if you DO require some other specific language of course). So why use any different pseudocode...?
Alex Martelli
A: 

While I waited for you to post this as a question I coded something up.

This code iterates through a string passed to a "longest" function, and when it finds the first of a sequence of letters it sets a pointer to it and starts counting the length of it. If it is the longest sequence of letters yet seen, it sets another pointer (the 'maxStringStart' pointer) to the beginning of that sequence until it finds a longer one.

At the end, it allocates enough room for the new string and returns a pointer to it.

#include<stdio.h>
#include<stdlib.h>
#include<string.h>

int isLetter(char c){

    return ( (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') );

}

char *longest(char *s) {

    char *newString = 0;
    int maxLength = 0;
    char *maxStringStart = 0;
    int curLength = 0;
    char *curStringStart = 0;

    do {

     //reset the current string length and skip this
     //iteration if it's not a letter
     if( ! isLetter(*s)) {
      curLength = 0;
      continue;
     }

     //increase the current sequence length. If the length before
     //incrementing is zero, then it's the first letter of the sequence:
     //set the pointer to the beginning of the sequence of letters
     if(curLength++ == 0) curStringStart = s;

     //if this is the longest sequence so far, set the
     //maxStringStart pointer to the beginning of it
     //and start increasing the max length.
     if(curLength > maxLength) {
      maxStringStart = curStringStart;
      maxLength++;
     }

    } while(*s++);

    //return null pointer if there were no letters in the string,
    //or if we can't allocate any memory.
    if(maxLength == 0) return NULL;
    if( ! (newString = malloc(maxLength + 1)) ) return NULL;

    //copy the longest string into our newly allocated block of
    //memory (see my update for the strlen() only requirement)
    //and null-terminate the string by putting 0 at the end of it.
    memcpy(newString, maxStringStart, maxLength);
    newString[maxLength + 1] = 0;

    return newString;

}

int main(int argc, char *argv[]) {

    int i;

    for(i = 1; i < argc; i++) {
     printf("longest all-letter string in argument %d:\n", i);
     printf("   argument: \"%s\"\n", argv[i]);
     printf("    longest: \"%s\"\n\n", longest(argv[i]));
    }

    return 0;

}

This is my solution in simple C, without any data structures.

I can run it in my terminal like this:

~/c/t $ ./longest "hello there, My name is Carson Myers." "abc123defg4567hijklmnop890"
longest all-letter string in argument 1:
   argument: "hello there, My name is Carson Myers."
    longest: "Carson"

longest all-letter string in argument 2:
   argument: "abc123defg4567hijklmnop890"
    longest: "hijklmnop"

~/c/t $

the criteria for what constitutes a letter could be changed in the isLetter() function easily. For example:

return ( 
    (c >= 'a' && c <= 'z') ||
    (c >= 'A' && c <= 'Z') ||
    (c == '.') || 
    (c == ' ') || 
    (c == ',') );

would count periods, commas and spaces as 'letters' also.


as per your update:

replace memcpy(newString, maxStringStart, maxLength); with:

int i;
for(i = 0; i < maxLength; i++)
    newString[i] = maxStringStart[i];

however, this problem would be much more easily solved with the use of the C standard library:

char *longest(char *s) {

    int longest = 0;
    int curLength = 0;
    char *curString = 0;
    char *longestString = 0;
    char *tokens = " ,.!?'\"()@$%\r\n;:+-*/\\";

    curString = strtok(s, tokens);
    do {

     curLength = strlen(curString);
     if( curLength > longest ) {
      longest = curLength;
      longestString = curString;
     }

    } while( curString = strtok(NULL, tokens) );

    char *newString = 0;

    if( longest == 0 ) return NULL;
    if( ! (newString = malloc(longest + 1)) ) return NULL;

    strcpy(newString, longestString);

    return newString;

}
Carson Myers
What's wrong with `#include <ctype.h>` and `isalpha()`?
Jonathan Leffler
nothing I suppose, I just didn't think of it -- just that now he says he can only use `strlen`. In his last question he used `malloc` so I guess that one's still okay too.
Carson Myers
I think that's a silly requirement - especially because your replacement function isn't guaranteed to work by the C standard (unfortunately, alphabetical characters aren't required to be consecutive, only digit characters). I think you should make a second version that uses all the standard functions, if only to show later readers how much easier this is with the standard library.
Chris Lutz
done, but it's getting to be a pretty hefty answer.
Carson Myers
A: 

Another variant.

#include <stdio.h>
#include <string.h>

int main(void)
{
        char s[] = "(2034HEY!!11   th[]thisiswhatwewant44";
        int len = strlen(s);
        int i = 0;
        int biggest = 0;
        char* p = s;

        while (p[0])
        {
                if (!((p[0] >= 'A' && p[0] <= 'Z') || (p[0] >= 'a' && p[0] <= 'z')))
                {
                        p[0] = '\0';
                }

                p++;
        }

        for (; i < len; i++)
        {
                if (s[i] && strlen(&s[i]) > biggest)
                {
                        biggest = strlen(&s[i]);
                        p = &s[i];
                }
        }

        printf("%s\n", p);
        return 0;
}
FeatureCreep
Hey - no printing the result - you're only allowed to use strlen()! :D Which just goes to emphasize how pointless the question is.
Jonathan Leffler
Haha, you're right. This is a weird question :).
FeatureCreep
why not `*p` instead of `p[0]`?
Carson Myers
That would work too.
FeatureCreep
+1  A: 

Why use strlen() at all? Here's my version which uses no function whatsoever.

#ifdef UNIT_TEST
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#endif

/*
// largest_letter_sequence()
// Returns a pointer to the beginning of the largest letter
//   sequence (including trailing characters which are not letters)
//   or NULL if no letters are found in s
// Passing NULL in `s` causes undefined behaviour
// If the string has two or more sequences with the same number of letters
//   the return value is a pointer to the first sequence.
// The parameter `len`, if not NULL, will have the size of the letter sequence
//
// This function assumes an ASCII-like character set
//   ('z' > 'a'; 'z' - 'a' == 25; ('a' <= each of {abc...xyz} <= 'z'))
//   and the same for uppercase letters
// Of course, ASCII works for the assumptions :)
*/
const char *largest_letter_sequence(const char *s, size_t *len) {
  const char *p = NULL;
  const char *pp = NULL;
  size_t curlen = 0;
  size_t maxlen = 0;

  while (*s) {
    if ((('a' <= *s) && (*s <= 'z')) || (('A' <= *s) && (*s <= 'Z'))) {
      if (p == NULL) p = s;
      curlen++;
      if (curlen > maxlen) {
        maxlen = curlen;
        pp = p;
      }
    } else {
      curlen = 0;
      p = NULL;
    }
    s++;
  }
  if (len != NULL) *len = maxlen;
  return pp;
}

#ifdef UNIT_TEST
void fxtest(const char *s) {
  char *test;
  const char *p;
  size_t len;

  p = largest_letter_sequence(s, &len);
  if (len && (len < 999)) {
    test = malloc(len + 1);
    if (!test) {
      fprintf(stderr, "No memory.\n");
      return;
    }
    strncpy(test, p, len);
    test[len] = 0;
    printf("%s ==> %s\n", s, test);
    free(test);
  } else {
    if (len == 0) {
      printf("no letters found in \"%s\"\n", s);
    } else {
      fprintf(stderr, "ERROR: string too large\n");
    }
  }
}

int main(void) {
  fxtest("(2034HEY!!11   th[]thisiswhatwewant44");
  fxtest("123456789");
  fxtest("");
  fxtest("aaa%ggg");
  return 0;
}
#endif
pmg