ansaurus

Question

Parsing text in C

Answer 1

A:

You could try using strtok() to tokenize each line, and then check whether each token is a number or a word (a fairly trivial check once you have the token string - just look at the first character of the token).

Amber 2009-09-05 21:06:54

Just looking at the first character of the token isn't a very robust check. I wouldn't trust data from a file that much.

Whisty 2009-09-05 21:29:19

Depends on the source of the file. If these are internal files generated by the application (or pre-existing files for which the format is strict and already known), then it's quite possible that a robust check isn't needed.

Amber 2009-09-05 21:33:50

Generally, strtok() is not a particularly good way to go about things. Doubly not in a threaded program. Also, if the required storage is 'string possibly containing spaces' plus number, strtok is likely to break things up into too many parts.

Jonathan Leffler 2009-09-06 00:31:04

I had actually slightly misread the question - I initially thought they wanted to grab each of the words individually. Since that's not the case, strtok isn't really appropriate.

Amber 2009-09-06 03:33:54

Answer 2

+4 A:

Edit: You can use pNum-buf to get the length of the alphabetical part of the string, and use strncpy() to copy that into another buffer. Be sure to add a '\0' to the end of the destination buffer. I would insert this code before the pNum++.

int len = pNum-buf;
strncpy(newBuf, buf, len-1);
newBuf[len] = '\0';

You could read the entire line into a buffer and then use:

char *pNum;
if (pNum = strrchr(buf, ' ')) {
  pNum++;
}

to get a pointer to the number field.

Rob Jones 2009-09-05 21:07:04

That's what i was writing, thanks to Stack Overflow's orange ajaxy alert :-)

p4bl0 2009-09-05 21:08:54

Heh, I'm usually on the other side of the alert too.

Rob Jones 2009-09-05 21:15:09

That works, but what about the alphabetical part? How do I copy it up to the last space?

2009-09-05 21:26:11

Thank you very much.

2009-09-05 21:42:33

Answer 3

A:

Assuming that the number is immediately followed by '\n'. you can read each line to chars buffer, use sscanf("%d") on the entire line to get the number, and then calculate the number of chars that this number takes at the end of the text string.

Liran Orevi 2009-09-05 21:27:37

Answer 4

A:

fscanf(file, "%s %d", word, &value);

This gets the values directly into a string and an integer, and copes with variations in whitespace and numerical formats, etc.

Edit

Ooops, I forgot that you had spaces between the words. In that case, I'd do the following. (Note that it truncates the original text in 'line')

// Scan to find the last space in the line
char *p = line;
char *lastSpace = null;
while(*p != '\0')
{
    if (*p == ' ')
        lastSpace = p;
    p++;
}


if (lastSpace == null)
    return("parse error");

// Replace the last space in the line with a NUL
*lastSpace = '\0';

// Advance past the NUL to the first character of the number field
lastSpace++;

char *word = text;
int number = atoi(lastSpace);

You can solve this using stdlib functions, but the above is likely to be more efficient as you're only searching for the characters you are interested in.

Jason Williams 2009-09-05 21:28:15

The %s will only match up to the next whitespace character.

Rob Jones 2009-09-05 21:39:00

Duh, I read the example, then read the format description below it and forgot that the format could have multiple spaces. (blush!)

Jason Williams 2009-09-05 21:42:40

Answer 5

A:

Depending on how complex your strings become you may want to use the PCRE library. At least that way you can compile a perl'ish regular expression to split your lines. It may be overkill though.

KFro 2009-09-05 21:34:35

Answer 6

A:

Given the description, here's what I'd do: read each line as a single string using fgets() (making sure the target buffer is large enough), then split the line using strtok(). To determine if each token is a word or a number, I'd use strtol() to attempt the conversion and check the error condition. Example:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**
 * Read the next line from the file, splitting the tokens into 
 * multiple strings and a single integer. Assumes input lines
 * never exceed MAX_LINE_LENGTH and each individual string never
 * exceeds MAX_STR_SIZE.  Otherwise things get a little more
 * interesting.  Also assumes that the integer is the last 
 * thing on each line.  
 */
int getNextLine(FILE *in, char (*strs)[MAX_STR_SIZE], int *numStrings, int *value)
{
  char buffer[MAX_LINE_LENGTH];
  int rval = 1;
  if (fgets(buffer, buffer, sizeof buffer))
  {
    char *token = strtok(buffer, " ");
    *numStrings = 0;
    while (token) 
    {
      char *chk;
      *value = (int) strtol(token, &chk, 10);
      if (*chk != 0 && *chk != '\n')
      {
        strcpy(strs[(*numStrings)++], token);
      }
      token = strtok(NULL, " ");
    }
  }
  else
  {
    /** 
     * fgets() hit either EOF or error; either way return 0
     */
    rval = 0;
  }
  return rval;
}
/**
 * sample main
 */
int main(void)
{
  FILE *input;
  char strings[MAX_NUM_STRINGS][MAX_STRING_LENGTH];
  int numStrings;
  int value;

  input = fopen("datafile.txt", "r");
  if (input)
  {
    while (getNextLine(input, &strings, &numStrings, &value))
    {
      /**
       * Do something with strings and value here
       */
    }
    fclose(input);
  }
  return 0;
}

John Bode 2009-09-06 00:41:37

Answer 7

A:

Given the description, I think I'd use a variant of this (now tested) C99 code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

struct word_number
{
    char word[128];
    long number;
};

int read_word_number(FILE *fp, struct word_number *wnp)
{
    char buffer[140];
    if (fgets(buffer, sizeof(buffer), fp) == 0)
        return EOF;
    size_t len = strlen(buffer);
    if (buffer[len-1] != '\n')  // Error if line too long to fit
        return EOF;
    buffer[--len] = '\0';
    char *num = &buffer[len-1];
    while (num > buffer && !isspace(*num))
        num--;
    if (num == buffer)         // No space in input data
        return EOF;
    char *end;
    wnp->number = strtol(num+1, &end, 0);
    if (*end != '\0')  // Invalid number as last word on line
        return EOF;
    *num = '\0';
    if (num - buffer >= sizeof(wnp->word))  // Non-number part too long
        return EOF;
    memcpy(wnp->word, buffer, num - buffer);
    return(0);
}

int main(void)
{
    struct word_number wn;
    while (read_word_number(stdin, &wn) != EOF)
        printf("Word <<%s>> Number %ld\n", wn.word, wn.number);
    return(0);
}

You could improve the error reporting by returning different values for different problems. You could make it work with dynamically allocated memory for the word portion of the lines. You could make it work with longer lines than I allow. You could scan backwards over digits instead of non-spaces - but this allows the user to write "abc 0x123" and the hex value is handled correctly. You might prefer to ensure there are no digits in the word part; this code does not care.

Jonathan Leffler 2009-09-06 00:44:46

ansaurus

tags:

views:

answers:

Parsing text in C

related questions