tags:

views:

1297

answers:

10

I need to copy the contents of a text file to a dinamically-allocated character array. My problem is getting the size of the contents of the file; Google reveals that I need to use fseek and ftell, but for that the file apparently needs to be opened in binary mode, and that gives only garbage.

EDIT: I tried opening in text mode, but I get weird numbers. Here's the code (I've omitted simple error checking for clarity):

long f_size;
char* code;
size_t code_s, result;
FILE* fp = fopen(argv[0], "r");
fseek(fp, 0, SEEK_END);
f_size = ftell(fp); /* This returns 29696, but file is 85 bytes */
fseek(fp, 0, SEEK_SET);
code_s = sizeof(char) * f_size;
code = malloc(code_s);
result = fread(code, 1, f_size, fp); /* This returns 1045, it should be the same as f_size */
A: 

You can open the file, put the cursor at the end of the file, store the offset, and go back to the top of the file, and make the difference.

Aif
+1  A: 

You can use fseek for text files as well.

  • fseek to end of file
  • ftell the offset
  • fseek back to the begining

and you have size of the file

Darth
+3  A: 

You cannot determine the size of a file in characters without reading the data, unless you're using a fixed-width encoding.

For example, a file in UTF-8 which is 8 bytes long could be anything from 2 to 8 characters in length.

That's not a limitation of the file APIs, it's a natural limitation of there not being a direct mapping from "size of binary data" to "number of characters."

If you have a fixed-width encoding then you can just divide the size of the file in bytes by the number of bytes per character. ASCII is the most obvious example of this, but if your file is encoded in UTF-16 and you happen to be on a system which treats UTF-16 code points as the "native" internal character type (which includes Java, .NET and Windows) then you can predict the number of "characters" to allocate as if UTF-16 were fixed width. (UTF-16 is variable width due to Unicode characters above U+FFFF being encoded in multiple code points, but a lot of the time developers ignore this.)

Jon Skeet
I hadn't realized that... so I should read the whole file, incrementing a counter? Wouldn't that be pretty slow?
Javier Badia
Or use fstat(2). See http://www.gnu.org/s/libc/manual/html_node/Reading-Attributes.html
scvalex
@reyjavikvi: Do you want fast, or do you want accurate? There's just logically no way of doing it *without* reading the file's data if you're using a variable width encoding - unless something else has done it first (such as the operating system) and cached the data.
Jon Skeet
(I've been assuming that you *are* interested in the number of characters instead of the number of bytes, by the way... and that you've got a variable width encoding. If you really just want to know the file size in bytes, that's a different and far simpler matter.)
Jon Skeet
@jbcreix: My point is that many platforms - including Java and .NET - use UTF-16 code points as "characters". For example, if you want to read a file which contains 120 UTF-16 code points in, you allocate a character array of size 120, and if the file is encoded in UTF-16 you can predict that size based on the file size in bytes. You can argue all you want about whether or not that's a good idea (I wasn't giving it as "advice", btw) but it's the way that major systems are implemented. I'll edit the answer to make this clearer though...
Jon Skeet
A: 

Kind of hard with no sample code, but fstat (or stat) will tell you how big the file is. You allocate the memory required, and slurp the file in.

xcramps
+1  A: 

If you're developing for Linux (or other Unix-like operating systems), you can retrieve the file-size with stat before opening the file:

#include <stdio.h>
#include <sys/stat.h>

int main() {
   struct stat file_stat;

   if(stat("main.c", &file_stat) != 0) {
      perror("could not stat");
      return (1);
   }
   printf("%d\n", (int) file_stat.st_size);

   return (0);
}

HTH, flokra

EDIT: As I see the code, I have to get into the line with the other posters:

The array that takes the arguments from the program-call is constructed this way:

[0] name of the program itself
[1] first argument given
[2] second argument given
[n] n-th argument given

You should also check argc before trying to use a field other than '0' of the argv-array:

if (argc < 2) {
   printf ("Usage: %s arg1", argv[0]);
   return (1);
}

HTH, flokra

flokra
+1  A: 

Hi, I'm pretty sure argv[0] won't be an text file.

phoku
+1  A: 

argv[0] is the path to the executable and thus argv[1] will be the first user submitted input. Try to alter and add some simple error-checking, such as checking if fp == 0 and we might be ble to help you further.

Håkon
+8  A: 

The root of the problem is here:

FILE* fp = fopen(argv[0], "r");

argv[0] is your executable program, NOT the parameter. It certainly won't be a text file. Try argv[1], and see what happens then.

Roddy
Wow, thanks. I feel stupid now.
Javier Badia
@reyjaviki - good :-) It'll be my turn next...
Roddy
A: 

Give this a try (haven't compiled this, but I've done this a bazillion times, so I'm pretty sure it's at least close):

char* readFile(char* filename)
{
    FILE* file = fopen(filename,"r");
    if(file == NULL)
    {
        return NULL;
    }

    fseek(file, 0, SEEK_END);
    long int size = ftell(fp);
    rewind(fp);

    char* content = calloc(size + 1, 1);

    fread(content,1,size,file);

    return content;
}
Imagist
A: 

Another approach is to read the file a piece at a time and extend your dynamic buffer as needed:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define PAGESIZE 128

int main(int argc, char **argv)
{
  char *buf = NULL, *tmp = NULL;
  size_t bufSiz = 0;
  char inputBuf[PAGESIZE];
  FILE *in;

  if (argc < 2)
  {
    printf("Usage: %s filename\n", argv[0]);
    return 0;
  }

  in = fopen(argv[1], "r");
  if (in)
  {
    /**
     * Read a page at a time until reaching the end of the file
     */
    while (fgets(inputBuf, sizeof inputBuf, in) != NULL)
    {
      /**
       * Extend the dynamic buffer by the length of the string
       * in the input buffer
       */
      tmp = realloc(buf, bufSiz + strlen(inputBuf) + 1);
      if (tmp)
      {
        /**
         * Add to the contents of the dynamic buffer
         */
        buf = tmp;
        buf[bufSiz] = 0;
        strcat(buf, inputBuf);
        bufSiz += strlen(inputBuf) + 1;
      }
      else
      {
        printf("Unable to extend dynamic buffer: releasing allocated memory\n");
        free(buf);
        buf = NULL;
        break;
      }
    }

    if (feof(in))
      printf("Reached the end of input file %s\n", argv[1]);
    else if (ferror(in))
      printf("Error while reading input file %s\n", argv[1]);

    if (buf)
    {
      printf("File contents:\n%s\n", buf);
      printf("Read %lu characters from %s\n", 
       (unsigned long) strlen(buf), argv[1]);
    }

    free(buf);
    fclose(in);   
  }
  else
  {
    printf("Unable to open input file %s\n", argv[1]);
  }

  return 0;
}

There are drawbacks with this approach; for one thing, if there isn't enough memory to hold the file's contents, you won't know it immediately. Also, realloc() is relatively expensive to call, so you don't want to make your page sizes too small.

However, this avoids having to use fstat() or fseek()/ftell() to figure out how big the file is beforehand.

John Bode