tags:

views:

537

answers:

7
+3  Q: 

Faster I/O in C

Hi, I have a problem which will take 1000000 lines of inputs like below from console.

0 1 23 4 5
1 3 5 2 56
12 2 3 33 5
...
...

I have used scanf, but it is very very slow. Is there anyway to get the input from console in a faster way? I could use read(), but I am not sure about the no of bytes in each line, so I can not as read() to read 'n' bytes. Thanks, Very obliged

+3  A: 

You use multiple reads with a fixed size buffer till you hit end of file.

dirkgently
A: 

Out of curiosity, what generates that many lines that fast in a console ?

Indeera
% a.out < big.fileor something like that, I'm guessing.
Roboprog
In case this is some program's output to stdout, redirecting to a file and loading from the file would be much faster. Console rendering is a slow process
Indeera
+1  A: 

Read a line at a time (if buffer not big enough for a line, expand and continue with larger buffer).

Then use dedicated functions (e.g. atoi) rather than general for conversion.

But, most of all, set up a repeatable test harness with profiling to ensure changes really do speed things up.

Richard
+5  A: 

Use fgets(...) to pull in a line at a time. Note that you should check for the '\n' at the end of the line, and if there is not one, you are either at EOF, or you need to read another buffer's worth, and concatenate the two together. Lather, rinse, repeat. Don't get caught with a buffer overflow.

THEN, you can parse each logical line in memory yourself. I like to use strspn(...) and strcspn(...) for this sort of thing, but your mileage may vary.

Parsing: Define a delimiters string. Use strspn() to count "non data" chars that match the delimiters, and skip over them. Use strcspn() to count the "data" chars that DO NOT match the delimiters. If this count is 0, you are done (no more data in the line). Otherwise, copy out those N chars to hand to a parsing function such as atoi(...) or sscanf(...). Then, reset your pointer base to the end of this chunk and repeat the skip-delims, copy-data, convert-to-numeric process.

Roboprog
+1  A: 

If your example is representative, that you indeed have a fixed format of five decimal numbers per line, I'd probably use a combination of fgets() to read the lines, then a loop calling strtol() to convert from string to integer.

That should be faster than scanf(), while still clearer and more high-level than doing the string to integer conversion on your own.

Something like this:

typedef struct {
  int number[5];
} LineOfNumbers;

int getNumbers(FILE *in, LineOfNumbers *line)
{
  char buf[128];  /* Should be large enough. */
  if(fgets(buf, sizeof buf, in) != NULL)
  {
    int i;
    char *ptr, *eptr;

    ptr = buf;
    for(i = 0; i < sizeof line->number / sizeof *line->number; i++)
    {
      line->number[i] = (int) strtol(ptr, &eptr, 10);
      if(eptr == ptr)
        return 0;
      ptr = eptr;
    }
    return 1;
  }
  return 0;
}

Note: this is untested (even uncompiled!) browser-written code. But perhaps useful as a concrete example.

unwind
I love "char buf[128]; /* Should be large enough. */" Excellent practice.
Chris Lutz
@Chris: Um ... Not sure if you're being ironic. This is example code, based on a very vague spec, and the fgets() uses sizeof too, so there should be no risk of an overrun here. Or maybe I'm just overly paranoid.
unwind
I am being ironic. I tend to avoid fixed-length line reads, even if it can work just fine, just because I usually like having the whole line at once in one variable (a taste I probably inherited from Perl). But this code is fine. I just find it amusing to see such comments, even when they're true.
Chris Lutz
+1  A: 

Use binary I/O if you can. Text conversion can slow down the reading by several times. If you're using text I/O because it's easy to debug, consider again binary format, and use the od program (assuming you're on unix) to make it human-readable when needed.

Oh, another thing: there's AT&T's SFIO library, which stands for safer/faster file IO. You might also have some luck with that, but I doubt that you'll get the same kind of speedup as you will with binary format.

zvrba
The Unix program which uses binary files is not the true Unix program, or something like that :-)Favor human readable data when practical -- or -- the boss/teacher says this is what we get!
Roboprog
++ If you have a choice, binary is fastest. If there are non-integers in the list, binary is also the most accurate.
Mike Dunlavey
A: 

fread will still return if you try to read more bytes than there are.

I have found on of the fastest ways to read file is like this:

/*seek end of file */ fseek(file,0,SEEK_END);

/*get size of file */ size = ftell(file);

/*seek start of file */ fseek(file,0,SEEK_SET);

/* make a buffer for the file */ buffer = malloc(1048576);

/*fread in 1MB at a time until you reach size bytes etc */

On modern computers put your ram to use and load the whole thing to ram, then you can easily work your way through the memory.

At the very least you should be using fread with block sizes as big as you can, and at least as big as the cache blocks or HDD sector size (4096 bytes minimum, I would use 1048576 as a minimum personally). You will find that with much bigger read requsts rfead is able to sequentially get a big stream in one operation. The suggestion here of some people to use 128 bytes is rediculous.... as you will end up with the drive having to seek all the time as the tiny delay between calls will cause the head to already be past the next sector which almost certainly has sequential data that you want.

You could malloc() the buffer to be large enough to hold all of `size` in one fell swoop, but that's a lot of memory for one variable, and I understand avoiding this.
Chris Lutz
Seek-to-EOF does not work on a pipe. stdin is often just a redirected pipeline, rather than an actual file. I assume "the console" means stdin. Sounds like a bit of Windows-speak carry over, as far as I can tell.
Roboprog
Speakin' of Windoze: this solution flirts with being a memory mapped file.
Roboprog