views:

1623

answers:

8

I'm writing a program where performance is quite important, but not critical. Currently I am reading in text from a FILE* line by line and I use fgets to obtain each line. After using some performance tools, I've found that 20% to 30% of the time my application is running, it is inside fgets.

Are there faster ways to get a line of text? My application is single-threaded with no intentions to use multiple threads. Input could be from stdin or from a file. Thanks in advance.

+2  A: 

If the data is coming from disk, you could be IO bound.

If that is the case, get a faster disk (but first check that you're getting the most out of your existing one...some Linux distributions don't optimize disk access out of the box (hdparm)), stage the data into memory (say by copying it to a RAM disk) ahead of time, or be prepared to wait.


If you are not IO bound, you could be wasting a lot of time copying. You could benefit from so-called zero-copy methods. Something like memory map the file and only access it through pointers.

That is a bit beyond my expertise, so you should do some reading or wait for more knowledgeable help.

BTW-- You might be getting into more work than the problem is worth; maybe a faster machine would solve all your problems...

NB-- It is not clear that you can memory map the standard input either...

dmckee
Sometimes it comes from the disk, sometimes it is fed through stdin, but in both cases the the time spent in fgets is roughly the same. Even creating a RAM disk for the file doesn't speed things up much.
dreamlax
After edit: the problem is that this application will be run on end user's computer, that's why performance is quite important.
dreamlax
+1  A: 

You might try minimizing the amount of time you spend reading from the disk by reading large amounts of data into RAM then working on that. Reading from disk is slow, so minimize the amount of time you spend doing that by reading (ideally) the entire file once, then working on it.

Sorta like the way CPU cache minimizes the time the CPU actually goes back to RAM, you could use RAM to minimize the number of times you actually go to disk.

GMan
Stdio already is buffered, isn't it?
Paul Tomblin
I think so but I'm sure it's less than a megabyte, so reading more than that should still help.
GMan
+1  A: 

Depending on your environment, using setvbuf() to increase the size of the internal buffer used by file streams may or may not improve performance.

This is the syntax -

setvbuf (InputFile, NULL, _IOFBF, BUFFER_SIZE);

Where InputFile is a FILE* to a file just opened using fopen() and BUFFER_SIZE is the size of the buffer (which is allocated by this call for you).

You can try various buffer sizes to see if any have positive influence. Note that this is entirely optional, and your runtime may do absolutely nothing with this call.

Hexagon
+2  A: 
  1. Use fgets_unlocked(), but read carefully what it does first

  2. Get the data with fgetc() or fgetc_unlocked() instead of fgets(). With fgets(), your data is copied into memory twice, first by the C runtime library from a file to an internal buffer (stream I/O is buffered), then from that internal buffer to an array in your program

dmityugov
Thanks for the suggestion, but I forgot to mention I am using Mac OS X. fgets_unlocked is not available since it is a GNU extension. I will look into using fgetc_unlocked.
dreamlax
Well, OS X is running GCC, you should get the GNU extensions, right?
Martin Cote
@Martin: It is not an extension of the GNU compiler, but the GNU C runtime library.
dreamlax
+1  A: 

Read the whole file in one go into a buffer.

Process the lines from that buffer.

That's the fastest possible solution.

Blank Xavier
+3  A: 

You don't say which platform you are on, but if it is UNIX-like, then you may want to try the read() system call, which does not perform the extra layer of buffering that fgets() et al do. This may speed things up slightly, on the other hand it may well slow things down - the only way to find out is to suck it and see.

anon
This turned out to be the fastest method of all. I eventually went down this route. It was simpler than I had thought to do "my own buffering" and it turned out to be much, much faster (almost 4 times) than using `fgets()`.
dreamlax
A: 

If the OS supports it, you can try asynchronous file reading, that is, the file is read into memory whilst the CPU is busy doing something else. So, the code goes something like:

  start asynchronous read
loop:
  wait for asynchronous read to complete
  if end of file goto exit
  start asynchronous read
  do stuff with data read from file
  goto loop
exit:

If you have more than one CPU then one CPU reads the file and parses the data into lines, the other CPU takes each line and processes it.

Skizz

Skizz
A: 

Look into fread(). It reads much faster for me, especially if buffer for fread is set to 65536. Cons: you have to do a lot of work and essentially write your own getline function to convert from binary read to text. Check out: file I/O