views:

878

answers:

8

As I loop through lines in file A, I am parsing the line and putting each string (char*) into a char**.

At the end of a line, I then run a procedure that consists of opening file B, using fgets, fseek and fgetc to grab characters from that file. I then close file B.

I repeat reopening and reclosing file B for each line.

What I would like to know is:

  1. Is there a significant performance hit from using malloc and free, such that I should use something static like myArray[NUM_STRINGS][MAX_STRING_WIDTH] instead of a dynamic char** myArray?

  2. Is there significant performance overhead from opening and closing file B (conceptually, many thousands of times)? If my file A is sorted, is there a way for me to use fseek to move "backwards" in file B, to reset where I was previously located in file B?

EDIT Turns out that a two-fold approach greatly reduced the runtime:

  1. My file B is actually one of twenty-four files. Instead of opening up the same file B1 a thousand times, and then B2 a thousand times, etc. I open up file B1 once, close it, B2 once, close it, etc. This reduces many thousands of fopen and fclose operations to roughly 24.

  2. I used rewind() to reset the file pointer.

This yielded a roughly 60-fold speed improvement, which is more than sufficient. Thanks for pointing me to rewind().

+4  A: 

If your dynamic array grows in time, there is a copy cost on some reallocs. If you use the "always double" heuristic, this is amortized to O(n), so it is not horrible. If you know the size ahead of time, a stack allocated array will still be faster.

For the second question read about rewind. It has got to be faster than opening and closing all the time, and lets you do less resource management.

dmckee
Historically, file open/close operations have proven to be tremendously expensive on most OS.
Brian Knoblauch
@Briam A link to something that provides evidence for this?
anon
I'm not actually realloc'ing, but doing a malloc followed by a free(), once I'm done with the line. I then malloc-free again on subsequent lines.
Alex Reynolds
@alex Posting some example code in your question might be a good ides.
anon
+1  A: 

Opening and closing has a variable overhead depending on if other programs are competitng for that resource.

measure the file size first and then use that to calculate the array size in advance to do one big heap allocation.

You won't get a multi-dimensional array right off, but a bit of pointer arithmetic and you are there.

Can you not cache positional information in the other file and then, rather than opening and closing it, use previous seek indexes as an offset? Depends on the exact logic really.

Aiden Bell
This won't have each line be a separate element. This may or not actually be necessary, but it's what the OP wants.
Matthew Flaschen
A big binary block can be split on '\n' with some pointer incrementing alot quicker than multiple reallocs() reallocs I suspect. But depends on how many lines.
Aiden Bell
A: 

There is always a performance hit with using dynamic memory. Using a static buffer will provide a speed boost.

There is also going to be a performance hit with reopening a file. You can use fseek(pos, SEEK_SET) to set the file pointer to any position in the file or fseek(offset, SEEK_CUR) to do a relative move.

Significant performance hit is relative, and you will have to determine what that means for yourself.

Dolphin
A: 
  1. I think it's better to allocate the actual space you need, and the overhead will probably not be significant. This avoids both wasting space and stack overflows

  2. Yes. Though the IO is cached, you're making unnecessary syscalls (open and close). Use fseek with probably SEEK_CUR or SEEK_SET.

Matthew Flaschen
+3  A: 

What I would like to know is:

  • does your code work correctly?
  • is it running fast enough for your purpose?

If the answer both of these is "yes", don't change anything.

anon
A: 

In both cases, there is some performance hit, but the significance will depend on the size of the files and the context your program runs in.

  1. If you actually know the max number of strings and max width, this will be a lot faster (but you may waste a lot of memory if you use less than the "max"). The happy medium is to do what a lot of dynamic array implementations in C++ do: whenever you have to realloc myArray, alloc twice as much space as you need, and only realloc again once you've run out of space. This has O(log n) performance cost.

  2. This may be a big performance hit. I strongly recommend using fseek, though the details will depend on your algorithm.

JSBangs
A: 

I often find the performance overhead to be outweighed by the direct memory management that comes with malloc and those low-level C handlers on memory. Unless these areas of memory are going to remain static and untouched for an amount of time that is in amortized time greater than touching this memory, it may be more beneficial to stick with the static array. In the end, it's up to you.

mduvall
+1  A: 
  1. If your files are large, disk I/O will be far more expensive than memory management. Worrying about malloc/free performance before profiling indicates that it is a bottleneck is premature optimization.

  2. It is possible that the overhead from frequent open/close is significant in your program, but again the actual I/O is likely to be more expensive, unless the files are small, in which case the loss of buffers between close and open can potentially cause extra disk I/O. And yes you can use ftell() to get the current position in a file then fseek with SEEK_SET to get to that.

TrayMan