views:

216

answers:

11

I have a very large binary file and I need to create separate files based on the id within the input file. There are 146 output files and I am using cstdlib and fopen and fwrite. FOPEN_MAX is 20, so I can't keep all 146 output files open at the same time. I also want to minimize the number of times I open and close an output file.

How can I write to the output files effectively?

I also must use the cstdlib library due to legacy code.

The executable must also be UNIX and windows cross-platform compatible.

+1  A: 

If you cannot increase the max FOPEN_MAX somehow, you can create a simple queue of requests and then close and re-open files as needed.

You can also keep track of the last write-time for each file, and try to keep the most recently written files open.

thomask
A: 

The solution seems obvious - open N files, where N is somewhat less than FOPEN_MAX. Then read through the input file and extract the contents of the first N output files. Then close the output files, rewind the input, and repeat.

Mark Bessey
A: 

First of all, I hope you are running as much in parallel as possible. There is no reason why you can't write to multiple files at the same time. I'd recommend doing what thomask said and queue requests. You can then use some thread synchronization to wait until the entire queue is flushed before allowing the next round of writes to go through.

Polaris878
+3  A: 

It may also be worth scanning the input file, making a list of each output id and sorting it so that you write all the file1 entries first, then all the file2 entries etc..

Martin Beckett
Or use some sort of binary search tree to hold the info, then iterate through the tree at the end for output (`std::map` or `std::unordered_map` would be good).
Brendan Long
You could decrease the passes if you took care of more than one file in each pass. You could also try to optimize this to take advantage of the earliest and latest instance of a particular ID so that you might be able to avoid scanning the input file entirely for each pass.
nategoose
Yes, or you create a start/end index for each id into the large file - the important thing is that 146 passes through a large sequential file is better than many 10000s of file opens/write/close if each line is a different output file.The point isn't always to find the perfect fastest solution - just the acceptably fastest.
Martin Beckett
A: 

You haven't mentioned if it's critical to write to these outputs in "real-time", or how much data is being written. Subject to your constraints, one option might be to buffer all the outputs and write them at the end of your software run.

A variant of this is to setup internal buffers of a fixed size, once you hit the internal buffer limit, open the file, append, and close, then clear the buffer for more output. The buffers reduce the number of open/close cycles and give you bursts of writes which the file system is usually setup to handle nicely. This would be for cases where you need somewhat real-time writes, and/or data is bigger than available memory, and file handles exceed some max in your system.

Digikata
Real-time is not required but the executable will need to process hundreds of GBs of data in a reasonable amount of time. Therefore speed is my number one concern.My originally tried the buff all outputs and write at end of the software run, but it was not efficient nor quick enough.
Elpezmuerto
+5  A: 

A couple possible approaches you might take:

  • keep a cache of opened output file handles that's less than FOPEN_MAX - if a write needs to occur on a files that already open, then just do the write. Otherwise, close one of the handles in the cache and open the output file. If your data is generally clumped together in terms of the data for a particular set of files is grouped together in the input file, this should work nicely with an LRU policy for the file handle cache.

  • Handle the output buffering yourself instead of letting the library do it for you: keep your own set of 146 (or however many you might need) output buffers and buffer the output to those, and perform an open/flush/close when a particular output buffer gets filled. You could even combine this with the above approach to really minimize the open/close operations.

Just be sure you test well for the edge conditions that can happen on filling or nearly filling an output buffer.

Michael Burr
You could sort of combine the two. If you have data for a file that is not currently open then you could buffer that up until it reaches some high water mark for that file or for all files, at which point you open that file (possibly closing another file first).
nategoose
A: 

You can do it in 2 steps.

1) Write the first 19 ids to one file, the next 19 ids to the next file and so on. So you need 8 output files (and the input file) opened in parallel for this step.

2) For every so created file create 19 (only 13 for the last one) new files and write the ids to it.

Independent of how large the input file is and how many id-datasets it contains, you always need to open and close 163 files. But you need to write the data twice, so it may only worth it, if the id-datasets are really small and randomly distributed.

I think in most cases it is more efficient to open and close the files more often.

rudi-moore
A: 

The safest method is to open a file and flush after writing, then close if no more recent writing will take place. Many things outside your program's control can corrupt the content of your file. Keep this in mind as you read on.

I suggest keeping an std::map or std::vector of FILE pointers. The map allows you to access file pointers by an ID. If the ID range is small, you could create a vector, reserving elements, and using the ID as an index. This will allow you to keep a lot of files open at the same time. Beware the concept of data corruption.

The limit of simultaneous open files is set by the operating system. For example, if your OS has a maximum of 10, you will have make arrangements when the 11th file is requested.

Another trick is reserve buffers in dynamic memory for each file. When all the data is processed, open a file (or more than one), write the buffer (using one fwrite), close and move on. This may be faster since you are writing to memory during the data processing rather than a file. An interesting side note is that your OS may also page the buffers to the hard drive as well. The size and quantities of buffers is an optimization issue that is platform dependent (you'll have to adjust and test to get a good combination). Your program will slow down if the OS pages the memory to the disk.

Thomas Matthews
A: 

Well, if I was writing it with your listed constraints in the OP, I would create 146 buffers and plop the data into them, then at the end, sequentially walk through the buffers and close/open a single file-handle.

You mentioned in a comment that speed was a major concern and that the naive approach is too slow.

There are a few things that you can start considering. One is a reorganizing of the binary file into sequential strips, which would allow parallel operations. Another is a least-recently used approach to your filehandle collection. Another approach might be to fork out to 8 different processes, each outputting to 19-20 files.

Some of these approaches will be more or less practical to write depending on binary organization(Highly fragmented vs highly sequential).

A major constraint is the size of your binary data. Is it bigger than cache? bigger than memory? streamed out of a tape deck? Continually coming off a sensor stream and only existing as a 'file' in memory? Each of those presents a different optimization strategy...

Another question is usage patterns. Are you doing occasional spike writes to the files, or are you having massive chunks written only a few times? That determines the effectiveness of the different caching/paging strategies of filehandles.

Paul Nathan
A: 

Assuming you are on a *nix system, the limit is per process, not system-wide. So that implies you could launch multiple processes, each responsible for a subset of the id's you are filtering for. Each could keep within the FOPEN_MAX for its process.

You could have one parent process reading the input file then sending the data to various 'write' processes through pipe special files.

Amardeep
A: 

"Fewest File Opens" Strategy:

To achieve a minimum number of file opens and closes, you will have to read through the input multiple times. Each time, you pick a subset of the ids that need sorting, and you extract only those records into the output files.

Pseudocode for each thread:

  1. Run through the file, collect all the unique ids.
  2. fseek() back to the beginning of the input.
  3. For every group of 19 IDs:
    1. Open a file for each ID.
    2. Run through the input file, appending matching records to the corresponding output file.
    3. Close this group of 19 output files.
    4. fseek() to the beginning of the input.

This method doesn't work quite as nicely with multiple threads, because eventually the threads will be reading totally different parts of the file. When that happens, it's difficult for the file cache to be efficient. You could use barriers to keep the threads more-or-less in lock-step.

"Fewest File Operations" Strategy

You could use multiple threads and a large buffer pool to make only one run-through of the input. This comes at the expense of more file opens and closes (probably). Each thread would, until the whole file was sorted:

  1. Choose the next unread page of the input.
  2. Sort that input into 2-page buffers, one buffer for each output file. Whenever one buffer page is full:
    1. Mark the page as unavailable.
    2. If this page has the lowest page-counter value, append it to the file using fwrite(). If not, wait until it is the lowest (hopefully, this doesn't happen much).
    3. Mark the page as available, and give it the next page number.

You could change the unit of flushing output files to disk. Maybe you have enough RAM to collect 200 pages at a time, per output file?

Things to be careful about:

  • Is your data page-aligned? If not, you'll have to be clever about reading "the next page".
  • Make sure you don't have two threads fwrite()'ing to the same output file at the same time. If that happens, you might corrupt one of the pages.
Andres Jaan Tack
This explanation got awkward, and I've already edited it a few times. Let me know what I can clarify.
Andres Jaan Tack