What should I keep in mind when migrating from processing many small data files to a few large data files in ruby?
Background: I'm a bioinformatician who is processing next generation sequencing data, which produces about one million sequences per run. I previously saved each one of the million sequences to its own file, and did a few processing steps to each sequence, producing a couple of files for each sequence. Unfortunately, having a couple of million files is making file input and output a major bottleneck (and also makes backup slow). (Having millions of files is also discouraged in answers to this question)
I considered using sqlite to store each file, but I want to avoid this option if possible, to avoid adding dependencies.
I suspect that I should write one and only one module for handling the large files, and let all of the processing scripts (which run as independent processes) use this module whenever it wants to do input or output. Providing the processing classes with a filestream created with StringIO may be useful for this, as that way they don't need to know about how the large files work.
In order to avoid having to read an entire large file when getting input (I want processing of each sequence to be an independent process, so that an analysis of one sequence can't corrupt the analysis of another sequence), I'll have to keep track of where I'm up to in the large input file. Although more sophisticated inter-process communication techniques exist, I might merely use a temporary file to store the character position for IO#seek.
I'll also have to keep in mind that I won't really be able to run multiple processes at once if they're writing to the same file, and that the large file handler will need to flush its output regularly.