I have a huge datafile (~4GB) that I am passing through R (to do some string clean up) on its way into an MySQL database. Each row/line is independent from the other. Is there any speed advantage to be had by using parallel operations to finish this process? That is, could one thread start with by skipping no lines and scan every second line and another start with a skip of 1 line and read every second line? If so, would it actually speed up the process or would the two threads fighting for the 10K Western Digital hard drive (not SSD) negate any possible advantages?
views:
79answers:
5The bottleneck will likely be the HDD. It doesn't matter how many processes are trying to access it, it can only read/write one thing at a time.
This assumes the "string clean up" uses minimal CPU. awk or sed are generally better for this than R.
Why not just use some of the standard Unix tools to split the file into chunks and call several R command-line expressions in parallel working on a chunk each? No need to be fancy if simple can do.
You probably want to read from the disk in one linear forward pass, as the OS and the disk optimize heavily for that case. But you could parcel out blocks of lines to worker threads/processes from where you're reading the disk. (If you can do process parallelism rather than thread parallelism, you probably should - way less hassle all 'round.)
Can you describe the string cleanup that's required? R is not the first thing I would reach for for string bashing.
The answer is maybe. At some point, disk access will become limiting. Whether this happens with 2 cores running or 8 depends on the characteristics of your hardware setup. It'd be pretty easy to just try it out, while watching your system with top. If your %wa is consitently above zero, it means that the CPUs are waiting for the disk to catch up and you're likely slowing the whole process down.
Ruby is another easy scripting language for file manipulations and clean up. But still it is an issue of the ratio of processing time vs reading time. If the point is to do things like select out columns or rearrange things you are far better off going with ruby, awk or sed, even for simple computations those would be better. but if for each line you are say, fitting a regression model or performing a simulation, you would be better doing the tasks in parallel. The question cannot have a definite answer because we don't know the exact parameters. But it sound like for most simple cleanup jobs it would be better to use a language well suited for it like ruby and run it in a single thread.