views:

79

answers:

5

I have a huge datafile (~4GB) that I am passing through R (to do some string clean up) on its way into an MySQL database. Each row/line is independent from the other. Is there any speed advantage to be had by using parallel operations to finish this process? That is, could one thread start with by skipping no lines and scan every second line and another start with a skip of 1 line and read every second line? If so, would it actually speed up the process or would the two threads fighting for the 10K Western Digital hard drive (not SSD) negate any possible advantages?

+1  A: 

The bottleneck will likely be the HDD. It doesn't matter how many processes are trying to access it, it can only read/write one thing at a time.

This assumes the "string clean up" uses minimal CPU. awk or sed are generally better for this than R.

Joshua Ulrich
I'm not familiar with awk or sed. Is there a simple one line command that would strip all null terminations off a null terminated string embedded within a csv?
drknexus
Looks like `tr -d "\000" < your.csv > new.csv` will do it.
Joshua Ulrich
+1  A: 

Why not just use some of the standard Unix tools to split the file into chunks and call several R command-line expressions in parallel working on a chunk each? No need to be fancy if simple can do.

Dirk Eddelbuettel
That might work - but the essential question remains the same - would that approach be better than a single threaded approach?
drknexus
A: 

You probably want to read from the disk in one linear forward pass, as the OS and the disk optimize heavily for that case. But you could parcel out blocks of lines to worker threads/processes from where you're reading the disk. (If you can do process parallelism rather than thread parallelism, you probably should - way less hassle all 'round.)

Can you describe the string cleanup that's required? R is not the first thing I would reach for for string bashing.

Zack
I am almost certainly using the wrong tool for this job. But, I'm using the tools I know how to use rather than learning new tools. The sting clean up from my input CSV is minimal, mostly stripping a terminating null from a character array that SQL would store faithfully and R/RMySql would freak out about when I try to read the data back into R. The string processing is so fast that I kind of doubted that reading it in one chuck and then parceling out blocks to workers would help enough to offset the overhead.
drknexus
If the *only* thing you need to do to the text is strip NUL bytes, then this'll do it, probably faster than anything else you could come up with: `tr -d '\0' < file_with_nulls > file_without_nulls`
Zack
+1  A: 

The answer is maybe. At some point, disk access will become limiting. Whether this happens with 2 cores running or 8 depends on the characteristics of your hardware setup. It'd be pretty easy to just try it out, while watching your system with top. If your %wa is consitently above zero, it means that the CPUs are waiting for the disk to catch up and you're likely slowing the whole process down.

chrisamiller
A: 

Ruby is another easy scripting language for file manipulations and clean up. But still it is an issue of the ratio of processing time vs reading time. If the point is to do things like select out columns or rearrange things you are far better off going with ruby, awk or sed, even for simple computations those would be better. but if for each line you are say, fitting a regression model or performing a simulation, you would be better doing the tasks in parallel. The question cannot have a definite answer because we don't know the exact parameters. But it sound like for most simple cleanup jobs it would be better to use a language well suited for it like ruby and run it in a single thread.

Andrew Redd