views:

301

answers:

7

I have a python program that does something like this:

  1. Read a row from a csv file.
  2. Do some transformations on it.
  3. Break it up into the actual rows as they would be written to the database.
  4. Write those rows to individual csv files.
  5. Go back to step 1 unless the file has been totally read.
  6. Run SQL*Loader and load those files into the database.

Step 6 isn't really taking much time at all. It seems to be step 4 that's taking up most of the time. For the most part, I'd like to optimize this for handling a set of records in the low millions running on a quad-core server with a RAID setup of some kind.

There are a few ideas that I have to solve this:

  1. Read the entire file from step one (or at least read it in very large chunks) and write the file to disk as a whole or in very large chunks. The idea being that the hard disk would spend less time going back and forth between files. Would this do anything that buffering wouldn't?
  2. Parallelize steps 1, 2&3, and 4 into separate processes. This would make steps 1, 2, and 3 not have to wait on 4 to complete.
  3. Break the load file up into separate chunks and process them in parallel. The rows don't need to be handled in any sequential order. This would likely need to be combined with step 2 somehow.

Of course, the correct answer to this question is "do what you find to be the fastest by testing." However, I'm mainly trying to get an idea of where I should spend my time first. Does anyone with more experience in these matters have any advice?

+3  A: 

Poor man's map-reduce:

Use split to break the file up into as many pieces as you have CPUs.

Use batch to run your muncher in parallel.

Use cat to concatenate the results.

Jonathan Feinberg
or use `xargs -P`
elder_george
If he's I/O bound, splitting the operation onto multiple CPU's isn't going to help.
T.E.D.
@T.E.D: That's not true. One step of his program is I/O bound, using multiple instances might allow one instance to do something while another one is waiting for I/O.
Georg
Perhaps, but it doesn't change how much I/O has to be performed. If that's what's determining his runtime, you won't really help very much by speeding up the other stuff.
T.E.D.
If it's I/O bound, it usually means that a single process is waiting for the OS to service it. If you have multiple concurrent processes, reading multiple files, you will achieve tremendous performance improvements. That's the essential theory behind all time-sharing OS's.
S.Lott
Not really. "I/O bound" means the amount of time required to complete the I/Os dominates the program's runtime. As an example, let's say there is 99 times more time in I/O than in processing (probably conservative, really). Now let's suppose he's naively doing *no* multitasking between I/O and CPU work. Now let's say you find a neat way to get rid of *all* that CPU processing time. You've still only have only sped his program up by 1%.
T.E.D.
Seymour Cray once said that a parallel supercomputer is a means of turning a compute-bound problem into an IO-bound problem.
Crashworks
@T.E.D. I/O bound means that the I/O limits the throughput. Not that I/O processing dominates the runtime. Indeed, if you look at the CPU stats for I/O bound programs you'll see that the system --as a whole-- is idle, and the user-space time is quite low. Your program is bound (limited, constrained) by waiting for OS I/O. Your program does not actually **DO** the I/O. It merely waits for the system. The system waits for devices.
S.Lott
@S. Lott. Right. So if you just speed up the CPU's stuff at this point, all you are doing it getting it to the point where it has to wait on the I/O quicker. Hurry up and wait.
T.E.D.
+3  A: 

If you are I/O bound, the best way I have found to optimize is to read or write the entire file into/out of memory at once, then operate out of RAM from there on.

With extensive testing I found that my runtime eded up bound not by the amount of data I read from/wrote to disk, but by the number of I/O operations I used to do it. That is what you need to optimize.

I don't know Python, but if there is a way to tell it to write the whole file out of RAM in one go, rather than issuing a separate I/O for each byte, that's what you need to do.

Of course the drawback to this is that files can be considerably larger than available RAM. There are lots of ways to deal with that, but that is another question for another time.

T.E.D.
A: 

The first thing is to be certain of what you should optimize. You seem to not know precisely where your time is going. Before spending more time wondering, use a performance profiler to see exactly where the time is going.

http://docs.python.org/library/profile.html

When you know exactly where the time is going, you'll be in a better position to know where to spend your time optimizing.

goger
That's not actually true. I have profiled. And as I noted, it's step 4 (writing to disk) that's taking up the most time.
Jason Baker
Your description did not make it clear to me that you had done anything other than eyeball it. If your code is overwhelmingly in step 4, then fine, but since you gave no details I was concerned you hadn't looked close enough at the other steps to be certain you will not be limited to minor speedups by Amdahl's law.
goger
+1  A: 

Use buffered writes for step 4.

Write a simple function that simply appends the output onto a string, checks the string length, and only writes when you have enough which should be some multiple of 4k bytes. I would say start with 32k buffers and time it.

You would have one buffer per file, so that most "writes" won't actually hit the disk.

Michael Dillon
+2  A: 

Python already does IO buffering and the OS should handle both prefetching the input file and delaying writes until it needs the RAM for something else or just gets uneasy about having dirty data in RAM for too long. Unless you force the OS to write them immediately, like closing the file after each write or opening the file in O_SYNC mode.

If the OS isn't doing the right thing, you can try raising the buffer size (third parameter to open()). For some guidance on appropriate values given a 100MB/s 10ms latency IO system a 1MB IO size will result in approximately 50% latency overhead, while a 10MB IO size will result in 9% overhead. If its still IO bound, you probably just need more bandwidth. Use your OS specific tools to check what kind of bandwidth you are getting to/from the disks.

Also useful is to check if step 4 is taking a lot of time executing or waiting on IO. If it's executing you'll need to spend more time checking which part is the culprit and optimize that, or split out the work to different processes.

Ants Aasma
How do I tell the difference between waiting on IO and executing? I don't see anything directly related to it in the python profiler output.
Jason Baker
You'll need OS level tools for that. For instance on Linux if I run the Python script with the time command I can see how much wall clock time it took and how much CPU time in user mode and in kernel mode. Difference between the first and the latter is time spent waiting on something, on a lightly loaded system, most likely IO.
Ants Aasma
+2  A: 

Can you use a ramdisk for step 4? Low millions sounds doable if the rows are less than a couple of kB or so.

gnibbler
That sounds like a good idea, to me.
T.E.D.
+1  A: 

Isn't it possible to collect a few thousand rows in ram, then go directly to the database server and execute them?

This would remove the save to and load from the disk that step 4 entails.

If the database server is transactional, this is also a safe way to do it - just have the database begin before your first row and commit after the last.

Tom Leys