views:

114

answers:

3

Hello!

Recently I wrote the algorithm to quantize an RGB image. Every pixel is represented by an (R,G,B) vector, and quantization codebook is a couple of 3-dimensional vectors. Every pixel of the image needs to be mapped to (say, "replaced by") the codebook pixel closest in terms of euclidean distance (more exactly, squared euclidean). I did it as follows:

class EuclideanMetric(DistanceMetric):
    def __call__(self, x, y):
        d = x - y
        return sqrt(sum(d * d, -1))

class Quantizer(object):
    def __init__(self, codebook, distanceMetric = EuclideanMetric()):
        self._codebook = codebook
        self._distMetric = distanceMetric

    def quantize(self, imageArray):
        quantizedRaster = zeros(imageArray.shape)

        X = quantizedRaster.shape[0]
        Y = quantizedRaster.shape[1]
        for i in xrange(0, X):
            print i
            for j in xrange(0, Y):
                dist = self._distMetric(imageArray[i,j], self._codebook)
                code = argmin(dist)
                quantizedRaster[i,j] = self._codebook[code]

        return quantizedRaster

...and it works awfully, almost 800 seconds on my Pentium Core Duo 2.2 GHz, 4 Gigs of memory and an image of 2600*2700 pixels:(

Is there a way to somewhat optimize this? Maybe the other algorithm or some Python-specific optimizations.

UPD: I tried to use the squared euclidean and still get an enormous time.

A: 

You could use the vector quantization function vq from scipy.cluster.vq.

unutbu
I would have certainly be using it already if there were ways to use some distance metric besides euclidean in the function. I'm writing a degree and need to compare some metrics to choose the accurate and rather computationally relaxed one.
m1st
+3  A: 

One simple optimization is to drop the sqrt call. x is monotonic with sqrt(x), and since you don't need the actual distance, just the min distance, use x^2 instead. Should help a bit since sqrt is expensive.

This trick is used a lot when working with distances. For instance, if you have a distance threshold, you can use threshold^2 and drop the sqrt in the distance calculation. Really, the sqrt is only necessary when absolute distance is needed. For relative distances, drop the sqrt.

Update: an algorithmic change is probably needed then. Right now you are comparing every codebook vector to every pixel. It would speed things up to reduce the number of distance calculations.

You might do better using a kd-tree for this, which will reduce the search for each pixel from O(codebook) to O(log(codebook)). I've never done this in python, but some googling gave an implementation that might work here.

academicRobot
Accepted! The profiling has shown that computing the distances takes almost all that overhead time. That's the bottleneck.
m1st
A: 

If X is very large, you're printing i quite a lot, which can really hurt performance. For a less specific answer, read on.

To find out where the bottleneck in your process is, I suggest a timing decorator, something along the lines of

from functools import wraps
import time

def time_this(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        finish = time.time()
        elapsed = (finish - start) * 1000
        print '{0}: {1} ms'.format(func.__name__, elapsed)
        return result
    return wrapper

I found this somewhere once upon a time and have always used it to figure out where my code is slow. You can break your algorithm down into a series of separate functions, then decorate the function with this decorator to see how long each function call takes. Then it's a matter of fiddling with which statements are in which functions to see what improves how long your decorated functions take to run. Mainly you're looking for two things: 1) statements that take a long time to execute, or 2) statements that do not necessarily take long to execute, but that are executed so many times that a very small improvement in performance will have a large effect on the overall performance.

Good luck!

Colorado
Printing does not really hurt the performance, I checked it out. And for profiling I use cProfiler and pstats modules.
m1st
I'll have to check those profiling modules out. Thanks for the pointer.
Colorado