views:

112

answers:

4

The question is, basically: what would be more preferable, both performance-wise and design-wise - to have a list of objects of a Python class or to have several lists of numerical properties?

I am writing some sort of a scientific simulation which involves a rather large system of interacting particles. For simplicity, let's say we have a set of balls bouncing inside a box so each ball has a number of numerical properties, like x-y-z-coordinates, diameter, mass, velocity vector and so on. How to store the system better? Two major options I can think of are:

to make a class "Ball" with those properties and some methods, then store a list of objects of the class, e. g. [b1, b2, b3, ...bn, ...], where for each bn we can access bn.x, bn.y, bn.mass and so on;

to make an array of numbers for each property, then for each i-th "ball" we can access it's 'x' coordinate as xs[i], 'y' coordinate as ys[i], 'mass' as masses[i] and so on;

To me it seems that the first option represents a better design. The second option looks somewhat uglier, but might be better in terms of performance, and it could be easier to use it with numpy and scipy, which I try to use as much as I can.

I am still not sure if Python will be fast enough, so it may be necessary to rewrite it in C++ or something, after initial prototyping in Python. Would the choice of data representation be different for C/C++? What about a hybrid approach, e.g. Python with C++ extension?

Update: I never expected any performance gain from parallel arrays per se, but in a mixed environment like Python + Numpy (or whatever SlowScriptingLanguage + FastNativeLibrary) using them may (or may not?) let you move more work out of you slow scripting code and into the fast native library.

+3  A: 

Having an object for each ball in this example is certainly better design. Parallel arrays are really a workaround for languages that do not support proper objects. I wouldn't use them in a language with OO capabilities unless it's a tiny case that fits within a function (and maybe not even then) or if I've run out of every other optimization option and the profiler shows that property access is the culprit. This applies twice as much to Python as to C++, as the former places a large emphasis on readability and elegance.

Max Shawabkeh
+1: Parallel arrays are perfectly awful. They -- generally -- will not have any better performance.
S.Lott
+1. Even in languages that don't support objects but have aggregate types (C and Pascal come to mind), parallel arrays are a bad idea. I used parallel arrays in BASIC -- old BASIC with line numbers -- and that's about it.
Fred Larson
> They -- generally -- will not have any better performance.I never expected any performance gain from parallel arrays per se, but in a mixed environment like Python + Numpy (or whatever SlowScriptingLanguage + FastNativeLibrary) using them may (or may not) let you move more work out of you slow scripting code and into the fast native library. That, in some sense, what my question mostly is about.
Headcrab
@Headcrab: Please update the question to reflect what it is mostly about. A list of tuples is far, far better than parallel lists. And I think it's seamlessly processed by numpy. Please update the question to indicate your focus.
S.Lott
A: 

Will you be having any forces between the balls (hard sphere/collision, gravity, electromagnetic)? I'm guessing so. Will you be having a large enough number of balls to want to use Barnes-Hut simulation ideas? If so, then you should definitely use the Ball class idea so that you can store them easily in octrees or something else along those lines. Also, using the Barnes-Hut simulation will cut down the complexity of the simulation to O(N log N) from O(N^2).

Really though, if you don't have forces between the balls or aren't using many balls, you don't need the possible speed gains from using parallel arrays and should go with the Ball class idea for that as well.

Justin Peel
I am using KDTree from scipy.spatial now (my current prototype version is in 2D), and, actually, that was one of the cases where I used a parallel array (a list of (x, y)-tuples) to build the tree. Thinking of it now, I probably should see if it's possible to avoid that by overloading [] operator for the Ball class, though...
Headcrab
A: 

I agree that parallel arrays are almost always a bad idea, but don't forget that you can use views into a numpy array when you're setting things, up, though... (Yes, I know this is effectively using parallel arrays, but I think it's the best option in this case...)

This is great if you know the number of "balls" you're going to create beforehand, as you can allocate an array for the coordinates, and store a view into that array for each ball object.

You have to be a bit careful to do operations in-place on the coords array, but it makes updating coordinates for numerous "balls" much, much, much faster.

For example...

import numpy as np

class Ball(object):
    def __init__(self, coords):
        self.coords = coords

    def _set_coord(self, i, value):
        self.coords[i] = value
    x = property(lambda self: self.coords[0],
            lambda self, value: self._set_coord(0, value))
    y = property(lambda self: self.coords[1],
            lambda self, value: self._set_coord(1, value))

    def move(self, dx, dy):
        self.x += dx
        self.y += dy

def main():
    n_balls = 10
    n_dims = 2
    coords = np.zeros((n_balls, n_dims))
    balls = [Ball(coords[i,:]) for i in range(n_balls)]

    # Just to illustrate that that the coords are updating
    ball = balls[0]

    # Random walk by updating coords array
    print 'Moving all the balls randomly by updating coords'
    for step in xrange(5):
        # Add a random value to all coordinates
        coords += 0.5 - np.random.random((n_balls, n_dims))

        # Display the coords for a particular ball and the 
        # corresponding row of the coords array
        print '    Value of ball.x, ball.y:', ball.x, ball.y
        print '    Value of coords[0,:]:', coords[0,:]

    # Move an individual ball object
    print 'Moving a ball individually through Ball.move()'
    ball.move(0.5, 0.5)
    print '    Value of ball.x, ball.y:', ball.x, ball.y
    print '    Value of coords[0,:]:', coords[0,:]

main()

Just to illustrate, this outputs something like:

Moving all the balls randomly by updating coords
    Value of ball.x, ball.y: -0.125713650677 0.301692195466
    Value of coords[0,:]: [-0.12571365  0.3016922 ]
    Value of ball.x, ball.y: -0.304516863495 -0.0447543559805
    Value of coords[0,:]: [-0.30451686 -0.04475436]
    Value of ball.x, ball.y: -0.171589457954 0.334844443821
    Value of coords[0,:]: [-0.17158946  0.33484444]
    Value of ball.x, ball.y: -0.0452864552743 -0.0297552313656
    Value of coords[0,:]: [-0.04528646 -0.02975523]
    Value of ball.x, ball.y: -0.163829876915 0.0153203173857
    Value of coords[0,:]: [-0.16382988  0.01532032]
Moving a ball individually through Ball.move()
    Value of ball.x, ball.y: 0.336170123085 0.515320317386
    Value of coords[0,:]: [ 0.33617012  0.51532032]

The advantage here is that updating a single numpy array is going to be much, much faster than iterating through all of your ball objects, but you retain a more object-oriented approach.

Just my thoughts on it, anyway..

EDIT: To give some idea of the speed difference, with 1,000,000 balls:

In [104]: %timeit coords[:,0] += 1.0
100 loops, best of 3: 11.8 ms per loop

In [105]: %timeit [item.x + 1.0 for item in balls]
1 loops, best of 3: 1.69 s per loop

So, updating the coordinates directly using numpy is roughly 2 orders of magnitude faster when using a large number of balls. (the difference is smaller when using 10 balls, as per the example, roughly a factor of 2x, rather than 150x)

Joe Kington
So this is using a single 2d array to store everything in columns, with each "ball" having a link to it's "storage column"?
Headcrab
@Headcrab - Yep. It's basically just storing a pointer. The advantage is when you want to use vectorized numpy expressions to update the coordinates, rather than iterating through each ball object.
Joe Kington
+1  A: 

I think it depends on what you're going to be doing with them, and how often you're going to be working with (all attributes of one particle) vs (one attribute of all particles). The former is better suited to the object approach; the latter is better suited to the array approach.

I was facing a similar problem (although in a different domain) a couple of years ago. The project got deprioritized before I actually implemented this phase, but I was leaning towards a hybrid approach, where in addition to the Ball class I would have an Ensemble class. The Ensemble would not be a list or other simple container of Balls, but would have its own attributes (which would be arrays) and its own methods. Whether the Ensemble is created from the Balls, or the Balls from the Ensemble, depends on how you're going to construct them.

One of my coworkers was arguing for a solution where the fundamental object was an Ensemble which might contain only one Ball, so that no calling code would ever have to know whether you were operating on just one Ball (do you ever do that for your application?) or on many.

Vicki Laidler