tags:

views:

154

answers:

4

I have a large list with 15k entries in a MySQL table from which I need to select a few items, many times. For example, I might want all entries with a number field between 1 and 10.

In SQL this would be easy:

SELECT text FROM table WHERE number>=1 AND number<10;

If I extract the entire table to a Python list:

PyList = [[text1, number1], [text2, number2], ...]

I could then extract those same text values I want by running through the entire list

for item in PyList
    if item[1] >=1 and item[1]<10:
        result.append(item[0])

Now, the performance question between the two is that I have to do this for a sliding window. I want to get those between 1 and 10, then 2 and 11, 3 and 12, ... 14990 and 15000 What approach is faster for a list this big?

An improvement in Python I'm thinking about is to pre-order the Python list by number. When the window moves I could remove the lowest value from result and append all elements verifying the next condition to get the new result. I would also keep track of index in the PyList so I would know where to start from in the next iteration. This would spare me from running through the entire list again.

I don't know how to speed up the MySQL for successive Selects that are very similar and I don't know how it works internally to understand performance differences between the two approaches.

How would you implement this?

A: 

Read all the data into Python (from the numbers you mention it should handily fit in memory), say into a variable pylist as you say, then prep an auxiliary data structure as follows:

import collections
d = collections.defaultdict(list)
for text, number in pylist:
  d[number].append(text)

Now, to get all texts for numbers between low included and high excluded,

def slidingwindow(d, low, high):
    result = []
    for x in xrange(low, high):
        result.extend(d.get(x, ()))
    return result
Alex Martelli
A: 

It is difficult to answer without actual performance, but my gut feeling is that it would be better to go for the SQL with bind variables (I am not MySQL expert, but in this case query syntax should be something like %varname).

The reason is that you would return data only when needed (thus user interface would be responsive much in advance) and you would rely on a system highly optimized for that kind of operation. On the other hand, retrieving a larger chunk of data i usually faster than retrieving smaller ones, so the "full python" approach could have its edge.

However, unless you have serious performance issues, I would still stick in using SQL, because it would lead to much simpler code, to read and understand.

Roberto Liffredo
+1  A: 

Simply define an index over number in your database, then the database can generate the result sets instantly. Plus it can do some calculations on these sets too, if that is your next step.

Databases are actually great at such queries, I'd let it do its job before trying something else.

THC4k
Thanks, I didn't know about index in MySQL so I learned that bit from your post.
greye
+1  A: 

It's certainly going to be much faster to pull the data into memory than run ~15,000 queries.

My advice is to make sure the SQL query sorts the data by number. If the data is sorted, you can use the very fast lookup methods in the bisect standard library module to find indexes.

Triptych