So I wrote a Python program to handle a little data processing task.
Here's a very brief specification in a made-up language of the computation I want:
parse "%s %lf %s" aa bb cc | group_by aa | quickselect --key=bb 0:5 | \
flatten | format "%s %lf %s" aa bb cc
That is, for each line, parse out a word, a floating-point number, and another word. Think of them as a player ID, a score, and a date. I want the top five scores and dates for each player. The data size is not trivial, but not huge; about 630 megabytes.
I want to know what real, executable language I should have written it in to get it to be similarly short (as the Python below) but much faster.
#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
top_5 = {}
for line in sys.stdin:
aa, bb, cc = line.split()
# We want the top 5 for each distinct value of aa. There are
# hundreds of thousands of values of aa.
bb = float(bb)
if aa not in top_5: top_5[aa] = []
current = top_5[aa]
current.append((bb, cc))
# Every once in a while, we drop the values that are not in
# the top 5, to keep our memory footprint down, because some
# values of aa have thousands of (bb, cc) pairs.
if len(current) > 10:
current.sort()
current[:-5] = []
for aa in top_5:
current = top_5[aa]
current.sort()
for bb, cc in current[-5:]:
print aa, bb, cc
Here’s some sample input data:
3 1.5 a
3 1.6 b
3 0.8 c
3 0.9 d
4 1.2 q
3 1.5 e
3 1.8 f
3 1.9 g
Here’s the output I get from it:
3 1.5 a
3 1.5 e
3 1.6 b
3 1.8 f
3 1.9 g
4 1.2 q
There are seven values for 3
, and so we drop the c
and d
values
because their bb
value puts them out of the top 5. Because 4
has
only one value, its “top 5” consists of just that one value.
This runs faster than doing the same queries in MySQL (at least, the way we’ve found to do the queries) but I’m pretty sure it's spending most of its time in the Python bytecode interpreter. I think that in another language, I could probably get it to process hundreds of thousands of rows per second instead of per minute. So I’d like to write it in a language that has a faster implementation.
But I’m not sure what language to choose.
I haven’t been able to figure out how to express this as a single query in SQL, and
actually I’m really unimpressed with MySQL’s ability even to merely
select * from foo into outfile 'bar';
the input data.
C is an obvious choice, but things like line.split()
, sorting a list
of 2-tuples, and making a hash table require writing some code that’s
not in the standard library, so I would end up with 100 lines of code
or more instead of 14.
C++ seems like it might be a better choice (it has strings, maps, pairs, and vectors in the standard library) but it seems like the code would be a lot messier with STL.
OCaml would be fine, but does it have an equivalent of line.split()
,
and will I be sad about the performance of its map?
Common Lisp might work?
Is there some equivalent of Matlab for database computation like this that lets me push the loops down into fast code? Has anybody tried Pig?
(Edit: responded to davethegr8's comment by providing some sample input and output data, and fixed a bug in the Python program!)
(Additional edit: Wow, this comment thread is really excellent so far. Thanks, everybody!)
Edit:
There was an eerily similar question asked on sbcl-devel in 2007 (thanks, Rainer!), and here's an awk
script from Will Hartung for producing some test data (although it doesn't have the Zipfian distribution of the real data):
BEGIN {
for (i = 0; i < 27000000; i++) {
v = rand();
k = int(rand() * 100);
print k " " v " " i;
}
exit;
}