A simple problem, really: you have one billion (1e+9) unsigned 32-bit integers stored as decimal ASCII strings in a TSV (tab-separated values) file. Conversion using int()
is horribly slow compared to other tools working on the same dataset. Why? And more importantly: how to make it faster?
Therefore the question: what is the fastest way possible to convert a string to an integer, in Python?
What I'm really thinking about is some semi-hidden Python functionality that could be (ab)used for this purpose, not unlike Guido's use of array.array
in his "Optimization Anecdote".
Sample data (with tabs expanded to spaces)
38262904 "pfv" 2002-11-15T00:37:20+00:00
12311231 "tnealzref" 2008-01-21T20:46:51+00:00
26783384 "hayb" 2004-02-14T20:43:45+00:00
812874 "qevzasdfvnp" 2005-01-11T00:29:46+00:00
22312733 "bdumtddyasb" 2009-01-17T20:41:04+00:00
The time it takes reading the data is irrelevant here, processing the data is the bottleneck.
Microbenchmarks
All of the following are interpreted languages. The host machine is running 64-bit Linux.
Python 2.6.2 with IPython 0.9.1, ~214k conversions per second (100%):
In [1]: strings = map(str, range(int(1e7)))
In [2]: %timeit map(int, strings);
10 loops, best of 3: 4.68 s per loop
REBOL 3.0 Version 2.100.76.4.2, ~231kcps (108%):
>> strings: array n: to-integer 1e7 repeat i n [poke strings i mold (i - 1)]
== "9999999"
>> delta-time [map str strings [to integer! str]]
== 0:00:04.328675
REBOL 2.7.6.4.2 (15-Mar-2008), ~523kcps (261%):
As John noted in the comments, this version does not build a list of converted integers, so the speed-ratio given is relative to Python's 4.99s runtime of for str in strings: int(str)
.
>> delta-time: func [c /local t] [t: now/time/precise do c now/time/precise - t]
>> strings: array n: to-integer 1e7 repeat i n [poke strings i mold (i - 1)]
== "9999999"
>> delta-time [foreach str strings [to integer! str]]
== 0:00:01.913193
KDB+ 2.6t 2009.04.15, ~2016kcps (944%):
q)strings:string til "i"$1e7
q)\t "I"$strings
496