ansaurus

Question

Answer 1

A:

Use a CURSOR for a large resultset, will be much faster. For small result sets the LIMIT OFFSET construction works fine, but it has it's limits.

ROW_NUMBER is a nice thing, but not for pagination. You end up with bad performance because of sequential scan's.

Frank Heikens 2010-06-26 22:00:21

Do cursors won't work well when you're doing pagination? Don't they disappear when the connection ends?

Charles 2010-06-26 22:05:00

mm yeah this is for a webapp, so cursors aren't an option.

zzzeek 2010-06-26 22:13:49

Yes, a CURSOR does dissappear after your connection ends and that's fine. It's still much much faster then any other solution. A query using LIMIT OFFSET or ROW_NUMBER also dissappears. I have no idea why a CURSOR can't be used in a webapp, it works great over here. Start testing and see how much you will gain. (or loose)

Frank Heikens 2010-06-27 07:39:41

Answer 2

+2 A:

I've constructed a test which compares OFFSET, cursors, and ROW_NUMBER(). My impression of ROW_NUMBER(), that it would be consistent in speed regardless of where you are in the result set, is correct. However, that speed is dramatically slower than either OFFSET or CURSOR, which, as was also my impression, are pretty much the same in speed, both degrading in speed the further out to the end of the result you go.

Results:

offset(100,100): 0.016359
scroll(100,100): 0.018393
rownum(100,100): 15.535614

offset(100,480000): 1.761800
scroll(100,480000): 1.781913
rownum(100,480000): 15.158601

offset(100,999900): 3.670898
scroll(100,999900): 3.664517
rownum(100,999900): 14.581068

The test script uses sqlalchemy to set up tables and 1000000 rows of test data. It then uses a psycopg2 cursor to execute each SELECT statement and fetch results with the three different methods.

from sqlalchemy import *

metadata = MetaData()
engine = create_engine('postgresql://scott:tiger@localhost/test', echo=True)

t1 = Table('t1', metadata,
    Column('id', Integer, primary_key=True),
    Column('d1', String(50)),
    Column('d2', String(50)),
    Column('d3', String(50)),
    Column('d4', String(50)),
    Column('d5', String(50))
)

if not engine.has_table('t1'):
    conn = engine.connect()
    t1.create(conn)

    # 1000000 rows
    for i in range(100):
        conn.execute(t1.insert(), [
            dict(
                ('d%d' % col, "data data data %d %d" % (col, (i * 10000) + j))
                for col in range(1, 6)
            ) for j in xrange(1, 10001)
        ])

import time

def timeit(fn, count, *args):
    now = time.time()
    for i in xrange(count):
        fn(*args)
    total = time.time() - now
    print "%s(%s): %f" % (fn.__name__, ",".join(repr(x) for x in args), total)

# this is a raw psycopg2 connection.
conn = engine.raw_connection()

def offset(limit, offset):
    cursor = conn.cursor()
    cursor.execute("select * from t1 order by id limit %d offset %d" % (limit, offset))
    cursor.fetchall()
    cursor.close()

def rownum(limit, offset):
    cursor = conn.cursor()
    cursor.execute("select * from (select *, "
                    "row_number() over (order by id asc) as rownum from t1) as foo "
                    "where rownum>=%d and rownum<%d" % (offset, limit + offset))
    cursor.fetchall()
    cursor.close()

def scroll(limit, offset):
    cursor = conn.cursor('foo')
    cursor.execute("select * from t1 order by id")
    cursor.scroll(offset)
    cursor.fetchmany(limit)
    cursor.close()

print 

timeit(offset, 10, 100, 100)
timeit(scroll, 10, 100, 100)
timeit(rownum, 10, 100, 100)

print 

timeit(offset, 10, 100, 480000)
timeit(scroll, 10, 100, 480000)
timeit(rownum, 10, 100, 480000)

print 

timeit(offset, 10, 100, 999900)
timeit(scroll, 10, 100, 999900)
timeit(rownum, 10, 100, 999900)

zzzeek 2010-06-27 16:54:26

Did you enable server_side_cursors? If not, the above code didn't create a database cursor, the thing you were looking for. http://www.sqlalchemy.org/docs/05/reference/dialects/postgres.html?highlight=cursor

Frank Heikens 2010-06-27 17:23:40

psycopg2 server side cursors are invoked when you call the cursor with a name, as in http://initd.org/psycopg/docs/usage.html#server-side-cursors . If you remove the name argument from the call to cursor(), the "scroll" function above takes about ten seconds per call - this because psycopg2 fully loads the result set if you don't turn on server side cursors and in the above case it pulls 1M rows over the wire.

zzzeek 2010-06-27 19:10:26

ansaurus

tags:

views:

answers:

OFFSET vs. ROW_NUMBER()

related questions