views:

202

answers:

2

The following query returns data right away:

SELECT time, value from data order by time limit 100;

Without the limit clause, it takes a long time before the server starts returning rows:

SELECT time, value from data order by time;

I observe this both by using the query tool (psql) and when querying using an API.

Questions/issues:

  • The amount of work the server has to do before starting to return rows should be the same for both select statements. Correct?
  • If so, why is there a delay in case 2?
  • Is there some fundamental RDBMS issue that I do not understand?
  • Is there a way I can make postgresql start returning result rows to the client without pause, also for case 2?
  • EDIT (see below). It looks like setFetchSize is the key to solving this. In my case I execute the query from python, using SQLAlchemy. How can I set that option for a single query (executed by session.execute)? I use the psycopg2 driver.

The column time is the primary key, BTW.

EDIT:

I believe this excerpt from the JDBC driver documentation describes the problem and hints at a solution (I still need help - see the last bullet list item above):

By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.

and

Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).

// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();

// Turn use of the cursor on.
st.setFetchSize(50);
A: 

In theory, because your ORDER BY is by primary key, a sort of the results should not be necessary, and the DB could indeed return data right away in key order.

I would expect a capable DB of noticing this, and optimizing for it. It seems that PGSQL is not. * shrug *

You don't notice any impact if you have LIMIT 100 because it's very quick to pull those 100 results out of the DB, and you won't notice any delay if they're first gathered up and sorted before being shipped out to your client.

I suggest trying to drop the ORDER BY. Chances are, your results will be correctly ordered by time anyway (there may even be a standard or specification that mandates this, given your PK), and you might get your results more quickly.

Carl Smotricz
I believe you might have misread my question. With the LIMIT, the database returns those rows right away. Without the limit, there is a pause before the first rows are returned to the client.
codeape
Databases can have different query plans if the optimizer knows you are interested in results as fast as possible. Oracle and DB2 both have options to do that. Maybe the limit clause kicks in the Postgresql hint that the query wants results immediately?
Ken Fox
Yes, I did misunderstand your question at first, that's why I revamped the whole thing. Please look at my updated answer now!
Carl Smotricz
Dropping the ORDER BY makes no difference. In fact, I now believe that the problem is in the client driver. It looks as if the driver by default collects all the results for the query at once (see my edit).
codeape
Ah well, it looks like they recognize the problem and helpfully offer a fix as well. If you're using `PreparedStatement` to submit your queries you can implement their suggestion directly (using 100 if you like), otherwise you should be able to use a `Statement` to submit your query too, and do the setXXX() call on that.
Carl Smotricz
Oh... Reminder: If autoCommit is turned off, you'll have to explicitly `conn.commit()` if you do any updates.
Carl Smotricz
One small problem: I work in Python. The solution is for a Java driver. Haven't been able to figure out how to do it in Python yet.
codeape
Me, I don't know Python. I'd recommend writing up a new SO question for the subset of info you need, perhaps something like "How do I set the fetch size for PGSQL in Python?"
Carl Smotricz
Yes, good idea.
codeape
+3  A: 

The psycopg2 dbapi driver buffers the whole query result before returning any rows. You'll need to use server side cursor to incrementally fetch results. For SQLAlchemy see server_side_cursors in the docs and if you're using the ORM the Query.yield_per() method.

SQLAlchemy currently doesn't have an option to set that per single query, but there is a ticket with a patch for implementing that.

Ants Aasma
I tried using a server-side cursor: c = conn.cursor("mycursor"); c.execute("..."); c.fetchmany(100). But still I get the long delay before something is returned. What am I doing wrong?
codeape
Assuming conn is a psycopg2 connection, I have no idea, works correctly for me. You can try executing EXPLAIN ANALYZE for the same query and look at the first time number for the first row in the explain output, that is the time postgresql took to find the first row.
Ants Aasma