views:

69

answers:

3

I have the following simple table that contains traffic measurement data:

CREATE TABLE "TrafficData"
(
  "RoadID" character varying NOT NULL,
  "DateID" numeric NOT NULL,
  "ExactDateTime" timestamp NOT NULL,
  "CarsSpeed" numeric NOT NULL,
  "CarsCount" numeric NOT NULL
)
CREATE INDEX "RoadDate_Idx" ON "TrafficData" USING btree ("RoadID", "DateID");

The column RoadID uniquely identifies the road whose data is being recorded, while DateID identifies the day of the year (1..365) of the data - basically a rounded off representation of ExactDateTime.

I have about 100.000.000 rows; there are 1.000 distinct values in the column "RoadID" and 365 distinct values in the column "DateID".

I then run the following query:

SELECT * FROM "TrafficData"
WHERE "RoadID"='Station_1'
AND "DateID">20100610 AND "DateID"<20100618;

This takes up to three mind-boggling seconds to finish, and I cannot for the life of me figure out WHY.

EXPLAIN ANALYZE gives me the following output:

Bitmap Heap Scan on "TrafficData"  (cost=104.84..9743.06 rows=2496 width=47) (actual time=35.112..2162.404 rows=2016 loops=1)
  Recheck Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
  ->  Bitmap Index Scan on "RoadDate_Idx"  (cost=0.00..104.22 rows=2496 width=0) (actual time=1.637..1.637 rows=2016 loops=1)
        Index Cond: ((("RoadID")::text = 'Station_1'::text) AND ("DateID" > 20100610::numeric) AND ("DateID" < 20100618::numeric))
Total runtime: 2163.985 ms

My specs:

  • Windows 7
  • Postgres 9.0
  • 4GB RAM

I'd greatly appreciate any helpful pointers!

A: 
  • 4GB RAM -> 6+ you have 100M records, which isn't large but for a desktop machine memory could matter. If this is not a desktop, I'm not sure why you would have such a small amount of memory
  • AND "DateID">20100610 AND "DateID"<20100618; -> DateID BETWEEN 20100611 AND 20100617;
  • Create an index on the DateID
  • Get rid of all the double quotes around field names
  • Instead of a VarChar, make RoadID a text field
vol7ron
RAM upgrade isn't an option, unfortunately. More RAM also shouldn't really make any difference, since the index files are already "small" enough to load into memory. I've also tried DateID BETWEEN, sadly made no difference.
TroutKing
well you also want to create an index on the DateID; see the additions
vol7ron
Right, I've added an index on DateID. Sadly no difference.
TroutKing
+2  A: 

The slow part is obviosly fetching the data from the tables, since the index access seems to be very fast. You might either optimize your RAM usage parameters (see http://wiki.postgresql.org/wiki/Performance_Optimization and http://www.varlena.com/GeneralBits/Tidbits/perf.html), or optimize the layout of the data in the table by issuing a CLUSTER command (see http://www.postgresql.org/docs/8.3/static/sql-cluster.html).

CLUSTER "TrafficData" USING "RoadDate_Idx";

should do it.

Daniel
I tried this out yesterday, but had to give up when the CLUSTER command was still running after eleven hours.
TroutKing
Holla... Maybe a simple sorted select into another might be faster? If CLUSTER is that slow, it really seems to be an io problem, and the data does not seem to match the index at all. If you can, try this on server class hardware...
Daniel
A: 

Adding to Daniel's answer, the cluster operation is a one off process that rearranged the data on disk. The intent is to get your 2000 results rows from fewer disk blocks.

As this is dummy data, being used to find out how you can quickly query it, I'd recommend reloading it, in a pattern closer to how it will be loaded as it is generated. I imagine that the data is generated one day at a time, which will effectively result in strong correlation between DateID and the location on disk. If that is the case, then I'd either cluster by DateID, or split your test data into 365 separate loads, and reload it.

Without that, and having randomly generated data, you're most likely having to perform over 2000 seeks of your disk head.

I'd also check that anything else you're running on Windows 7 isn't adding time to those reads that you don't need, such as ensuring that the blocks read do not contain virus signatures, or concurrently performing an automatically scheduled disk defragmentation (resulting in the disk head hardly ever being anywhere close to where it was last time a database block was read).

Stephen Denne