ansaurus

Question

PostgreSQL: How to optimize my database for storing and querying a huge graph

Answer 1

+3 A:

I guess it is because of a “density” of same-key-records on the disk. I think the records with same id are stored in dense (i.e., few number of blocks) and those with same link are stored in sparse (i.e., distributed to huge number of blocks). If you have inserted records in the order of id, this situation can be happen.

Assume that: 1. there are 10,000 records, 2. they're stored in the order such as (id, link) = (1, 1), (1, 2),..., (1, 100), (2, 1)..., and 3. 50 records can be stored in a block.

In the assumption above, block #1~#3 consists of the records (1, 1)~(1, 50), (1, 51)~(1, 100) and (2, 1)~(2, 50) respectively.

When you SELECT * FROM edges WHERE id=1, only 2 blocks (#1, #2) is to be loaded and scanned. On the other hand, SELECT * FROM edges WHERE link=1 requires 50 blocks (#1, #3, #5,...), even though the number of rows are same.

habe 2009-12-01 05:46:15

Answer 2

+1 A:

Your issue seems to be disk-io related. Postgres has to read the tuples of index matches in order to see whether or not the row is visible (this can not be done from an index as it doesn't contain the necessary information).

VACUUM ANALYZE (or simply ANALYZE) will help if you have lots of deleted rows and/or updated rows. Run it first and see if you get any improvements.

CLUSTER might also help. Based on your examples, I'd say using link_idx as the cluster-key. "CLUSTER edges USING link_idx". It might degrade the performance of your id queries though (your id queries might be quick because they are already sorted on disk). Remember to run ANALYZE after CLUSTER.

Next steps include fine-tuning memory parameters, adding more memory, or adding a faster disk subsystem.

2009-12-01 10:25:19

Answer 3

+1 A:

I think habe is right.

You can check this by using cluster link_idx on edges; analyze edges after filling the table. Now the second query should be fast, and first should be slow.

To have both queries fast you'll have to denormalize by using a second table, as you have proposed. Just remember to cluster and analyze this second table after loading your data, so all egdes linking to a node will be physically grouped.

If you will not query this all the time and you do not want to store and backup this second table then you can create it temporarily before querying:

create temporary table egdes_backwards
  as select link, id from edges order by link, id;
create index edges_backwards_link_idx on edges_backwards(link);

You do not have to cluster this temporary table, as it will be physically ordered right on creation. It does not make sense for one query, but can help for several queries in a row.

Tometzky 2009-12-01 13:20:30

`CLUSTER` took too long on my table. So I solved the problem creating an additional table in analogy to your suggestion:`CREATE TABLE edges2 AS SELECT id,link FROM edges ORDER BY link; CREATE INDEX link_idx on edges2(link);` A query like `SELECT id FROM edges2 WHERE link=4620;` now only takes a few 100 ms. Thank you!

asmaier 2009-12-22 10:02:16

Answer 4

A:

If you need good performance and can deal without foreign key constraints (or use triggers to implement them manually) try the intarray and intagg extension modules. Instead of the edges table have an outedges integer[] column on nodes table. This will add about 140MB to the table, so the whole thing will still probably fit into memory. For reverse lookups, either create an GIN index on the outedges column (for an additional 280MB), or just add an inedges column.

Postgresql has pretty high row overhead so the naive edges table will result in 1G of space for the table alone, + another 1.5 for the indices. Given your dataset size, you have a good chance of having most of it in cache if you use integer arrays to store the relations. This will make any lookups blazingly fast. I see around 0.08ms lookup times to get edges in either direction for a given node. Even if you don't fit it all in memory, you'll still have a larger fraction in memory and a whole lot better cache locality.

Ants Aasma 2009-12-07 19:35:49

Answer 5

A:

Hi there, have you tried doing this in www.neo4j.org? This is almost trivial in a graph database and should give you performance on your usecase in ms-range.

2009-12-15 18:50:30

ansaurus

tags:

views:

answers:

PostgreSQL: How to optimize my database for storing and querying a huge graph

related questions