views:

355

answers:

3

Hi again,

I currently working with a larger wikipedia-dump derived PostgreSQL database; it contains about 40 GB of data. The database is running on an HP Proliant ML370 G5 server with Suse Linux Enterprise Server 10; I am querying it from my laptop over a private network managed by a simple D-Link router. I assigned static DHCP (private) IPs to both laptop and server.

Anyway, from my laptop, using pgAdmin III, I send off some SQL commands/queries; some of these are CREATE INDEX, DROP INDEX, DELETE, SELECT, etc. Sometimes I send a command (like CREATE INDEX), it returns, telling me that the query was executed perfectly, etc. However, the postmaster process assigned to such a command seems to remain sleeping on the server. Now, I do not really mind this, for I say to myself that PostgreSQL maintains a pool of postmasters ready to process queries. Yet, if this process eats up 6 GB of it 9.4 GB assigned RAM, I worry (and it does so for the moment). Now maybe this is a cache of data that is kept in [shared] memory in case another query happens to need to use that same data, but I do not know.

Another thing is bothering me.

I have 2 tables. One is the page table; I have an index on its *page_id* column. The other is the pagelinks tables which has the *pl_from* column that references either nothing or a variable in the *page.page_id* column; unlike the *page_id* column, the *pl_from* has no index (yet). To give you an idea of the scale of the tables and the necessity for me to find a viable solution, page table has 13.4 million rows (after I deleted those I do not need) while the pagelinks table has 293 million.

I need to execute the following command to clean the pagelinks table of some of its useless rows:

DELETE FROM pagelinks USING page WHERE pl_from NOT IN (page_id);

So basically, I wish to rid the pagelinks table of all links coming from a page not in the page table. Even after disabling the nested loops and/or sequential scans, the query optimizer always gives me the following "solution":

Nested Loop  (cost=494640.60..112115531252189.59 rows=3953377028232000 width=6)
  Join Filter: ("outer".pl_from <> "inner".page_id)"
  ->  Seq Scan on pagelinks  (cost=0.00..5889791.00 rows=293392800 width=17)
  ->  Materialize  (cost=494640.60..708341.51 rows=13474691 width=11)
        ->  Seq Scan on page  (cost=0.00..402211.91 rows=13474691 width=11)

It seems that such a task would take more than weeks to complete; obviously, this is unacceptable. It seems to me that I would much rather it use the *page_id* index to do its thing...but it is a stubborn optimizer and I might be wrong.

Any idea?

+1  A: 

To your second question; you could try creating a new table with just the records you need with a CREATE TABLE AS statement; if the new table is sufficiently small, it might be faster- but it might not help either.

alex
Actually, this is what I am trying for it seems like my best shot. I will post the results if it works. Thx!
Nicholas Leonard
+1  A: 

Hi again,

Indeed, I decided to CREATE a temporary table to speed up query execution:

CREATE TABLE temp_to_delete AS(
    (SELECT DISTINCT pl_from FROM pagelinks) 
     EXCEPT 
    (SELECT page_id FROM page));
DELETE FROM pagelinks USING temp_to_delete 
    WHERE pagelinks.pl_from IN (temp_to_delete.pl_from);

Surprisingly, this query completed in about 4 hours while the initial query had remained active for about 14hrs before I decided to kill it. More specifically, the DELETE returned:

Query returned successfully: 31340904 rows affected, 4415166 ms execution time.

As for the first part of my question, it seems that the postmaster process indeed keeps some info in cache; when another query requires info not in the cache and some memory (RAM), the cache is emptied. And the postmasters are indeed but a pool of process'.

It has also occurred to me that the gnome-system-monitor is a myth for it gives incomplete information and is worthless in informational value. It is mostly due to this application that I have been so confused lately; for example, it does not consider the memory usage of other users (like the postgres user!) and even tells me that I have 12 GB of RAM left when this is so untrue. Hence, I tried out a couple of system monitors for I like to know how postgreSQL is using its resources, and it seems that xosview is indeed a valid tool.

Hope this helps!

Nicholas Leonard
A: 

Your postmaster process will stay there as long as the connection to the client is open. Does pgadmin close the connection ? I don't know.

Memory used could be shared_buffers (check your config settings) or not.

Now, the query. For big maintenance operations like this, feel free to set work_mem to something large like a few GB. You look like you got lots of RAM, so use it.

set work_mem to '4GB'; EXPLAIN DELETE FROM pagelinks WHERE pl_from NOT IN (SELECT page_id FROM page);

It should seq scan page, hash it, and seq scan pagelinks, peeking in the hash to check for page_ids. It should be quite fast (much faster than 4 hours !) but you need a large work_mem for the hash.

But since you delete a significant portion of your table, it might be faster to do it like this :

CREATE TABLE pagelinks2 AS SELECT a.* FROM pagelinks a JOIN pages b ON a.pl_from = b.page_id;

(you could use a simple JOIN instead of IN)

You can also add an ORDER BY on this query, and your new table will be nicely ordered on disk for optimal access later.

peufeu