Hi again,
I just finished transferring as much link-structure data concerning wikipedia (English) as I could. Basically, I downloaded a bunch of SQL dumps from wikipedia's latest dump repository. Since I am using PostgreSQL instead of MySQL, I decided to load all these dumps into my db using pipeline shell commands.
Anyway, one of these tables has 295 million rows: the pagelinks table; it contains all intra-wiki hyperlinks. From my laptop, using pgAdmin III, I sent the following command to my database server (another computer):
SELECT pl_namespace, COUNT(*) FROM pagelinks GROUP BY (pl_namespace);
Its been at it for an hour or so now. The thing is that the postmaster seems to be eating up more and more of my very limited HD space. I think it ate about 20 GB as of now. I had previously played around with the postgresql.conf file in order to give it more performance flexibility (i.e. let it use more resources) for it is running with 12 GB of RAM. I think I basically quadrupled most bytes and such related variables of this file thinking it would use more RAM to do its thing.
However, the db does not seem to use much RAM. Using the Linux system monitor, I am able to see that the postmaster is using 1.6 GB of shared memory (RAM). Anyway, I was wondering if you guys could help me better understand what it is doing for it seems that I really do not understand how PostgreSQL uses HD resources.
Concerning the metastructure of wikipedia databases, they provide a good schema that may be of use or even but of interest to you.
Feel free to ask me for more details, thx.