tags:

views:

184

answers:

2

I have two tables:

CREATE TABLE pagelets ( 
    `netloc` INT(32) unsigned NOT NULL,
    `page_key` BIGINT(64) unsigned NOT NULL,
    `pagelet_serial` BIGINT(64) unsigned NOT NULL,      --that this is unique
    `pagelet_shingle` BIGINT(64) unsigned NOT NULL,
    `delete_me` BOOLEAN NOT NULL,
    KEY ( `pagelet_shingle` )
) ENGINE=MyISAM DEFAULT CHARSET=utf8
;

CREATE TABLE links ( 
    `netloc` INT(32) unsigned NOT NULL,
    `page_key` BIGINT(64) unsigned NOT NULL,
    `pagelet_serial` BIGINT(64) unsigned NOT NULL,  
    `url` BIGINT(64) unsigned NOT NULL,
    `delete_me` BOOLEAN NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8
;

And the way I use these tables are the following:

1) SELECT DISTINCT A.* FROM pagelets A JOIN ( 
    SELECT pagelet_shingle FROM pagelets WHERE netloc=Y
    GROUP BY pagelet_shingle HAVING COUNT(DISTINCT page_key) > 1) 
    AS B USING(pagelet_shingle) ORDER BY A.pagelet_shingle;

2) SELECT COUNT(*) FROM pagelets WHERE pagelet_serial=Y

3) SELECT url FROM links  
    WHERE page_key=X AND pagelet_serial NOT IN  
    (Y, Y, Y, Y, Y ) ORDER BY pagelet_serial

4) SELECT COUNT(*) FROM links WHERE netloc=X AND url=Y

As I am not a DB expert, I stumble upon the following dilemma:

1) For both of my tables: the columns are almost identical. The general relationships is "for each pagelet there are a couple of links". Hence I decided to take them links out. However, I find that in my subsequent queries, I slowly added the columns in for convenience, which explain why they look so similar.

2) Is this bad practice? If so, how should I have done better? How would the queries have changed? Will this affect performance vastly?

3) For query number 1, I did an explain.

+----+-------------+----------+------+---------------+------+---------+------+------+-----------------------------------------------------+
| id | select_type | table    | type | possible_keys | key  | key_len | ref  | rows | Extra                                               |
+----+-------------+----------+------+---------------+------+---------+------+------+-----------------------------------------------------+
|  1 | PRIMARY     | NULL     | NULL | NULL          | NULL | NULL    | NULL | NULL | Impossible WHERE noticed after reading const tables | 
|  2 | DERIVED     | pagelets | ALL  | NULL          | NULL | NULL    | NULL |  210 | Using filesort                                      | 
+----+-------------+----------+------+---------------+------+---------+------+------+-----------------------------------------------------+

How is the where impossible? The where is absolutely essential.

4) Currently how I operate is by having a particular web page inserted into the DB with delete_me = 1. After every operation on each web page, I choose to delete those with delete_me = 1 and page_key = X. I know adding and deleting is troublesome, does anyone have a better method (I have a entries that cannot be modified - I want them to act as the backbone of my calculations, hence for every new page a n number of past pages (fixed) will be used for these calculations and then after each process the page is removed)

Thank you all

A: 

The WHERE is most likely impossible because X and Y aren't defined anywhere, aren't quoted, and even if they were quoted, they aren't numbers (netloc is an int(32)). However, maybe you were masking the real values being passed there?

Also, you should take a moment to create primary keys (and indexes) for your tables for performance reasons.

I'm guessing that the pagelet table should use pagelet_serial as its primary key. I'm not sure the links table has any columns (yet) that could be used as a PK.

R. Bemrose
dude X and Y are variables. They are numbers. Are you serious?
Really? I can't tell that. Nor can I tell if you're passing decimals where it's expecting ints, because you didn't provide that information either.
R. Bemrose
A: 

Regarding Query 1:

I think the explain is trying to tell you that

HAVING COUNT(DISTINCT page_key) > 1

is referring to a column that is not in the local select, that is you should add

COUNT(DISTINCT page_key) into the corresponding select

Steve De Caux