I have two tables:
CREATE TABLE pagelets (
`netloc` INT(32) unsigned NOT NULL,
`page_key` BIGINT(64) unsigned NOT NULL,
`pagelet_serial` BIGINT(64) unsigned NOT NULL, --that this is unique
`pagelet_shingle` BIGINT(64) unsigned NOT NULL,
`delete_me` BOOLEAN NOT NULL,
KEY ( `pagelet_shingle` )
) ENGINE=MyISAM DEFAULT CHARSET=utf8
;
CREATE TABLE links (
`netloc` INT(32) unsigned NOT NULL,
`page_key` BIGINT(64) unsigned NOT NULL,
`pagelet_serial` BIGINT(64) unsigned NOT NULL,
`url` BIGINT(64) unsigned NOT NULL,
`delete_me` BOOLEAN NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8
;
And the way I use these tables are the following:
1) SELECT DISTINCT A.* FROM pagelets A JOIN (
SELECT pagelet_shingle FROM pagelets WHERE netloc=Y
GROUP BY pagelet_shingle HAVING COUNT(DISTINCT page_key) > 1)
AS B USING(pagelet_shingle) ORDER BY A.pagelet_shingle;
2) SELECT COUNT(*) FROM pagelets WHERE pagelet_serial=Y
3) SELECT url FROM links
WHERE page_key=X AND pagelet_serial NOT IN
(Y, Y, Y, Y, Y ) ORDER BY pagelet_serial
4) SELECT COUNT(*) FROM links WHERE netloc=X AND url=Y
As I am not a DB expert, I stumble upon the following dilemma:
1) For both of my tables: the columns are almost identical. The general relationships is "for each pagelet there are a couple of links". Hence I decided to take them links out. However, I find that in my subsequent queries, I slowly added the columns in for convenience, which explain why they look so similar.
2) Is this bad practice? If so, how should I have done better? How would the queries have changed? Will this affect performance vastly?
3) For query number 1, I did an explain.
+----+-------------+----------+------+---------------+------+---------+------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------+---------+------+------+-----------------------------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables |
| 2 | DERIVED | pagelets | ALL | NULL | NULL | NULL | NULL | 210 | Using filesort |
+----+-------------+----------+------+---------------+------+---------+------+------+-----------------------------------------------------+
How is the where impossible? The where is absolutely essential.
4) Currently how I operate is by having a particular web page inserted into the DB with delete_me = 1. After every operation on each web page, I choose to delete those with delete_me = 1 and page_key = X. I know adding and deleting is troublesome, does anyone have a better method (I have a entries that cannot be modified - I want them to act as the backbone of my calculations, hence for every new page a n number of past pages (fixed) will be used for these calculations and then after each process the page is removed)
Thank you all