views:

108

answers:

3

I need some performance improvement guidance, my query takes several seconds to run and this is causing problems on the server. This query runs on the most common page on my site. I think a radical rethink may be required.

~ EDIT ~ This query produces a list of records whose keywords match those of the program (record) being queried. My site is a software download directory. And this list is used on the program listing page to show other similar programs. PadID is the primary key of the program records in my database.

~ EDIT ~

Heres my query

 select match_keywords.PadID, count(match_keywords.Word) as matching_words 
 from keywords current_program_keywords 
 inner join keywords match_keywords on
       match_keywords.Word=current_program_keywords.Word 
 where match_keywords.Word IS NOT NULL 
 and current_program_keywords.PadID=44243 
 group by match_keywords.PadID 
 order by matching_words DESC 
 LIMIT 0,11;

Heres the query explained. alt text

Heres some sample data, however I doubt you'd be able to see the effects of any performance tweaks without more data, which I can provide if you'd like.

 CREATE TABLE IF NOT EXISTS `keywords` (
   `Word` varchar(20) NOT NULL,
   `PadID` bigint(20) NOT NULL,
   `LetterIdx` varchar(1) NOT NULL,
   KEY `Word` (`Word`),
   KEY `LetterIdx` (`LetterIdx`),
   KEY `PadID_2` (`PadID`,`Word`)
 ) ENGINE=MyISAM DEFAULT CHARSET=latin1;

 INSERT INTO `keywords` (`Word`, `PadID`, `LetterIdx`) VALUES
 ('tv', 44243, 'T'),
 ('satellite tv', 44243, 'S'),
 ('satellite tv to pc', 44243, 'S'),
 ('satellite', 44243, 'S'),
 ('your', 44243, 'X'),
 ('computer', 44243, 'C'),
 ('pc', 44243, 'P'),
 ('soccer on your pc', 44243, 'S'),
 ('sports on your pc', 44243, 'S'),
 ('television', 44243, 'T');

I've tried adding an index, but this doesn't make much difference.

 ALTER TABLE `keywords` ADD INDEX ( `PadID` ) 
+1  A: 

Try this approach, not sure if it will help but at least is different:

select PadID, count(Word) as matching_words
from keywords k
where Word in (
  select Word 
  from keywords
  where PadID=44243 )
group by PadID 
order by matching_words DESC 
LIMIT 0,11

Anyway the job you want to get done is heavy, and full of string comparison, maybe exporting keywords and storing only numeric ids in the keyword table can reduce the times.

frisco
Wow thats tonnes slower took 11 seconds, thanks for trying though
Jules
I'm not really sure what else I could store as a numeric id ?
Jules
Lol, at least I tried, what I was talking about is making a Keywords table with only keyword and an Keyword_Id then the current Keyword table renamed to hasKeywords or something like that that has the keyword id and the program id, that will reduce database size (as keyword strings are not replicated) and also help with the string comparison. Anyway if you want send a bigger set of data to frisco82 @ gmail.com and I will give another try.
frisco
+1  A: 

You might find this helpful if I understood you correctly. The solution takes advantage of innodb's clustered primary key indexes (http://pastie.org/1195127)

EDIT: here's some links that may prove of interest:

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://dev.mysql.com/doc/refman/5.0/en/innodb-adaptive-hash.html

drop table if exists programmes;
create table programmes
(
prog_id mediumint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;

insert into programmes (name) values 
('prog1'),('prog2'),('prog3'),('prog4'),('prog5'),('prog6');


drop table if exists keywords;
create table keywords
(
keyword_id mediumint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;

insert into keywords (name) values 
('tv'),('satellite tv'),('satellite tv to pc'),('pc'),('computer');


drop table if exists programme_keywords;
create table programme_keywords
(
keyword_id mediumint unsigned not null,
prog_id mediumint unsigned not null,
primary key (keyword_id, prog_id), -- note clustered composite primary key
key (prog_id)
)
engine=innodb;

insert into programme_keywords values 

-- keyword 1
(1,1),(1,5),

-- keyword 2
(2,2),(2,4),

-- keyword 3
(3,1),(3,2),(3,5),(3,6),

-- keyword 4
(4,2),

-- keyword 5
(5,2),(5,3),(5,4);

/*
efficiently list all other programmes whose keywords match that of the 
programme currently being queried (for instance prog_id = 1) 
*/


drop procedure if exists list_matching_programmes;

delimiter #

create procedure list_matching_programmes
(
in p_prog_id mediumint unsigned
)
proc_main:begin

select
 p.*
from
 programmes p
inner join
(
 select distinct -- other programmes with same keywords as current
  pk.prog_id
 from
  programme_keywords pk
 inner join
 (
  select keyword_id from programme_keywords where prog_id = p_prog_id
 ) current_programme -- the current program keywords
 on pk.keyword_id = current_programme.keyword_id
 inner join programmes p on pk.prog_id = p.prog_id 

) matches 
on matches.prog_id = p.prog_id
order by
 p.prog_id;

end proc_main #


delimiter ;

call list_matching_programmes(1);
call list_matching_programmes(6); 


explain
select
 p.*
from
 programmes p
inner join
(
 select distinct
  pk.prog_id
 from
  programme_keywords pk
 inner join
 (
  select keyword_id from programme_keywords where prog_id = 1
 ) current_programme
 on pk.keyword_id = current_programme.keyword_id
 inner join programmes p on pk.prog_id = p.prog_id 

) matches 
on matches.prog_id = p.prog_id
order by
 p.prog_id;

EDIT: added char_idx functionality as requested

alter table keywords add column char_idx char(1) null after name;

update keywords set char_idx = upper(substring(name,1,1));

select * from keywords;

explain
select
 p.*
from
 programmes p
inner join
(
 select distinct
  pk.prog_id
 from
  programme_keywords pk
 inner join
 (
  select keyword_id from keywords where char_idx = 'P' -- just change the driver query
 ) keywords_starting_with
 on pk.keyword_id = keywords_starting_with.keyword_id
) matches 
on matches.prog_id = p.prog_id
order by
 p.prog_id;
f00
Wow this is a bit more drastic than I wanted.
Jules
Nothing drastic about it - the important part is the numeric based clustered primary key index on programme_keywords (keyword_id, prog_id). You could always create a secondary composite index on your own table if desired but MyIsam is not as performant especially under load as innodb - read up on it :)
f00
I think my database is InnoDb, its at the bottom of the list of table in phpmyadmin. Do I have to change my programmes table to InnoDB ? Also when would the procedure be used, once a day or eveytime I need to create the list of similar programs. Thanks
Jules
Currently padid in my tables is `PadID` bigint(20) NOT NULL auto_increment, will medium int make a difference ?
Jules
Also, this will leave me with some other fixes to make, which rely on the keywords table... select padid from pads WHERE keywords like '%$search%' AND RemovemeDate = '2001-01-01 00:00:00' ORDER BY VersionAddDate DESC; --- ----SELECT *, count( Word ) AS WordCount FROM `keywords` WHERE `LetterIdx` = '" . $LetterCat . "' GROUP BY Word ORDER BY Word; --- What will I need to do here ?
Jules
Actually the keywords in the pads table is a duplicate, I could do with moving away from that field, to your new layout.
Jules
you can use whatever unsigned integer datatype you require - the mediumint unsigned is probably too small for a keywords primary key but you know your data better than I do so choose wisely as the smaller the datatype the more key data you can stuff into the buffers and subsequently less disk reads are required. The table definition in your example uses the MyIsam engine and as such doesnt support clustered primary key indexes.
f00
OK :) what about my LetterIdx and searches by keyword linking to the pads aka programme table ? Think this will sort it, Thanks
Jules
Your letter_idx field can be moved to the keywords table as a char(1) I think. You'd then select keywords with letter_id = 'S' and join back to the programme_keywords table on keyword_id
f00
Can you give me an example of this, so I get the most efficient join, just the select query ?
Jules
sure - see edits in answer above.
f00
+1  A: 

Ok after reviewing you database I think there is not a lot of room to improve in the query, in fact on my test server with index on Word it only takes about 0.15s to complete, without the index it is almost 4x times slower.

Anyway I think that implementing the change in database sctructure f00 and I have told you it will improve the response time.

Also drop the index PadID_2 as it is now it is futile and it will only slow your writes. What you should do but it requise to clean the database is to avoid duplicate keyword-prodId pair first removing al duplicate ones currently in DB (around 90k in my test with 3/4 of your DB) that will reduce query time and give meaningfull results. If you ask for a progId that has the keyword ABC that is duplicated for progdID2 then progID2 will be on top o other progIDs with the same ABC keyword but not duplicated, on my tests I have seen a progID that get several more matches that the same progID I am querying. After dropping duplicates from the DB you will need to change your application to avoid this problem again in the future and just for being safe you could add a primary key (or index with unique activated) to Word + ProgID.

frisco