views:

133

answers:

3

Hi Guys,

I am currently developing a an application to allow users to search through a database of documents using various paramaters and returning a set of paged results. I am building it in PHP/MySQL, which is not my usual development platform, but its been grand so far.

The problem I am having is that in order to return a full set of results I have to use LEFT JOIN on every table, which completely destroys my performance. The person who developed the database has said that the query I am using will return the correct results, so thats what I have to use. The query is below, I am by no means an SQL Guru and could use some help on this.

I have been thinking that it might be better to split the query into sub-queries? Below is my current query:

    SELECT d.title, d.deposition_id, d.folio_start, d.folio_end, pl.place_id, p.surname, p.forename, p.person_type_id, pt.person_type_desc, p.age, d.manuscript_number, dt.day, dt.month, dt.year, plc.county_id, c.county_desc
 FROM deposition d 
 LEFT JOIN person AS p ON p.deposition_id = d.deposition_id 
 LEFT JOIN person_type AS pt ON p.person_type_id = pt.person_type_id 
 LEFT JOIN place_link AS pl ON pl.deposition_id = d.deposition_id 
 LEFT JOIN date AS dt ON dt.deposition_id = d.deposition_id 
 LEFT JOIN place AS plc ON pl.place_id = plc.place_id 
 LEFT JOIN county AS c ON plc.county_id = c.county_id
 WHERE 1 AND d.manuscript_number = '840' 
 GROUP BY d.deposition_id ORDER BY d.folio_start ASC
 LIMIT 0, 20

Any help or guidance would be greatly appreciated!

Deposition Table:

CREATE TABLE IF NOT EXISTS `deposition` (
  `deposition_id` varchar(11) NOT NULL default '',
  `manuscript_number` int(10) NOT NULL default '0',
  `folio_start` varchar(4) NOT NULL default '0',
  `folio_end` varchar(4) default '0',
  `page` int(4) default NULL,
  `deposition_type_id` int(10) NOT NULL default '0',
  `comments` varchar(255) default '',
  `title` varchar(255) default NULL,
  PRIMARY KEY  (`deposition_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Date Table

CREATE TABLE IF NOT EXISTS `date` (
  `deposition_id` varchar(11) NOT NULL default '',
  `day` int(2) default NULL,
  `month` int(2) default NULL,
  `year` int(4) default NULL,
  PRIMARY KEY  (`deposition_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Person_Type

CREATE TABLE IF NOT EXISTS `person_type` (
  `person_type_id` int(10) NOT NULL auto_increment,
  `person_type_desc` varchar(255) NOT NULL default '',
  PRIMARY KEY  (`person_type_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=latin1 AUTO_INCREMENT=59 ;
+2  A: 

Seems that you want to select one person, place etc. per deposition.

The query you wrote will return you this, but it's not guaranteed which one will it return, and the query is inefficient.

Try this:

SELECT  d.title, d.deposition_id, d.folio_start, d.folio_end, pl.place_id, p.surname, p.forename, p.person_type_id, pt.person_type_desc, p.age, d.manuscript_number, dt.day, dt.month, dt.year, plc.county_id, c.county_desc
FROM    deposition d
LEFT JOIN
        person p
ON      p.id = 
        (
        SELECT  id
        FROM    person pi
        WHERE   pi.deposition_id = d.deposition_id
        ORDER BY
                pi.deposition_id, pi.id
        LIMIT 1
        )
LEFT JOIN
        place_link AS pl
ON      pl.id = 
        (
        SELECT  id
        FROM    place_link AS pli
        WHERE   pli.deposition_id = d.deposition_id
        ORDER BY
                pli.deposition_id, pi.id
        LIMIT 1
        )
LEFT JOIN
        date AS dt
ON      dt.id = 
        (
        SELECT  id
        FROM    date AS dti
        WHERE   dti.deposition_id = d.deposition_id
        ORDER BY
                dti.deposition_id, pi.id
        LIMIT 1
        )
LEFT JOIN
        place AS plc
ON      plc.place_id = pl.place_id 
LEFT JOIN
        county AS c
ON      c.county_id = plc.county_id
WHERE   d.manuscript_number = '840' 
ORDER BY
        d.manuscript_number, d.folio_start
LIMIT   20

Create an index on deposition (manuscript_number, folio_start) for this to work fast

Also create a composite index on (deposition_id, id) on person, place_link and date.

Quassnoi
You will have to excuse my ignorance but I am assuming that this will still allow me to return results that do not have a person assocuiated with a deposition, etc?
TGuimond
@TGuimond: sure. BTW, I corrected the query, there was a mistake in it.
Quassnoi
@Quassnio: I posted a few of the other tables schema's in question, I tried your suggestion but had to update the query a bit and I crashed my dev server! Definitely a mistake on my part, but I think the adding of indexes has sorted it out. Thanks for the help!
TGuimond
+1  A: 

You only need a LEFT JOIN if the joined table might not have a matching value. Is it possible in your database schema for a person to not have a matching person_type? Or deposition to not have a matching row in date? A place not have a matching county?

For any of those relationships that must exist for the result to make sense you can change the LEFT JOIN to an INNER JOIN.

These columns should have indexes (unique if possible):

person.deposition_id
date.deposition_id
place_link.deposition_id
place_link.place_id

The date table looks like a bad design; I can't think of a reason to have a table of dates instead of just putting a column of type date (or datetime) in the deposition table. And date is a terrible name for a table because it's a SQL reserved word.

gregjor
I had started out with Inner Joins on the tables you mentioned above but it turns out that each of these might have no matching value for the deposition, which is why I have ended up with all LEFT JOIN. The database was provided to me and there's nothing I can do about the Schema, unfortunately!
TGuimond
There's no reason for LEFT JOIN to be significantly slower than an INNER JOIN, so the problem is more likely lack of indexes on the joined columns. Can you add indexes if they aren't there? I understand working with a legacy database -- I specialize in fixing/enhancing broken applications and I see all kinds of database schemas that make my head spin.
gregjor
@gregjor: in `MySQL`, a table that is nullable in a `LEFT JOIN` is always driven in a nested loop, so yes, there is a reason. However, the op mentioned he needs a `LEFT JOIN`.
Quassnoi
@quassnoi: in MySQL all joins are driven in nested loops, taking into account how the join is accomplished (ref, range, scan). In this query the WHERE clause only refers to columns in the deposition table; none of the joined tables are qualified by any conditions except for the JOIN. The big question is are the join columns indexed (ref join) or not (ALL join).
gregjor
@gregjor: I mean "driven" as opposed to "leading". The left table will be scanned first (in the outer loop) and a search will be performed over the right table (in the inner loop). With a plain join, the order can be swapped (as chosen by the optimizer).
Quassnoi
@quassnoi: Yes I understand that, but in this case I don't think any of the joins can be flipped. It looks like each left join will match either no row or one row, but I'm not sure about the join on person. My point is that if the join columns aren't indexed every left join requires a table scan when it could be indexed and turned into a much faster ref lookup.
gregjor
+1  A: 

The poor performance is almost certainly from lack of indexes. Your deposition table doesn't have any indexes, and that probably means the other tables you're referencing don't have any either. You can start by adding an index to your deposition table. From the MySQL shell, or phpMyAdmin, issue the following query.

ALTER TABLE deposition ADD INDEX(deposition_id, manuscript_number);

You know you're on the right track if the query executes faster after adding the index. From there you might want to put indexes on the other tables on the referenced columns. For instance for this part of your query "LEFT JOIN person AS p ON p.deposition_id = d.deposition_id", you could try adding an index to the person table using.

ALTER TABLE person ADD INDEX(deposition_id);

mellowsoon
What's purpose of an index with a `PRIMARY KEY` as a first field?
Quassnoi
@mellowsoon: This appears to have helped considerably. I have edited the question to some of the other tables in question, I'm fairly sure the improvement is not due to caching as I have tried a number of different queries and they are running infinitely faster. Would you suggest I add indexes to the rest of the tables also?
TGuimond
Those tables look ok, because the primary keys are the columns being referenced in your query. In general indexes make all the difference. I've queried 20 million rows on a table without proper indexes, and it takes 10 minutes. With proper indexes.. under 3 seconds. :)
mellowsoon