views:

77

answers:

3

I am re-writing the SQL which lets a user search for any other user on our site and also shows their roles.

An an example, roles can be "Writer", "Editor", "Publisher".

Each role links a User to a Publication.

Users can take multiple roles within multiple publications.

Example table setup:

"users" : user_id, firstname, lastname
"publications" : publication_id, name  
"link_writers" : user_id, publication_id  
"link_editors" : user_id, publication_id  

Current psuedo SQL:

SELECT * FROM (
  (SELECT user_id FROM users WHERE firstname LIKE '%Jenkz%') 
  UNION 
  (SELECT user_id FROM users WHERE lastname LIKE '%Jenkz%')
) AS dt
JOIN (ROLES STATEMENT) AS roles ON roles.user_id = dt.user_id

At the moment my roles statement is:

SELECT  dt2.user_id, dt2.publication_id, dt.role FROM (
  (SELECT 'writer' AS role, link_writers.user_id, link_writers.publication_id
  FROM link_writers)
  UNION
  (SELECT 'editor' AS role, link_editors.user_id, link_editors.publication_id
  FROM link_editors)
) AS dt2

The reason for wrapping the roles statement in UNION clauses is that some roles are more complex and require a table join to find the publication_id and user_id.

As an example "publishers" might be linked accross two tables

"link_publishers": user_id, publisher_group_id
"link_publisher_groups": publisher_group_id, publication_id

So in that instance, the query forming part of my UNION would be:

SELECT 'publisher' AS role, link_publishers.user_id, link_publisher_groups.publication_id
FROM link_publishers
JOIN link_publisher_groups ON lpg.group_id = lp.group_id

I'm pretty confident that my table setup is good (I was warned off the one-table-for-all system when researching the layout). My problem is that there are now 100,000 rows in the users table and upto 70,000 rows in each of the link tables.

Initial lookup in the users table is fast, but the joining really slows things down.

How can I only join on the relevant roles?

-------------------------- EDIT ---------------------------------- explain

Explain above (open in a new window to see full resolution).

The bottom bit in red, is the "WHERE firstname LIKE '%Jenkz%'" the third row searches WHERE CONCAT(firstname, ' ', lastname) LIKE '%Jenkz%'. Hence the large row count, but I think this is unavoidable, unless there is a way to put an index accross concatenated fields?

The green bit at the top just shows the total rows scanned from the ROLES STATEMENT.

You can then see each individual UNION clause (#6 - #12) which all show a large number of rows. Some of the indexes are normal, some are unique.

It seems that MySQL isn't optimizing to use the dt.user_id as a comparison for the internal of the UNION statements. Is there any way to force this behaviour?

Please note that my real setup is not publications and writers but "webmasters", "players", "teams" etc.

A: 

My initial idea was to create a temporary table to hold (and index) the user_id that match the name, and use that to join against each link table. Unfortunately, in MySQL a temporary table can only be joined against ONCE in a query.

The nasty workaround is to create a permanent table, adding the connection_id to the primary key so separate sessions do not get confused.

create table tt ( connection_id int not null,
                  user_id int not null, 
                  firstname varchar(10) not null, 
                  lastname varchar(10) not null,
                  primary key( connection_id, user_id ) );

The following sequence would be repeated each time you need an answer:

delete from tt where connection_id = connection_id();

insert into tt 
  SELECT connection_id(), user_id, firstname, lastname FROM users 
  WHERE firstname LIKE '%Jenkz%' 
  UNION 
  SELECT connection_id(), user_id, firstname, lastname FROM users 
  WHERE lastname LIKE '%Jenkz%';

Next, your existing UNION is extended so that only the relevant user_id is pulled out:

SELECT 'writer' AS role, link_writers.user_id, link_writers.publication_id
FROM link_writers
JOIN tt ON tt.connection_id = connection_id() and tt.user_id = link_writers.user_id

UNION

SELECT 'editor' AS role, link_editors.user_id, link_editors.publication_id
FROM link_editors
JOIN tt ON tt.connection_id = connection_id() and tt.user_id = link_editors.user_id

UNION

SELECT 'publisher' AS role, link_publishers.user_id, link_publisher_groups.publication_id
FROM link_publishers
JOIN link_publisher_groups 
   ON link_publisher_groups.publisher_group_id = link_publishers.publisher_group_id
JOIN tt ON tt.connection_id = connection_id() and tt.user_id = link_publishers.user_id

Perhaps this will be an improvement because not every row of all the link tables is being pulled into the union.

The EXPLAIN is a bit peculiar in that only 4 bytes of the index on tt is being used - where I would have expected all 8 bytes. Perhaps this is because I have so little data in tt.

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: tt
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: link_writers
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: test.tt.user_id
         rows: 1
        Extra: Using index
*************************** 3. row ***************************
           id: 2
  select_type: UNION
        table: tt
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
        Extra: Using index
*************************** 4. row ***************************
           id: 2
  select_type: UNION
        table: link_editors
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: test.tt.user_id
         rows: 1
        Extra: Using index
*************************** 5. row ***************************
           id: 3
  select_type: UNION
        table: tt
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
        Extra: Using index
*************************** 6. row ***************************
           id: 3
  select_type: UNION
        table: link_publishers
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: test.tt.user_id
         rows: 1
        Extra: Using index
*************************** 7. row ***************************
           id: 3
  select_type: UNION
        table: link_publisher_groups
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: test.link_publishers.publisher_group_id
         rows: 2
        Extra: Using index
*************************** 8. row ***************************
           id: NULL
  select_type: UNION RESULT
        table: <union1,2,3>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra:
8 rows in set (0.00 sec)
Martin
This looks like a good "if all else fails" solution Martin, thanks for your work. I've now posted the EXPLAIN from my actual query if this offers any more insight?
Jenkz
A: 

Having checked out OMG Ponies' answer for SO - Use Of Correlated Subquery, I came up with this:

SELECT * FROM (
  (SELECT user_id FROM users WHERE firstname LIKE '%Jenkz%') 
  UNION 
  (SELECT user_id FROM users WHERE lastname LIKE '%Jenkz%')
) AS dt
JOIN ( SELECT 'writer' AS role, link_writers.user_id, link_writers.publication_id
       FROM link_writers
       UNION
       SELECT 'editor' AS role, link_editors.user_id, link_editors.publication_id
       FROM link_editors
       UNION
       SELECT 'publisher' AS role, lp.user_id, lpg.publication_id
       FROM link_publishers lp
       JOIN link_publisher_groups lpg ON lpg.publisher_group_id = lp.publisher_group_id
     ) roles on roles.user_id = dt.user_id

The explain looks reasonable on my tiny data set. What does it look like on the real thing ?

Martin
Hi Martin, this seems to make sense, I'm assuming "correlate one level deep" means it ca ony be wrapped in one derived table? At the moment there has been no increase in performance (the same number of rows, so I am experimenting with forcing indexes as the whole internal table is still scanned. I'll update you with any progress and mark your answer if I don't get any further.... thanks!
Jenkz
Your original ROLES query had two levels, so I was hoping that the removal of the (apparently cosmetic) outer level might have allowed MySQL to push the ON clause into the components of the UNION. I couldn't tell if it worked on my small data set, and I still don't understand the manual clause that implies correlated subqueries might work with JOINS.
Martin
The query refuses to use the column. I've now found this bug on the MySQL site: http://bugs.mysql.com/bug.php?id=9021 Apparently it works in 5.2.....I can't get the simple example from the manual to work either: "SELECT * FROM t1 WHERE column1 = ANY (SELECT column1 FROM t2 WHERE t2.column2 = t1.column2);" loads, but very slowly ignoring the column.
Jenkz
A: 

Yet another approach is to slightly denormalise your design to better support your query.

To do this, create a new table "role":

create table role (
     user_id int not null,
     role enum ('writer', 'editor', 'publisher' ) not null,
     primary key (user_id, role )
);

This would be updated whenever a new row is added to one of your link tables that includes a user_id:

insert ignore into role values( $user_id, $role );

After a while, it is likely that the role entry will already exist, hence the "ignore" modifier.

The table could be boot-strapped from your existing tables:

insert ignore into role select distinct user_id, 'writer' from link_writers;
insert ignore into role select distinct user_id, 'editor' from link_editors;
insert ignore into role select distinct user_id, 'publisher' from link_publishers;

Your search query then becomes a set of simple JOINS which MySQL should have no problem optimising:

SELECT 
   r.user_id, 
   r.role,
   case r.role 
        when 'writer' then w.publication_id
        when 'editor' then e.publication_id
        when 'publisher' then pg.publication_id
        end as publication_id
FROM (
  (SELECT user_id FROM users WHERE firstname LIKE '%Jenkz%') 
  UNION 
  (SELECT user_id FROM users WHERE lastname LIKE '%Jenkz%')
) AS dt
JOIN role r on r.user_id = dt.user_id
LEFT JOIN link_writers w on r.user_id = w.user_id and r.role = 'writer'
LEFT JOIN link_editors e on r.user_id = e.user_id and r.role = 'editor'
LEFT JOIN link_publishers p on r.user_id = p.user_id and r.role = 'publisher'
LEFT JOIN link_publisher_groups pg on p.publisher_group_id = pg.publisher_group_id;

This will give a very "wide" answer.

Martin