views:

205

answers:

2

I am wondering how Google does it. I have a lot of slow queries when it comes to page count and total number of results. Google returns a count value of 250,000,00 in a fraction of a second.

I am dealing with grid views. I have built a custom pager for a gridview that requires an SQL query to return a page count based on the filters set by the user. The filters are at least 5 which includes a keyword, a category and subcategory, a date range filter, and a sort expression filter for sorting. The query contains about 10 massive table left joins.

This query is executed every time a search is performed and a query execution last an average of 30 seconds - be it count or a select. I believe what's making it slow is my query string of inclusive and exclusive date range filters. I have replaced (<=,>=) to BETWEEN and AND but still I experience the same problem.

See the query here: http://friendpaste.com/4G2uZexRfhd3sSVROqjZEc

I have problems with a long date range parameter.

Check my table that contains the dates: http://friendpaste.com/1HrC0L62hFR4DghE6ypIRp

UPDATE [9/17/2010] I minimized my date query and removed the time. I tried reducing the joins for my count query (I am actually having a problem with my filter count which takes to long to return a result of 60k rows).

      SELECT COUNT(DISTINCT esched.course_id)
        FROM courses c
           LEFT JOIN events_schedule esched
              ON c.course_id = esched.course_id
           LEFT JOIN course_categories cc
              ON cc.course_id = c.course_id
           LEFT JOIN categories cat
              ON cat.category_id = cc.category_id
     WHERE     1 = 1
           AND c.course_type = 1
           AND active = 1
           AND c.country_id = 52
           AND c.course_title LIKE '%cook%'
           AND cat.main_category_id = 40
           AND cat.category_id = 360
 AND (

    (2010-09-01' <= esched.date_start OR 2010-09-01' <= esched.date_end) 
    AND

    ('2010-09-25' >= esched.date_start OR '2010-09-25' >= esched.date_end)     
   )

I just noticed that my query is quite fast when I have a filter on my main or sub category fields. However when I only have a date filter and the range is a month or a week it needs to count a lot of rows and is done in 30seconds in average.

These are the static fields:

AND c.course_type = 1
AND active = 1
AND c.country_id = 52

UPDATE [9/17/2010] If a create a hash for these three fields and store it on one field will it do a change in speed?

These are my dynamic fields:

AND c.course_title LIKE '%cook%'
AND cat.main_category_id = 40
AND cat.category_id = 360
// ?DateStart and ?DateEnd

UPDATE [9/17/2010]. Now my problem is the leading % in LIKE query

Will post an updated explain

+3  A: 

Search engines like Google use very complex behind-the-scenes algorythyms to index searches. Essentially, they have already determined which words occur on each page as well as the relative importance of those words and the relative importance of the pages (relative to other pages). These indexes are very quick because they are based on Bitwise Indexing.

Consider the following google searches:

custom : 542 million google hits
pager : 10.8 m
custom pager 1.26 m

Essentially what they have done is created a record for the word custom and in that record they have placed a 1 for every page that contains it and a 0 for every page that doesn't contain it. Then they zip it up because there are a lot more 0s than 1s. They do the same for pager.

When the search custom pager comes in, they unzip both records, perform a bitwise AND on them and this results in an array of bits where length is the total number of pages that they have indexed and the number of 1s represents the hit count for the search. The position of each bit corresponds to a particular result which is known in advance and they only have to look up the full details of the first 10 to display on the first page.

This is oversimplified, but that is the general principle.

Oh yes, they also have huge banks of servers performing the indexing and huge banks of servers responding to search requests. HUGE banks of servers!

This makes them a lot quicker than anything that could be done in a relational database.

Now, to your question: Could you paste some sample SQL for us to look at?

One thing you could try is changing the order that the tables and joins appear in your SQl statement. I know that it seems that it shouldn't make a difference but it certainly can. If you put the most restrictive joins earlier in the statement then you could well end up with fewer overall joins performed within the database.

A real world example. Say you wanted to find all of the entries in the phonebook under the name 'Johnson', with the number beginning with '7'. One way would be to look for all the numbers beginning with 7 and then join that with the numbers belonging to people called 'Johnson'. In fact it would be far quicker to perform the filtering the other way around even if you had indexing on both names and numbers. This is because the name 'Johnson' is more restrictive than the number 7.

So order does count, and datbase software is not always good at determining in advance which joins to perform first. I'm not sure about MySQL as my experience is mostly with SQL Server which uses index statistics to calculate which order to perform joins. These stats get out of date after a number of inserts, updates and deletes, so they have to be re-computed periodically. If MySQL has something similar, you could try this.

UPDATE I have looked at the query that you posted. Ten left joins is not unusual and should perform fine as long as you have the right indexes in place. Yours is not a complicated query.

What you need to do is break this query down to its fundamentals. Comment out the lookup joins such as those to currency, course_stats, countries, states and cities along with the corresponding fields in the select statement. Does it still run as slowly? Probably not. But it is probably still not ideal.

So comment out all of the rest until you just have the courses and the group by course id and order by courseid. Then, experiment with adding in the left joins to see which one has the greatest impact. Then, focusing on the ones with the greatest impact on performance, change the order of the queries. This is the trial - and - error approach,. It would be a lot better for you to take a look at the indexes on the columns that you are joining on.

For example, the line cm.method_id = c.method_id would require a primary key on course_methodologies.method_id and a foreign key index on courses.method_id and so on. Also, all of the fields in the where, group by and order by clauses need indexes.

Good luck

UPDATE 2 You seriously need to look at the date filtering on this query. What are you trying to do?

   AND ((('2010-09-01 00:00:00' <= esched.date_start
          AND esched.date_start <= '2010-09-25 00:00:00')
         OR ('2010-09-01 00:00:00' <= esched.date_end
             AND esched.date_end <= '2010-09-25 00:00:00'))
        OR ((esched.date_start <= '2010-09-01 00:00:00'
             AND '2010-09-01 00:00:00' <= esched.date_end)
            OR (esched.date_start <= '2010-09-25 00:00:00'
                AND '2010-09-25 00:00:00' <= esched.date_end)))

Can be re-written as:

AND (

    //date_start is between range - fine
    (esched.date_start BETWEEN '2010-09-01 00:00:00' AND '2010-09-25 00:00:00') 

    //date_end is between range - fine
    OR (esched.date_end BETWEEN '2010-09-01 00:00:00' AND '2010-09-25 00:00:00')       

    OR (esched.date_start <= '2010-09-01 00:00:00' AND esched.date_end >= '2010-09-01 00:00:00' ) 

    OR (esched.date_start <= '2010-09-25 00:00:00' AND esched.date_end > = '2010-09-25 00:00:00')
  )
Daniel Dyson
I am definitely using indexes, will get back here later.
geocine
I am a trying to filter an event date that is within a date range. eg. The Event is Sept 3 to Sept 5 . A query Sept 3 to 4 , Sept 4 to 6 and Sept 1 to 3 should return the event.
geocine
How about Sept 1 to 8 query it should return the event as well as September 4 and September 2
geocine
Yes, very good point. I have rearranged the 3rd and 4th OR statements to make them clearer. How are you getting on with the indexing?
Daniel Dyson
How many courses are there? The check c.course_title LIKE '%cook%' will take a long time because of the leading %. This is because an index on course_title will have to perform a table scan for any record containing 'cook'. Does this field have a full text index? Incidentally, a search such as c.course_title LIKE 'cook%' would be quick because the index would have the records in alphabetical order so it would just look for courses starting with 'cook'. As soon as you add a leading %, the index becomes useless. I know this is not much help, but it is something to be aware of.
Daniel Dyson
what if I want to retain the same logic. will full text index work?
geocine
@Daniel that might be right on the money. I don't know in mysql, but in ms sql that could easily be the main cause of the issue. Specifically the db engine could decide to use an execution plan that doesn't use the dates indexes because it already has to do a table scan / depends on the metrics, but I've seen it happen. Some ways I've worked around it is force the use of specific indexes or rewrite the query so part of it is explicitly only hitting indexes.
eglasius
I have updated my post.
geocine
+2  A: 

on your update you mention you suspect the problem to be in the date filters.

All those date checks can be summed up in a single check:

esched.date_ends >= '2010-09-01 00:00:00' and esched.date_start <= '2010-09-25 00:00:00'

If with the above it behaves the same, check if the following returns quickly / is picking your indexes:

SELECT COUNT(DISTINCT esched.course_id) FROM events_schedule esched WHERE esched.date_ends >= '2010-09-01 00:00:00' and esched.date_start <= '2010-09-25 00:00:00'

ps I think that when using the join, you can do SELECT COUNT(c.course_id) to count main records of courses in the query directly i.e. might not need the distinct that way.


re update now most time going to the wild card search after the change:

Use a mysql full text search. Make sure to check fulltext-restrictions, one important is that its only supported in MyISAM tables. I must say that I haven't really used the mysql full text search, and I'm not sure how that impacts the use of other indexes in the query.

If you can't use a full text search, imho you are out luck in using your current approach to it i.e. since it can't use the regular index to check if a word its contained in any part of the text.

If that's the case, you might want to switch that specific part of the approach and introduce a tag/keywords based approach. Unlike categories, you can assign multiple to each item, so its flexible yet doesn't have the free text issue.

eglasius
I updated my post.
geocine
@geocine if I followed that right, it took the problem off the date filters to the like % / so its an improvement, right?
eglasius