views:

103

answers:

3

I have two tables, Table A with 700,000 entries and Table B with 600,000 entries. The structure is as follows:

Table A:

+-----------+---------------------+------+-----+---------+----------------+
| Field     | Type                | Null | Key | Default | Extra          |
+-----------+---------------------+------+-----+---------+----------------+
| id        | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment | 
| number    | bigint(20) unsigned | YES  |     | NULL    |                | 
+-----------+---------------------+------+-----+---------+----------------+

Table B:

+-------------+---------------------+------+-----+---------+----------------+
| Field       | Type                | Null | Key | Default | Extra          |
+-------------+---------------------+------+-----+---------+----------------+
| id          | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment | 
| number_s    | bigint(20) unsigned | YES  | MUL | NULL    |                | 
| number_e    | bigint(20) unsigned | YES  | MUL | NULL    |                | 
| source      | varchar(50)         | YES  |     | NULL    |                |
+-------------+---------------------+------+-----+---------+----------------+

I am trying to find if any of the values in Table A are present in Table B using the following code:

$sql = "SELECT number from TableA";
$result = mysql_query($sql) or die(mysql_error());

while($row = mysql_fetch_assoc($result)) {
        $number = $row['number'];
        $sql = "SELECT source, count(source) FROM TableB WHERE number_s < $number AND number_e > $number GROUP BY source";
        $re = mysql_query($sql) or die(mysql_error);
        while($ro = mysql_fetch_array($re)) {
                echo $number."\t".$ro[0]."\t".$ro[1]."\n";
        }
}

I was hoping that the query would go fast but then for some reason, it isn't terrible fast. My explain on the select (with a particular value of "number") gives me the following:

mysql> explain SELECT source, count(source) FROM TableB WHERE number_s < 1812194440 AND number_e > 1812194440 GROUP BY source;
+----+-------------+------------+------+-------------------------+------+---------+------+--------+----------------------------------------------+
| id | select_type | table      | type | possible_keys           | key  | key_len | ref  | rows   | Extra                                        |
+----+-------------+------------+------+-------------------------+------+---------+------+--------+----------------------------------------------+
|  1 | SIMPLE      | TableB     | ALL  | number_s,number_e       | NULL | NULL    | NULL | 696325 | Using where; Using temporary; Using filesort | 
+----+-------------+------------+------+-------------------------+------+---------+------+--------+----------------------------------------------+
1 row in set (0.00 sec)

Is there any optimization that I can squeeze out of this?

I tried writing a stored procedure for the same task but it doesn't even seem to work in the first place... It doesn't give any syntax errors... I tried running it for a day and it was still running which felt odd.

CREATE PROCEDURE Filter() 
Begin 
  DECLARE number BIGINT UNSIGNED; 
  DECLARE x INT; 
  DECLARE done INT DEFAULT 0; 
  DECLARE cur1 CURSOR FOR SELECT number FROM TableA; 
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1; 
  CREATE TEMPORARY TABLE IF NOT EXISTS Flags(number bigint unsigned, count int(11)); 
  OPEN cur1; 
  hist_loop: LOOP 
    FETCH cur1 INTO number; 
    SELECT count(*) from TableB WHERE number_s < number AND number_e > number INTO x; 
    IF done = 1 THEN 
      LEAVE hist_loop; 
    END IF; 
    IF x IS NOT NULL AND x>0 THEN 
      INSERT INTO Flags(number, count) VALUES(number, x); 
    END IF; 
  END LOOP hist_loop; 
  CLOSE cur1;
END
+2  A: 

It looks to me like you have separate indexes on the number_e and number_s columns, probably created with separate ADD INDEX(number_e) and ADD INDEX(number_s) columns.

You will likely get much better performance if you add an index that encompasses both those columns, as they are both being used in your query, and MySQL is clearly not choosing to use either of the single-column indexes, judging that a whole table scan would be faster (not uncommon if your query spans a large range of values).

ALTER TABLE tblB ADD INDEX(number_s,number_e);

You won't need the individual number_s index after that, as MySQL can use the one you just created for queries against number_s only, so you might as well drop that one.

zombat
+1 for the combined index. Did not observe much difference but I will go ahead and try the R-Trees that was suggested as well. Thanks!
Legend
+4  A: 

You are trying to find intervals which contain a point. This is not so fast with a B-tree index (the default index type in most databases), however an R-tree index will work well for this sort of query. MySQL doesn't allow you to change the type of an index directly, but you can force MySQL to use an R-Tree by using the GEOMETRY column type.

Quassnoi covers this in his article on nested sets in MySQL. While it's not quite the same, it's very similar. A quote from the article:

There is also a certain class of tasks that require searching for all ranges containing a known value:

* Searching for an IP address in the IP range ban list
* Searching for a given date within a date range

and several others. These tasks can be improved by using R-Tree capabilities of MySQL

Mark Byers
Thanks a lot for the list... R-Tree never occurred to me...
Legend
Though this answer is probably THE answer to the question I still get funny feelings when someone uses code like `for all records in table1 do select ... from table2 where condition based on something from table1`. A join would IMHO be a more natural direction of thought.
extraneon
@extraneon: Indeed, and if you read the article you can see that's exactly what Quassnoi suggests, e.g. `JOIN t_hierarchy hrp ON MBRWithin(Point(0, hp.lft), hrp.sets)`. Though here you don't want all intervals - just knowing that there is one is enough.
Mark Byers
@Mark Byers: Just wanted to say a big "Thank You". I obtained my results in less than 5 seconds :)
Legend
@Legend: Great :) How long was it taking before?
Mark Byers
@Mark Byers: I did not let it run till completion. I ran it for about 2 hours and it was still continuing. So your reply was a life saver :) Actually, my problem maps exactly into the IP blacklisting you specified. I was trying to find the IP ranges that contained a given IP but the only difference was that I stored the IP addresses as BIGINTs.
Legend
@Legend: OK! I really appreciate you giving feedback about how it went, and also for actually doing the bulk of the work yourself rather than just saying 'give me the code!'. :)
Mark Byers
@Mark Byers: Thanks! You will be surprised to know how much that R-Trees will help me in the future. I learnt an entirely new concept today :) Thanks to others as well for sticking with me...
Legend
+1  A: 

First, I assume the desired output is to group all 'source' where the input lies between number_e and number_s, and the count thereof.

I'm scratchy on the syntax, but you might consider using a 'BETWEEN' clause up there instead of an explicit comparison using less-than/greater-than operators

Edit: What Zombat says applies too; indexes will help too.

Everyone
+1 for the suggestion. Thanks!
Legend