tags:

views:

223

answers:

7

We have a MySQL table with about 3.5 million IP entries.

The structure:

CREATE TABLE IF NOT EXISTS `geoip_blocks` (
  `uid` int(11) NOT NULL auto_increment,
  `pid` int(11) NOT NULL,
  `startipnum` int(12) unsigned NOT NULL,
  `endipnum` int(12) unsigned NOT NULL,
  `locid` int(11) NOT NULL,
  PRIMARY KEY  (`uid`),
  KEY `startipnum` (`startipnum`),
  KEY `endipnum` (`endipnum`)
) TYPE=MyISAM  AUTO_INCREMENT=3538967 ;

The problem: A query takes more than 3 seconds.

SELECT uid FROM `geoip_blocks` WHERE 1406658569 BETWEEN geoip_blocks.startipnum AND geoip_blocks.endipnum LIMIT 1

- about 3 seconds

SELECT uid FROM `geoip_blocks` WHERE startipnum < 1406658569 and endipnum > 1406658569 limit 1

- no gain, about 3 seconds

How can this be improved?

+1  A: 

Your startip and endip should be a combined index. Mysql can't utilize multiple indexes on the same table in one query.

I'm not sure about the syntax, but something like

KEY (startipnum, endipnum)

jishi
Unfortunately this does not have an impact. Queries still last a few seconds.
Alex
Run an EXPLAIN on the query and give us the result. Make sure your indexes are up to date with an ANALYZE or OPTIMIZE.
jishi
+1  A: 

It looks like you're trying to find the range that an IP address belongs to. The problem is that MySQL can't make the best use of an index for the BETWEEN operation. Indexes work better with an = operation.

One way you can add an = operation to your query is by adding the network part of the address to the table. For your example:

numeric 1406658569
ip address 83.215.232.9
class A with 8 bit network part
network part = 83

With an index on (networkpart, startipnum, endipnum, uid) a query like this will become very fast:

SELECT  uid 
FROM    `geoip_blocks` 
WHERE   networkpart = 83
        AND 1406658569 BETWEEN startipnum AND endipnum

In case one geoip block spans multiple network classes, you'd have to split it in one row per network class.

Andomar
I've never had problem with between-operations with a combined index. With your example, it would only use the index for networkpart, meaning it will have to make a table scan on the temporary table either way. That might be a problem since an A-class network is 16 million possible addresses. Even if you only got C-style blocks, that leaves you with atleast 65k rows to scan.
jishi
The index I proposed has four columns. My idea was that MySQL can filter on networkpart, then startipnum, and then index scan the rest. An index on (a,b) cannot be used to do the bold part in WHERE a < 1 and **1 < b**, because b is independent of a.
Andomar
Ah, sorry, missed your index spec.
jishi
+1  A: 
lstntjss
How would that improve things?
jishi
A: 

Maybe you would like to have a look at partitioning the table. This feature has been available since MySQL 5.1 - hence you do not specify which version you are using, this might not work for you if you are stuck with an older version.

As the possible value range for IP addresses is known - at least for IPv4 - you could break down the table into multiple partitions of equal size (or maybe even not equal if your data is not evenly distributed). With that MySQL could very easily skip large parts of the table, speeding up the scan if it is still required.

See MySQL manual on partitioning for the available options and syntax.

Daniel Schneller
+1  A: 

The solution to this is to grab a BTREE/ISAM library and use that (like BerkelyDB). Using ISAM this is a trivial task.

Using ISAM, you would set your start key to the number, do a "Find Next", (to find the block GREATER or equal to your number), and if it wasn't equal, you'd "findPrevious" and check that block. 3-4 disk hits, shazam, done in a blink.

Well, it's A solution.

The problem that is happening here is that SQL, without a "sufficiently smart optimizer", does horrible on this kind of query.

For example, your query:

SELECT uid FROM `geoip_blocks` WHERE startipnum < 1406658569 and endipnum > 1406658569 limit 1

It's going to "look at" ALL of the rows that are "less than" 1406658569. ALL of them, then it's going to scan them looking for ALL of the rows that match the 2nd criteria.

With a 3.5m row table, assuming "average" (i.e. it hits the middle), welcome to a 1.75m row table scan. Even worse, and index table scan. Ideally MySQl will "give up" and "just" table scan, as it's faster.

Clearly, this is not what you want.

@Andomar's solution is basically forcing you to "block" to data space, via the "network" criteria. Effectively breaking your table in to 255 pieces. So, instead of scanning 1.75m rows, you get to scan 6800 rows, a marked improvement at a cost of you breaking your blocks up on the network boundary.

There is nothing wrong with range queries in SQL.

SELECT * FROM table WHERE id between X and Y

is a, typically, fast query, as the optimizer can readily delimit the range of rows using the index.

But, that's not your query, because you are not ranging you main ID in this case (startipnum).

If you "know" that your block sizes are within a certain range (i.e. none of your blocks, EVER, have more than, say, 1000 ips), then you can block your query by adding "WHERE startipnum between {ipnum - 1000} and {ipnum + 1000}". That's not really different than the network blocking that was proposed, but here you don't have to maintain it as much. Of course, you can learn this with:

SELECT max(endipnum - startipnum) FROM table

to get an idea what your largest range is.

Another option, which I've seen, have never used, but is actually, well, perfect for this, is to look at MySql's Spatial Extensions, since that's what this really is.

This is designed more for GIS applications, but you ARE searching for something in ranges, and that's a lot of what GIS apps do. So, that may well be a solution for you as well.

Will Hartung
A: 

Thanks for all your comments, I really appreciate it.

For now we ended up using a caching mechanism and we have reduced that expensive queries.

Alex
A: 

Use the second query you mentioned and try to add an index hint. See:

http://dev.mysql.com/doc/refman/5.0/en/index-hints.html

Salman A