views:

128

answers:

2

I have a problem with a subquery involving IPV4 addresses stored in MySQL (MySQL 5.0).

The IP addresses are stored in two tables, both in network number format - e.g. the format output by MySQL's INET_ATON(). The first table ('events') contains lots of rows with IP addresses associated with them, the second table ('network_providers') contains a list of provider information for given netblocks.

events table (~4,000,000 rows):

event_id (int)
event_name (varchar)
ip_address (unsigned int)

network_providers table (~60,000 rows):

ip_start (unsigned int)
ip_end  (unsigned int)
provider_name (varchar)

Simplified for the purposes of the problem I'm having, the goal is to create an export along the lines of:

event_id,event_name,ip_address,provider_name

If do a query along the lines of either of the following, I get the result I expect:

SELECT provider_name FROM network_providers WHERE INET_ATON('192.168.0.1') >= network_providers.ip_start ORDER BY network_providers.ip_start DESC LIMIT 1

SELECT provider_name FROM network_providers WHERE 3232235521 >= network_providers.ip_start ORDER BY network_providers.ip_start DESC LIMIT 1

That is to say, it returns the correct provider_name for whatever IP I look up (of course I'm not really using 192.168.0.1 in my queries).

However, when performing this same query as a subquery, in the following manner, it doesn't yield the result I would expect:

SELECT 
events.event_id,
events.event_name,
    (SELECT provider_name FROM network_providers 
    WHERE events.ip_address >= network_providers.ip_start 
    ORDER BY network_providers.ip_start DESC LIMIT 1) as provider
FROM events

Instead the a different (incorrect) value for provider is returned. Over 90% (but curiously not all) values returned in the provider column contain the wrong provider information for that IP.

Using events.ip_address in a subquery just to echo out the value confirms it contains the value I'd expect and that the subquery can parse it. Replacing events.ip_address with an actual network number also works, just using it dynamically in the subquery in this manner that doesn't work for me.

I suspect the problem is there is something fundamental and important about subqueries in MySQL that I don't get. I've worked with IP addresses like this in MySQL quite a bit before, but haven't previously done lookups for them using a subquery.

The question:

I'd really appreciate an example of how I could get the output I want, and if someone here knows, some enlightenment as to why what I'm doing doesn't work so I can avoid making this mistake again.

Notes:

The actual real-world usage I'm trying to do is considerably more complicated (involving joining two or three tables). This is a simplified version, to avoid overly complicating the question.

Additionally, I know I'm not using a between on ip_start & ip_end - that's intentional (the DB's can be out of date, and such cases the owner in the DB is almost always in the next specified range and 'best guess' is fine in this context) however I'm grateful for any suggestions for improvement that relate to the question.

Efficiency is always nice, but in this case absolutely not essential - any help appreciated.

+1  A: 

You should take a look at this post:

http://jcole.us/blog/archives/2007/11/24/on-efficiently-geo-referencing-ips-with-maxmind-geoip-and-mysql-gis/

It has some nice ideas for working with IPs in queries very similar to yours.

Another thing you should try is using a stored function instead of a sub-query. That would simplify your query as follows:

SELECT 
event.id,
event.event_name,
GET_PROVIDER_NAME(event.ip_address) as provider
FROM events
Ike Walker
+1 That URL doesn't quite address the problem I have but it was interesting (and is actually very relevant to what I'm doing, although I didn't go into that in my question as wanted to keep it simple). Your point about a stored function however, hit the nail on the head.
Iain Collins
A: 

There seems to be no way to achieve what I wanted with a JOIN or Subquery.

To expand on Ike Walker's suggestion of using a stored function, I ended up creating a stored function in MySQL with the following:

DELIMITER //
DROP FUNCTION IF EXISTS get_network_provider //
CREATE FUNCTION get_network_provider(ip_address_number INT) RETURNS VARCHAR(255)
BEGIN
DECLARE network_provider VARCHAR(255);
    SELECT provider_name INTO network_provider FROM network_providers
    WHERE ip_address_number >= network_providers.ip_start
    AND network_providers.provider_name != ""
    ORDER BY provider_name.ip_start DESC LIMIT 1;
RETURN network_provider;
END //

Explanation:

The check to ignore blank names, and using >= & ORDER BY for ip_start rather than BETWEEN ip_start and ip_end is a specific fudge for the two combined network provider databases I'm using, both of which need to be queried in this way.

This approach works well when the query calling the function only needs to return a few hundred results (though it may take a handful of seconds). On queries that return a few thousand results, it may take 2 or 3 minutes. For queries with tens of thousands of results (or more) it's too slow to be practical use.

This was not unexpected from using a stored function like this (i.e. every result returned triggering a separate query) but I did hit a drop in performance sooner than I had expected.

Recommendation:

The upshot of this was that I needed to accept that the data structure is just not suitable for my needs. This had been already pointed out to me by a friend, it just wasn't something I really wanted to hear at the time (because I really wanted to use that specific network_provider DB due to other keys in the table that were useful to me, e.g. for things like geolocation).

If you end up trying to use any of the IP provider DB's (or indeed any other database) that follow a similar dubious data format, then I can only suggest they are just not going to be suitable and it's not worth trying to cobble something together that will work with them as they are.

At the very least you need to reformat the data so that they can be reliably used with a simple BETWEEN statement (no sorting, and no other comparisons) so you can use it with subqueries (or JOINS) - although it's likely an indicator that any data that messed up is probably not all that reliable anyway.

Iain Collins