views:

272

answers:

4

trying to solve a puzzle which i found here: http://zcasper.blogspot.com/2005/10/google-phone-interview.html

the goal is to re-present a IP-Address range to country code look-up table in memory and use this data-structure to process a zilloin rows of ipaddress to identify the country code..

so i started with a shoot from the hip thought of using HashTable a hash-table works great; if we have a country-code to range look-up, as we have less country names that map to ip address ranges?

but not sure; how do i go with ipaddress to country code. any thoughts? or can i use a tree data-structure?

A: 

The input file provides a range of IP Addresses (not 1:1 mapping) so you need some sort of ordered map structure.

// Assuming IPv4, and the inputs are valid (start before end) 
// and no overlapping ranges. 
public class CountyCodeToIPMap {
    private final TreeMap<Long, CountryCodeEntry> ipMap = 
            new TreeMap<Long, CountryCodeEntry>();

    public void addIpRange(long startIp, long endIp, String countryCode) {
        ipMap.put(startIp, new CountryCodeEntry(endIp, countryCode);
    }

    public String getCountryCode(long ip) {
        Map.Entry<Long, CountryCodeEntry> entry = ipMap.floorEntry(ip);
        if (entry != null && ip <= entry.getValue().endIpAddress) {
            return entry.getValue().countryCode;
        } else {
            return null;
        }
    }
}

public class CountryCodeEntry {
    public final long endIpAddress;
    public final String countryCode;
    public CountryCodeEntry (long endIpAddress, String countryCode) {
        this.endIpAddress = endIpAdddress;
        this.countryCode = countryCode;
    }
}
Michael Barker
Tried for a 200K records; it was fast :-), by the way is there any programmatic api in Java Collections to know about the Tree data-structure properties like "Depth" or "Height" ?
Satish
None that I know of. TreeMap in JDK is a red-black tree, so it will be roughly balanced, another option in the JDK is the ConcurrentSkipList which will be better balanced if the import data is sorted ahead of time. Other than you would need to look outside the Java Collections library at some of the more specialised structures.
Michael Barker
A: 

you have no chance of storing All ip adresses. what you can do is store the intervals start-end where ip adress ranges are.

there is a specialized data structure, called Interval Tree which allows you to query this.

Andreas Petersson
Isn't storing a range exactly what the OP said he would do?
Fredrik
A: 

this is if you are considering an sql solution:

if you can add some constraints to your data set, you might get away using a very simple sql. where you can even use simple indexes. - that is the case when you use the GeoCityLite dataset

if your ip blocks are non-overlapping, you can can just insert them in a database as unsigned 32bit numbers in a "blocks" table and query them like that with hibernate:

     (GeoipBlocks) getSession()
            .createQuery("select  gb" +
                    " from GeoipBlocks gb" +
                    " where gb.startIpNum <= :ipnumeric " +
                    " order by gb.startIpNum desc").
                    setMaxResults(1)
            .setParameter("ipnumeric", ipInLongValue)
            .uniqueResult()

i wrote it down in hql syntax, because not all databases use the same syntax for offset + limit

that issues a query for the best match, assuming all blocks are non-overlapping. - you don't even need the end ip for that, this is automatically determined by the successor.

avoid to query it this way!:

    select * from blocks where ipstart <= ip and ipend >= ip

my database was not able to fully utilize their indexes, and did a lot of table scanning.

Andreas Petersson
A: 

Due to the way the internet routing works, your algorithm needs to handle Longest Prefix Matching and you want store CIDR blocks, instead of addresses.

I developed an algorithm to handle this but can't post it here. The closest thing in Open Source is the routing table handling code in Linux.

You can also check out Patricia Trie or Radix Tree algorithms. Those can be used to solve this problem.

ZZ Coder