views:

211

answers:

4

Question

What is the fastest way to find if an IP address exists in a file that contains IP addresses sorted as:

219.93.88.62
219.94.181.87
219.94.193.96
220.1.72.201
220.110.162.50
220.126.52.187
220.126.52.247

Constraints

  • No database (e.g., MySQL, PostgreSQL, Oracle, etc.)
  • Infrequent pre-processing is allowed (see possibilities section)
  • Would be nice not to have to load the file each query (131Kb)
  • Uses under 5 megabytes of disk space
  • No extra PHP modules

File Details

  • One IP address per line
  • 9500+ lines

Possible Solutions

  • Create a directory hierarchy (radix tree?) then use is_dir() (sadly, this uses 87 megabytes)
+3  A: 

Since your file stores the IP addresses in sorted order already you can quickly find a specific IP address in O(log(n)) time by using a binary search.

If you want to speed this up even further you can cache for example every 100th row in memory and use an in-memory binary search first, then you know which part of the file you need to read in to finish the search.

Having said that 131kB is really not that much, so the simplest and fastest solution is to buy more memory and cache the entire file in memory in a hash table.

Mark Byers
I have no control over how much memory and disk space will be available. I like the idea of an index, though.
Dave Jarvis
+3  A: 

EDIT I didn't notice the php tag, I don't know if the following type of thing is possible in that language. But I'll leave it for the idea anyway.

An IPv4 address is representable as a 32-bit number, so I'd just make an array of int32, translate your addresses into the 'ints` with the following Python-ish psuedocode:

x = 0
i = 24
s = '111.222.333.444'
for part in s.split('.'):
    x += part.toint() << i
    i -= 8
IPlist.append(x)

Then you can get the input address, convert it to an int the same way, and do binary search on the array.

For ~10 k lines, the array will take ~40 kBytes.

mtrw
+1 for genius. But a bit more complicated than I'd like. ;-)
Dave Jarvis
+1  A: 

Might not be fast, but I'd try this out: If the IP address file doesn't change much, read the file into an array and cache it (maybe Memcache) and search from there on every request.

Shiki
I should add one more constraint: no extra modules.
Dave Jarvis
+3  A: 

Scanning the file line by line to find an IP seems like a pain if you have 9,000 non-matches to check before you get to 232.0.17.1

Is your file constrained to a single file? e.g. lets say this list is banned IPs and you just want to see if one is "in" the list.

What if you made a DIR to contain multiple files:

BannedIPs
  +- 0.ips
  +- 1.ips
  +- 37.ips
  +- 123.ips
  +- 253.ips
  +- 254.ips

Each file only contains IP addresses that start with that number.

If you were lucky enough to have even distribution... you'd have 256 files, but each would only have ~37 entries.

Thus when you want to test: 232.0.17.1 you look in the 232.ips file and scan for it.

scunliffe
This is probably the easiest to implement and does not depend on caching in memory. Thanks again!
Dave Jarvis