views:

109

answers:

3

I need to store a large (128-bit) PK. Each int will have some corresponding columns... no schema is defined now... and I want the schema flexible in the future. (I only need conservative flexibility eg adding new columns from time to time)

At this point I'm not too concerned with the ability to do joins and such. I mostly want to pick a random PK and search up or down to the next 10 records. Since there can be a lot of white space in the search the cost of the upward and downward search may vary.

What is the best technology to handle this request? I'm interested in something that will save me money (per transaction), and storage space. I'm also interested in performance.

What do you recommend?

Update

OK, so what is this for? I want to create a history of data for IPv6 addresses. Of course this will be a very sparse table... but I do need to track certain things regarding seen IP's.

+3  A: 

To clarify, I think you need a key of 128 bits (not 2^128 bits).

I'm taking this as a question about Db Key type selection, I'm not sure what consequences the Azure angle has. AFAIK it is build on top of MS-SQL.

128 bits or 16 bytes is the same size as a Guid (UniqueIdentifier) but I don't think you want to use that. Although there is support for it to be used as a key.

A direct choice would be something like binary(16) but I don't know how well suited that is as a PK.

You can code it as a char(32) hex string, that is not to excessive.

For practicality estimates, the key factor is how sparse your data is, or better: how many addresses do you expect to have to store?

Henk Holterman
Wikipedia says IPv6 is 2^128... http://en.wikipedia.org/wiki/IPv6, am I misreading it?
MakerOfThings7
It will grow over time, but since we live in a IPv4 world now, it will take a long time to get to 2^32 in terms of PK usage. I'm just planning for the future.
MakerOfThings7
IPV6 is a 128 bits address, allowing 2^128 possible values. You're in trouble if you need more than a small fraction of 2^128.
Henk Holterman
@Henk Holterman I made the key requirements change as you sugguested. 128bits it is! sorry for the delay in me understanding that! When choosing how to store the PK, my goal is to seek up/down and find the next smaller/larger value.
MakerOfThings7
+1  A: 

First of all, your premise in 2^128 integer keys is wrong, since you mentioned you want to store IP V6 addresses. An IP V6 Address is 128 bits long. To store it as an integer you need 128/32 or 4 32-bit integers per address. So the correct estimate is 2^128 possible addresses * 4 integers for a total of 2^128 * 4 keys of 32 bit integers....

Anyway I want that in bytes so we'll just go 2^128 possible addresses * 4 integers * 4 bytes per integer = 5.44 * 10^39 bytes. After that just follow Andreas' calculation and you'll end up with more....

That being said the idea of IP V6 is that we have more addresses than we'll ever need to use. So I highly doubt anywhere near 2^128 will be assigned for many years. At most if we go to IP V6 right now, we'll have the IP V4 address space assigned and nothing else, and though the number of ip addresses increases every year, not by that much.

Anyway it seems like you don't know what you are storing since the schema is not defined so Azure table may be what you want. Mostly it is key/value. For each IP address you could store totally different properties. And it is really easy to add another property/remove another property using the update/insert/merge operations. But if you want some uniformity applied to your data than use SQL. It's true that you will have to modify the schema as changes happens, but this will enforce that every row (and hence IP address) has the same data. Otherwise it is easy to leave out "required" columns/properties or to misspell them if you have multiple applications. But it really depends what you want to do. It's more do you value data integrity or do you value the flexibility of properties? Even though a schema does need to be changed, there are commands to add/remove columns from a schema. It's more do you want every IP address to store the same properties or can each have different properties. I believe the Azure Table way probably takes less storage per address than the SQL way if you are not using most of the properties for a given IP address. So it all depends on what you are looking for.

Cervo
I updated the question 20 min ago to reflect my mistake in understanding IPv6, but thanks for correcting it too. I'm *most* interested in finding N nearest neighbors. If that requires a fixed schema, so be it. A dynamic XML based schema is fine too. I want to find all recorded peers, and skip all the sparse data.
MakerOfThings7
I didn't see the bit update when I loaded the page, but that's correct. It depends what you are storing for each neighbor. For the N Nearest neighbors you could use a schema with all the geographical information. If for each neighbor you want to store different things, go with Azure table. I definitely think that you will probably never need to store anywhere near 2^128 keys. I would be surprised if you even hit a full 2^40 keys in our lifetime.
Cervo
@cervo When I say nearest neighbors, I don't mean geographically. If the PK is a 128bit "number" I want to simply locate the next/previous allocated number in the chain.
MakerOfThings7
In that case, I'm not sure what kind of index that Azure Table has. But you may want SQL because you could just clustered index the IP addresses and then a query to say get me the maximum address less than this one or the minimum address greater than this one is very efficient. I know Azure/Big Table index the key so you can say get (key) and put (key) but I'm not sure what optimizations they have for putting the keys in order. More research is needed.
Cervo
Also they may not allocate each address in sequence either. I'm not sure the full ip 6 plan. But they will probably assign blocks of address from different parts of the sequence to people and then people will allocate from those blocks. So one address might be IP 27 and the next allocated address after that might be 50,000,000,001. I'm not really sure what the nearest neighbor on the address gets you.
Cervo
Anyway http://www.iana.org/reports/2002/ipv6-allocation-policy-26jun02 seems to have some IP6 allocation plans, not sure how up to date they are.
Cervo
+1  A: 

Windows Azure Tables would be my recommendation, but there's only one sort order defined, so it will be hard to search both forward and backward. You may end up having to store each key twice, once in normal order, and once reversed (0xFFF...F - key) to support both scan directions efficiently.

smarx
Hey Steve, good to see you use this site! I'm halfway hoping for the announced secondary indexes to be released if this is the case. Perhaps I could store a second value of MaxValue - System.int128? I think I need to store it as a string, is that still true?
MakerOfThings7
Yes, both partition and row keys in azure tables are strings, which is important if you need the sorted.
knightpfhor
Yup, both of you are right. That means it's important to zero-pad the number. (5 should be stored as "000...5" to make sure lexicographical behaves like a numerical sort.)
smarx
@Smarx Thanks. Is there a formula or calculator I can use to determine how much this will cost me (including XML overhead)? Ideally I'd like to see the estimated storage cost and the est. network costs broken down as separate line items.
MakerOfThings7