views:

301

answers:

3

I am attempting to devise a system for packing integer values greater than 65535 into a ushort. Let me explain.

We have a system which generates Int32 values using an IDENTITY column from SQL Server and are limited by an in-production client API that overflows our Int32 IDs to ushorts. Fortunately the client only has about 20 or so instances of things with these IDs - let's call them packages - at any given time and it only needs to have them unique amongst local siblings. The generally accepted solution is to translate our Int32 IDs to ushorts (and no I don't mean casting, I mean translating) before transmitting them to the client, however there are barbs with this approach:

  1. Some IDs less than 65535 may still be in play on a given client at any time due to non-expiration.
  2. We cannot have any ID collisions - that is if package ID 1 goes to the client, an algorithm that tracks how many times 65535 is removed from an Int32 to make a ushort when applied to 65536 would also result in 1 thus causing a collision.
  3. We must be able to reconstruct the ushort into the Int32 upon return.

What we have available to solve this problem is a single signed byte field that is echoed to us and gives us 127 values to play with (really 117 because we're using 0-9 for something else). I'll refer to this as the "byte field" from here on out.

We have discussed three different translation routines:

  1. Multiplicative: store in the byte field how many times we remove 65535 from our Int32 to make a ushort. This has collision problems as detailed above.
  2. Serialized Session State: for each client, generate a session ID based on facts about that client. Then store a 1:1 translation table starting from 1 up to the number of packages delivered so when the client accesses our server again the inventory of packages can be translated back to their known database IDs. This has overhead problems since we'd be backing serialized session state to a database and we want to support hundreds to thousands of transactions a second.
  3. Varied algorithmic approach where the byte field is an ID of a transformative algorithm that takes an Int32 and transforms it into a ushort. Obviously many of these are going to be simple Multiplicative (to increase our ceiling of IDs we can transform) but some will have to be multiplicative with a smaller boundry (like 32768) with a number added to/subtracted from to get as close to a number that can be guaranteed unique amongst siblings. This approach is processor intensive but should allow us to avoid collisions while remaining scalable (though with this approach we have a limited ceiling that won't be reached before the ushort problem goes away on its own due to upgrades).

So my question is: is there a better way than my approaches above, and if not, what should I be looking for in terms of algorithms (for approach #3) to generate a number between 1-65535 when a given number is greater than 0 and must not be a one way hash?

Clarification: its not that the ushort ceiling is the greatest problem, its that the client API uses a ushort so I cannot combine the byte field on the client to get bigger values (the client API is non-upgradeable but will eventually phase out of existence).

A: 

How much "more" than 65535 do you need? You could always just add a few bits from your "byte field" as the high-order bits of the ID. Just 2 bits would get you to 262,143, 3 bits would get you 524,287.

ctacke
+1  A: 

I can think of a few other options:

Are there globally fewer than 65536 entries in the database? If so, then you could maintain a mapping table that's not associated with session state, but is a persisted part of the application.

Are the majority of entries at indexes less than, say 50,000? If that's the case you could map such values directly, and use a map associated with the session for the remaining ones.

If persisting such session data is an issue and the number of clients is reasonably small, you could enable client/session affinity and maintain the map local to the server.

If it's not a web application, you could maintain the map on the client itself.

I don't see any algorithmic way that would avoid collisions - I suspect you could always come up with an examples that would collide.

mancaus
+1  A: 

Regarding approach 2:

Your second approach is pretty much how NAT works. Every TCP/UDP client on the local network has up to 65535 ports in use (except port 0) and a private IP. The router knows only a single public IP. Since two clients may both have source port 300, it cannot simply just replace the private IP with a public one, that would cause collisions to appear. Thus it replaces the IP and "translates" the port (NAT: Network Address Translation). On return, it translates the port back and replaces the public with a private IP again, before forwarding the package back. You'd be doing nothing else than that. However, routers keep that information in memory - and they are not too slow when doing NAT (companies with hundreds of computers are NATed to the Internet sometimes and the slow down is hardly noticeably in most cases). You say you want up to thousand transactions a second - but how many clients will there be? As this mainly will define the size of memory needed to backup the mappings. If there are not too many clients, you could keep the mapping with a sorted table in memory, in that case, speed will be the smallest problem (table getting to bigger and server running out of memory is the bigger one).

What is a bit unclear to me is that you once say

Fortunately the client only has about 20 or so instances of things with these IDs - let's call them packages - at any given time and it only needs to have them unique amongst local siblings.

but then you say

Some IDs less than 65535 may still be in play on a given client at any time due to non-expiration.

I guess, what you probably meant by the second statement is, that if a client requests ID 65536, it might still have IDs below 65535 and these can be as low as (let's say) 20. It's not that the client processes IDs in a straight order, right? So you cannot say, just because it now requested 65536, it may have some smaller values, but certainly not in the range 1-1000, correct? It might actually keep a reference to 20, 90, 2005 and 41238 and still go over 65535, that's what you meant?

I personally like your second approach more than the third one, as it is easier to avoid a collision in any case and translating the number back is a plain, simple operation. Although I doubt that your third approach can work in the long run. Okay, you might have a byte to store how often you subtracted 2^16 of the number. However, you can only subtract 117 * 2^16 as largest numbers. What will you do if numbers go above that? Using a different algorithm, that does not subtract, but does what? Divide? Shift bits? In that case you lose granularity, that means this algorithm can't hit any possible number any longer (it will make large jumps). If it was so easy to just apply a magic translation function upon 32 bit to make 16 bit from it (+ one extra byte) and then just transform it back, guess every compression method in this world would use it, as it could, no matter what the 32 bit number was, always compress it down to 24 bit (16 bit + one byte). That would be magic. It is not possible to pack 32 bit into 24 bit and also pack all the logic how to transform it back into it as well. You will need some external storage, which brings us back to your 2nd approach. This is the only approach that will work and it will work for every number in 32 bit number range.

Mecki
Accepted to get this out of the unanswered questions list and this is a good explanation of the solution for the problem.
cfeduke
You seem to be confusing PAT or NAPT and NAT in your post. NAT does not translate the port, it translates the address whilst PAT and NAPT translate both the address and port.
Fraser
@Fraser: Don't nit-pick please. The main purpose of NAT is to have mulitple internal private IP addressend being mapped to a single public one and that is not possible unless you also do PAT (or no two hosts ever talk to the same host on the internet, while using the same service - will hard to enforce, both accessing Google, bang); PAT is basically a non-existing term in network business. See also Wikipedia page to NAT; they only quickly mention PAT, say you also can just say NAT for that and use solely NAT to describe every kind of NAT.
Mecki