views:

1159

answers:

5

Here are the requirements:

Must be alphanumeric, 8-10 characters so that it is user friendly. These will be stored as unique keys in database. I am using Guids as primary keys so an option to use GUids to generate these unique Ids would be preferable.

I am thinking on the lines of a base-n converter that takes a Guid and converts to an 8 character unique string.

Short, light-weight algorithm preferred as it would be called quite often.

+2  A: 

You may want to try a CRC32 hashing algorithm. The CRC32 generates an 8 character string.

http://en.wikipedia.org/wiki/Cyclic_redundancy_check

http://textop.us/Hashing/CRC

Nick Berardi
+5  A: 

You might consider base 36. in that it can do letters and numbers. Consider removing I (eye) and O (Oh) from your set so they don't get mixed up with 1 (one) and 0 (zero). Some people might complain about 2 and Z as well.

EvilTeach
This is probably the best bet, but unfortunately a 128-bit GUID ends up still being over 20 characters in base 36. Perhaps GUID isn't the best starting point.
Matt Hamilton
The guid is readily available from my model objects so it will be very convenient I believe.
Mank
well, you can go A-Z and a-z and 0-9 thats base 62. Perhaps that will work out better.
EvilTeach
However, people don't realize that items with different case are different, so you will run into problems using base 62
Kibbee
Long time, but I ended up using the base34, removing some illegible characters.
Mank
I'm glad it worked out for you.
EvilTeach
+3  A: 

If you're looking for "user friendly" you might want to try using entire words rather than simply making it short/alphanumberic, thus, something like:

words = [s.strip().lower() for s in open('/usr/share/dict/canadian-english') if "'" not in s]
mod = len(words)

def main(script, guid):
    guid = hash(guid)

    print "+".join(words[(guid ** e) % mod] for e in (53, 61, 71))

if __name__ == "__main__":
    import sys
    main(*sys.argv)

Which produces output like:

oranjestad+compressing+wellspring
padlock+discommoded+blazons
pt+olenek+renews

Which is amusing. Otherwise, simply taking the first 8-10 characters of the guid or sha1/md5 hash of the guid is probably your best bet.

Aaron Maenpaa
Is an md5/sha1 hash of a guid guaranteed to be unique?
Mank
No, but the first the first 10 characters give you a space of 2 ** 40 possibilities (approximately 1 trillion) so depending on how many identifiers you're looking for the number of collisions *should* be pretty low. Add a uniqueness constraint, a fall back and log collisions.
Aaron Maenpaa
+3  A: 

The simplest thing that could possibly work is a counter that is incremented every time a value is required. Eight (left-zero-padded) digits gives you 100 million possible values 00000000 thru 99999999 (although you might interject spaces or hyphens for human readability, as in 000-000-00).

If you will need more than 100 million values, you could either increase the length or use letters in alternate positions. Using A0A0A0A0 thru Z9Z9Z9Z9 gives you over four-and-a-half billion possible values (4,569,760,000) available. It is a trivial bit of code to take a long integer and produce such an encoding (mod 10 for the rightmost digit, div by 10 then mod 26 for the rightmost letter, etc.) If you have the memory to burn, the fastest way is to convert the counter to a mod 260 array, and use each mod 260 value as an index into an array of two-character strings ("A0", "A1", "A2", and so on thru "A9", "B0", "B1", etc. thru "Z9").

The problem with base 36 (mentioned in another reply) is that you not only have to worry about reader confusion of similar characters (one vs. I, zero vs. O, two vs. Z, five vs. S) but also about combinations of adjacent letters that might be perceived by readers as spelling distasteful or obscene words or abbreviations.

joel.neely
I like your suggestion. My only concern is having to manage a global counter in the application. Thats one of the reasons I would like using a Guid as sequence is not a concern.
Mank
+6  A: 
8 characters - perfectly random - 36^8 = 2,821,109,907,456 combinations
10 characters - perfectly random - 36^10 = 3,656,158,440,062,976 combinations
GUID's - statistically unique* - 2^128 = 340,000,000,000,000,000,000,000,000,000,000,000,000 combinations

* Is a GUID unique 100% of the time? [stackoverflow]

The problem with your GUID -> character conversion; while your GUID is statistically unique, by taking any subset you decrease randomness and increase the chance of collisions. You certainly don't want to create non-unqiue SKU's.


Solution 1:

Create SKU using data relevant to the object and business rules.

i.e. There likely to be a small combination of attributes that makes an object unique (a natural key). Combine the elements of the natural key, encode and compress them to create a SKU. Often all you need is a date-time field (ie CreationDate) and a few other properties to achieve this. You're likely to have a lot of holes in sku creation, but sku's are more relevant to your users.

hypothetically:

Wholesaler, product name, product version, sku
Amazon,     IPod Nano,    2.2,             AMIPDNN22
BestBuy,    Vaio,         3.2,             BEVAIO32

Solution 2:

A method that reserves a range of numbers, and then proceeds to release them sequentially, and never returns the same number twice. You can still end up with holes in the range. Likely though you don't need to generate enough sku's to matter, but ensure your requirements allow for this.

An implementation is to have a key table in a database that has a counter. The counter is incremented in a transaction. An important point is that rather than incrementing by 1, the method in software grabs a block. pseudo-c#-code is as follows.

-- what the key table may look like
CREATE TABLE Keys(Name VARCHAR(10) primary key, NextID INT)
INSERT INTO Keys Values('sku',1)

// some elements of the class
public static SkuKeyGenerator 
{
    private static syncObject = new object();
    private static int nextID = 0;
    private static int maxID = 0;
    private const int amountToReserve = 100;

    public static int NextKey()
    {
        lock( syncObject )
        {
            if( nextID == maxID )
            {
                ReserveIds();
            }
            return nextID++;
        }
    }
    private static void ReserveIds()
    {
        // pseudocode - in reality I'd do this with a stored procedure inside a transaction,
        // We reserve some predefined number of keys from Keys where Name = 'sku'
        // need to run the select and update in the same transaction because this isn't the only
        // method that can use this table.
        using( Transaction trans = new Transaction() ) // pseudocode.
        {
             int currentTableValue = db.Execute(trans, "SELECT NextID FROM Keys WHERE Name = 'sku'");
             int newMaxID = currentTableValue + amountToReserve;
             db.Execute(trans, "UPDATE Keys SET NextID = @1 WHERE Name = 'sku'", newMaxID);

             trans.Commit();

             nextID = currentTableValue;
             maxID = newMaxID;
        }
    } 

The idea here is that you reserve enough keys so that your code doesn't go the the database often, as getting the key range is an expensive operation. You need to have a good idea of the number of keys you need to reserve to balance key loss (application restart) versus exhausting keys too quickly and going back to the database. This simple implementation has no way to reuse lost keys.

Because this implementation relies a database and transactions you can have applications running concurrently and all generate unique keys without needing to go to the database often.

Note the above is loosely based on key table, page 222 from Patterns of Enterprise Application Architecture (Fowler). The method is usually used to generate primary keys without the need of a database identity column, but you can see how it can be adapted for your purpose.

Robert Paulson