views:

323

answers:

7

I need to efficently insert a 5 character RANDOM string into a database while also ensuring that it is UNIQUE. Generating the random string is not the problem, but currently what I am doing is generating the string and then checking the DB if it exists already... if it does, I start over.

Is there a more efficient way to do this process?

Please note, I do NOT want to use GUID or anything else that is more than 5 Characters.... I MUST stick to 5 Characters.

PS: I don't think it makes a difference, but my strings are all case sensitive.

Here is the "Random String" portion

    Public Function GetRandomNumbers(ByVal numChars As Integer) As String
    Dim chars As String() = { _
     "A", "B", "C", "D", "E", "F", _
     "G", "H", "I", "J", "K", "L", _
     "M", "N", "O", "P", "Q", "R", _
     "S", "T", "U", "V", "W", "X", _
     "Y", "Z", "0", "1", "2", "3", _
     "4", "5", "6", "7", "8", "9", _
     "a", "b", "c", "d", "e", "f", _
     "g", "h", "i", "j", "k", "l", _
     "m", "n", "o", "p", "q", "r", _
     "s", "t", "u", "v", "w", "x", _
     "y", "z"}
    Dim rnd As New Random()
    Dim random As String = String.Empty
    Dim i As Integer = 0
    While i < numChars
        random += chars(rnd.[Next](0, 62))
        System.Math.Max(System.Threading.Interlocked.Increment(i), i - 1)
    End While
    Return random
End Function
+1  A: 

You could generate a GUID and only use the first 5 characters?

Mark Redman
That's just another way of generating a random string, you still have to check for duplicates.
Guffa
Was my first thought too, although he'll have to generate 5 additional bits for a case sensitive string.
schnaader
+1  A: 

Is randomness more important, or is uniqueness more important? -- note that I said "more" important; I get the fact that you need both.

If randomness is more important, then you're going to need some way to track historical values. The database itself (with an appropriate index) is going to be the best way to do this.

If uniqueness is more important, then simply use a counter and zero-pad it to five digits. This will, of course, limit you to 100,000 rows, so you could alternatively use a counter and a transformation into character space (eg, 1 = "A", 2 = "B", 27 = "AA", and so on).

kdgregory
+8  A: 

Create a table with a big pool of 5-character strings that are added in sequence (so they are unique), and have a GUID as their primary key. Add a column to indicate whether they are used or not.

When you need a new number, you select top 1 from the pool, order by the guid (so it becomes random), and set the result as "spent".

Console
This creates an additional table, but will be unique, random and use the most possible values without having to continually search the current values. The OP's original solution will take longer and longer as the number of rows increase.
Edward Leno
So I'm assuming that there will be a significant amount of work up front in generating the initial random strings.
rockinthesixstring
Instead of adding a column to indicate whether they are used, why not just delete them as they are used? Makes the query faster and easier to write.
JohnFx
I will be continually using them... they never expire.
rockinthesixstring
For a base 54 random string the table size is 459 million and all your doing is punting the same problem from one table to another unless a predictable sequence is used which is then not "random".
Einstein
FYI. My final solution is to create a unique db field (specified here: http://stackoverflow.com/questions/1514253/sql-server-2008-unique-column-that-is-case-sensitive) and then write a simple console app that inserts a generated random string via a "try catch" block. If it fails the try... it just loops to a new insert. I'm storing the values in a separate table and then will attach to them via a JOIN referencing the ID (int) field.
rockinthesixstring
@Einstein: You don't need to pre-generate all the possible strings, just enough so that it seems like the numbers are random. You can have the procedure that "picks" numbers generate new ones as needed. (Obviously, a sequence of 5 characters is only random in a very limited sense of the word. As you point out, if you have 459 million of them you know which ones they are).
Console
+1  A: 

There is a method to pick unused unique words by random, but it's probably not going to be any better than what you are doing now.

The principle is that you determine which permutations of the words that are unused, generate a random number based on how many unused permuations there are, and pick that one.

If you for example would use a word with three characters, and only the character 0 and 1, there are eight possible permutations. If you already used the combinations "010" and "100", you would get something that looks like this:

PI = permutation index
UI = unused permutation index

No. PI UI
----------
000 0  0
001 1  1
010 2  -
011 3  2
100 4  -
101 5  3
110 6  4
111 7  5

To pick an unused permutation, you simply generate a random number from 0 to 5, and pick the corresponding permutation.

Keeping a list of all possible permuations is of course not practical, so you would need a function that can determine the permutation index from the string, and one function that can determine the string from the permutation index.

Also, to determine which permutations are unused, you have to check which are used, so you still have to query the table at some point.

Guffa
A: 

If you are inserting the string to an existing, populated, table then you will always need to check if the string doesn't exist there (it doesn't have to be an explicit SELECT). You can either to it manually, or have an UNIQUE constraint on the column and let the database to do it. So if the database returns an error because the string is already there, generate another one.

Note that if you have an empty table and want to populate it with multiple random strings, it's a different problem.

Lukáš Lalinský
A: 

I think you should stick to your origional idea. Putting a unique constraint on the index and letting the database check/report dupes for you would be fairly effecient method of dupe checking but this assumption depends on some information not provided like the number of rows and likelyhood of encountering dupes with randomly selected data.

Fully pre-populating a unique sequence pool with your parameters requires a 459 million row table.

You could use a bloom filter to load managable statistics into a database or main memory and avoid dupes but depending on the row count and filter configuration this may lead to saturation of the filter when the row count is an appreciable percentage of the 459 million limit.. Since the filter can report false positives you should work to ensure that you don't get into a situation where your system is stuck trying permutations that pass the filter approaching forever.

Einstein
A: 

As you know how long your word shall be, why not a tree-based approach? (Let's call it randomised tree walk)

Say your word has n characters. Generate a list of all symbols s in S and relate a counter for each symbol and possible position in the string, essentially a matrix M of dimensions s times n. Now roll your dice and choose the first letter and lookup M(s,1). If M(s,1) is larger or equal the number of possible words starting with s, roll again. Otherwise increment M(s,1).

Repeat this for every letter 1 till n.

Should be pretty fast until you have used up to many words.

Don Johe