views:

765

answers:

5

i want to uniquely shorten strings-file ids to use in urls like the ones on bit.ly etc. I can use ids from a db but i want urls to be random like.

what would be the best solution?

site will be a mobile site so i want to it to as short as possible

+3  A: 

You can't "uniquely shorten" arbitrary strings. Pigeonhole principle and all.

What you want to do (and, AFAIK what url-shortening services do) is keep a database of everything submitted, and the short string used. Then you can look it up in the database.

You can generate the short strings by simply incrementing a number and Base64 encoding it for each time.

Anon.
i thought of that but wouldn;t it be a bit expensive from the querying db point
nLL
This is what databases are designed for.
Anon.
+1 increment and Base64 enc. That seems to be exactly what the OP is asking for.
gWiz
While the Base64 enc is random "like" (indeed what the op asked for), it would be easy for a saavy programmer to recognize the trailing "==" on the string, and Base64 decode it to get the number. He wants to avoid numeric ids, because perhaps his id is an identity field that is an indication of the number of records, and he doesn't want that to be known.
Samuel Meacham
how would one use this? create a record in the db id/filename/encodestring (1/myfile.txt/MQ==) or (222/myfile1.txt/MjIy) . But what If I want the string generated to not be case sensitive and contain alphanumeric like bit.ly works... What's the solution - create a random string/number generator ? in base64 enc is 'bXlmaWxlLnR4dA==' so what would you do store it as so if i want a really short url, i would also need to
kiev
A: 

store a random alpha-numeric string and use that for your short url. make it the length that you think is best for your site and it's users some thing like www.yoursite.com/d8f3

RHicke
oi, don't like handing out -1's, but Guid truncation got one, so random bytes gets one too. maybe if you add the "write a collision handler"-clause i could recant, but advocating random values in place of unique values is just plain wrong :(
johnny g
I said STORE them. During his storage process he can make a check that it is unique before placing them into his database as the unique. I guess it's my fault for assuming that he isn't stupid enough to not write a collision handler.
RHicke
A: 

You could use a hash (for example CRC32) to produce quite short URLs. You will never be able to get 'unique' URLs as you are reducing the data, so there has to be collisions.

Adam Pope
+1  A: 

There are two methods to implementing a mapping service like the one you describe.

  1. Clients submit globally unique ids, or
  2. Server generates globally unique ids

Clients submit globally unique ids

As far as I know, 1. should only be attempted with Guids, unless you devise a similar means to cram sufficiently distinct information into a short byte stream. Either way, if you have a stream of bytes that represent a globally unique identifier, you may do something like this

// source is either a Guid, or some other globally unique byte stream
byte[] bytes = Guid.NewGuid ().ToByteArray ();
string base64String = Convert.ToBase64String (bytes).Trim ("=");

to obtain a user-readable string of alphanumerics that appears random, but avoids collisions inherent in other random schemes. A Guid contains 16 bytes, or 128 bits, which translates to approximately 19 characters for a full Base64 encoding.

The advantage to this approach is that clients may generate their own tiny Uris without a central authority. The downside is hefty length if you roll with Guid, or implementing your own globally unique byte stream which - let's face it - is error prone.

If you do go this route, consider Google'ing globally unique byte streams or the such. Oh, and STAY AWAY FROM RANDOM BYTES, otherwise you will have to build collision resolution ON TOP OF your tiny Uri generator.

Server generates globally unique ids

Again, the primary advantage to the above is that Client's may generate their Uris a priori. Particularly handy if you are about to submit a long running request you wish to check up on. This may not be particularly relevant to your situation, and may provide only limited value.

So, that aside, a server-centric approach, in which a single authority generates and doles out ids may be more appealing. If this is the route you choose, then the only question is how long would you like your Uri?

Presuming a desired length of 5 characters, and let's say you go with a Base64 encoding, each id may represent up to 5 characters by 7 bits per character equals 35 bits or 2^35 [34 359 738 368] distinct values. That's a fairly large domain. *

Then it becomes a question of returning a value for a given submission. There are probably a great many many ways to do this, but I would go with something like this,

  • Enumerate all possible values within a "free list" in your database
  • Remove value from free list when consumed
  • Add value to free list when released

Enhancements or optimizations may include

  • Do not enumerate every value on range [0, 2^35], instead enumerate a manageable subset, say 100 000 values at a time, and when all values are consumed, simply generate another 100 000 values in sequence and continue
  • Add an expiry date to values, and recycle expired values end of day
  • Distribute your service, when parallelizing your service simply dole out small mutually exclusive subsets of your free list to distributed services

Conclusion

Bottom line is, you want to guarantee uniqueness - so collisions are a big no-no.


*=34 359 738 368 is the size of the raw domain, this is all ids of 0 length to 5 length. If you are interested in restricting all ids to a minimum and maximum of 5 length, then your domain looks like all ids of length 0 to 5 (2^35) less all ids of length 0 to 4 (2^28) is 2^35 - 2^28 = 34 091 302 912, which is still quite large :)

johnny g
A: 

Hey nll, as several other people has told you.. If you start compressing the url into something small it will be impossible for you to keep it unique. That said, you need to make your own coding for every url submitted to you. One way (easy) to do it is, try to create a database from the submitted urls and then generate a guid field for each and then get a substring from it ensuring everytime you register something is totally different from the previous.

For instance: www.google.com with the guid F9168C5E-CEB2-4faa-B6BF-329BF39FA1E4 -> http://www.mysite.com/?q=CEB2

As more characters as you use, more amount of links you can keep track on. for this sample you will have 65536 different links (with only 4 characters on hex).

Hope this helps.

sorry, -1 for truncating Guids. just poor poor practice. only an entire Guid is guaranteed to be unique. taking any portion or subset of a Guid is not. every-little-bit-counts.
johnny g
I know Johnny, and you can't find in any of my post a reference where I am saying a subset of the link is unique...