views:

844

answers:

7

Right now I'm working on a project which requires an integer to be converted to a base 62 string many times a second. The faster this conversion is completed, the better.

The problem is that I'm having a hard time getting my own base conversion methods to be fast and reliable. If I use strings, it's generally reliable and works well, but it's slow. If I use char arrays, it's generally much faster, but it's also very messy, and unreliable. (It produces heap corruption, comparison of strings that should match return a negative, etc.)

So what's the fastest and most reliable way of converting from a very large integer to a base 62 key? In the future, I plan on utilizing SIMD model code in my application, so is this operation parallelizable at all?

EDIT: This operation is performed several million times a second; as soon as the operation finishes, it begins again as part of a loop, so the faster it runs, the better. The integer being converted is of arbitrary size, and can easily be as large as a 128 bit integer (or larger).

EDIT: This is the function I am currently using.

char* charset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
int charsetLength = (int)(strlen(charset));

//maxChars is an integer specifying the maximum length of the key
char* currentKey = new char[maxChars];

void integerToKey(unsigned long long location)
{
    unsigned long long num = location;
    int i = 0;

    for(; num > 0; i++)
    {
            currentKey[i] = charset[num % (charsetLength)];
            num /= charsetLength + 1;
    }

    currentKey[i + 1] = '\0';
}

I ripped this out of a class that is part of my application, and some of the code is modified so that it makes sense sans its owning class.

+1  A: 

I feel bad because I cant remember where I originally found this, but I have been using this in my code and have found it to be pretty fast. You could modify this to be more efficient in certain places I am sure.

Oh and I feel worse because this is written in Java, but a quick c&p and refactor could get it working in c++

public class BaseConverterUtil {

     private static final String baseDigits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

     public static String toBase62( int decimalNumber ) {
         return fromDecimalToOtherBase( 62, decimalNumber );
     }

     public static String toBase36( int decimalNumber ) {
         return fromDecimalToOtherBase( 36, decimalNumber );
     }

     public static String toBase16( int decimalNumber ) {
         return fromDecimalToOtherBase( 16, decimalNumber );
     }

     public static String toBase8( int decimalNumber ) {
         return fromDecimalToOtherBase( 8, decimalNumber );
     }

     public static String toBase2( int decimalNumber ) {
         return fromDecimalToOtherBase( 2, decimalNumber );
     }

     public static int fromBase62( String base62Number ) {
         return fromOtherBaseToDecimal( 62, base62Number );
     }

     public static int fromBase36( String base36Number ) {
         return fromOtherBaseToDecimal( 36, base36Number );
     }

     public static int fromBase16( String base16Number ) {
         return fromOtherBaseToDecimal( 16, base16Number );
     }

     public static int fromBase8( String base8Number ) {
         return fromOtherBaseToDecimal( 8, base8Number );
     }

     public static int fromBase2( String base2Number ) {
         return fromOtherBaseToDecimal( 2, base2Number );
     }

     private static String fromDecimalToOtherBase ( int base, int decimalNumber ) {
         String tempVal = decimalNumber == 0 ? "0" : "";
         int mod = 0;

         while( decimalNumber != 0 ) {
             mod = decimalNumber % base;
             tempVal = baseDigits.substring( mod, mod + 1 ) + tempVal;
             decimalNumber = decimalNumber / base;
         }

         return tempVal;
     }

     private static int fromOtherBaseToDecimal( int base, String number ) {
         int iterator = number.length();
         int returnValue = 0;
         int multiplier = 1;

         while( iterator > 0 ) {
             returnValue = returnValue + ( baseDigits.indexOf( number.substring( iterator - 1, iterator ) ) * multiplier );
             multiplier = multiplier * base;
             --iterator;
         }
         return returnValue;
     }

 }
instanceofTom
+3  A: 

Off the top of me head I'd expect an implementation to look a lot like this.

const char lookUpTable[] = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 
  'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
  'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
  'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z' };

std::string ConvertToBase62( int integer )
{
   char res[MAX_BASE62_LENGTH];
   char* pWritePos = res;
   int leftOver = integer;
   while( leftOver )
   {
      int value62     = leftOver % 62;     
      *pWritePos      = lookUpTable[value62];
      pWritePos++;

      leftOver        /= value62;
   }
   *pWritePos = 0;    

   return std::string( res );
}

At the moment this isn't very SIMD optimisable. There is no SIMD modulo.

If we do Modulo ourselves we could in turn rewrite the loop as follows.

   while( leftOver )
   {
      const int newLeftOver = leftOver / 62;
      int digit62     = leftOver - (62 * newLeftOver);     
      *pWritePos      = lookUpTable[digit62];
      pWritePos++;

      leftOver        = newLeftOver;
   }

Now we have something that would be easy to SIMD if it weren't for that lookup ...

Though you can still get a good speed improvement by doing the modulo for several values simultaneously. It'd probably even be worth unrolling the loop a second time so you can process the next 4 or so modulos while the previous set are calculating (Due to instruction latency). You should be able to hide latencies pretty effectively this way. #

I'll come back if i can think of a way to eliminate the table lookup ...

Edit: That said as the maximum number of base62 digits you can get from a 32-bit integer is 6 you should just be able to fully unwind the loop and process all 6 digits simultaneously. I'm not entirely sure SIMD would give you much of a win here. It'd be an interesting experiment but I do really doubt you'd get all that much of a speed up over the loop above. Would be interesting to try it if someone hadn't poured tea over my dev machine's keyboard :(

Edit 2: while i think about it. A constant / 62 can be craftily optimised by the compiler using scary magic numbers ... so I don't even reckon the loop above would do a divide.

Goz
+3  A: 

Probably what you want is some version of itoa. Here is a link that shows various versions of itoa with performance tests: http://www.jb.man.ac.uk/~slowe/cpp/itoa.html

In general, I know of two ways to do this. One way it to perform successive divisions to strip off one digit at a time. Another way is to precompute conversions in "blocks". So you could precompute a block of int to text conversion of size 62^3 then do the digits 3 at a time. Provided you do the memory layout and lookup efficiently this can be slightly faster at runtime but incurs a startup penalty.

Corwin Joy
It would be interesting to compare your method to oen doing each digit seperately. The only issue I can imagine is that a 62^3 (@238K table) would cache very poorly. Though that probably wouldn't matter much on a modern PC due to the big L2 caches ...
Goz
ooops forgot you'd need to return 3 characters. That prings you up to nearer 3/4 of a meg for a lookup.
Goz
This precompute method sounds like it would offer a great speed increase, as this operation is performed several million times every second. Any speed increase during the operation of the program would compensate for the startup penalty by a large margin. I'll take a look at doing this.
GenTiradentes
I actually got this method working, and it increased the overall speed of the application by 80%. Thanks a lot!
GenTiradentes
+2  A: 

there are reversing issues in the above - the low orders come first in the generated string - I don't know if that is actually an issue because it depends on the subsequent usage of the generated string.

Generally this sort of radix conversion can be accelerated by doing it in radix*radix chunks In your case a char[2][62*62] is needed. This array can be constructed at initialization time(it is const).

This must be benchmarked though. The divide cost used to be HUGE so saving half the divides was a sure win. It depends on the ability to cache this 7000+ byte table and the cost of the divide.

pgast
+1  A: 

If you are getting heap corruption, you have issues beyond the code you're showing here.

You can make the string class faster by reserving the space for the string before you begin, with string::reserve.

Your string is coming out in reverse order, the lower order base-62 digit is the first character in the string. This might explain your comparison issues.

Mark Ransom
+1  A: 

Your implementation is pretty much as quick as it's going to get. I would suggest a couple of changes though:

void integerToKey(unsigned long long location)
{
    unsigned long long num = location;
    int i = 0;
    for(; num > 0; i++)
    {
            currentKey[i] = charset[num % (charsetLength)];
            num /= charsetLength; // use charsetLength
    }
    currentKey[i] = '\0'; // put the null after the last written char
}

The first change (divide by charsetLength) may have been causing your string comparison problems. With your original code (dividing by charsetLength + 1), there may be different values of integer that incorrectly get converted to the same string. For base 62, then both 0 and 62 would be encoded as "0".

It's hard to say whether either of the above changes would be causing your reported heap corruption problems, without a bit more context (such as the value of maxChars).

Also, you should be aware that the above code will write the digits of the string representation in reverse order (try it with base 10 and convert a number such as 12345 to see what I mean). This may not matter for your application, though.

Greg Hewgill
It doesn't matter. The intended function is to convert a number to a text string in a way that allows for sequential and consistent generation of every key in a keyspace.
GenTiradentes
Also, I divided by (charsetLength + 1) because when I simply divide by charsetLength, the first character in the charset is left out of the keyspace.So with the charset being defined like so: char* charset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";Divinding by charsetLength means that zero is excluded when converting to base 62. When I divide by (charsetLength + 1) the problem is resolved.
GenTiradentes
You know, if you're doing sequential-only generation, there's no need to convert the whole integer every time. Start with "0", and write a function to increment the (base62) digits by hand performing the carry yourself. That will be faster than recomputing the base62 representation every time, and will handle unlimited precision.
Greg Hewgill
It's actually not purely sequential. I need to be able to divide portions of the keyspace among threads, as well as external networked clients, which will be handled in the future. Thanks though!
GenTiradentes
A: 

Here's a solution I use in php for Base 10 to N (62 in this example)
My whole post is over here: http://ken-soft.com/?p=544

public class BNID {
        // Alphabet of Base N (This is a Base 62 Implementation)
        var $bN = array(
            '0','1','2','3','4','5','6','7','8','9',
            'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',
            'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'
        );

        var $baseN;

        function __construct() {
            $this->baseN = count($this->bN);
        }

        // convert base 10 to base N
        function base10ToN($b10num=0) {
            $bNnum = "";
            do {
                $bNnum = $this->bN[$b10num % $this->baseN] . $bNnum;
                $b10num /= $this->baseN;
            } while($b10num >= 1);     
            return $bNnum;
        }

        // convert base N to base 10
        function baseNTo10($bNnum = "") {
           $b10num = 0;
            $len = strlen($bNnum);
            for($i = 0; $i < $len; $i++) {
                $val = array_keys($this->bN, substr($bNnum, $i, 1));
                $b10num += $val[0] * pow($this->baseN, $len - $i - 1);
            }
            return $b10num;
        }

}
KennyCason