views:

248

answers:

1

Hi,

I wrote a simple program to implement SSE intrinsics for computing the inner product of two large (100000 or more elements) vectors. The program compares the execution time for both, inner product computed the conventional way and using intrinsics. Everything works out fine, until I insert (just for the fun of it) an inner loop before the statement that computes the inner product. Before I go further, here is the code:

    //this is a sample Intrinsics program to compute inner product of two vectors and compare Intrinsics with traditional method of doing things.

        #include <iostream>
        #include <iomanip>
        #include <xmmintrin.h>
        #include <stdio.h>
        #include <time.h>
        #include <stdlib.h>
        using namespace std;

        typedef float v4sf __attribute__ ((vector_size(16)));

        double innerProduct(float* arr1, int len1, float* arr2, int len2) {  //assume len1 = len2.

          float result = 0.0;
          for(int i = 0; i < len1; i++) {
            for(int j = 0; j < len1; j++) {
              result += (arr1[i] * arr2[i]);
            }
          }

         //float y = 1.23e+09;
         //cout << "y = " << y << endl;
         return result;
        }

        double sse_v4sf_innerProduct(float* arr1, int len1, float* arr2, int len2) { //assume that len1 = len2.

          if(len1 != len2) {
            cout << "Lengths not equal." << endl;
            exit(1);
          }

          /*steps:
         * 1. load a long-type (4 float) into a v4sf type data from both arrays.
         * 2. multiply the two.
         * 3. multiply the same and store result.
         * 4. add this to previous results.
         */

          v4sf arr1Data, arr2Data, prevSums, multVal, xyz;
          //__builtin_ia32_xorps(prevSums, prevSums);   //making it equal zero.
         //can explicitly load 0 into prevSums using loadps or storeps (Check).

          float temp[4] = {0.0, 0.0, 0.0, 0.0};
          prevSums = __builtin_ia32_loadups(temp);
          float result = 0.0;

          for(int i = 0; i < (len1 - 3); i += 4) {
            for(int j = 0; j < len1; j++) {
            arr1Data = __builtin_ia32_loadups(&arr1[i]);
            arr2Data = __builtin_ia32_loadups(&arr2[i]);  //store the contents of two arrays.
            multVal = __builtin_ia32_mulps(arr1Data, arr2Data);   //multiply.
            xyz = __builtin_ia32_addps(multVal, prevSums);
            prevSums = xyz;
          }
         }
          //prevSums will hold the sums of 4 32-bit floating point values taken at a time. Individual entries in prevSums also need to be added.
          __builtin_ia32_storeups(temp, prevSums);  //store prevSums into temp.

           cout << "Values of temp:" << endl;
           for(int i = 0; i < 4; i++)
             cout << temp[i] << endl;

          result += temp[0] + temp[1] + temp[2] + temp[3];

        return result;
        }

        int main() {
          clock_t begin, end;
          int length = 100000;
          float *arr1, *arr2;
          double result_Conventional, result_Intrinsic;

 //         printStats("Allocating memory.");
          arr1 = new float[length];
          arr2 = new float[length];
 //         printStats("End allocation.");

          srand(time(NULL));  //init random seed.
 //         printStats("Initializing array1 and array2");
          begin = clock();
          for(int i = 0; i < length; i++) {
         //   for(int j = 0; j < length; j++) {
          //    arr1[i] = rand() % 10 + 1;
                arr1[i] = 2.5;
           //    arr2[i] = rand() % 10 - 1;
                arr2[i] = 2.5;
         //   }
          }
          end = clock();
          cout << "Time to initialize array1 and array2 = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl;
  //        printStats("Finished initialization.");

    //      printStats("Begin inner product conventionally.");
          begin = clock();
          result_Conventional = innerProduct(arr1, length, arr2, length);
          end = clock();
          cout << "Time to compute inner product conventionally = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl;
    //      printStats("End inner product conventionally.");

      //    printStats("Begin inner product using Intrinsics.");
          begin = clock();
          result_Intrinsic = sse_v4sf_innerProduct(arr1, length, arr2, length);
          end = clock();
          cout << "Time to compute inner product with intrinsics = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl;
          //printStats("End inner product using Intrinsics.");

          cout << "Results: " << endl;
          cout << " result_Conventional = " << result_Conventional << endl;
          cout << " result_Intrinsics = " << result_Intrinsic << endl;
        return 0;
        }

I use the following g++ invocation to build this:

 g++ -W -Wall -O2 -pedantic -march=i386 -msse intrinsics_SSE_innerProduct.C -o innerProduct  

Each of the loops above, in both the functions, runs a total of N^2 times. However, given that arr1 and arr2 (the two floating point vectors) are loaded with a value 2.5, the length of the array is 100,000, the result in both cases should be 6.25e+10. The results I get are:

Results:
result_Conventional = 6.25e+10
result_Intrinsics = 5.36871e+08

This is not all. It seems that the value returned from the function that uses intrinsics "saturates" at the value above. I tried putting other values for the elements of the array and different sizes too. But it seems that any value above 1.0 for the array contents and any size above 1000 meets with the same value we see above.

Initially, I thought it might be because all operations within SSE are in floating point, but floating point should be able to store a number that is of the order of e+08.

I am trying to see where I could be going wrong but cannot seem to figure it out. I am using g++ version: g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2).

Any help on this is most welcome.

Thanks,
Sriram.

+2  A: 

The problem that you are having is that while a float can store 6.25e+10, it only has a few significant digits of precision.

This means that when you are building a large number by adding lots of small numbers together a bit at a time, you reach a point where the smaller number is smaller than the lowest precision digit in the larger number so adding it up has no effect.

As to why you are not getting this behaviour in the non-intrinsic version, it is likely that result variable is being held in a register which uses a higher precision that the actual storage of a float so it is not being truncated to the precision of a float on every iteration of the loop. You would have to look at the generated assembler code to be sure.

Charles Bailey
But if I were to remove the inner loops ( (for int j = 0; j < len1; j++)), I get the correct answer for different array lengths (for both the functions). The problem that I have mentioned above arises only when I have the inner loop running. Also, if the loss of precision upon adding very small numbers was the problem, then it should have arisen when I was trying the program without the inner loop. Array values of 2.5 (and 2.5^2) are not so small as to lose precision. Please correct me if I am wrong.
Sriram
Re-read his answer. He says that when the value is already large, then this occurs. So if you started out with 2.5e08, then adding 2.5 may not make any difference. You should try replacing with double and seeing if there is a difference.
DeadMG
I just replaced "result" with double. There is no difference. The output I get are the same as those I have mentioned above.
Sriram
Which `result` did you change? If the non-intrinsic version then I'd expect no change - it's correct already. In the intrinsic version it makes no difference because your are using a `v4sf` as storage for your critical intermediate.
Charles Bailey
I changed the return type of result in both functions.
Sriram
How can I lookup the generated assembler code? What do I look for within?
Sriram
@Siriam: Pass the `-S` option to gcc. http://stackoverflow.com/questions/137038/
Charles Bailey
I have the assembler code generated by g++. I am unable to make any sense of it. It is also 11000+ lines long. How can I post it here? Any suggestions?
Sriram
@Siriam: It shouldn't be that long, I tried it and the assembler for `innerProduct` with your compiler options is less than 60 lines long. If you inspect it (there's no need to post it) you should see that the code is just using the 387 register stack to store the accumulated result which means that you have 80-bits of precision for the accumulator in this case.
Charles Bailey
Hi, I changed the code a bit. Each time the inner loop ends execution, I add prevSums to another v4sf variable. prevSums is initialized to zero every time the inner loop begins. That seems to solve the problem that Charles pointed out. Is there another way?
Sriram
@Siriam: If you know that you're going to be accumulating a large number very small bits at a time and you have to use a narrow floating point type then doing it in "batches" as you suggest is a fairly standard way.
Charles Bailey
And one last question. I tried running the above program on a 100000 element array by computing inner product in "batches". I get a speed increase factor of about 2x. Is that all I can hope to get? I had thought that using intrinsics would give me a 4x boost. Any suggestions on how I can get that extra speed or why I do not seem to be getting that extra speed?
Sriram