views:

606

answers:

3

What is the most efficient for speed algorithm to solve the following problem?

Given 6 arrays, D1,D2,D3,D4,D5 and D6 each containing 6 numbers like:

D1[0] = number              D2[0] = number      ......       D6[0] = number
D1[1] = another number      D2[1] = another number           ....
.....                       ....                ......       ....
D1[5] = yet another number  ....                ......       ....

Given a second array ST1, containing 1 number:

ST1[0] = 6

Given a third array ans, containing 6 numbers:

ans[0] = 3, ans[1] = 4, ans[2] = 5, ......ans[5] = 8

Using as index for the arrays D1,D2,D3,D4,D5 and D6, the number that goes from 0, to the number stored in ST1[0] minus one, in this example 6, so from 0 to 6-1, compare the ans array against each D array. The result should be 0 if one or more ans numbers are not found in any D at the same index and should be 1 if all ans numbers are found in some D at the same index. That is, return 0 if some ans[i] doesn't equal any DN[i] and return 1 if every ans[i] equals some DN[i].

My algorithm so far is:
I tried to keep everything unlooped as much as possible.

EML  := ST1[0]   //number contained in ST1[0]   
EML1 := 0        //start index for the arrays D 

While EML1 < EML
   if D1[ELM1] = ans[0] 
     goto two
   if D2[ELM1] = ans[0] 
     goto two
   if D3[ELM1] = ans[0] 
     goto two
   if D4[ELM1] = ans[0] 
     goto two
   if D5[ELM1] = ans[0] 
     goto two
   if D6[ELM1] = ans[0] 
     goto two

   ELM1 = ELM1 + 1

return 0     //If the ans[0] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers


two:

EML1 := 0      start index for arrays Ds 
While EML1 < EML
   if D1[ELM1] = ans[1] 
     goto three
   if D2[ELM1] = ans[1] 
     goto three
   if D3[ELM1] = ans[1] 
     goto three
   if D4[ELM1] = ans[1] 
     goto three
   if D5[ELM1] = ans[1] 
     goto three
   if D6[ELM1] = ans[1] 
     goto three
   ELM1 = ELM1 + 1

return 0    //If the ans[1] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers

three:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[2] 
     goto four
   if D2[ELM1] = ans[2] 
     goto four
   if D3[ELM1] = ans[2] 
     goto four
   if D4[ELM1] = ans[2] 
     goto four
   if D5[ELM1] = ans[2] 
     goto four
   if D6[ELM1] = ans[2] 
     goto four
   ELM1 = ELM1 + 1

return 0   //If the ans[2] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers

four:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[3] 
     goto five
   if D2[ELM1] = ans[3] 
     goto five
   if D3[ELM1] = ans[3] 
     goto five
   if D4[ELM1] = ans[3] 
     goto five
   if D5[ELM1] = ans[3] 
     goto five
   if D6[ELM1] = ans[3] 
     goto five
   ELM1 = ELM1 + 1

return 0 //If the ans[3] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers


five:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[4] 
     goto six
   if D2[ELM1] = ans[4] 
     goto six
   if D3[ELM1] = ans[4] 
     goto six
   if D4[ELM1] = ans[4] 
     goto six
   if D5[ELM1] = ans[4] 
     goto six
   if D6[ELM1] = ans[4] 
     goto six
   ELM1 = ELM1 + 1

return 0  //If the ans[4] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers

six:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[5] 
     return 1            ////If the ans[1] number is not found in either D1[0-6].....  
   if D2[ELM1] = ans[5]      return 1 which will then include ans[0-6] numbers
     return 1
   if D3[ELM1] = ans[5] 
     return 1
   if D4[ELM1] = ans[5] 
     return 1
   if D5[ELM1] = ans[5] 
     return 1
   if D6[ELM1] = ans[5] 
     return 1
   ELM1 = ELM1 + 1

return 0 

As language of choice, it would be pure c

+1  A: 

I was a little bit confused by your question, but I think I have it enough to help you get started.

#define ROW 6
#define COL 6

int D[ROW][COL]; // This is all of your D arrays in one 2 dimensional array.

Next you should probably use nested for loops. Each loop will correspond to a dimension of D. Remember that the order of your indexes matters. The easiest way to keep it straight in C is to remember that D[i] is a valid expression even when D has more than one dimension (and would evaluate to a pointer to a row : a sub array).

If you can't change the independent D arrays to one multidimensional array you could easily make a pointer array whose members point to the heads of each of those arrays and achieve the same effect.

Then you can use the break statement to break out of the inner loop after you have determined that the current D[i] doesn't match the ans.

nategoose
well I do not want to use 2 dimension array, I need to have 6 different array, and possible as unlooped as possible
Mark
What I am interested is to speed up, that code, which I put in algorithm form, and it has to be with 1 dimension arrays
Mark
If you compiled it with a compiler that was capable of loop unrolling with that optimization turned on then it would likely result in something similar to what you were trying to achieve with gotos without upsetting your instructor.
nategoose
@mark: also gotos have another effect than upsetting instructors, they empty processor pipe-lines (that's true of any jump, so in this regard loops are no better). But whenever you can express your program without using any kind of goto/branch/jump even if you execute more instructions than strictly necessary you speedup things. I believe there is a track to follow there (will give it a try).
kriss
Well I tried not to use goto and use bool variables and break, but it did not work, and besides that means more memory. Is goto real a problem in programming or is it a matter of form?
Mark
@mark: Just because you declare a variable doesn't mean that the compiler has to use up memory for it. Modern compilers are very smart. Goto is a problem because using labels and goto often results in unreadable code. If the reader of your code has to constantly look for labels of gotos to understand the code then they are horrible. I'm not totally against them, but I'm not who you have to impress. I'm fairly certain that if you code this in C with loops but compile with high optimization you will end up with good code. If this is in a function then you may want to use the restrict keyword.
nategoose
@mark: Another thing you could try that would fit what you have and technically not have "goto" in it would be to put all of this in a `do{}while(0)` and let break be your goto. Or `if (D1[ELM1] != ans[0] return 0;}` Also, you should not that with code that does a lot of jumping already trying to avoid loops doesn't save you what you as much as with non-jumpy code.
nategoose
Mark
I tried lot of combination, some did not work and some were slower than the actual implementation
Mark
You should have said CUDA before. That changes things and explains your otherwise confusing rules. Try `if ((D1[ELM1] - ans[0]) return 0; }`If it weren't for the return I'd also suggest _trying_ `x=((D1[ELM1]-ans[0]) ELM1 += x/x;` I suggested the self divide to someone else on here doing GPU work as a possible jumpless way of getting 1 as long as 0/0 results in 0 or a trap, and they said that it worked for them.
nategoose
+5  A: 

I did a direct trivial C implementation of the algorithm provided by the original poster. It is here

As other proposed the first thing to do is to roll up the code. Unrolling is not really good for speed as it lead to code cache misses. I began by rolling inner loops and got this. Then I rolled the outer loop and removed the now useless gotos and got code below.

EDIT: I changed several time the C code because even as simple as it is there seems to be problems when JIT compiling or executing it with CUDA (and CUDA seems not to be very verbose about errors). That is why the piece of code below use globals... and that is just trivial implementation. We are not yet going for speed. It says much about premature optimization. Why bother to make it fast if we can't even make it work ? I guess there is still issues as CUDA seems to impose many restrictions on the code you can make work if I believe the Wikipedia article. Also maybe we should use float instead of int ?

#include <stdio.h>

int D1[6] = {3, 4, 5, 6, 7, 8};
int D2[6] = {3, 4, 5, 6, 7, 8};
int D3[6] = {3, 4, 5, 6, 7, 8};
int D4[6] = {3, 4, 5, 6, 7, 8};
int D5[6] = {3, 4, 5, 6, 7, 8};
int D6[6] = {3, 4, 5, 6, 7, 9};
int ST1[1] = {6};
int ans[6] = {1, 4, 5, 6, 7, 9};
int * D[6] = { D1, D2, D3, D4, D5, D6 };

/* beware D is passed through globals */
int algo(int * ans, int ELM){
    int a, e, p;

    for (a = 0 ; a < 6 ; a++){ 
        for (e = 0 ; e < ELM ; e++){
            for (p = 0 ; p < 6 ; p++){
                if (D[p][e] == ans[a]){
                    goto cont;
                }
            }
        }
        return 0; //bad row of numbers found
    cont:;
    }
    return 1;
}

int main(){
    int res;
    res = algo(ans, ST1[0]);
    printf("algo returned %d\n", res);
}

Now that is interesting, because we can understand what code is doing. By the way doing this packing job I corrected several oddities in the original question. I believe it was typos as it wasn't logical at all in the global context. - goto always jump to two (it should have progressed) - the last test checked ans[0] instead of ans[5]

please Mark, correct me if I'm wrong in the above asumptions on what the original code should do and your original algorithm is typo free.

What the code is doing it for each value in ans check it is present in a two dimentional array. If any number miss it returns 0. If all numbers are found it returns 1.

What I would do to get a real fast code is not to implement it in C but in another language like python (or C++) where set is a basic data structure provided by standard libraries. Then I would build a set with all the values of the arrays (that is O(n)) and check if numbers searched are present in set or not (that is O(1)). The final implementation should be faster than the existing code at least from an algorithmic point of view.

Python example is below as it is really trivial (print true/false instead of 1/0 but you get the idea):

ans_set = set(ans)
print len(set(D1+D2+D3+D4+D5+D6).intersection(ans_set)) == len(ans_set)

Here is a possible C++ implementation using sets:

#include <iostream>
#include <set>

int algo(int * D1, int * D2, int * D3, int * D4, int * D5, int * D6, int * ans, int ELM){
    int e, p;
    int * D[6] = { D1, D2, D3, D4, D5, D6 };
    std::set<int> ans_set(ans, ans+6);

    int lg = ans_set.size();

    for (e = 0 ; e < ELM ; e++){
        for (p = 0 ; p < 6 ; p++){
            if (0 == (lg -= ans_set.erase(D[p][e]))){
                // we found all elements of ans_set
                return 1;
            }
        }
    }
    return 0; // some items in ans are missing
}

int main(){
    int D1[6] = {3, 4, 5, 6, 7, 8};
    int D2[6] = {3, 4, 5, 6, 7, 8};
    int D3[6] = {3, 4, 5, 6, 7, 8};
    int D4[6] = {3, 4, 5, 6, 7, 8};
    int D5[6] = {3, 4, 5, 6, 7, 8};
    int D6[6] = {3, 4, 5, 6, 7, 1};

    int ST1[1] = {6};

    int ans[] = {1, 4, 5, 6, 7, 8};

    int res = algo(D1, D2, D3, D4, D5, D6, ans, ST1[0]);
    std::cout << "algo returned " << res << "\n";
}

We make some performance hypothesis : the content of ans should be sorted or we should construct it otherwise, we suppose that content of D1..D6 will change between calls to algo. Hence we do not bother constructing a set for it (as set construction is O(n) anyway we wouldn't gain anything if D1..D6 are changing). But if we call several times algo with the same D1..D6 and that is ans that change we should do the opposite and transform D1..D6 into one larger set that we keep available.

If I stick to C I could do it as follow:

  • copy all numbers in D1.. D6 in one unique array (using memcpy for each row)
  • sort content of this array
  • use a dichotomic search to check if number if available

As data size are really small here , we could also try going for micro optimizations. It could pay better here. Don't know for sure.

EDIT2: there is hard restrictions on the subset of C supported by CUDA. The most restrictive one is that we shouldn't use pointers to main memory. That will have to be taken into account. It explains why the current code does not work. The simplest change is probably to call it for every array D1..D6 in turn. To keep it short and avoid function call cost we may use a macro or an inline function. I will post a proposal.

kriss
I corrected it yes was a typo it should have been [5]
Mark
I am not very familiar with C++, so I do not understand "where set is a basic data structure provided by standard libraries. Then I would build a set with all the values of the arrays (that is O(n)) and check if numbers searched are present in set or not (that is O(1))."
Mark
@Mark: I will post the python solution now as it is a piece of cake. It will give you the idea, the C++ one is not much more complicated. C would need more work if we do not have a set library immediately available (a library that implement the concept of set).
kriss
@Mark: OK, I got in comments you want to do it in CUDA. It clarify the context. C++ set could be interesting to try as you have a C++ compiler available. Say us if it is faster or slower than current C code, there is other possibilities not yet tried (like micro optimizing to avoid jumps).
kriss
yes I'm trying right now, I'll let you know
Mark
@kriss: Unfortunately it did not work, maybe because the int * D[6] = { D1, D2, D3, D4, D5, D6 }; and the D[p][e] which is two dimension, while the above not
Mark
@Mark: no the above syntax is ok. Do you speak of the C++ or the C code. I didn't really tried the C code (hence it may contains typos), but I compiled and run the C++ one ans tested it with several values of D1..D6. I'm quite sure it's completely OK, but it may be compilers differences. What is the actual compiler message ?
kriss
@Mark: I corrected two typos in C program (line ST1 become `int ST1[1] = {6};` and line cont: become `cont:;`)
kriss
Well, because I've implemented it in CUDA, the error is of any use:Runtime API error: Unknown error
Mark
@Mark: if the C version works, it is most likely the C++ STL implementation that is bogus or incomplete. The C and C++ code I posted are really very simple.
kriss
@Mark: doesn't CUDA return the line number and the source file where the error occurred ? That should be enough to diagnose.
kriss
yes, unknown error, I know why, in the code is missing a return 1if D1[ELM1] = ans[0] return 1there must be two return 0, and one return 1 (this is when a reach 6)
Mark
if D1[ELM1] = ans[5] instead of having a goto, it should have a return 1, for all the other ans, goto
Mark
I tried to put, after one:if (a == 5) return 1but it gives the error I mentioned earlier
Mark
@Mark: no need for the return. goto cont will continue with the next iteration of the loop `a` was 5 and become 6, the test is false and exit loop then execute return 1. And this is the only possible path to reach return 1, all other are catches by the return 0. No problem there (good try anyway). It you insert a special case for ans[5] you will just make your code fatter (hence slower) inside the loop. There must be something else.
kriss
if you check the original code, you will see that under each while loop (when it finish) there is a return 0, the return one is only when ans[5] is compared again D1,D2,D3,D4,D5,D6 and a match is found
Mark
It has to be: return 0; //bad row of numbers, if while ends cont:; return 1 if a[5] catch the goto cont. } return 0;}
Mark
sorry is my fault, I should have specified, the code return one only if after checking all the a numbers, the last check found a match.Namely, if after checked a[0],a[1],a[2],a[3],a[4] nothing is found and with a[5] a match is found then return 1
Mark
@mark: that is exactly what the code do.
kriss
@mark: 'Runtime Error' is the symptom of a C language support problem, not of a design error in the algorithm. In C code above the only feature used that is not plain old C from prehistorical times is the array initializer. I will change them in the sample C code by moving them to globals (and using globals is awfull but even faster than passing parameters).
kriss
@Mark: it is probably not all all a C or C++ issue not an algorithm issue at that point, but a CUDA specific one. The question should be retaged CUDA (I will do that if you don't mind) to attract CUDA experts. For them making it work (and really fast) should be a piece of cake.
kriss
yes, it does not like it, so I will find a way around
Mark
A: 

With only 36 values to compare, the most efficient approach would be to not use CUDA at all.

Just use a CPU loop.

If you change your inputs, I'll change my answer.

Danny Varod
No that was just an example, but there are many more to compare
Mark
Do you want a single bool answer or an array of answer per element?
Danny Varod
I gave up on that, now I'm working on another project http://stackoverflow.com/questions/3017591/how-can-i-inprove-this-function-under-cuda
Mark