views:

240

answers:

4

First of all, I don't have multiplication, division operations so i could use shifting/adding, overflow-multiplication, precalculations etc. I'm just comparing one n-bit binary number to another, but according to algorithm the quantity of such operations seems to be huge. Here it is :

  1. There is given a sequence of 0's and 1's that is divided into blocks. Let the length of a sequence be S, the length of a block is N which is some power of two (4,8,16,32, etc.). Quantity of blocks is n=S/N, no rocket science here.
  2. According to chosen N i'm building a set of all possible N-bit binary numbers, which is a collection of 2^N-1 objects.
  3. After this I need to compare each binary number with each block from source sequence and calculate how much times there was a match for each binary number, for example :
    S : 000000001111111100000000111111110000000011111111... (0000000011111111 is repeated 6 times, 16bit x 6 = 96bits overall)
    N : 8
    blocks : {00000000, 11111111, 00000000, 1111111,...}
    calculations:

.

// _n = S/N;
// _N2 = Math.Pow(2,N)-1
// S=96, N=8, n=12, 2^N-1=255 for this specific case
// sourceEpsilons = list of blocks from input, List<string>[_n]
var X = new int[_n]; // result array of frequencies
for (var i = 0; i < X.Length; i++) X[i] = 0; // setting up
for (ulong l = 0; l <= _N2; l++) // loop from 0 to max N-bit binary number
var currentl = l.ToBinaryNumberString(_N/8); // converting counter to string, getting "current binary number as string"
var sum = 0; // quantity of currentl numbers in blocks array
for (long i = 0; i < sourceEpsilons.LongLength; i++)
{
   if (currentl == sourceEpsilons[i]) sum++; // evaluations of strings, evaluation of numbers (longs) takes the same time
}
// sum is different each time, != blocks quantity                    
for (var j = 0; j < X.Length; j++) 
if (sum - 1 == j) X[j]++; // further processing
// result : 00000000 was matched 6 times, 11111111 6 times, X[6]=2. Don't ask me why do i need this >_<

With even small S i seem to have (2^N-1)(S/N) iterations, with N=64 the number grows to 2^64=(max value of type long) so that ain't pretty. I'm sure there is a need to optimize loops and maybe change the approach cardinally (c# implementation for N=32 takes 2h @ dual-core pc w/ Parallel.For). Any ideas how to make the above scheme less time and resource-consuming? It seems like i have to precompute binary numbers and get rid of first loop by reading "i" from file and evaluate it with blocks "on-the-fly", but the filesize will be (2^N)*N bytes ((2^N-1)+1)*N) which is somehow unacceptable too.

A: 

I'm just comparing one n-bit binary number to another

Isn't that what memcmp is for?

You're looping through every possible integer value, and it's taking 2 hours, and you're surprised at this? There's not much you can do to streamline things if you need to iterate that much.

Billy ONeal
+5  A: 

It seems like what you want is a count of how many times each specific block occurred in your sequence; if that's the case, comparing every block to all possible blocks and then tallying is a horrible way to go about it. You're much better off making a dictionary that maps blocks to counts; something like this:

var dict = Dictionary<int, int>();
for (int j=0; j<blocks_count; j++)
{
    int count;
    if (dict.TryGetValue(block[j], out count)) // block seen before, so increment
    {
        dict[block[j]] = count + 1;
    }
    else // first time seeing this block, so set count to 1
    {
        dict[block[j]] = 1; 
    }
}

After this, the count q for any particular block will be in dict[the_block], and if that key doesn't exist, then the count is 0.

tzaman
It seems like this approach is acceptable. I was also counting blocks absent in source, but you are right that much easier is to enumerate through available blocks and then subtract their quantity from 2^N which will do the same. Thanks.
Alcz
A: 

Are you trying to get the number of unique messages in S? For instance in your given example, for N = 2, you get 2 messages (00 and 11), for N = 4 you get 2 messages, (0000 and 1111), and for N = 8 you get 1 message (00001111). If that's the case, then the dictionary approach suggested by tzaman is one way to go. Another would be sort the list first, then run through it and look for each message. A third, naive, approach would be to use a sentinel message, all 0's for instance, and run through looking for messages that are not the sentinel. When you find one, destroy all its copies by setting them to the sentinel. For instance:

int CountMessages(char[] S, int SLen, int N) {
    int rslt = 0;
    int i, j;
    char *sentinel;

    sentinel = calloc((N+1)*sizeof(char));

    for (i = 0; i < N; i ++)
        sentinel[i] = '0';

    //first, is there a sentinel message?
    for (i = 0; ((i < SLen) && (rslt == 0)); i += N) {
        if (strncmp(S[i], sentinel, N) == 0)
            rslt++;
    }

    //now destroy the list and get only the unique messages
    for (i = 0; i < SLen; i += N) {
        if (strncmp(S[i], sentinel, N) != 0) { //first instance of given message
            rslt++;                
            for (j = i+N; j < SLen; j += N) { //look for all remaining instances of this message and destroy them
                if (strncmp(S[i], S[j], N) == 0)
                    strncpy(S[j], sentinel, N); //destroy message
            }
        }
    }

    return rslt;
}

The first means using either a pre-written dictionary or writing your own. The second and third destroy the list, meaning you have to use a copy for each 'N' you want to test, but are pretty easy. As for parallelization, the dictionary is the easiest, since you can break the string into as many sections as you have threads, do a dictionary for each, then combine the dictionaries themselves to get the final counts. For the second, I imagine the sort itself can be made parallel easily, then there's a final pass to get the count. The third would require you to do the sentinel-ization on each substring, then redo it on the final recombined string.

Note the big idea here though: rather than looping through all the possible answers, you only loop over all the data!

mtrw
Well, sounds reasonable. The stuff above is direct interpretation of some stochastic process formula involving Kronecker's delta function that can be easily reverted from f(i,j) to f(j,i). The only problem is if that's possible for that specific case... I don't want to break other parts of system. Thanks anyway.
Alcz
A: 

Instead of a dictionary, you can also use a flat file, of size 2^N entries, each of a size of, for example integer.

This would be your counting pad. Instead of looping through all possible numbers in a collection, and comparing to your currently viewed number, you iterate through S forward only like such:

procedure INITIALIZEFLATFILE is
    allocate 2^N * sizeof(integer) bytes to FLATFILE
end procedure

procedure COUNT is
    while STREAM is not at END
        from FLATFILE at address STREAM.CURRENTVALUE read integer into COUNT
        with FLATFILE at address STREAM.CURRENTVALUE write integer COUNT+1
        increment STREAM
    end while
end procedure

A dictionary is conservative on space in the beginning, and requires a lookup to the proper index later on. If you expect all possible integers eventually, you can keep a fixed-size "scorecard" from the getgo.

maxwellb
If N is relatively small, for example 8 as in your example, a scorecard like this using integers to count will allow you to use 2^8*4 == 1024 bytes of memory for each scorecard.This grows quickly as you track larger bit-width values.
maxwellb