tags:

views:

280

answers:

6

i was assigned to make some changes to a C program written by someone else...i want to understand it first to work on it properly...i came upon a function that generates the histogram of ASCII values from a given long string of data. it is something like this.

//load the symbols the old data
  for(int k = 0;k < 256;++k)
  {
    sym[k].Symbol = k;
    sym[k].Count  = 0;
  }

  //Creating the probability distribution for each of the source symbols.
  for(int k = size;k;--k)
  {
    sym[*in ++].Count ++;
  }

here 'in' is the char array (string) containing the characters to be counted. sym is a struct variable. i can't quite understand how this works. can anyone tell me how how the second loop generated the count of the symbols 1 to 255 (ASCII) in the string?

+4  A: 
for(int k = 0; k < size; k++)
  {
    sym[in[k]].Count++;
  }

This is basically what that second loop is doing.

They just dereference and then move to the next ascii value in one step, and increment the counter for that ascii value.

James Black
A: 

If 'in' is the input string then *in++ is taking each character in the string and looking up the entry in the ascii list sym[] corrsponding to that character value.

So if the string starts with 'A' then (*in) is 65 and it is referencing sym[65]

edit: sym[k].symbol is a bit redundant you can jsut have an array of 256 integers to represent the ascii chart, since sym[n] must be for symbol numbered 'n'

Martin Beckett
A: 

In a word, poorly. The basic idea is pretty simple, but the code is needlessly complex. In particular, his Symbol member is completely useless.

What you'd normally want to do is something like this:

int counts[UCHAR_MAX] = {0};

size_t len = strlen(input_string);
for (int i=0; i<len; i++)
    ++counts[unsigned char(input_string[i])];

So, the basic idea here is pretty simple: walk through the string, and for each item in the string, increment the count for that character.

He's doing pretty much the same thing, but keeping the Count as a member of a structure, along with the Symbol. Since the Symbol is always equal to the subscript of that item, storing it is pointless and wasteful.

Other than that, he's counting down in his loop -- probably a micro-optimization, because (at least on some machines) the zero flag will be set based on the value of the counter when it's decremented, so counting down to zero avoids a comparison in the loop. Given the amount he's wasting with his structure and unnecessarily storing the Symbol values, this makes no sense at all.

If you honestly cared about the code being close to optimum, you could write something more like this:

int counts[UCHAR_MAX] = {0}:

while (*in)
    ++counts[(unsigned char)*in++];

For anybody wondering about the cast, it's unnecessary if you're sure your input will always be true ASCII, that never has the high-bit set. Since you can rarely guarantee much about the input, however, it's generally safer to cast to unsigned char. Otherwise, a character with its top-bit set will typically be interpreted as a negative number, and index outside the array bound. Of course, it's possible for char to be unsigned by default, but it's pretty rare. On a typical (two's complement) machine, the cast doesn't require any extra operations; it just governs how the existing bit pattern will be interpreted.

Jerry Coffin
Funny how you blame the original code for being needlessly complex but you keep the (needless) separate computation of the length before the main loop. I think you have the wrong syntax for casts, too.
Pascal Cuoq
And the code is not walking backwards through the string. It is decrementing `k` but incrementing pointer `in` from the beginning to the end of the string.
Pascal Cuoq
Yes, you can avoid computing the length outside the loop, but (at least from what I've seen) most people find it simpler to understand this way. The cast syntax is wrong if (and only if) it's strictly C, not C++. None of this, however, changes the fact that the original code is pointlessly complex.
Jerry Coffin
Good answer. I did a +1 on your answer so there must be someone who voted you down. Oh well.
Alok
@Alok:Thanks for the kind words. From the looks of things, there were actually two up and two down. Such is life...
Jerry Coffin
A: 

in++ increments in, a pointer to the character being read.

*in++, parsed as *(in++), is the char currently read. It's also a number, and the algorithm takes advantage of this to use it as an index in an array. The appropriate count (the count of the character just read) sym[*in ++].Count is incremented.

Pascal Cuoq
A: 

The second loop uses the value of the character pointed to by the pointer in to index the count array.

A really good way to investigate this code is to put a few printf statements around it. Print the value of *in, print the count after it is incremented. You will soon get the picture this way.

Another option is to run code that you don't understand through the debugger.

Tony van der Peet
A: 

something++ means "add 1 to something, and return its value before the addition".

in is a pointer to the first character of the input.

So, *in++ means "move the input pointer one item onwards, and return the item it was pointing to".

So you can see that

sym[*in ++].Count ++;

means "move the input pointer one item onwards, and increment the Count field of element in the array sym corresponding to the character that was at the current input pointer position the item it was pointing to";

and the enclosing loop does this size times, thus processing the input.

moonshadow