views:

218

answers:

3

I'm using the Statistics::Descriptive library in Perl to calculate frequency distributions and coming up against a floating point rounding error problem.

I pass in two values, 0.205 and 0.205, (taken from other numbers and sprintf'd to those) to the stats module and ask it to calculate the frequency distribution but it's getting stuck in an infinite loop.

Stepping through with a debugger I can see that it's doing:

my $interval = $self->{sample_range}/$partitions;

my $iter = $self->{min};

while (($iter += $interval) <  $self->{max}) {

  $bins{$iter} = 0;

  push @k, $iter;  ##Keep the "keys" unstringified

}

$self->sample_range (The range is max-min)is returning 2.77555756156289e-17 rather than 0 as I'd expect. This means that the loop ((min+=range) < max)) enters a (for all intents and purposes) infinite loop.

DB<8> print $self->{max};
0.205
DB<9> print $self->{min};
0.205
DB<10> print $self->{max}-$self->{min};
2.77555756156289e-17

So this looks like a rounding problem. I can't think how to fix this on my side though, and I'm not sure editing the library is a good idea. I'm looking for suggestions of a workaround or alternative.

Cheers, Neil

+3  A: 

Not exactly a rounding problem; you can see the more precise values with something like

printf("%.18g %.18g", $self->{max}, $self->{min});

Looks to me like there's a flaw in the module where it assumes the sample range can be divided up into $partitions pieces; because floating point doesn't have infinite precision, this isn't always possible. In your case, the min and max values are exactly adjacent representable values, so there can't be more than one partition. I don't know what exactly the module is using the partitions for, so I'm not sure what the impact of this may be. Another possible problem in the module is that it is using numbers as hash keys, which implicitly stringifies them which slightly rounds the value.

You may have some success in laundering your data through stringization before feeding it to the module:

$data = 0+"$data";

This will at least ensure that two numbers that (with the default printing precision) appear equal are actually equal.

ysth
Yup, thanks. Max is actually 0.20500000000000002 and min is 0.20499999999999999 so that explains why it's going wrong.I'll try some workarounds.
NeilInglis
A: 

That shouldn't cause an infinite loop. What would cause that loop to be infinite would be if $self->{sample_range}/$partitions is 0.

Chas. Owens
Yeah, I didn't think so either but DB<12> p $iter; 0.205 DB<13> p $interval; 3.46944695195361e-18 DB<14> p $iter+=$interval 0.205 DB<15> p $self->{max} 0.205 DB<16> p ($iter += $interval) < $self->{max} 1 so (( 0.205 + 3.46944695195361e-18) < 0.205 ) evaluates to true.Of course, it's been a long day so I could be off the ball...
NeilInglis
Hrm formatting fail. Sorry.
NeilInglis
Nope; take for instance the numbers 1 and 1+2**-52. They differ by 2**-52. Assuming you want 4 partitions, that gives an interval of 2**-54 (which is clearly non-zero), but if you try to add that to 1, you leave the 1 unchanged (on most platforms), since the nearest representable value to 1+2**-54 is 1. The loop is assuming that if you increment a number by a non-zero value, that will increase the number, and that's not true in this case, leading to an endless loop.
ysth
@ysth, good point
Chas. Owens
+4  A: 

Hi!

I am the Statistics::Descriptive maintainer. Due to its numeric nature, many rounding problems have been reported. I believe this particular one was fixed in a later version to the one you were using that I released recently, by using multiplication for the divisions instead of +=.

Please use the most up-to-date version from the CPAN, and it should be better.

Shlomi Fish
Hi, Shlomi! Glad you noticed this question; you've saved me from having to email you a link to it. I see the new version is still using numbers as hash keys like $bins{$self->max()} = 0; to avoid having this round the values, you may want to use pack "F" (requires 5.8.0+) and unpack whenever you use the key.
ysth
Excellent, thanks! I should've checked for a new version, my fault. Very impressed by this response to my first Stack Overflow question. Thanks again to all who responded.
NeilInglis