views:

165

answers:

4

Hi, I'm searching for improving my algorithm for an allocate massive data riddle. If anyone can think of some good idea that makes it run faster, I'll appreciate it.

I have 8 files contains 2 parameters:

1) start point

2) end point

These type of data repeat itself with different numbers along the file. means I could have a file with 200 start points + 200 end points, or a file with 50000 start points + 50000 end points.

OK, these data eventuality reflect the X axis data points. each point start/end will be sorted into X_sort_array.

The problem starts with the Y axis data points.

In order to create the Y axis data points, I made up with some algorithm that run all over the X_sort_array points and check for each point if it is between all of start-end points individuality . if it is add one for Y_array[x_point] and so on...

I'll paste the massive part of my code here, if there is a syntax/way to perform it more efficiency, I will glad to know.

when sort_all_legth bigger then 50000 - the slow motion began:

    my $sort_all_length = @{$values{"$tasks_name[$i]\_sort\_allvals"}};
    my $core_count = 0;
    my $m=0;

    for my $k(0 .. $sort_all_length-1){



    if (($values{"$tasks_name[$i]\_sort\_allvals"}->[$k] eq $values{"$tasks_name[$i]\_sort\_allvals"}->[$k-1])
& ($k != 0) ) {
}

    else {
    my $pre_point_flag;
    # CHEACK PREVIUES POINT:
    for my $inst_num(0 .. $sort_all_length/2-1){
     $pre_point_flag=1;
     if ( ($values{"TID$i\_$tasks_name[$i]\_SHL"}->[$inst_num] <=    $values{"$tasks_name[$i]\_sort\_allvals"}->[$k]-1) 
   & ($values{"$tasks_name[$i]\_sort\_allvals"}->[$k]-1 <= $values{"TID$i\_$tasks_name[$i]\_EHL"}->[$inst_num]) )  {
    $pre_point_flag=0;
    last;
 }
}
if ($pre_point_flag eq 1){
 push(@{$values{"$tasks_name[$i]\_Y"}},0);
 push(@{$values{"$tasks_name[$i]\_X"}},$values{"$tasks_name[$i]\_sort\_allvals"}->[$k]-1);
}

# CHEACK DEFINE POINT:
for my $inst_num(0 .. $sort_all_length/2-1){
 if ( ($values{"TID$i\_$tasks_name[$i]\_SHL"}->[$inst_num] <= $values{"$tasks_name[$i]\_sort\_allvals"}->[$k]) 
   & ($values{"$tasks_name[$i]\_sort\_allvals"}->[$k] <= $values{"TID$i\_$tasks_name[$i]\_EHL"}->[$inst_num]) )  {
    $core_count++;
 }

}
push(@{$values{"$tasks_name[$i]\_Y"}},$core_count);
push(@{$values{"$tasks_name[$i]\_X"}},$values{"$tasks_name[$i]\_sort\_allvals"}->[$k]);

# CHEACK LATER POINT:
for my $inst_num(0 .. $sort_all_length/2-1){
 $pre_point_flag=1;
 if ( ($values{"TID$i\_$tasks_name[$i]\_SHL"}->[$inst_num] <= $values{"$tasks_name[$i]\_sort\_allvals"}->[$k]+1) 
   & ($values{"$tasks_name[$i]\_sort\_allvals"}->[$k]+1 <= $values{"TID$i\_$tasks_name[$i]\_EHL"}->[$inst_num]) )  {
    $pre_point_flag=0;
    last;
 }
}
if ($pre_point_flag eq 1){
 push(@{$values{"$tasks_name[$i]\_Y"}},0);
 push(@{$values{"$tasks_name[$i]\_X"}},$values{"$tasks_name[$i]\_sort\_allvals"}->[$k]+1);
}

if ($y_max_val < $core_count){
 $y_max_val = $core_count;
}

$core_count = 0;
$m++;
}



}

small data example that has good performance :

(bad performance comes up with I.P size of 50000 !)

database: task: byte_position

database: binary file size: 3216

database: numbers of start/end values = 804

database: counting active cores in sorted I.P array (size: 1608)...

database: I.P array value:

375127997,375135023,375177121,375177978,375225484,375226331, 375273745,375274563,375320063,375325255,375372479,375377085, 375422115,375422896,375473198,375474058,375517412,375518169, 375561967,375562760,375606301,375607092...

TEST example input/output:

inputs are start and end points on X axis:

16,255

16,255

16,255

100,355

200,455

database: task: TEST

database: binary file size: 20

database: numbers of start/end values = 5

database: sorted I.P array (X axis):

16,16,16,100,200,255,255,255,355,455

each of these points called interesting point (on X axis).

in order to calculate Y values, I draw a vertical line with each I.P.

then I count how many input lines (start/end point) pass though the first vertical line.

the result is the first Y point for the first X I.P

database: counting active cores in sorted I.P array (size: 10)...

output vector:

database: sorted cores array (Y axis):

0,3,4,5,5,2,1,0

In my algorithm I added another point before and after each I.P that start/end a block.

that's way you can see zero at the beginning/end !

3 - means there are 3 slicing lines at X point 16

4 - means there are 4 slicing lines at X point 100

5 - (1)means there are 5 slicing lines at X point 200

5 - (2)means there are 5 slicing lines at X point 255

2 - means there are 2 slicing lines at X point 355

1 - means there are 1 slicing lines at X point 455

Hope these extended example makes you more understand the algorithm. :)

I will rebuild the code for readability.

Thanks, Yodar.

+3  A: 

A very simple way to improve your performance there entails simply learning better coding practices.

One tenet you are ignoring is this: Do not repeat yourself.

You repeat certain bits of code in there a lot, forcing Perl to evaluate the same expression again and again. An example for this is this string:

"$tasks_name[$i]\_sort\_allvals"

It's used about 10 times in there. That means every single time you use it perl refers to the array, in case it changed, and puts together that string. It may not seem much, but eventually it adds up.

Another example is this:

$values{"$tasks_name[$i]\_sort\_allvals"}->[$k]

It's used 10 times as well and while $k actually changes each loop, for each run through the loop the value of that whole expression is the same. It would be faster to store it in a single scalar at the beginning of the loop and then use that scalar throughout the rest of it, as you avoid forcing perl to resolve a reference 10 times per loop.

Mithaldu
An algorithm redesign sounds more important, but, yech, I'd do this just for readability!
fennec
Agreed. But sometimes it just helps to give salient reasons beyond that. :)
Mithaldu
Thanks for your comment. I'll try to change it and keep it for next time :)
YoDar
+3  A: 

I'm having a little trouble following it, but it sounds like you're searching a (conveniently presorted) array of datafile-X-points from the start for my $var (0..whatever) { ... last if $done; } repeatedly for each axis-X-point.

Total Shlemiel the Painter algorithm. This is probably the source of the performance problem. You should try avoiding parts of the search you don't need to do again.

fennec
Nice algorithm ! I'm not quite sure I've understand the connections to my algorithm. thanks for your comment :)
YoDar
A: 

I tried to understand your program but my brain choked a bit. Can you provides us an example input, some explanation of what you want to do, and good example output? From your example I can't make head or tails.

You say this is some valid input:

database: task: TEST

database: binary file size: 20

database: numbers of start/end values = 5

database: sorted I.P array (X axis):

16,16,16,100,200,255,255,255,355,455

database: counting active cores in sorted I.P array (size: 10)...

and this is valid output:

database: sorted cores array (Y axis):

0,3,4,5,5,2,1,0

but I don't understand it. Care to explain what you want to achieve?

Leonardo Herrera
I've added explanations for the example...
YoDar
A: 

Based on what you say here:

In order to create the Y axis data points, I made up with some algorithm that run all over the X_sort_array points and check for each point if it is between all of start-end points individuality . if it is add one for Y_array[x_point] and so on...

and assuming this is your problem, then a modified binary search algorithm will work in O(log n), or approximately 6 steps for n = 1_000_000:

sub binary_range_search {
    my ( $range, $ranges ) = @_;
    my ( $low, $high ) = ( 0, @{$ranges} - 1 );
    while ( $low <= $high ) {

        my $try = int( ( $low + $high ) / 2 );

        $low  = $try + 1, next if $ranges->[$try][1] < $range->[0];
        $high = $try - 1, next if $ranges->[$try][0] > $range->[1];

        return $ranges->[$try];
    }
    return;
}

In this version, you're looking to find whether the range @$range overlaps any of the ranges in @$ranges. You can modify it to look for overlaps of a single point over all ranges. Is that what you're looking for?

Pedro Silva
I think this might be what I've been looking for. Still I need to make some upgrades to use it right. I know X_sort_array must be sorted in order to use this algorithm, but is the @$range need to be sorted also ?
YoDar
No at all. @$range should be something like (1, 50) or (100, 100), in which case it will find the proper range where 100 is contained. The version I posted is just something I already had for the purpose of finding overlapping ranges.
Pedro Silva
Is there a way to count how many @$ranges overlaps occur on a single point ?
YoDar
Yeah, sure. Also using binary search, see my question: http://stackoverflow.com/questions/2046390/how-to-extend-a-binary-search-iterator-to-consume-multiple-targets and the accepted answer: http://stackoverflow.com/questions/2046390/how-to-extend-a-binary-search-iterator-to-consume-multiple-targets/2052468#2052468If you implement the `binary_range_search` iterator version, computing the number of overlaps would be a matter of exhausting the iterator through `$count++ while $brs_iterator->()`.
Pedro Silva