Hi, I'm searching for improving my algorithm for an allocate massive data riddle. If anyone can think of some good idea that makes it run faster, I'll appreciate it.
I have 8 files contains 2 parameters:
1) start point
2) end point
These type of data repeat itself with different numbers along the file. means I could have a file with 200 start points + 200 end points, or a file with 50000 start points + 50000 end points.
OK, these data eventuality reflect the X axis data points. each point start/end will be sorted into X_sort_array.
The problem starts with the Y axis data points.
In order to create the Y axis data points, I made up with some algorithm that run all over the X_sort_array points and check for each point if it is between all of start-end points individuality . if it is add one for Y_array[x_point] and so on...
I'll paste the massive part of my code here, if there is a syntax/way to perform it more efficiency, I will glad to know.
when sort_all_legth bigger then 50000 - the slow motion began:
my $sort_all_length = @{$values{"$tasks_name[$i]\_sort\_allvals"}};
my $core_count = 0;
my $m=0;
for my $k(0 .. $sort_all_length-1){
if (($values{"$tasks_name[$i]\_sort\_allvals"}->[$k] eq $values{"$tasks_name[$i]\_sort\_allvals"}->[$k-1])
& ($k != 0) ) {
}
else {
my $pre_point_flag;
# CHEACK PREVIUES POINT:
for my $inst_num(0 .. $sort_all_length/2-1){
$pre_point_flag=1;
if ( ($values{"TID$i\_$tasks_name[$i]\_SHL"}->[$inst_num] <= $values{"$tasks_name[$i]\_sort\_allvals"}->[$k]-1)
& ($values{"$tasks_name[$i]\_sort\_allvals"}->[$k]-1 <= $values{"TID$i\_$tasks_name[$i]\_EHL"}->[$inst_num]) ) {
$pre_point_flag=0;
last;
}
}
if ($pre_point_flag eq 1){
push(@{$values{"$tasks_name[$i]\_Y"}},0);
push(@{$values{"$tasks_name[$i]\_X"}},$values{"$tasks_name[$i]\_sort\_allvals"}->[$k]-1);
}
# CHEACK DEFINE POINT:
for my $inst_num(0 .. $sort_all_length/2-1){
if ( ($values{"TID$i\_$tasks_name[$i]\_SHL"}->[$inst_num] <= $values{"$tasks_name[$i]\_sort\_allvals"}->[$k])
& ($values{"$tasks_name[$i]\_sort\_allvals"}->[$k] <= $values{"TID$i\_$tasks_name[$i]\_EHL"}->[$inst_num]) ) {
$core_count++;
}
}
push(@{$values{"$tasks_name[$i]\_Y"}},$core_count);
push(@{$values{"$tasks_name[$i]\_X"}},$values{"$tasks_name[$i]\_sort\_allvals"}->[$k]);
# CHEACK LATER POINT:
for my $inst_num(0 .. $sort_all_length/2-1){
$pre_point_flag=1;
if ( ($values{"TID$i\_$tasks_name[$i]\_SHL"}->[$inst_num] <= $values{"$tasks_name[$i]\_sort\_allvals"}->[$k]+1)
& ($values{"$tasks_name[$i]\_sort\_allvals"}->[$k]+1 <= $values{"TID$i\_$tasks_name[$i]\_EHL"}->[$inst_num]) ) {
$pre_point_flag=0;
last;
}
}
if ($pre_point_flag eq 1){
push(@{$values{"$tasks_name[$i]\_Y"}},0);
push(@{$values{"$tasks_name[$i]\_X"}},$values{"$tasks_name[$i]\_sort\_allvals"}->[$k]+1);
}
if ($y_max_val < $core_count){
$y_max_val = $core_count;
}
$core_count = 0;
$m++;
}
}
small data example that has good performance :
(bad performance comes up with I.P size of 50000 !)
database: task: byte_position
database: binary file size: 3216
database: numbers of start/end values = 804
database: counting active cores in sorted I.P array (size: 1608)...
database: I.P array value:
375127997,375135023,375177121,375177978,375225484,375226331, 375273745,375274563,375320063,375325255,375372479,375377085, 375422115,375422896,375473198,375474058,375517412,375518169, 375561967,375562760,375606301,375607092...
TEST example input/output:
inputs are start and end points on X axis:
16,255
16,255
16,255
100,355
200,455
database: task: TEST
database: binary file size: 20
database: numbers of start/end values = 5
database: sorted I.P array (X axis):
16,16,16,100,200,255,255,255,355,455
each of these points called interesting point (on X axis).
in order to calculate Y values, I draw a vertical line with each I.P.
then I count how many input lines (start/end point) pass though the first vertical line.
the result is the first Y point for the first X I.P
database: counting active cores in sorted I.P array (size: 10)...
output vector:
database: sorted cores array (Y axis):
0,3,4,5,5,2,1,0
In my algorithm I added another point before and after each I.P that start/end a block.
that's way you can see zero at the beginning/end !
3 - means there are 3 slicing lines at X point 16
4 - means there are 4 slicing lines at X point 100
5 - (1)means there are 5 slicing lines at X point 200
5 - (2)means there are 5 slicing lines at X point 255
2 - means there are 2 slicing lines at X point 355
1 - means there are 1 slicing lines at X point 455
Hope these extended example makes you more understand the algorithm. :)
I will rebuild the code for readability.
Thanks, Yodar.