tags:

views:

69

answers:

2

I am fairly new to matlab but for my job I need to import an ENORMOUS data set and organize it in a certain way. I have written a code that will do this, but very ineffieciently (it is only my third major piece of code and it takes several hours). Matlab is telling me that I can preallocate my variables (about fifty times in fact) but I am having trouble seeing how to do that because I am not sure what matrix the data will be added to for each iteration in the for loop. The code itself probably explains this better than I do.
(This is just a small piece of it, but will hopefully display my problem)

for x= 1:length(firstinSeq)
            for y= 1:length(littledataPassed-1)
                if firstinSeq(x,1)== littledataPassed(y,1) && firstinSeq(x,2)== littledataPassed(y,2) 
                        switch firstinSeq(x,3)
                            case 0
                                for z= 0:1000
                                    w= y+z;  
                                    if firstinSeq(x,4)== littledataPassed(w,4) 
                                        if littledataPassed(w,6)== 1 && firstinSeq(x,2)== littledataPassed(w,2) && littledataPassed(w,5)== 0 
                                            msgLength0= [msgLength0; firstinSeq(x,:) littledataPassed(w,:)];
                                            break
                                        else continue
                                        end
                                    else msgLength0= [msgLength0; firstinSeq(x,:) [0 0 0 0 0 0]];  
                                        break
                                    end
                                end
                            case 1
                                for z= 0:1000
                                    w= y+z; 
                                    if firstinSeq(x,4)== littledataPassed(w,4) %if sequence not the same, terminate
                                        if littledataPassed(w,6)== 1 && firstinSeq(x,2)== littledataPassed(w,2) && littledataPassed(w,5)== 0
                                            msgLength1= [msgLength1; firstinSeq(x,:) littledataPassed(w,:)];
                                            break
                                        else continue
                                        end
                                    else msgLength1= [msgLength1; firstinSeq(x,:) [0 0 1 0 0 0]]; 
                                        break        
                                    end
                                end
                            case 2
                                for z= 0:1000
                                    w= y+z;
                                    if firstinSeq(x,4)== littledataPassed(w,4)
                                        if littledataPassed(w,6)== 1 && firstinSeq(x,2)== littledataPassed(w,2) && littledataPassed(w,5)== 0
                                            msgLength2= [msgLength2; firstinSeq(x,:) littledataPassed(w,:)];
                                            break
                                        else continue
                                        end
                                    else msgLength2= [msgLength2; firstinSeq(x,:) [0 0 2 0 0 0]];
                                        break
                                    end
                                end
                                for z= 0:1000
                                    w= y+z;
                                    if firstinSeq(x,4)== littledataPassed(w,4)
                                        if littledataPassed(w,6)== 1 && firstinSeq(x,2)== littledataPassed(w,2) && littledataPassed(w,5)== 1
                                            msgLength2= [msgLength2; firstinSeq(x,:) littledataPassed(w,:)];
                                            break
                                        else continue
                                        end
                                    else msgLength2= [msgLength2; firstinSeq(x,:) [0 0 2 0 1 0]];  
                                        break
                                    end
                                end

any thoughts on how I could preallocate these variables(msgLength0,1,2,etc)? They do not have data added for every value in the loop and I am uncertain of the end size for each run. There are a total of eight cases for my switch right now, making this program very slow.

+1  A: 

If I read your code correctly then one of the variables msgLengthN is extended for each trip through the innermost loop ? If so, that prompts the thought that you might want to pre-allocate an array called msgLengthAll and populate that as you go, making sure that there is a value in each entry to distinguish between 0, 1, 2, etc.

If you don't know up front how much space to allocate for msgLengthAll then you could either:

  • Scan the input file once to determine how big this, and other arrays, need to be. There's no disgrace in reading large files more than once to process them and it might save you a lot of time. OR
  • Indulge in some fancy allocation scheme whereby initially you make a guess about how much space msgLengthAll will need, then, when it gets full, allocate more memory. There is a variety of ways of deciding how much more to allocate at each expansion point: a fixed size or possibly as much as there is already allocated (ie double the allocation at each expansion). This is, of course, potentially quite complicated.

Are you reading the file line-by-line and updating in-memory variables as you go ? Or are you reading the whole file, then sorting things out in memory ? How big is ENORMOUS ? How much RAM do you have ?

High Performance Mark
Thanks for your response. It is an entire data set for a research project. 210 text files at about 30,000 lines and 210 at around 80,000 lines are being read (they come in pairs) into Matlab. The way I tried to set up my script is to read in a pair of files, match up the corresponding data lines, and then run this ridiculous loop on the lines I want processed. ~200,000 lines of data are processed by this loop (firstinSeq varaible). So to answer your question, importing, then sorting. I would post my whole script for you but it is embarrisingly ineffiecient and takes 600 lines.
Maxwell
@Maxwell: so if you take a matching pair of files, do the 30,000 lines in one match up 1:~3 with the 80,000 lines in the other ? Are you then trying to build an array in Matlab that stores the aggregate data from both ? For this sort of data-wrangling I'd usually use utilities such as sed and awk as much as possible before starting to try to read the data into Matlab, for performance if for no other reason.
High Performance Mark
Okay so I have been playing around with this code since first posting and have modified it a great deal. I have FINALLY figured out how to preallocating this. Code is still slow because I dont really know how to simplify/vectorize this loop, but much improved.
Maxwell
@Mark: Yes. The data is about messages; the smaller file is just a subset of the messages in the larger file but holds different information so they needed to be matched up. Getting the data how I want it is fairly efficient (only 5 minutes to match together,sort, seperate important lines.) The part of the code that is still slow is the loop (partly posted above.) I have preallocated, but is there any other way you see to make this faster without too much trouble? (I am terrible at vectorizing, which is frustrating because I know I could probably do it here somehow)
Maxwell
A: 

You can vectorize the processing in each switch case by finding the indices of the records within the 1000 element block that meet your criteria and then appending them to msgLength0 in one fell swoop. The following is a vectorized version of the case 0 code:

indexStop = find(firstinSeq(x,4) != littledataPassed(y:y+1000,4), 1, 'first');
if isempty(indexStop)
   indexStop = 1000;
end
indexProcess = find(littledataPassed(y:y+indexStop,6) == 1 & ...
   littledataPassed(y:y+indexStop,2) == firstinSeq(x,2) & ...
   littledataPassed(y:y+indexStop,5) == 0);
msgLength0 = [msgLength0; firstinSeq(x,:) littledataPassed(y+indexProcess-1,:); [0 0 0 0 0 0]];

Vectorizing the outer loops would do a lot as well to reduce the execution time. I don't know enough about your data to suggest a specific approach but perhaps using the reshape and/or repmat functions to create arrays that you can operate on vectorally may be the way to go.

b3

related questions