views:

140

answers:

4

Hi I want to find a programmatic solution using C++.

I have a 900 files each of 27MB size. (just to inform about the enormity ).

Each file has 55K rows and Varying columns. But the header indicates the columns

I want to sort the rows in an order w.r.t to a Column Value.

I wrote the sorting algorithm for this (definitely my newbie attempts, you may say). This algorithm is working for few numbers, but fails for larger numbers.

Here is the code for the same: basic functions I defined to use inside the main code:

int getNumberOfColumns(const string& aline)
{
 int ncols=0;
 istringstream ss(aline);
 string s1;
 while(ss>>s1) ncols++;
 return ncols;
}

vector<string> getWordsFromSentence(const string& aline)
{
 vector<string>words;
 istringstream ss(aline);
 string tstr;
 while(ss>>tstr) words.push_back(tstr);
 return words;
}

bool findColumnName(vector<string> vs, const string& colName)
{
 vector<string>::iterator it = find(vs.begin(), vs.end(), colName);
 if ( it != vs.end()) 
 return true;
 else return false;
}

int getIndexForColumnName(vector<string> vs, const string& colName)
{
 if ( !findColumnName(vs,colName) ) return -1;
 else {
  vector<string>::iterator it = find(vs.begin(), vs.end(), colName);
 return it - vs.begin();
 }
}

////////// I like the Recurssive functions - I tried to create a recursive function
///here. This worked for small values , say 20 rows. But for 55K - core dumps
void sort2D(vector<string>vn, vector<string> &srt, int columnIndex)
{
  vector<double> pVals;
 for ( int i = 0; i < vn.size(); i++) {
  vector<string>meancols = getWordsFromSentence(vn[i]);
  pVals.push_back(stringToDouble(meancols[columnIndex]));
 }

        srt.push_back(vn[max_element(pVals.begin(), pVals.end())-pVals.begin()]);
        if (vn.size() > 1 ) {
        vn.erase(vn.begin()+(max_element(pVals.begin(), pVals.end())-pVals.begin()) );
        vector<string> vn2 = vn;
 //cout<<srt[srt.size() -1 ]<<endl;
        sort2D(vn2 , srt, columnIndex);
        }
}

Now the main code:

 for ( int i = 0; i < TissueNames.size() -1; i++)
 {
  for ( int j = i+1; j < TissueNames.size(); j++)
  {
   //string fname = path+"/gse7307_Female_rma"+TissueNames[i]+"_"+TissueNames[j]+".txt";
   //string fname2 = sortpath2+"/gse7307_Female_rma"+TissueNames[i]+"_"+TissueNames[j]+"Sorted.txt";
   string fname = path+"/gse7307_Male_rma"+TissueNames[i]+"_"+TissueNames[j]+".txt";
   string fname2 = sortpath2+"/gse7307_Male_rma"+TissueNames[i]+"_"+TissueNames[j]+"4Columns.txt";
   vector<string>AllLinesInFile;
   BioInputStream fin(fname);
   string aline;
   getline(fin,aline);
   replace (aline.begin(), aline.end(), '"',' ');
   string headerline = aline;
   vector<string> header = getWordsFromSentence(aline);

   int pindex = getIndexForColumnName(header,"p-raw");
   int xcindex = getIndexForColumnName(header,"xC");
   int xeindex = getIndexForColumnName(header,"xE");
   int prbindex = getIndexForColumnName(header,"X");

   string newheaderline = "X\txC\txE\tp-raw";
   BioOutputStream fsrt(fname2);
   fsrt<<newheaderline<<endl;

   int newpindex=3;
   while ( getline(fin, aline) ){

   replace (aline.begin(), aline.end(), '"',' ');
   istringstream ss2(aline);
   string tstr;
   ss2>>tstr;
   tstr = ss2.str().substr(tstr.length()+1);
   vector<string> words = getWordsFromSentence(tstr);
   string values = words[prbindex]+"\t"+words[xcindex]+"\t"+words[xeindex]+"\t"+words[pindex];
    AllLinesInFile.push_back(values);
   }

   vector<string>SortedLines; 
   sort2D(AllLinesInFile, SortedLines,newpindex);

   for ( int si = 0; si < SortedLines.size(); si++)
    fsrt<<SortedLines[si]<<endl;
   cout<<"["<<i<<","<<j<<"] = "<<SortedLines.size()<<endl;
  }
 }

can some one suggest me a better way of doing this? why it is failing for larger values. ?

The primary function of interest for this query is Sort2D function.

thanks for the time and patience.

prasad.

+1  A: 

You could try a method that doesn't involve recursion. if your program crashes using the Sort2D function with large values, then your probably overflowing the stack (danger of using recursion with a large number of function calls). Try another sorting method, maybe using a loop.

TheFuzz
Instead of using recursion use a queue. This will avoid a STACK OVERFLOW (:
Poni
As I mentioned below, I doubt that a stack overflow is the problem. The function's stack uses 4 vectors (typically only uses stack space of a pointer), a reference, and an int. I would recommend using a debugger and seeing where it crashes.
rlbond
+2  A: 

I'm not sure why your code is crashing, but recursion in that case is only going to make the code less readable. I doubt it's a stack overflow, however, because you're not using much stack space in each call.

C++ already has std::sort, why not use that instead? You could do it like this:

// functor to compare 2 strings
class CompareStringByValue : public std::binary_function<string, string, bool>
{
public:
    CompareStringByValue(int columnIndex) : idx_(columnIndex) {}
    bool operator()(const string& s1, const string& s2) const
    {
        double val1 = stringToDouble(getWordsFromSentence(s1)[idx_]);
        double val2 = stringToDouble(getWordsFromSentence(s2)[idx_]);
        return val1 < val2;
    }
private:
    int idx_;
};

To then sort your lines you would call

std::sort(vn.begin(), vn.end(), CompareByStringValue(columnIndex));

Now, there is one problem. This will be slow because stringToDouble and getWordsFromSentence are called multiple times on the same string. You would probably want to generate a separate vector which has precalculated the values of each string, and then have CompareByStringValue just use that vector as a lookup table.

Another way you can do this is insert the strings into a std::multimap<double, std::string>. Just insert the entries as (value, str) and then read them out line-by-line. This is simpler but slower (though has the same big-O complexity).

EDIT: Cleaned up some incorrect code and derived from binary_function.

rlbond
Thanks alot for the answer. This is what I was looking for. I never was good at predicates and functors. But this looks like the way to go. At least now I know how to write Functors. Thanks so much. Before I close, just wanted to mention few points which would make the code work (for those like me) : in the operator() method, the arguments are const string' after the class definition. Regarding the speed, in research groups this is pretty fast - acceptable. but will sit tight to learn and write better code from your suggestions.
Prasad
Can I know why you didn't think of a predicate and opted for a functor ? Before looking at your post and after posting my question, i was working to create a Predicate of this sort : int compareByPValues(const string vector<string>meancols2 = getWordsFromSentence(s2); return ( stringToDouble(meancols1[columnIndex]) < stringToDouble(meancols2[columnIndex]) );}This is more for my understanding. is it that predicate cannot have arguments but functors give that flexibility ? thanks alot.
Prasad
You've got it -- predicates can't have state, and your predicate won't work since it takes 3 arguments. However functors also can be inlined, and are typically faster.
rlbond
A: 

The problem is less your code than the tool you chose for the job. This is purely a text processing problem, so choose a tool good at that. In this case on Unix the best tool for the job is Bash and the GNU coreutils. On Windows you can use PowerShell, Python or Ruby. Python and Ruby will work on any Unix-flavoured machine too, but roughly all Unix machines have Bash and the coreutils installed.

Let $FILES hold the list of files to process, delimited by whitespace. Here's the code for Bash:

for FILE in $FILES; do
  echo "Processing file $FILE ..."
  tail --lines=+1 $FILE |sort >$FILE.tmp
  mv $FILE.tmp $FILE
done
wilhelmtell
Definitely this is another solution though it didnt strike me till now. So if I use sort -n -K (field specification) I should be able to sort the whole file. In the past I used to do for other works. some how I forgot the easier ways. Thank you very much. Again some coding is needed to identify the right column since the column number varies in each file. Over all I got the point. That will also be faster. I agree. Thank you very much.
Prasad
In that case you need a more elaborate solution. At this point I'd probably switch to Ruby (or Python if that's what _you_ are more comfortable with), because I think the solution would be smaller, simpler and faster to code than in Bash. The point remains though: there are tools with which you can bang faster a smaller, simpler, _correct_ solution. The trick is to get familiar with as many tools as you can (as opposed to be expert in as many tools as you can). There's almost always a tool in which the solution to a given problem is trivial, that doesn't require being an expert in that tool.
wilhelmtell
Yes. I completely agree with you. Thank you. I think I will be more comfortable with Python than Ruby. I am not sure but need to explore further.
Prasad
Sorry Wil!! I dont know how your answer got -2 voting. I tried voting now. your answer is quite reasonable though I said in my question - a solution with C++.
Prasad
A: 

sort2D crashes because you keep allocating an array of strings to sort and then you pass it by value, in effect using O(2*N^2) memory. If you really want to keep your recursive function, simply pass vn by reference and don't bother with vn2. And if you don't want to modify the original vn, move the body of sort2D into another function (say, sort2Drecursive) and call that from sort2D.

You might want to take another look at sort2D in general, since you are doing O(N^2) work for something that should take O(N+N*log(N)).

MSN
thanks I got the point. Yes!! I should pass it by reference and remove vn2. I will check if it works. Thanks for the inputs.
Prasad