views:

1275

answers:

14

I am inputting a 200mb file in my application and due to a very strange reason the memory usage of my application is more than 600mb. I have tried vector and deque, as well as std::string and char * with no avail. I need the memory usage of my application to be almost the same as the file I am reading, any suggestions would be extremely helpful. Is there a bug that causes so much memory consumption? Could you pinpoint the problem or should I rewrite the whole thing?

Windows Vista SP1 x64, Microsoft Visual Studio 2008 SP1, 32Bit Release Version, Intel CPU

The whole application until now:

#include <string>
#include <vector>
#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <time.h>



static unsigned int getFileSize (const char *filename)
{
    std::ifstream fs;
    fs.open (filename, std::ios::binary);
    fs.seekg(0, std::ios::beg);
    const std::ios::pos_type start_pos = fs.tellg();
    fs.seekg(0, std::ios::end);
    const std::ios::pos_type end_pos = fs.tellg();
    const unsigned int ret_filesize (static_cast<unsigned int>(end_pos - start_pos));
    fs.close();
    return ret_filesize;
}
void str2Vec (std::string &str, std::vector<std::string> &vec)
{
    int newlineLastIndex(0);
    for (int loopVar01 = str.size(); loopVar01 > 0; loopVar01--)
    {
     if (str[loopVar01]=='\n')
     {
      newlineLastIndex = loopVar01;
      break;
     }
    }
    int remainder(str.size()-newlineLastIndex);

    std::vector<int> indexVec;
    indexVec.push_back(0);
    for (unsigned int lpVar02 = 0; lpVar02 < (str.size()-remainder); lpVar02++)
    {
     if (str[lpVar02] == '\n')
     {
      indexVec.push_back(lpVar02);
     }
    }
    int memSize(0);
    for (int lpVar03 = 0; lpVar03 < (indexVec.size()-1); lpVar03++)
    {
     memSize = indexVec[(lpVar03+1)] - indexVec[lpVar03];
     std::string tempStr (memSize,'0');
     memcpy(&tempStr[0],&str[indexVec[lpVar03]],memSize);
     vec.push_back(tempStr);
    }
}
void readFile(const std::string &fileName, std::vector<std::string> &vec)
{
    static unsigned int fileSize = getFileSize(fileName.c_str());
    static std::ifstream fileStream;
    fileStream.open (fileName.c_str(),std::ios::binary);
    fileStream.clear();
    fileStream.seekg (0, std::ios::beg);
    const int chunks(1000); 
    int singleChunk(fileSize/chunks);
    int remainder = fileSize - (singleChunk * chunks);
    std::string fileStr (singleChunk, '0');
    int fileIndex(0);
    for (int lpVar01 = 0; lpVar01 < chunks; lpVar01++)
    {
     fileStream.read(&fileStr[0], singleChunk);
     str2Vec(fileStr, vec);
    }
    std::string remainderStr(remainder, '0');
    fileStream.read(&remainderStr[0], remainder);
    str2Vec(fileStr, vec);  
}
int main (int argc, char *argv[])
{   
     std::vector<std::string> vec;
     std::string inFile(argv[1]);
     readFile(inFile, vec);
}
A: 

Try using a list instead of a vector. Vectors are (almost always) linear in memory.

Granted, the fact that you have strings inside, which are (almost always) copy-on-modify, reference-counted should make that less of a problem, but it might help.

Matt Cruikshank
A list may just use more memory. It has to store two extra pointers for each list element. It'll also be a lot slower to iterate through, due to cache misses.
jalf
Fully agree, list wis surely use more memory if .reserve() is used with vector.
Drakosha
I'm actually not certain, but I believe that for most C++ implementations the std::string class is *not* implemented using a ref counted, copy-on-write method.
Michael Burr
Right - his problem is he doesn't know how many newlines, so he can't do reserve. Since he can't do reserve, he's potentially fragmenting memory doing push_back on the order of a million times (assuming 200 chars per line.)
Matt Cruikshank
A: 

I don't know if this is relevant because I don't really know what your file looks like.

But you should be aware that std::string is likely to have a considerable space overhead when storing a very short string. And if you're individually new-ing up char* for very short strings, you're also going to see all the allocation block overhead.

How many strings are you putting into that vector, and what's their average length?

Will Dean
+1  A: 

Inside readFile you have at least 2 copies of your file - the ifstream, and the data copied into your std::vector . As long as you have the file open, and you're copying it like you are, it's going to be hard to get the total memory footprint down below double the file size.

Harper Shelby
But the ifstream shouldn't hold the entire file contents in memory. It's just a buffer.
Roddy
@Roddy: Sure, it's a buffer - but what size? What limits that? Since the iostream is really an abstraction over the underlying OS functions, you'd have to look into the lower-level implementations to see what *they* do when asked to open a file. I'd bet most of them load it all in memory.
Harper Shelby
@Harper !!!??? Huh? You really think that most OS file opening loads all a file into memory? Really? You really think that? Why?
Will Dean
@Will Dean: just top of the head notions based on stuff I haven't done in a while - I've been stuck so far from code like this that I'm operating on memory of opening large files with iostream, and having it suck up memory.
Harper Shelby
+3  A: 

Your memory is being fragmented.

Try something like this :

  HANDLE heaps[1025];
  DWORD nheaps = GetProcessHeaps((sizeof(heaps) / sizeof(HANDLE)) - 1, heaps);

  for (DWORD i = 0; i < nheaps; ++i) 
  {
    ULONG  HeapFragValue = 2;
    HeapSetInformation(heaps[i],
                       HeapCompatibilityInformation,
                       &HeapFragValue,
                       sizeof(HeapFragValue));
  }
Edouard A.
+3  A: 

If I'm reading this right, the biggest issue is that this algorithm automatically doubles doubles the required memory.

In ReadFile(), you read the whole file into a set of 'singleChunk' sized strings (chunks), and then in the last loop in str2Vec() you allocate a tempstring for every newline separated segment of the chunk. So you're doubling the memory right there.

You've also got speed issues - str2vec does 2 passes over the chunk to find all the newlines. There's no reason you can't do that in one.

AShelly
There's a lot of that code that could be replaced with more idiomatic C++, using the STL more appropriately.
Harper Shelby
+1  A: 
  1. do not use std::list. It'll require more memory then vector.
  2. vector does what's called "doubling", i.e., when out of space, it allocates twice the memory it currently has. in order to avoid it you can use std::vector::reserve() method and if i'm not mistaken you can check it using std::vector::capacity() method (note capacity() >= size()).

Since amount of lines is not known during the execution, i see no simple algorithm to avoid "doubling" issue. From a comment by slavy13.myopenid.com the solution is to move the information to another prereserved vector after finishing reading (relevant question is http://stackoverflow.com/questions/253157/how-to-downsize-stdvector).

Drakosha
we don't know the number of lines beforehand, so using reserve() would not work, using reserve there is still the possibility that the vector would still reallocate
you can resize it down by swapping it with a temp copy after you are finished reading. http://stackoverflow.com/questions/253157/how-to-downsize-stdvector
The doubling allocation that the vector does would not double the amount of memory used by the strings - just the size of the array stored in the vector - the memory used by the strings is stored on the heap
1800 INFORMATION
A: 

Growing vectors by pushBack() will cause memory fragmentation and inefficient memory usage. I'd try using lists instead, and only creating a vector (if you need one) when you know exactly how many elements it will require.

Roddy
+1  A: 

First, how are you determining memory usage? Task Manager is not a suitable tool for that, as what it shows isn't actually memory usage.

Second, apart from your (for some reason?) static variables, the only data that does not get freed when you're done reading the file, is the vector. So test its capacity, and test the capacity of each string it contains. Find out how much memory they each use. You've got the tools to determine where the memory is being spent.

jalf
+2  A: 

The STL containers exist to abstract away memory operations. If you have a hard memory limit, then you can't really abstract those away.

I would recommend using mmap() to read the file in (or, in Windows, MapViewOfFile()).

Max Lybbert
+2  A: 

Another thing you could do is to load the entire file into one block of memory. Then make a vector of pointers to the first character of each line, and at the same time, replace the newline with a \0 so it's null-terminated. (Presuming of course that your strings aren't supposed to have \0 in them.)

It's not necessarily as convenient as having a vector of strings, but having a vector of const char* is potentially "just as good."

Matt Cruikshank
+1  A: 

I think your attempt to write your own buffering strategy is misguided.

The streams have a very good buffering strategy already implemented. If you think you need a larger buffer you can install a basic buffer into the stream without any extra code to control the buffer.

Here is what I came up with: NB tested with a text version of the "King James Bible" that I found online.

#include <string>
#include <vector>
#include <list>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <iostream>

class Line: public std::string
{
};

std::istream& operator>>(std::istream& in,Line& line)
{
    // Relatively efficient way to copy a line into a string.
    return std::getline(in,line);
}
std::ostream& operator<<(std::ostream& out,Line const& line)
{
    return out << static_cast<std::string const&>(line) << "\n";
}

void readLinesFromStream(std::istream& stream,std::vector<Line>& lines)
{
    /*
     * Read into a list as this is flexible in memory usage and will not
     * allocate huge chunks of un-required space.
     *
     * Even with huge files the space for list will be insignificant
     * compared to the size of the data.
     *
     * This then allows us to reserve the correct size of the vector
     * Thus avoiding huge memory chunks being prematurely allocated that
     * are not required. It also prevents the internal structure from
     * being copied every time the container is re-sized.
     */
    std::list<Line>     data;
    std::copy(  std::istream_iterator<Line>(stream),
                std::istream_iterator<Line>(),
                std::inserter(data,data.end())
             );

    /*
     * Reserve the correct size in the vector.
     * then copy out of the list into the vector
     */
    lines.reserve(data.size());
    std::copy(  data.begin(),
                data.end(),
                std::back_inserter(lines)
             );
}

void readLinesFromFile(std::string const& name,std::vector<Line>& lines)
{
    /*
     * Set up the file stream and override the default buffer used by the stream.
     * Make it big because we think the istream buffer is insufficient!!!!
     */
    std::ifstream       file;
    std::vector<char>   buffer(10000);
    file.rdbuf()->pubsetbuf(&buffer[0],buffer.size());

    file.open(name.c_str());
    readLinesFromStream(file,lines);
}


int main(int argc,char* argv[])
{
    std::vector<Line>   lines;
    readLinesFromFile(argv[1],lines);

    // Un-comment if your file is larger than 1100 lines.

    // I tested with a copy of the King James bible. 
    // std::cout << "Lines: " << lines.size() << "\n";
    // std::copy(lines.begin() + 1000,lines.begin() + 1100,std::ostream_iterator<Line>(std::cout));
}
Martin York
A: 

Maybe you should elaborate on why you need to read the whole file in memory, I suspect there is probably a way to do what you want without reading the whole file into memory at once. If you really need this functionlity look into memory mapped files which are probably going to be more efficient than you writing the equivalent. Your internal data structure can then use offset into the file. Btw be sure to see if you need to handle character encoding.

BeWarned
A: 

You should know that because you declared fileStream as static, it never goes out of scope meaning that the file isn't closed until the very last moment in execution. This will certainly involve some memory. You could explicitly close it just before that last str2Vec to try to help the situation.

Also, you open and close the same file multiple times, just open it once and pass it around by reference (resetting state if needed). Though I imagine you could pull off what you need with a single pass on the file.

Heck, i doubt you really need to know the filesize like you do here, you could just read in amounts of size "chunks" until you get a short read (at which point you are done).

Why don't you explain the goal of the code, I feel there is a much simpler solution possible.

Evan Teran
A: 

I find that the best way to do lines is to read-only memory map the file. Do not bother writing \0 in for \n, instead use pairs of const char *s, like std::pair<const char*, const char*> or pairs of const char*s and a count.. If you need to edit the lines, a good way to do it is to make an object that can store pointer pairs or a std::string with the modified line.

As for saving room in memory with STL vectors or deques, a good technique is to let it double until you are done adding to it. Then resize it to its real size which should free the unused memory back to the heap allocator. The memory may still be allocated to the program, though I wouldn't worry about it. Also, instead of taking the default size, start out by getting the file size in bytes, divide by your best guess at the average characters per line, and reserve that much space at the start.

Zan Lynx