views:

540

answers:

10

I'm using C++ and it's STL. I have a large (100MB+) text file. This file just has a lot of "words" (strings separated by whitespace) like:

sdfi sidf ifids sidf assd fdfd fdfd ddd ddd

I need to put each of these "words" in a vector:

vector<string> allWordsInFile;

So for each word that I read from the file, i do:

allWordsInFile.push_back(word);

The file has a lot of duplicate words and I'm looking for ways to save memory. Each word needs to be represented in the correct position in the vector. It would be great if I could just have a list of all the words outside the vector and then just put a reference in the vector, but it's not possible to put references in a vector as far as I know. Then I thought about just storing pointers to the words, but the length of each word is so short that I don't think it will make much of a difference? (each pointer is 4 bytes on my system and most strings would probably be about the same size).

Can someone suggest another way to approach this?

+1  A: 

Since your strings are usually around 4 bytes anyway, simply creating another level of indirection won't help, since the size of a pointer is 4 bytes (on x86, or worse 8 bytes on x64). And the size of an index based on an int would be 4 bytes as well.

Loading by parts:

You could consider loading in your file by parts to save memory. Only load in what you need based on the word position they would like to find.

You could scan the file once to build an index. This index would store the starting position of every 10th word (10 chosen arbitrarily).

Then if you want to access word 11, you would calculate 11 divided by 10 to get the position in the index for the group's starting position, and seek to the found starting position. Then calculate 11 modulo 10 to find out how many words to read from that index to get the desired word.

This method does not try to eliminate storing duplicate strings, but it limits the RAM you need to use to only the size of your index. You can adjust the "every 10 words" above to "every X words" to reduce memory consumption. So your size used in RAM would be only (num_words_in_file/X)*sizeof(int) which is MUCH smaller than the size of storing the entire file in RAM, even if you stored each unique string only once.

Accessing each word with no extra space:

If you pad each word with a certain character so that each word is the same size, then ignore the padding character when you read in. You could access the exact word without an extra pass through phase to build the index and with virtually no extra space.

Brian R. Bondy
A: 

I think using some thing like this will save the memory:

struct WordInfo
{
    std::string m_word;
    std::vector<unsigned int> m_positions;
};

typedef std::vector<WordInfo> WordVector;

First find whether the word exists in WordVector

If no,
    create WordInfo object and push back into WordVector
else
    get the iterator for the existing WordInfo
    Update the m_positions with the position of the current string
Naveen
Or you could just use std::pair.
aib
+3  A: 

If you don't have a lot of words, you can store the words in an outside array and store the corresponding indices in your words vector. Depending on how many unique words there are you can have only 1 (for max 256 words) or 2 (max 65536 words) bytes per word.

If you desire speed, you can use an std::map to look up the index of a string in log(n) time (instead of iterating over the outside array)

e.g. for max. 65536 unique words

vector<short> words
map<string,short> index
vector<string> uniqueWords
cindex = 0
for all words
    read the word
    if index[word] does not exist
     index[word] = cindex++
     uniqueWords.push_back(word)
    words.push_back(index[word]);

To retrieve the original words, just look it up in uniqueWords.

v3
The goal is to use less memory. And since each index stores a short, or int if you need more than 65536 entries, I think this solution would use more RAM. Because the size of your index table is just as big as the size of the entire list of words. Plus you have to store the actual unique words on top of that. If the poster had long strings in his file, then this would be great. But because he has an average of 4 byte strings, I think this would consume more memory.
Brian R. Bondy
+1  A: 

One way could be to store one vector just containing only the unique strings. Then, the "words" list is just a vector of integers being the index into the array of unique strings. This would save memory at the cost of being slower to read in the file since you would have to do some kind of linear scan in the uniques array for each new word. You could then use a map as an index into the array of unique strings - if the new word is not found in the set, then you know to add the word at the end of the list of uniques. Hey come to think of it, you don't even need the vector since the map serves that purpose:

typedef map<string, int> UniqueIndex;
UniqueIndex uniqueIndex;

typedef vector<int> WordsInFile;
WordsInFile wordsInFile;

for (each word in the file)
{
  UniqueIndex::const_iterator it=uniqueIndex.find(word);
  int index; // where in the "uniqueIndex" we can find this word
  if (it==uniqueIndex.end())
  {
    // not found yet
    index=uniqueIndex.size();
    uniqueIndex[word]=index;
  }
  else
    index=it.second;
  wordsInFile.push_back(index);
}
1800 INFORMATION
How would it save memory if you need to store each index which is 4 bytes. And each of his strings is around 4 bytes anyway? So you need to keep 4 bytes * the number of words + the size of the unique strings.
Brian R. Bondy
well if the number of unique strings was small you could use 16 bit integers...I thought your solution was better than mine though - I voted you up
1800 INFORMATION
+7  A: 

boost::flyweight looks useful here.

In fact the tutorial example shows boost::flyweight<std::string> being used to compress duplicates of names in a database.

timday
I've added a separate more detailed answer which compares the efficency of containers of strings, flyweights and the extreme option of unioning four characters with a char*.
timday
A: 

In case it's possible for you not to use a vector - another possibility, similar to some solutions above but with only one structure instead of two, would be to have a map of words to an integer list, each integer representing the position, and a count variable that increments each time you read a word:

   int count = 0;
   Map<String, List<Integer>> words = new HashMap<String, List<Integer>>();

Then it goes something like (Java-type pseudocode):

   for (word : yourFile) {
      List<Integer> positions = words.get(word);
      if (values == null) {
         positions = new ArrayList<Integer>();
      }
      positions.add(++count);
      words.put(word, positions);
   }
JRL
std::map<std::string, std::vector<int> > words; int position = 0; for (each word in the file) { words[word].push_back(position); ++position; }
+2  A: 

Amm , what you actually want to do is called compression.

Huffman coding will probably can do a good job here. You do one scan to build the frequency table of the words , and then apply the Huffman algorithm to attach each word a symbol. then you compose row of bits which represent the words with the appropiate order. this row of bits can be considered as your "Low Memory Vector".

The nature of huffman coding let you access the symbol at any location you want (no symbol is a prefix of another symbol) , the problem here is that access time will be O(n) There are some optimization that can reduce the access time but only by a constant factor, nothing can prevent it from being O(n) and still preserve small memory usage. If you want to hear about an optimization that can be done , leave me a comment.

the drawbacks :

  1. after you have built the encoded words , the access in O(n) , you have to linearly scan the symbols till you get to the appropriate position.
  2. Implementation of this idea is not trivial at all and will consume lot of your time.

Edit: One thing I didn't think about when writing the post , you have to hold the lookup table. So this may only work if you have lot of repetitive words.

Huffman Coding: http://en.wikipedia.org/wiki/Huffman_coding

"So this may only work if you have lot of repetitive words." If he does not, he is lost. There is no way how to compress random data.
Suma
I was only suggesting an idea , I didn't say he have to implement it, He will decide whether it appropiate his data.
I did not intend to criticise your idea. I think you are right. If the words are random he is lost not because your idea is bad, but because he is lost in principle and there is nothing anyone can do. More likely then not, his data are not random, though.
Suma
A: 

First of all, we should know what the strings are like:

if "most strings are 4 letters" and the file is 100MB, then

a) there must be so many duplicates that maybe you are better off storing the strings that are not in the array (specially if you can ignore the case), but that would not give you their position in the vector.

b) Maybe there is a way to squeeze from ascii 8 bit (assuming it is indeed ASCII) (8X4=32) to maybe 20 bits (5x4), using 5 bits per letter, and with some fancy bit working that would reduce the size of the vector. Please run a sample of the data and see how many different letters are there indeed in the files, maybe some groups of letters are so abundant that it makes sense to assign them a special value (of the 32 options in the 8 bit sequence). Actually, if i am correct, if most words are convertable to a 20 bit representation, then all that is needed is an array with 3MB to store all the words and their word count - and handle the >4 chars separately (assuming 3 bytes is enough for the word count, which it should, maybe 2 bytes is enough: could make it dynamic for a total of 2MB used)

c) another nice option is, i think someone else said it above, to just concateate a string with the letters and run a compressor on it, the bad thing is the cpu load and the temporary memory it probably needs to compres//decompress. Apart from that should work very well

d) if you really want to minimize the ram used, maybe you want to use your disk: sort the words (you can use the disk if dont have enough ram), and create a temporary file with the words one after another, that would do for sequential access. Then you can create a one-off tree-like representation of the words with the leaves containing a relative address to the words in the file for random access, and "serialize" it to disk. In the end, as most words are 4 character longs, with 5 hops you would get the position of any letter in the file without using any ram, so to speak. You can also cache in ram the first 2 or 3 layers of the tree, which will be light ram-wise, to reduce the hopping to 2 or 3 hops for 4 character words. Then you can use some little ram to cache most used words, and do all kind of niceties to speed up the access.

it is pretty late, hope i am making sense...

pd. thnx for the comments guys

LSalvo
I started with describing a trie too, but then removed it because he needs to preserve the word number, and to do that you need to store an equivalent amount of memory (since each string is ~ 4 bytes)
Brian R. Bondy
Besides of tries, Directed Acyclic Word Graph would be probably even better suited, but again, it does not preserve the order.
Suma
+1  A: 

You need to specify what operations you need to be fast on the vector, otherwise it is impossible to design a proper solution. Will you need mostly random access, or mostly sequential? What performance is acceptable for random access?

To illustrate my point: one way to store your data is to simply compress them using LZMA or other good compression library. Then when you need to access some element, you decompress, discarding decompress data as soon as decompression no longer needs them. Such storage will be very space efficient, sequential access will be reasonable fast, but random access time will be very bad.

Suma
A: 

After pointing you at boost::flyweight in another answer, I wanted to take a closer look at the relative efficiency of containers of strings, flyweights and the "nuclear option" of unioning four chars with a pointer (class "sillystring" in the code below).

Notes on the code:

  • Works on 32-bit Debian/squeeze with g++ 4.3.3 and boost 1.38.0
  • Uses std::deque instead of std::vector because vector's size-doubling behaviour (c.f deques' chunks) gives an (arguably) misleading impression of inefficiency if the test case just happened to have doubled recently.
  • The "sillystring" uses the LSB of the pointer to distinguish the pointer use-case from the local chars case. This should work unless your malloc allocates on odd-byte boundaries (in which case the code will throw) (certainly haven't seen this on my system; YMMV). Before anyone feels the need to point it out, yes, I do consider this horrible dangerous hacky code and not an option to be chosen lightly.

Results vary depending on the distribution of word lengths, but for a distribution "(2D6+1)/2" (so peaked at 4, but with lengths from 1 to 6), the efficiencies (defined as the ratio between actual memory consumption and the actual number of characters needing to be stored) are:

  • 12.4% deque<string>
  • 20.9% deque<flyweight<string> >
  • 43.7% deque<sillystring>

If all your words were 4 characters (change to const int length=4; in the word generator), which is the ideal case for sillystring, then you get:

  • 14.2% deque<string>
  • 77.8% deque<flyweight<string> >
  • 97.0% deque<sillystring>

So flyweight is certainly a quick improvement, but you can do better by capitalising on the ability of your words to fit into a pointer-sized space and avoid additional heap overhead.

Here's the code:

// Compile with "g++ -O3 -o fly fly.cpp -lpthread"
// run "./fly 0 && ./fly 1 && ./fly 2" 

#include <boost/flyweight.hpp>
#include <boost/format.hpp>
#include <boost/random.hpp>
#include <cstring>
#include <deque>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>

#include <sys/types.h>
#include <unistd.h>

#define THROW(X,MSG) throw X(boost::str(boost::format("%1%: %2%") % __PRETTY_FUNCTION__ % MSG))

struct random_word_generator
{
  random_word_generator(uint seed)
    :_rng(seed),
     _length_dist(1,6),
     _letter_dist('a','z'),
     _random_length(_rng,_length_dist),
     _random_letter(_rng,_letter_dist)
  {}
  std::string operator()()
  {
    std::string r;
    const int length=(_random_length()+_random_length()+1)/2;
    for (int i=0;i<length;i++) r+=static_cast<char>(_random_letter());
    return r;
  }
private:
  boost::mt19937 _rng;
  boost::uniform_int<> _length_dist;
  boost::uniform_int<> _letter_dist;
  boost::variate_generator<boost::mt19937&,boost::uniform_int<> > 
    _random_length;
  boost::variate_generator<boost::mt19937&,boost::uniform_int<> > 
    _random_letter;
};

struct collector
{
  collector(){}
  virtual ~collector(){}

  virtual void insert(const std::string&)
    =0;
  virtual void dump(const std::string&) const
    =0;
};

struct string_collector
  : public std::deque<std::string>, 
    public collector
{
  void insert(const std::string& s)
  {
    push_back(s);
  }
  void dump(const std::string& f) const
  {
    std::ofstream out(f.c_str(),std::ios::out);
    for (std::deque<std::string>::const_iterator it=begin();it!=end();it++)
      out << *it << std::endl;
  }
};

struct flyweight_collector 
  : public std::deque<boost::flyweight<std::string> >, 
    public collector
{
  void insert(const std::string& s)
  {
    push_back(boost::flyweight<std::string>(s));
  }
  void dump(const std::string& f) const
  {
    std::ofstream out(f.c_str(),std::ios::out);
    for (std::deque<boost::flyweight<std::string> >::const_iterator it=begin();
         it!=end();
         it++
         )
      out << *it << std::endl;
  }
};

struct sillystring
{
  sillystring()
  {
    _rep.bits=0;
  }
  sillystring(const std::string& s)
  {
    _rep.bits=0;
    assign(s);
  }
  sillystring(const sillystring& s)
  {
    _rep.bits=0;
    assign(s.str());
  }
  ~sillystring()
  {
    if (is_ptr()) delete [] ptr();
  }
  sillystring& operator=(const sillystring& s)
  {
    assign(s.str());
  }
  void assign(const std::string& s)
  {
    if (is_ptr()) delete [] ptr();
    if (s.size()>4)
      {
        char*const p=new char[s.size()+1];
        if (reinterpret_cast<unsigned int>(p)&0x00000001)
          THROW(std::logic_error,"unexpected odd-byte address returned from new");
        _rep.ptr.value=(reinterpret_cast<unsigned int>(p)>>1);
        _rep.ptr.is_ptr=1;
        strcpy(ptr(),s.c_str());
      }
    else
      {
        _rep.txt.is_ptr=0;
        _rep.txt.c0=(s.size()>0 ? validate(s[0]) : 0);
        _rep.txt.c1=(s.size()>1 ? validate(s[1]) : 0);
        _rep.txt.c2=(s.size()>2 ? validate(s[2]) : 0);
        _rep.txt.c3=(s.size()>3 ? validate(s[3]) : 0);
      }
  }
  std::string str() const
  {
    if (is_ptr())
      {
        return std::string(ptr());
      }
    else
      {
        std::string r;
        if (_rep.txt.c0) r+=_rep.txt.c0;
        if (_rep.txt.c1) r+=_rep.txt.c1;
        if (_rep.txt.c2) r+=_rep.txt.c2;
        if (_rep.txt.c3) r+=_rep.txt.c3;
        return r;
      }
  }
private:
  bool is_ptr() const
  {
    return _rep.ptr.is_ptr;
  }
  char* ptr()
  {
    if (!is_ptr())
      THROW(std::logic_error,"unexpected attempt to use pointer");
    return reinterpret_cast<char*>(_rep.ptr.value<<1);
  }
  const char* ptr() const
  {
    if (!is_ptr()) 
      THROW(std::logic_error,"unexpected attempt to use pointer");
    return reinterpret_cast<const char*>(_rep.ptr.value<<1);
  }
  static char validate(char c)
  {
    if (c&0x80)
      THROW(std::range_error,"can only deal with 7-bit characters");
    return c;
  }
  union
  {
    struct
    {
      unsigned int is_ptr:1;
      unsigned int value:31;
    } ptr;
    struct
    {
      unsigned int is_ptr:1;
      unsigned int c0:7;
      unsigned int :1;
      unsigned int c1:7;
      unsigned int :1;
      unsigned int c2:7;      
      unsigned int :1;
      unsigned int c3:7;      
    } txt;
    unsigned int bits;
  } _rep;
};

struct sillystring_collector 
  : public std::deque<sillystring>, 
    public collector
{
  void insert(const std::string& s)
  {
    push_back(sillystring(s));
  }
  void dump(const std::string& f) const
  {
    std::ofstream out(f.c_str(),std::ios::out);
    for (std::deque<sillystring>::const_iterator it=begin();
         it!=end();
         it++
         )
      out << it->str() << std::endl;
  }
};

// getrusage is useless for this; Debian doesn't fill out memory related fields
// /proc/<PID>/statm is obscure/undocumented
size_t memsize()
{
  const pid_t pid=getpid();
  std::ostringstream cmd;
  cmd << "awk '($1==\"VmData:\"){print $2,$3;}' /proc/" << pid << "/status";
  FILE*const f=popen(cmd.str().c_str(),"r");
  if (!f)
    THROW(std::runtime_error,"popen failed");
  int amount;
  char units[4];
  if (fscanf(f,"%d %3s",&amount,&units[0])!=2)
    THROW(std::runtime_error,"fscanf failed");
  if (pclose(f)!=0)
    THROW(std::runtime_error,"pclose failed");
  if (units[0]!='k' || units[1]!='B')
    THROW(std::runtime_error,"unexpected input");
  return static_cast<size_t>(amount)*static_cast<size_t>(1<<10);
}

int main(int argc,char** argv)
{
  if (sizeof(void*)!=4)
    THROW(std::logic_error,"64-bit not supported");
  if (sizeof(sillystring)!=4) 
    THROW(std::logic_error,"Compiler didn't produce expected result");

  if (argc!=2)
    THROW(std::runtime_error,"Expected single command-line argument");

  random_word_generator w(23);

  std::auto_ptr<collector> c;
  switch (argv[1][0])
    {
    case '0':
      std::cout << "Testing container of strings\n";
      c=std::auto_ptr<collector>(new string_collector);
      break;
    case '1':
      std::cout << "Testing container of flyweights\n";
      c=std::auto_ptr<collector>(new flyweight_collector);
      break;
    case '2':
      std::cout << "Testing container of sillystrings\n";
      c=std::auto_ptr<collector>(new sillystring_collector);
      break;
    default:
      THROW(std::runtime_error,"Unexpected command-line argument");
    }

  const size_t mem0=memsize();

  size_t textsize=0;
  size_t filelength=0;
  while (filelength<(100<<20))
    {
      const std::string s=w();
      textsize+=s.size();
      filelength+=(s.size()+1);
      c->insert(s);
    }

  const size_t mem1=memsize();
  const ptrdiff_t memused=mem1-mem0;

  std::cout 
    << "Memory increased by " << memused/static_cast<float>(1<<20)
    << " megabytes for " << textsize/static_cast<float>(1<<20)
    << " megabytes text; efficiency " << (100.0*textsize)/memused << "%"
    << std::endl;

  // Enable to verify all containers stored the same thing:
  //c->dump(std::string(argv[1])+".txt");

  return 0;
}
timday