tags:

views:

329

answers:

5

Ok. So I have this function, init():

void init()
{
fstream file;
int index = 0;

char temp_list[60000][15];

listlen = 0;
current_index = 0;

file.open("en_US.dic");
while(!file.eof())
{   
    file >> temp_list[index];
    index++;
}

listlen = index;
file.close();
file.open("en_US.dic");

word_list = new char*[listlen];

int count = 0;
for(int i = 0; i < listlen; i++)
{
    word_list[i] = new char[21];
    file >> word_list[i];
}

file.close();
}

This code compiles and runs correctly with no errors. However, when I change the line

word_list[i] = new char[21]

to

word_list[i] = new char[x] //x < 21

I get the following error:

dict: malloc.c:3074: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.

I'm somewhat new to programming (<2 years), and I've never seen anything like this. Anyone have any ideas? Thanks in advance!

+1  A: 

If the file has words of length 20 or more, file >> word_list[i] will write past the end of the allocated buffer, which can result in the error you saw. This is called a buffer overflow.

This is also an issue when writing to temp_list, but in that case the buffer overflow is less damaging since it will probably just overwrite the memory used for the next word.

One way to fix this is to use an array of std::string instead of char * - allocation will be handled automatically that way.

interjay
+4  A: 

I'm guessing that one of your words is longer then the value specified in x.

When this happens, you are going to overflow your malloc buffer.

If you allocate N bytes, you need to make sure you write no more then N bytes.

Using operator>> and char buffers is a recipe for disaster. operator>> will keep reading/writing until it reaches the word separator. Since operator>> doesn't know how big the char * buffer is, it will overflow the buffer when the word is longer then the buffer. If you want to use operator>> to extract words, use std::string.

What is happening

A very common way to implement malloc is to have bookkeeping data in between the buffers returned from malloc. When you overwrite this data, the assumptions that malloc has made about the structure of the data no longer exist.

So, malloc has something like this:

+------------------+-------------+------------------+-------------+-----------
| malloc internals | user buffer | malloc internals | user buffer | etc...
+------------------+-------------+------------------+-------------+-----------

So, if you allocated 8 bytes to user buffer, but then write 12 bytes, you've just trashed the first 4 bytes of the next malloc internals record.

R Samuel Klatchko
Ahhh. Thanks. I've begun redesigning from scratch based on Roger Pate's advice, but this makes much more sense.
Dane Larsen
+5  A: 

There are three major issues with that code, two of them here:

while (!file.eof())
{   
    file >> temp_list[index];
    index++;
}

You cannot test file.eof() to see if the next operation would fail, only if the previous hit eof, and that's generally only useful if it failed, so change it to:

while (file >> temp_list[index]) {
    index++;
}

As extractions (>>) return the stream and the stream is testable directly, this code is now testing the stream on every iteration and only incrementing index if the extraction was successful.

Now when extracting into a char array, input streams stop at whitespace, but they don't know the maximum length they can store unless you tell them. This same error later in the code is probably why you see what you do, because I suspect you're reading far more data than you expect, and thus trampling all over your memory. Fixed:

while (file >> std::setw(15) >> temp_list[index]) {
    index++;
}

However, the last major problem is you allocate resources and leak them, so use vector and string instead:

#include <fstream>
#include <iostream>
#include <string>
#include <vector>

void init() {
  typedef std::vector<std::string> C; // for later convenience
  C words;
  {
    ifstream file ("en_US.dic");
    if (!file) {
      std::cerr << "could not open file\n";
      // handle error: throw an exception, call abort(), etc.
    }
    for (std::string word; file >> word;) {
      words.push_back(word);
    }
    // if you want to read lines instead:
    //for (std::string line; std::getline(file, line);) {
    //  words.push_back(line);
    //}
  }
  // now use words[0] through words[words.size() - 1]
  std::cout << "Read " << words.size() << " words:\n";
  for (int i = 0; i < words.size(); ++i) {
    std::cout << "  " << words[i] << '\n';
  }
  std::cout << "Output again:\n";
  for (C::const_iterator i = words.begin(); i != words.end(); ++i)
  {
    std::cout << "  " << *i << '\n';
  }
}
Roger Pate
+1 For noting the EOF issue, and suggesting `std::string` and `std::vector`.
Thomas Matthews
Wow! Thanks a ton! I knew very little about vectors before this. My experience with the stl is limited, so this should throw me in a bit further. Thanks again!
Dane Larsen
A: 

You may want to change your design here. Dictionaries are huge.
Do you need to haul all the words (data) into memory?

Since dictionaries are huge, they are designed so they don't need to be entirely in memory at the same time. Professional dictionaries have index tables which are smaller than the whole data file. The principle idea is that index tables are small and can be hauled into memory and kept in memory rather than hauling in all the data at once.

I optimized a program by keeping the initial index table into memory. The result of the first index table is a file offset to another table (or the name of another file). This secondary table would be hauled into memory, if necessary, and so on until the exact item was found.

See the following topics (search the web):

  • B+ Tree
  • Index tables
  • Block I/O
  • File Offsets
Thomas Matthews
Basically, I'm making a program for scrabble, so I need to search the entire dictionary pretty much every time. Thanks for the references though.
Dane Larsen
A: 

This will really mess up:

for(int i = 0; i < listlen; i++)
{
    word_list[i] = new char[21];
    file >> word_list[i];
}

If any of the words are bigger than 20 characters (+1 for the '\0'). Then basically you would be scribbling on memory that wsa used by the memory manager. This will cause all sorts of problems with subsequent allocations and de-allocations.

It worked in the previous loop because the buffer was contiguous:

char temp_list[60000][15];

Though a word from one line may have overlapped onto the next line, this would not have been a problem unless you were actually reading a large word into temp_list[59999] (which would overlap onto another variable).

Martin York
I think you're right. I'm completely redesigning the program though, so I shouldn't have any problems.Thanks!
Dane Larsen