tags:

views:

1043

answers:

4

Ok so before I even ask my question I want to make one thing clear. I am currently a student at NIU for Computer Science and this does relate to one of my assignments for a class there. So if anyone has a problem read no further and just go on about your business.

Now for anyone who is willing to help heres the situation. For my current assignment we have to read a file that is just a block of text. For each word in the file we are to clear any punctuation in the word (ex : "can't" would end up as "can" and "that--to" would end up as "that" obviously with out the quotes, quotes were used just to specify what the example was).

The problem I've run into is that I can clean the string fine and then insert it into the map that we are using but for some reason with the code I have written it is allowing an empty string to be inserted into the map. Now I've tried everything that I can come up with to stop this from happening and the only thing I've come up with is to use the erase method within the map structure itself.

So what I am looking for is two things, any suggestions about how I could a) fix this with out simply just erasing it and b) any improvements that I could make on the code I already have written.

Here are the functions I have written to read in from the file and then the one that cleans it.

Note: the function that reads in from the file calls the clean_entry function to get rid of punctuation before anything is inserted into the map.

Edit: Thank you Chris. Numbers are allowed :). If anyone has any improvements to the code I've written or any criticisms of something I did I'll listen. At school we really don't get feed back on the correct, proper, or most efficient way to do things.

int get_words(map<string, int>& mapz)
{
 int cnt = 0;               //set out counter to zero

 map<string, int>::const_iterator mapzIter;

 ifstream input;            //declare instream
 input.open( "prog2.d" ); //open instream
 assert( input );           //assure it is open

 string s;                  //temp strings to read into
 string not_s;

 input >> s;

 while(!input.eof())        //read in until EOF
  {
   not_s = "";
   clean_entry(s, not_s);

   if((int)not_s.length() == 0)
    {
     input >> s;
     clean_entry(s, not_s);
    }    

   mapz[not_s]++;              //increment occurence
   input >>s;
  }
 input.close();     //close instream 

 for(mapzIter = mapz.begin(); mapzIter != mapz.end(); mapzIter++)
  cnt = cnt + mapzIter->second;

 return cnt;        //return number of words in instream
}


void clean_entry(const string& non_clean, string& clean)
{
 int i, j, begin, end;

 for(i = 0; isalnum(non_clean[i]) == 0 && non_clean[i] != '\0'; i++);

 begin = i;

 if(begin ==(int)non_clean.length())
   return;

 for(j = begin; isalnum(non_clean[j]) != 0 && non_clean[j] != '\0'; j++);

 end = j;

 clean = non_clean.substr(begin, (end-begin));

 for(i = 0; i < (int)clean.size(); i++)
  clean[i] = tolower(clean[i]);

}
+7  A: 

The problem with empty entries is in your while loop. If you get an empty string, you clean the next one, and add it without checking. Try changing:

not_s = "";
clean_entry(s, not_s);

if((int)not_s.length() == 0)
 {
  input >> s;
  clean_entry(s, not_s);
 }    

mapz[not_s]++;              //increment occurence
input >>s;

to

not_s = "";
clean_entry(s, not_s);

if((int)not_s.length() > 0)
{
    mapz[not_s]++;              //increment occurence
}    

input >>s;

EDIT: I notice you are checking if the characters are alphanumeric. If numbers are not allowed, you may need to revisit that area as well.

Chris Marasti-Georg
Quite so. Another option would be a do-while loop: inputting, cleaning, and looping while length == 0
Tom Ritter
AviewAnew, assuming I understand you correctly, that would not work if a word starting with a punctuation mark was found in the text body.
Chris Marasti-Georg
Actually, it would break on a work consisting entirely of non-alpha characters.
Chris Marasti-Georg
while(!input.eof()) { not_s = ""; do { input>>s; clean_entry(s, not_s); }while(not_s.length == 0); mapz[not_s]++ }It gets ugly in the case that the file ends with a punctuation mark, in which case you need to add !eof to the do, and an if to the map++
Tom Ritter
The file does in fact end in "???"
Brandon Haugen
+1  A: 

A blank string is a valid instance of the string class, so there's nothing special about adding it into the map. What you could do is first check if it's empty, and only increment in that case:

if (!not_s.empty())
    mapz[not_s]++;

Style-wise, there's a few things I'd change, one would be to return clean from clean_entry instead of modifying it:

string not_s = clean_entry(s);
...
string clean_entry(const string &non_clean)
{
    string clean;
    ... // as before 
    if(begin ==(int)non_clean.length())
        return clean;
    ... // as before
    return clean;
 }

This makes it clearer what the function is doing (taking a string, and returning something based on that string).

Eclipse
I like your idea and if I had total control over writing the program I would most likely do it how you say, but I am instead constrained to the outline they give us where they have defined the headers for the function we are to implement. Thanks though.
Brandon Haugen
+2  A: 

Further improvements would be to

  • declare variables only when you use them, and in the innermost scope
  • use c++-style casts instead of the c-style (int) casts
  • use empty() instead of length() == 0 comparisons
  • use the prefix increment operator for the iterators (i.e. ++mapzIter)
MP24
What is the benefit of moving the ++ to the beginning?
Chris Marasti-Georg
mapzIter++ returns a copy of the iterator first, then increments the iterator. ++mapzIter simply increments the iterator.
Eclipse
That is another question answered here: http://stackoverflow.com/questions/24901/is-there-a-performance-difference-between-i-and-i-in-c
Richard Corden
+1  A: 

The function 'getWords' is doing a lot of distinct actions that could be split out into other functions. There's a good chance that by splitting it up into it's individual parts, you would have found the bug yourself.

From the basic structure, I think you could split the code into (at least):

  • getNextWord: Return the next (non blank) word from the stream (returns false if none left)
  • clean_entry: What you have now
  • getNextCleanWord: Calls getNextWord, and if 'true' calls CleanWord. Returns 'false' if no words left.

The signatures of 'getNextWord' and 'getNextCleanWord' might look something like:

bool getNextWord (std::ifstream & input, std::string & str);
bool getNextCleanWord (std::ifstream & input, std::string & str);

The idea is that each function does a smaller more distinct part of the problem. For example, 'getNextWord' does nothing but get the next non blank word (if there is one). This smaller piece therefore becomes an easier part of the problem to solve and debug if necessary.

The main component of 'getWords' then can be simplified down to:

std::string nextCleanWord;
while (getNextCleanWord (input, nextCleanWord))
{
  ++map[nextCleanWord];
}

An important aspect to development, IMHO, is to try to Divide and Conquer the problem. Split it up into the individual tasks that need to take place. These sub-tasks will be easier to complete and should also be easier to maintain.

Richard Corden