tags:

views:

123

answers:

3

I have a part of a code that does the following: It reads in sentences from a file in a particular format, puts them in a vector. To probe whether the strings in the vector are stored correctly, I put debugging cout statements. I found that the last string member member of the vector is "". Why is this so? The file I am reading from ends with the last floating point value (that is stored in weight in each iteration). There is no whitespace or \n after that. I am pasting that part of the code in the form of a separate program below.

#include <iostream>
#include <stdio.h>
#include <string>
#include <vector>

using namespace std;


int dist=0;

void stringtolower(char *s)

{

 int i=0;

 char c;

 while(s[i]!='\0')

 {

  c=s[i];

  c=tolower(c);

  s[i]=c;

  i++;

 }

}



void cleanup(char *s)

{
 int i=0;
 dist=0;
 while(*(s+i)=='\r' || *(s+i)=='\n' || *(s+i)=='\t')
 {
  dist++;
  i++;
 }

 while(*(s+i)!='\0'){

    /*if(*(s+i)=='"' || *(s+i)=='`' || *(s+i)=='\'' || *(s+i)=='.')

      *(s+i)=' ';*/

  if(*(s+i)==':' || *(s+i)=='\t' || *(s+i)=='\n' || *(s+i)=='\r' || *(s+i)=='"' || *(s+i)=='`' ){

   *(s+i)='\0';

   break;

  }

  i++;

 }

 return; 

}





int isinlist(vector<string> sents, char *s){

 for(int i=0;i<sents.size();i++){

  if(!sents[i].compare(s)){

   return 1;

  }

 }

 return 0;

}

int main()
{
 char *s=NULL;
 FILE *fp;
 fp=fopen("1.txt","r");
 size_t len=0;
 ssize_t read;
 vector<string> sents;
 float weight;
 while(!feof(fp))
 {
  read=getdelim(&s,&len,':',fp);

  cleanup(s);
  s=s+dist;

  fscanf(fp,"%f",&weight);


  if(isinlist(sents,s)){

   continue;

  }
  stringtolower(s);
  string str(s);

  //sentences.push(str); // Push sentence into FIFO queue for later processing
  sents.push_back(str);
 }
 for(int i=0;i<sents.size();i++)
 {
  cout<<sents[i]<<endl;
 }
}

Thanks a lot for your help.

+1  A: 

You are testing EOF and this does not guarantee that there is any data left for reading. Don't do it.

http://stackoverflow.com/questions/216068/getting-integers-from-a-string-created-by-getline-in-c/1958370#1958370

See my response there for further information. You should also use std::getline and std::ifstream instead of C file I/O.

Tronic
+2  A: 

Because you are not handling end of file (eof) correctly.

You can only tell that you've reached the eof when you've tried to read beyond the end of the file. Consider the case of a 0 length file. When that happens, this will be the case.

FILE *fp = fopen(..., "r");
assert(!feof(fp));  // guaranteed, even if the file is 0 length

That is, even though there is no more data, feof will not return true until it actually tries to read the next byte.

What you need to do is detect end of file during reading. For example:

FILE *fp = fopen(..., "r");
char buffer[SIZE];
while (fgets(buffer, sizeof(buffer), fp) != NULL)
{
    // got some data, do something with it.
}

// fgets returned NULL, now let's check if it was because
// we got to the eof or had an error
if (feof(fp))
    // got to the end
else
    // got an error 

If getdelim is written properly, it should return an indicator when it has reached end-of-file. There are two different ways it may be written:

  1. It only returns indicator provided it hasn't already read any data when it reaches EOF
  2. It always returns indicator when it reaches EOF.

If the former, you want to structure your code like:

while (getdelim(&s,&len,':',fp) != GET_DELIM_EOF_VALUE)

If the latter, you will need something like:

while ((getdelim(&s,&len,':',fp) != GET_DELIMI_EOF_VALUE) ||
       (len != 0))
R Samuel Klatchko
thanks for that... I am using the getdelim function specified in the C library, not a function of my own. DO you know what it returns when it reaches the end-of-file?
assassin
hey... i got to know what the problem is... I made the changes as you had recommended, and while debugging in gdb realized that after reading all the lines in the file, for the next iteration of the loop it reads in a '\n' character that is stored in s and cleanup(s) returns "". But I have checked the file manually and I remember not putting any newline after all the lines. Can you please tell me hat the problem here is?
assassin
btw.. i forgot telling you, the getdelim function I am using is the same as in the GNU C Library. You could check out http://www.triggertek.com/m/g/getdelim.3 or http://www.gnu.org/s/libc/manual/html_node/Line-Input.html for more info on it. getdelim returns -1 for a read error or on reaching end-of-file.
assassin
It's normal (although not required) for files to have a newline as the last character (text editors such as emacs/vim will add it there for you). Rather then assume that the newline is a problem, you should write your code to allow it to be there.
R Samuel Klatchko
+2  A: 

Some general tips:

Avoid global variables. The dist value is calculated purely within cleanup - it should be local to that function, and then returned from it, so the main function can use it.

Consider advancing pointers instead of using array-style indexing, to reduce the number of variables you need:

void stringtolower(char *s)
{
    char c;

    while (*s != '\0')
    {
        c = *s;
        c = tolower(c);
        *s = c;

        s++;
    }
}

And declare variables as close as possible to where they are used, and initialise them as you declare them:

void stringtolower(char *s)
{
    while (*s != '\0')
    {
        char c = *s;
        c = tolower(c);
        *s = c;

        s++;
    }
}

And avoid making temporary copies where they add no extra clarity:

void stringtolower(char *s)
{
    while (*s != '\0')
    {
        *s = tolower(*s);
        s++;
    }
}

And consider using for to express the usual iteration pattern:

void stringtolower(char *s)
{
    for (; *s != '\0'; s++)
        *s = tolower(*s);
}

Here's a similar job done on cleanup:

int cleanup(char *s)
{
    char *p = s;
    for (; *p == '\r' || *p == '\n' || *p =='\t'; p++);

    int dist = p - s;

    for (; *p != '\0'; p++) 
    {
        if (*p == ':' || 
            *p == '\t' || 
            *p == '\n' || 
            *p == '\r' || 
            *p == '"' || 
            *p == '`' ) 
        {
            *p = '\0';
            break;
        }
    }

    return dist;
}

Pick a single way of laying out braces and stick to it.

Consider using std::find from <algorithm> instead of your isinlist.

On the other hand, for keeping a list like that so you can search for values previously handled, use std::set instead of std::list. It has a built-in find function that will work much faster than a linear search:

std::set<std::string> sent;

...

if (sent.find(x) != sent.end())
    continue;

sent.insert(x);

Prefer std::string to represent intermediate string values. You can use character pointers for convenient manipulation, but you might as well write code the safe way until you have proof that it is a significant cause of your program running slowly.

Use std::ifstream to read input from a file. It will close the file after use automatically, which you forget to do with fclose.

If you do all these things, your program will be a lot shorter and more readable, and so easier for you to find out when you've got something wrong.

Daniel Earwicker
thanks a lot for all those tips.... will implement them in my code.
assassin