views:

41

answers:

1

I am having to read a text file; however, one certain text file is giving me issues. Not only is the text file huge (an entire ebook), but there are also several accented letters. I am reading in the words one letter at a time stopping on appropriate punctuation or spaces. I do this by testing appropriate ASCII for letters and punctuation such as an apostrophe. is there a way I can read in the accented letters as well but keep them separate from other letters? Do I need to add any random libraries?

Here is my code to get the word:

string GetNextWord(){
string w="";                            // used to store each word temporarly
    char c;                                 // used for each individual character   
    int i=0;                                // a counter
input.get(c);                           // gets first character
c=tolower(c);                           // forces c to lowercase

while(c>=97 && c<=122 || c==39){        // loops while the character is a lowercase letter or '
    w=w+c;                              // adds character to word string
    input.get(c);                       // gets next character
    c=tolower(c);                       // forces c to lowercase
++i;                                    // increments counter
}
if(i>0)                                 // if there is a word
    return w;                           // return the word
else                                    // otherwise string is NULL
        return "NOT A WORD!";               // returns a flag to main
}

Works on every file so far except, this one.
You can see the input here-> http://www.gutenberg.org/cache/epub/244/pg244.txt

A: 

Accented ASCII characters fall outside of the normal character set. i.e. above 127. You're not clear on it "works on every file so far", but looking at the above, if you're running into accented characters, my guess is that you're entering an infinite loop. To handle the extended characters correctly, you will need to know what code page you're dealing with. I'm also unsure whether std::tolower correctly handles extended characters in ASCII, at least not without being told what the locale/code page is.

Nathan Ernst
It works on files with numbers, punctuation, capital and lowercase letters, so far accented letters are the only thing that are causing me issues. I agree with your idea that it is going into an infinite loop though. using namespace std;
Chase Sawyer
Does that help you any?
Chase Sawyer
MattSmith