views:

292

answers:

2

I'm creating a program which follows certain rules to result in a count of the words, syllables, and sentences in a given text file.

A sentence is a collection of words separated by whitespace that ends in a . or ! or ? However, this is also a sentence:

Greetings, earthlings..

The way I've approached this program is to scan through the text file one character at a time using getchar(). I am prohibited from working with the the entire text file in memory, it must be one character or word at a time.

Here is my dilemma: using getchar() i can find out what the current character is. I just keep using getchar() in a loop until it finds the EOF character. But, if the sentence has multiple periods at the end, it is still a single sentence. Which means I need to know what the last character was before the one I'm analyzing, and the one after it. Through my thinking, this would mean another getchar() call, but that would create problems when i go to scan in the next character (its now skipped a character).

Does anyone have a suggestion as to how i could determine that the above sentence, is indeed a sentence?

Thanks, and if you need clarification or anything else, let me know.

+3  A: 

You just need to implement a very simple state machine. Once you've found the end of a sentence you remain in that state until you find the start of a new sentence (normally this would be a non-white space character other than a terminator such as . ! or ?).

Paul R
Thats actually a great idea, thanks a lot. I understand exactly what you mean, I'm surprised I didn't think of that. Thanks!
Blackbinary
I also suggest you read in blocks of characters, either by line or by quantity. In general, reading from memory is faster and usually easier to debug (you can see past and future letters).
Thomas Matthews
A: 

You need an extensible grammar. Look for example at regular expressions and try to build one.

Generally human language is diverse and not easily parseable especially if you have colloquial speech to analyze or different languages. In some languages it may not even be clear what the distinction between a word and a sentence is.

Thorsten79
That sounds much more complex then what I'm attempting. There is a finite set of rules to define sentences, words, and syllables. Which i can cover with if statements.
Blackbinary