tags:

views:

187

answers:

4

hello everyone. I'm a beginner to C programming. I'm trying to learning how to code a spell checker that looks through all the words in a dictionary file, compare them with an article, print out all the words that do not exist in the dictionary file onto the console. Since I'm studying malloc in class, I've lowercased every word, removed all the punctuations in the article, and string copied them into malloc. I don't know what should the next step be, would someone give me a hint? Thanks

MAIN.C

#include <stdio.h>
#include <stdlib.h>
char dictionary[1000000];
char article[100000];

void spellCheck(char[], char[]);

int main(void) {
    FILE* dict_file;
    FILE* article_file;
    int bytes_read;
    char* p;
    dict_file = fopen("american-english.txt", "r");
    if (dict_file == 0) {
     printf("unable to open dictionary file \"american-english.txt\"\n");
     return -1;
    }

    article_file = fopen("article.txt", "r");
    if (article_file == 0) {
     printf("unable to open file \"article.txt\"\n");
     return -1;
    }

    /* read dictionary */
    p = dictionary;
    p = fgets(p, 100, dict_file);
    while (p != 0) {
     while (*p != '\0') { 
      p += 1; 
     }
     p = fgets(p, 100, dict_file);
    }

    /* read article */
    p = article;
    bytes_read = fread(p, 1, 1000, article_file);
    p += bytes_read;
    while (bytes_read != 0) {
     bytes_read = fread(p, 1, 1000, article_file);
     p += bytes_read;
    }
    *p = 0;

    spellCheck(article, dictionary);
}

PROJECT.C

void spellCheck(char article[], char dictionary[]) {
    int len = strlen(article) + 1;
    int i;
    char* tempArticle;
    tempArticle = malloc(len);

    if (tempArticle == NULL) {
        printf("spellcheck: Memory allocation failed.\n");
        return;
    }

    for(i = 0; i < len; i++)
        tempArticle[i] = tolower(article[i]);


    i=0;

    while (article[i] != '\0'){
     if (article[i] >= 33 && article[i] <= 64)
      article[i] = ' ';
    }

    printf("%s", tempArticle);

    free(tempArticle);
}
+1  A: 

The next step for your code would be to for each article word compare to to every word in the dictionary. The comparison is easily performed with strcmp, but the way you store the dictionary will force you to mess around with pointers to find the start of each new word in the dictionary.

Without any major changes you could do the comparison something like this, but it will require that you somehow determine when you've compared against all words in the dictionary, for example by counting how many words there are in the dictionary when you read it in from the file.

char* dictionary_word = dictionary;
int not_found = 1;
int i = 0;
for (; i < dictionary_word_count; ++i) {
    if ((not_found = strcmp(tempArticle, dictionary_word)) == 0) {
        break; /* Word found, we're done */
    }
    /* Add code to move dictionary_word to the next word here */
}

The problem with your current program is moving dictionary_word to the next word in a good way. It's possible to do so simply by advancing the pointer one character at a time and checking if you've found the next word. I would instead recommend you to create another array of char pointers and have them point to the beginning of each word and assign these as you read the words from the dictionary file. That would allow you to do something like dictionary_word = dictionary_word_pointers[i]; at the start of the for loop to get it to point to the correct word, instead of using a while loop to find the start of the next word. It would also have the added benefit of being easy to sort.

You can sort the dictionary beforehand and use binary search to speed up the dictionary lookups if the dictionary is large and searching through it using linear search is too slow.

Zareth
The above is C++, not C.
Kinopiko
What does in_dictionary do?
Kinopiko
+3  A: 

How you organize your data structures will be important.

You may want to not only put your dictionary into a binary tree, as Zareth mentioned, but do the same with the article, so you can remove all duplicate words and have them sorted.

This way when you start to search through the dictionary, if you go past the letters that your word starts with then you can quit, as the dictionary is sorted.

James Black
The processing involved in creating a binary tree from the article is going to be much more than the processing involved in just reading the article and checking each word against the dictionary though. And why is a tree structure necessary?
Kinopiko
The tree structure is to help get rid of duplicate words quickly, since he should be splitting the article up into words and putting them into a dynamic list of some sort, so why not sort it, to quickly remove the duplicates.
James Black
But removing the duplicates involves reading the article from beginning to end and parsing each word against the existing tree of article words. That is exactly the same amount of work as parsing the article from beginning to end and parsing each word against the dictionary. You're not saving any time at all by doing that.
Kinopiko
Except that to parse you have to take your word, run it against the dictionary, and do that each and every time. This depends on the dictionary being larger than the article, in terms of the number of words.
James Black
I see your point. Your reasoning is that each dictionary search is expensive, so you are trying to minimize total dictionary lookups.
Kinopiko
@Kinopiko - Correct, the idea is to have this search be as fast, and to avoid any unnecessary searches. If you want to get fancy then you can build the tree up so each letter is a node, so you don't do a strcmp, but just compare the correct letter, which is fastest, but will build a huge tree.
James Black
A: 

Congratulations, you have loaded the data into memory and you did everything right with checking the status of the system calls. Now you need to do more things with your dictionary data:

  1. Create an array of char * pointers, one pointing to each word.

    char * words[100000]; /* make sure you have enough space. */

  2. For each word in your dictionary, make an entry in words. There are various ways to do this, for example you could use strndup to copy each word from dictionary after finding its length using isspace or strcspn.

  3. Sort words (see qsort).
  4. Read the article, word by word, using the same method as in step 2.
  5. Search the dictionary (see bsearch) for the word.
  6. Put the misspelled words into another array similar to words.

If you want to get fancy, you might want to look into using stat to get the size of your files and allocating the memory for dictionary and article using malloc instead of using "magic numbers" or "very big numbers". For industrial strength C, you definitely need to do that.

Kinopiko
A: 

Is the 'dictionary' organized with one word per line? You could sensibly use 'strlen()` instead of the loop with 'p += 1'. Presumably the dictionary is also sorted?

Once you have the dictionary in memory, you don't need to read the whole of the article into memory. You could read one word at a time with 'fscanf()', then eliminate any punctuation so "t'other" appears as words "t" and "other" and "doesn't" appears as "doesn" and "t" - if you like. Or you could decide that isn't helpful. On the other hand, you probably do want to remove characters like question marks and double quotes.

Does your dictionary provide all variants on a word, or do you need to get involved in stemming? "Antidisestablishmentarianism" can be stemmed into "anti", "dis", "establish", "ment", "arian", "ism", I think, as an example.

You also need to consider whether it is correct to lower-case everything. You might decide that "IBM" is OK and "ibm" is not, for example; likewise with "ICBM" and "icbm" (and both "Ibm" and "Icbm" are bad under any reasonable definition of 'proper spelling').

You should be exploiting the fact that your dictionary is sorted to reduce the search time using a binary search or some similar mechanism.

Jonathan Leffler