views:

69

answers:

2

The text to be checked is in Greek, but I would like to know if it can be done for English words too. My initial idea is described here, and I have already found a way to do it using VBA. But I wonder if there's a way to do it using R. If there isn't a way in R, do you think of something better than Excel-vba?

+2  A: 

There exists an open source GNU spell checker called Aspell with suppot for various languages. This is a command line program which I basically use for scanning bunches of text files at once (then the output is just given to the console).
But there also exists a C API and perhaps more interesting for you a Pipe mode which accepts streams of texts and outputs to the standard output.

Hope this helps.

Henrik
Thank you. Is there a windows binary for Aspell?
gd047
Yes, there is, and the windows binary is what I am using: http://aspell.net/win32/
Henrik
Is there a way to use it from R? I saw this http://www.omegahat.org/Aspell/ but I read that `There is currently no binary version for Windows`
gd047
I think Hunspell should be used instead of Aspell today; it certainly works on Windows, but you may need to compile it by yourself.
mbq
Sorry, but I haven't heard of any R Version. And truly, Hunspell is the more up-to-date thing, but as you just need a spell check, Aspell is probably enough. If you get it to work for your problem.
Henrik
+4  A: 

Alternatively, OpenOffice ships with a dictionary that entries stored in a text file. You can read that and remove the word definitions to create your word list.

This was tested on v3.0; the file location may have shifted, and the filename will change depending on which dictionary you want.

library(stringr)
dict <- readLines("C:/Program Files/OpenOffice.org 3/share/uno_packages/cache/uno_packages/174.tmp_/dict-en.oxt/th_en_US_v2.dat")
is_word <- str_detect(dict, "^[^(]")
words <- str_split_fixed(dict[is_word], "\\|", 2)
words <- words[,1]

This list contains some multi-word phrases. You may prefer to split on the first space, and take unique values. You probably also want to write words to file, to save repeating yourself.

Once this is done, checking a word is as easy as

c("persnickety", "sqwrzib") %in% words      # TRUE FALSE
Richie Cotton