tags:

views:

49

answers:

3

I'm trying to isolate the single words in a pdf file, but when reading the file using the pdf-reader gem the text arrives fractured, like this

"A lit"
"tle "
"bit of tex"
"t"

So I'm planning to put these together using some heuristics. For this, I need a library which checks if a given string is a valid english word, like

"tree".is_english? # => true
"askdjfah".is_english? # => false

Does this exist? Ideally, it would also work with german text.

If not, is there some freely available dictionary online? I guess I could write my own tree structure to do the lookup, if i had to.

A: 

I don't know any library that do what you want, but there are dictionaries with words. It shouldn't be hard to find them on google. For example this.

klew
+3  A: 

You can check out raspell, or even manually invoking aspell, with any dictionary you like.

Mladen Jablanović
+1  A: 

If you have the unix tool look installed on your system, you can check whether a word is a word easily. Example:

strings = %w{ cat dog tree trees treez }

strings.each do |string|
  if system("look #{string} > /dev/null 2>&1") 
    puts "#{string} is a word"
  else
    puts "#{string} is not a word"
  end
end

Here's more information on look: http://docstore.mik.ua/orelly/unix/upt/ch27_18.htm

Since look uses the word dictionary in /usr/dict/words, I think it's possible to install a German word dictionary. Look for the wgerman package in Debian. I'm not sure how to install it on other systems.

dan