tags:

views:

126

answers:

5

What is the best way to validate a gets input against a very long word list (a list of all the English words available)?

I am currently playing with readlines to manipulate the text, but before there's any manipulation, I would like to first validate the entry against the list.

+3  A: 

A Bloom Filter probably isn't the simplest way, but it will be one of the more efficient. The additional check to weed out false positives is from Pesto's answer.

#!/usr/bin/env ruby

require 'rubygems'
require 'bloomfilter'

puts 'Loading...'

# Should probably tailor the size of the filter to the dictionary file and not 
# assume a Unix environment...
bf = BloomFilter.new(10000000)
dictionary = '/usr/share/dict/words'

File.foreach(dictionary) { |word| bf.add(word.chomp) }

puts 'Ready!'

begin
    input = gets.chomp
    if bf.include?(input)
        if File.foreach(dictionary) { |word| break word if word.chomp == input }
            puts 'valid'
        else
            puts 'false positive'
        end
    else
        puts 'invalid'
    end
end while not input.empty?
Mark Johnson
I have not heard of bloom filters before. Thank you.
cloudhead
A: 

are you reading the list from a file? can't you have it all in memory? maybe a finger tree may help you if not, there's not more than "read a chunk of data from the file and grep into"

ZeD
+2  A: 

The simplest way, but by no means the fastest, is to simply search against the word list each time. If the word list is in an array:

if word_list.index word
    #manipulate word
end

If, however, you had the word list as a separate file (with each word on a separate line), then we'll use File#foreach to find it:

if File.foreach("word.list") {|x| break x if x.chomp == word}
   #manipulate word
end

Note that foreach does not strip off the trailing newline character(s), so we get rid of them with String#chomp.

Pesto
This did the trick... i was able to wrap my program and create a error message if the word isn't matched. Thanks.
cloudhead
+2  A: 

Here's a simple example using a Set, though Mark Johnson is right, a bloom filter would be more efficient.

require 'set'

WORD_RE = /\w+/

# Read in the default dictionary (from /usr/share/dict/words),
# and put all the words into a set
WORDS = Set.new(File.read('/usr/share/dict/words').scan(WORD_RE))

# read the input line by line
STDIN.each_line do |line|
  # find all the words in the line that aren't contained in our dictionary
  unrecognized = line.scan(WORD_RE).find_all { |term| not WORDS.include? term }

  # if none were found, the line is valid
  if unrecognized.empty?
    puts "line is valid"
  else # otherwise, the line contains some words not in our dictionary
    puts "line is invalid, could not recognize #{unrecognized.inspect}"
  end
end
rampion
you obviously have much more experience than me :) i will study this
cloudhead
I'll mark up my example to make it a bit better study guide.
rampion
A: 

Read the word list into memory, and for each word, make an entry into a hash table:

def init_word_tester
    @words = {}

    File.foreach("word.list") {|word| 
        @words[word.chomp] = 1
    }
end

now you can just check every word against your hash:

def test_word word
    return @words[word]
end
Aftermathew