views:

327

answers:

5

After hearing the latest Stack Overflow podcast, Peter Norvig's compact Python spell-checker intrigued me, so I decided to implement it in Scala if I could express it well in the functional Scala idiom, and also to see how many lines of code it would take.

Here's the whole problem. (Let's not compare lines of code yet.)

(Two notes: You can run this in the Scala interpreter, if you wish. If you need a copy of big.txt, or the whole project, it's on GitHub.)

import scala.io.Source

val alphabet = "abcdefghijklmnopqrstuvwxyz"

def train(text:String) = {
  "[a-z]+".r.findAllIn(text).foldLeft(Map[String, Int]() withDefaultValue 1)
    {(a, b) => a(b) = a(b) + 1}
}

val NWORDS = train(Source.fromFile("big.txt").getLines.mkString.toLowerCase)

def known(words:Set[String]) =
  {Set.empty ++ (for(w <- words if NWORDS contains w) yield w)}

def edits1(word:String) = {
  Set.empty ++
  (for (i <- 0 until word.length) // Deletes
    yield (word take i) + (word drop (i + 1))) ++ 
  (for (i <- 0 until word.length - 1) // Transposes
    yield (word take i) + word(i + 1) + word(i) + (word drop (i + 2))) ++ 
  (for (i <- 0 until word.length; j <- alphabet) // Replaces
    yield (word take i) + j + (word drop (i+1))) ++ 
  (for (i <- 0 until word.length; j <- alphabet) // Inserts
    yield (word take i) + j + (word drop i))
}

def known_edits2(word:String) = {Set.empty ++ (for (e1 <- edits1(word);
  e2 <- edits1(e1) if NWORDS contains e2) yield e2)}

def correct(word:String) = {
  val options = Seq(() => known(Set(word)), () => known(edits1(word)),
    () => known_edits2(word), () => Set(word))
  val candidates = options.foldLeft(Set[String]())
    {(a, b) => if (a.isEmpty) b() else a}
  candidates.foldLeft("") {(a, b) => if (NWORDS(a) > NWORDS(b)) a else b}
}

Specifically, I'm wondering if there's anything cleaner I can do with the correct function. In the original Python, the implementation is a bit cleaner:

def correct(word):
    candidates = known([word]) or known(edits1(word)) or
      known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

Apparently in Python, an empty set will evaluate to Boolean False, so only the first of the candidates to return a non-empty set will be evaluated, saving potentially expensive calls to edits1 and known_edits2.

The only solution I would come up with is the version you see here, where the Seq of anonymous functions are called until one returns a non-empty Set, which the last one is guaranteed to do.

So experienced Scala-heads, is there a more syntactically concise or better way to do this? Thanks in advance!

+4  A: 

Would this work? The _ syntax is a partially applied function and by using a (lazy) Stream, I ensure that the evaluations in the reduceLeft (which I think is more appropriate than foldLeft here) only happen as required!

def correct(word:String) = {
  Stream(known(Set(word)) _, 
         known(edits1(word)) _, 
         known_edits2(word) _, 
         Set(word) _
         ).find( !_().isEmpty ) match {
    case Some(candidates) =>
      candidates.reduceLeft {(res, n) => if (NWORDS(res) > NWORDS(n)) res else n}
    case _ => "" //or some other value

}

I've probably made some syntax errors here, but I think the Stream approach is a valid one

oxbow_lakes
+6  A: 

I'm not sure why you're attempting to use lazy evaluation for known rather than simply using a stream as oxbow_lakes illustrated. A better way of doing what he did:

def correct(word: String) = {
  import Stream._

  val str = cons(known(Set(word)), 
            cons(known(edits1(word)),
            cons(known_edits2(word),
            cons(Set(word), empty))))

  str find { !_.isEmpty } match {
    case Some(candidates) =>
      candidates.foldLeft(Set[String]()) { (res, n) =>
        if (NWORDS(res) > NWORDS(n)) res else n
      }

    case None => Set()
  }
}

The exploits the fact that Stream.cons is lazy already and so we don't need to wrap everything up in a thunk.

If you're really in the mood for nice syntax though, we can add some syntactic sugar to all of those conses:

implicit def streamSyntax[A](tail: =>Stream[A]) = new {
  def #::(hd: A) = Stream.cons(hd, tail)
}

Now our previously-ugly str definition falls into the following:

def correct(word: String) = {
  val str = known(Set(word)) #:: known(edits1(word)) #::
            known_edits2(word) #:: Set(word) #:: Stream.empty

  ...
}
Daniel Spiewak
Thanks for the answers and the sugary addition.A slightly modified version of this works and is much nicers than tossing anonymous functions around, but it's not quite as performant as it could be, because one too many function invocations happens (that is to say, if known(Set(word)) returns a non-empty set, known(edits1(word)) still gets executed.)What would be the best way to get rid of that?
Steven Merrill
I would probably switch around the order so that Set(word) gets added first. That doesn't carry a performance hit to speak of, so it should be alright to eagerly evaluate that one term.
Daniel Spiewak
Unfortunately, that breaks the program logic, since no correction will ever be returned, since Set(word) will never be the empty set.
Steven Merrill
Ah, good point. Well, if you think about it though, the first member of the stream *must* be evaluated whether or not you've wrapped it in a function value. The first thing we do with our stream is invoke `find`. There are two cases: either we find the non-empty set on the first one, in which case we have evaluated the first cell; or we don't find it on the first one, in which case we have evaluated the first cell for testing and must move on to the second. Either way, you can't avoid that overhead.
Daniel Spiewak
+2  A: 

Iterators are also lazy (although not very functional since you can only iterate over them once.) So, you could do it like this:

  def correct(word: String) = {
    val sets = List[String => Set[String]](
      x => known(Set(x)), x => known(edits1(x)), known_edits2
    ).elements.map(_(word))

    sets find { !_.isEmpty } match {
      case Some(candidates: Set[String]) => candidates.reduceLeft { (res, n) => if (NWORDS(res) > NWORDS(n)) res else n }
      case None => word
    }
  }

As a bonus, Iterator's find() method doesn't force evaluation of the next element.

David Winslow
A-ha! This is the first answer that doesn't cause any extra invocations of the candidates.
Steven Merrill
This will still *always* evaluate the first candidate. In other words, it doesn't do any better than my answer.
Daniel Spiewak
In a short-circuited chain, the first thing should always be evaluated. What would be the criterion for avoiding evaluation of the first term?
David Winslow
+3  A: 

Scala 2.7-ready (including implicit performance work-around):

class Or[A](one: Set[A]) {
  def or(other: => Set[A]): Set[A] = if (one.isEmpty) other else one 
}

implicit def toOr[A](one: Set[A]) = new Or(one)

def correct(word: String) = {
  candidates = known(Set(word)) or known(edits1(word)) or known_edits2(word) or Set(word)
  candidates.foldLeft("") {(a, b) => if (NWORDS(a) > NWORDS(b)) a else b}
}

Scala 2.8-goodness:

implicit def toOr[A](one: Set[A]) = new AnyRef { 
  def or(other: => Set[A]): Set[A] = if (one.isEmpty) other else one 
}

def correct(word: String) = {
  candidates = known(Set(word)) or known(edits1(word)) or known_edits2(word) or Set(word)
  candidates.max(Ordering.fromLessThan[String](NWORDS(_) < NWORDS(_)))
}

That said, I pretty much upvoted everyone else's. I hadn't consider a Stream.

EDIT

It seems Ordering.fromLessThan can lead to twice the necessary comparisons. Here is an alternate version for that line:

  candidates.max(new Ordering[String] { def compare(x: String, y: String) = NWORDS(x) - NWORDS(y) })
Daniel
Thanks so much! This implementation is beautiful, but it's not performant, unfortunately, since testing shows it to cause n^2 evaluations of the candidates. I think I'll still have to use either anonymous or partially-applied functions in some capacity.
Steven Merrill
You mean `max`? How strange.
Daniel
Note that the definition of `or` doesn't allow unnecessary evaluations. `one` is passed by value, so it won't reevaluate. `other` is passed by name, so it will only be evaluated when used, and it is only used if `one` is empty. If you had trouble with `or`, you probably missed `=>` in `other: => Set[A]`.
Daniel
I have looked at the max question, and, while inefficient, it is now n^2, but n*2. I'll provide a better version anyway, as well as open a bug fix to slightly improve `min` and `max`. As for `fromLessThan`, it is broken by design, as it always test twice if "a > b". :-(
Daniel
Ticket #2697 open, if you want to check the details.
Daniel
I was using the 2.7 version. Let me see if I can reproduce it.
Steven Merrill
Daniel,Check out a version with two printlns added and its outputs:http://gist.github.com/242386
Steven Merrill
I can't begin to imagine why this is happening, but I know how to stop it. The problem is create a new class within the implicit, which causes reflection calls. One has to split the class and the implicit, as shown in the modified example.
Daniel
And, by the way, it doesn't happen on Scala 2.8.
Daniel
A: 

I've tried to implement a short Scala implementation of the spelling corrector. It's 15 lines without imports. The shortest replacement for Python's or is simple call by name parameter:

def or[T](candidates : Seq[T], other : => Seq[T]) = if(candidates.isEmpty) other else candidates
def candidates(word: String) = or(known(List(word)), or(known(edits1(word)), known(edits2(word))))

In a real world scenario I would use the implicit conversion Daniel proposed.

Thomas Jung