views:

630

answers:

5

I'm using some Java code to do fast prefix lookups, using java.util.TreeSet, could I be using scala's TreeSet instead? Or a different solution?

/** A class that uses a TreeSet to do fast prefix matching
 */
class PrefixMatcher {
  private val _set = new java.util.TreeSet[String]

  def add(s: String) = _set.add(s)

  def findMatches(prefix: String): List[String] = {
    val matches = new ListBuffer[String]
    val tailSet = _set.tailSet(prefix)
    for ( tail <- tailSet.toArray ) {
      val tailString = tail.asInstanceOf[String]
      if ( tailString.startsWith(prefix) ) 
        matches += tailString
      else
        return matches.toList    
    }

    matches.toList
  }
}
+2  A: 

As I understand it, the Scala TreeSet is backed by the Java TreeSet, but using the Scala variant would allow you to shorten up the code using a sequence comprehension (http://www.scala-lang.org/node/111) giving you an implementation that looked something like (for Scala 2.7):

import scala.collection.jcl.TreeSet;

class PrefixMatcher 
{
    private val _set = new TreeSet[String]

    def add(s: String) = _set.add(s)

    def findMatches(prefix: String): Iterable[String] =
        for (s <- _set.from(prefix) if s.startsWith(prefix)) yield s
}

object Main
{
    def main(args: Array[String]): Unit =
    {
        val pm = new Trie()

        pm.add("fooBar")
        pm.add("fooCow")
        pm.add("barFoo")

        pm.findMatches("foo").foreach(println)
    }
}

Apologies for any bad Scala style on my part, I'm just getting used to the language myself.

pyo
@pyo: It *looks* like your for loop loops through all strings in the set, is that right? In the java code I call tailSet, which does something like: return an iterator to the nodes that start with the prefix specified in order, allowing us to find the matching items much faster than comparing every item in the set.
Alex Black
@Alex Black: indeed, I missed out a call to .from in the sequence comprehension, edited in now.
pyo
Ah, that looks much better, thx, let me try that out...
Alex Black
@pyo: one more question, will your implementation break when s.startsWith(prefix) returns false as the one I posted does?
Alex Black
Nope, it will test the startsWith on every String 'from' prefix to the end of the set
pyo
My big objection to this code is that this is *not* a Trie.
Ken Bloom
Truth be told I was unaware of the significance of the term, I merely kept it in from the original question. I shall duly edit it now ;)
pyo
+2  A: 

The from & takeWhile approach:

class PrefixMatcher {
    private val _set = new TreeSet[String]
    def add(s: String) = _set.add(s)
    def findMatches(prefix: String): Iterable[String] =
        _set from prefix takeWhile(_ startsWith prefix)
}

An alternative is to select a subset from prefix to prefix++ (the smallest string after the prefix). This selects only the range of the tree that actually starts with the given prefix. Filtering of entries is not necessary. The subSet method will create a view of the underlying set.

There's still some work (overflow and empty strings won't work) in the increment method but the intent should be clear.

class PrefixMatcher {
  private val _set = new java.util.TreeSet[String]

  def add(s: String) = _set.add(s)

  def findMatches(prefix: String) : Set[String] = {
    def inc(x : String) = { //ignores overflow
       assert(x.length > 0) 
       val last = x.length - 1
       (x take last) + (x(last) + 1).asInstanceOf[Char]
    }
   _set.subSet(prefix, inc(prefix))
  }
}

The same works with the scala jcl.TreeSet#range implementation.

Thomas Jung
Thanks for the suggestion, I think the from syntax looks simpler for this scenario than subSet.
Alex Black
Iterating over the set is even simpler. :-) The difference is the runtime behavior. Selecting a subset for the prefix is quite selective and will skip most of the tree.
Thomas Jung
Heh. I'm comparing using the method 'from' to get a subset with no end (so you have to check to see if you've reached the end), to your suggestion of using subSet, both skip most of the tree I think.
Alex Black
Yes you're right. I've added a from version that does abort when the end of prefix is reached (using takeWhile). I suppose this has the same runtime behavior as your implementation and the subSet implementation. It's quite readable.
Thomas Jung
Great. I tried to upvote you, and turns out I'd already upvoted you, so it downvoted you, and now it says too old to change, sorry!!
Alex Black
Maybe if you edit the answer one more time I'll be able to change my vote back to up.
Alex Black
After 5 downvotes on your answers I'll feel better. :-)
Thomas Jung
5 downvotes? how do you mean? (sorry If I am missing the sarcasm or something!)
Alex Black
There we go, back to upvoted.
Alex Black
5 * 2 ~ 10. It was supposed to be ironic. Does not always work between different cultures.
Thomas Jung
Ah, makes sense now.. My paranoia that maybe I'd downvoted you 5 times, not just 1 time, prevented me from catching what you meant :)
Alex Black
+1  A: 

I blogged about finding matches for a combination of prefixes a while ago. It's a harder problem, as you don't know when one prefix ends and the other begins. It might interest you. I'll even post below the code that I did not blog (yet, hopefully :), though it is stripped of all comments, none of which were made in English:

package com.blogspot.dcsobral.matcher.DFA

object DFA {
  type Matched = List[(String, String)]
  def words(s : String) = s.split("\\W").filter(! _.isEmpty).toList
}

import DFA._
import scala.runtime.RichString

class DFA {
  private val initialState : State = new State(None, "")
  private var currState : State = initialState
  private var _input : RichString = ""
  private var _badInput : RichString = ""
  private var _accepted : Boolean = true

  def accepted : Boolean = _accepted
  def input : String = _input.reverse + _badInput.reverse 

  def transition(c : Char) : List[(String, Matched)] = {
    if (c == '\b') backtrack
    else {
      if (accepted) {
        val newState = currState(c)
        newState match {
          case Some(s) => _input = c + _input; currState = s
          case None => _badInput = c + _badInput; _accepted = false
        }
      } else {
        _badInput = c + _badInput
      }
      optionList
    }
  }

  def transition(s : String) : List[(String, Matched)] = {
    s foreach (c => transition(c))
    optionList
  }

  def apply(c : Char) : List[(String, Matched)] = transition(c)
  def apply(s : String) : List[(String,Matched)] = transition(s)

  def backtrack : List[(String, Matched)] = {
    if(_badInput isEmpty) {
      _input = _input drop 1
      currState.backtrack match {
        case Some(s) => currState = s
        case None =>
      }
    } else {
      _badInput = _badInput drop 1
      if (_badInput isEmpty) _accepted = true
    }
    optionList
  }

  def optionList : List[(String, Matched)] = if (accepted) currState.optionList else Nil

  def possibleTransitions : Set[Char] = if (accepted) (currState possibleTransitions) else Set.empty

  def reset : Unit = {
    currState = initialState
    _input = ""
    _badInput = ""
    _accepted = true
  }

  def addOption(s : String) : Unit = {
    initialState addOption s
    val saveInput = input
    reset
    transition(saveInput)
  }
  def removeOption(s : String) : Unit = {
    initialState removeOption s
    val saveInput = input
    reset
    transition(saveInput)
  }
}

class State (val backtrack : Option[State],
             val input : String) {
  private var _options : List[PossibleMatch] = Nil
  private val transitions : scala.collection.mutable.Map[Char, State] = scala.collection.mutable.Map.empty
  private var _possibleTransitions : Set[Char] = Set.empty

  private def computePossibleTransitions = {
    if (! options.isEmpty)
      _possibleTransitions = options map (_.possibleTransitions) reduceLeft (_++_)
    else
      _possibleTransitions = Set.empty
  }

  private def computeTransition(c : Char) : State = {
    val newState = new State(Some(this), input + c)
    options foreach (o => if (o.possibleTransitions contains c) (o computeTransition (newState, c)))
    newState
  }

  def options : List[PossibleMatch] = _options
  def optionList : List[(String, Matched)] = options map (pm => (pm.option, pm.bestMatch))
  def possibleTransitions : Set[Char] = _possibleTransitions

  def transition(c : Char) : Option[State] = {
    val t = c.toLowerCase
    if (possibleTransitions contains t) Some(transitions getOrElseUpdate (t, computeTransition(t))) else None
  }

  def apply(c : Char) : Option[State] = transition(c)

  def addOption(option : String) : Unit = {
    val w = words(option)
    addOption(option, w.size, List(("", w.head)), w)
  }

  def addOption(option : String, priority : Int, matched : Matched, remaining : List[String]) : Unit = {
    options find (_.option == option) match {
      case Some(pM) => 
        if (!pM.hasMatchOption(matched)) {
          pM.addMatchOption(priority, matched, remaining)
          if (priority < pM.priority) {
            val (before, _ :: after) = options span (_ != pM)
            val (highPriority, lowPriority) = before span (p => p.priority < priority || 
              (p.priority == priority && p.option < option))
            _options = highPriority ::: (pM :: lowPriority) ::: after
          }
          transitions foreach (t => pM computeTransition (t._2, t._1))
        }
      case None => 
        val (highPriority, lowPriority) = options span (p => p.priority < priority || 
              (p.priority == priority && p.option < option))
        val newPM = new PossibleMatch(option, priority, matched, remaining)
        _options = highPriority ::: (newPM :: lowPriority)
        transitions foreach (t => newPM computeTransition (t._2, t._1))
    }
    computePossibleTransitions
  }

  def removeOption(option : String) : Unit = {
    options find (_.option == option) match {
      case Some(possibleMatch) =>
        val (before, _ :: after) = options span (_ != possibleMatch)
        (possibleMatch.possibleTransitions ** Set(transitions.keys.toList : _*)).toList foreach (t => 
          transition(t).get.removeOption(option))
        _options = before ::: after
        computePossibleTransitions
      case None =>
    }
  }
}

class PossibleMatch (val option : String, 
                     thisPriority : Int, 
                     matched : Matched, 
                     remaining : List[String]) {
  private var _priority = thisPriority
  private var matchOptions = List(new MatchOption(priority, matched, remaining))
  private var _possibleTransitions = matchOptions map (_.possibleTransitions) reduceLeft (_++_)
  private def computePossibleTransitions = {
    _possibleTransitions = matchOptions map (_.possibleTransitions) reduceLeft (_++_)
  }

  def priority : Int = _priority
  def hasMatchOption(matched : Matched) : Boolean = matchOptions exists (_.matched == matched)
  def addMatchOption(priority : Int, matched : Matched, remaining : List[String]) : Unit = {
    if (priority < _priority) _priority = priority
    val (highPriority, lowPriority) = matchOptions span (_.priority < priority)
    val newMO = new MatchOption(priority, matched, remaining)
    matchOptions = highPriority ::: (newMO :: lowPriority)
    computePossibleTransitions
  }
  def bestMatch : Matched = matchOptions.head.matched.reverse.map(p => (p._1.reverse.toString, p._2)) ::: 
    remaining.tail.map(w => ("", w))
  def possibleTransitions : Set[Char] = _possibleTransitions

  def computeTransition(s: State, c : Char) : Unit = {
    def computeOptions(state : State,
                       c : Char, 
                       priority : Int, 
                       matched : Matched, 
                       remaining : List[String]) : Unit = {
      remaining match {
        case w :: ws => 
          if (!w.isEmpty && w(0).toLowerCase == c.toLowerCase) {
            val newMatched = (w(0) + matched.head._1, matched.head._2.substring(1)) :: matched.tail
            val newPriority = if (matched.head._1 isEmpty) (priority - 1) else priority

            if (w.drop(1) isEmpty)
              s.addOption(option, newPriority - 1, ("", ws.head) :: newMatched , ws)
            else
              s.addOption(option, newPriority, newMatched, w.substring(1) :: ws)
          }
          if (ws != Nil) computeOptions(s, c, priority, ("", ws.head) :: matched, ws)
        case Nil =>
      }
    }

    if(possibleTransitions contains c)
      matchOptions foreach (mO => computeOptions(s, c, mO.priority, mO.matched, mO.remaining))
  }
}

class MatchOption (val priority : Int,
                   val matched : Matched,
                   val remaining : List[String]) {
  lazy val possibleTransitions : Set[Char] = Set( remaining map (_(0) toLowerCase) : _* )
}

It really needs some refactoring, though. I always do it when I'm start to explain it for the blog.

Daniel
Interesting stuff, I don't need to match on combinations of prefixes at the moment, I'll keep it in mind.
Alex Black
Be grateful! I don't allow code that ugly on other's sight that easily. ;-)
Daniel
A: 

Ok, I just realized what you want is pretty much what a friend of mine suggested for another problem. So, here is his answer, simplified for your needs.

class PrefixMatcher {
  // import scala.collection.Set // Scala 2.7 needs this -- and returns a gimped Set
  private var set = new scala.collection.immutable.TreeSet[String]()
  private def succ(s : String) = s.take(s.length - 1) + ((s.charAt(s.length - 1) + 1)).toChar

  def add(s: String) = set += s

  def findMatches(prefix: String): Set[String] = 
    if (prefix.isEmpty) set else set.range(prefix, succ(prefix))
}
Daniel
+2  A: 

Use a Trie. Nobody's actually posted a Trie here yet, despite the fact that some people have posted sorted TreeMap data structures that they have misnamed as tries. Here is a fairly representative sample of a Trie implementation in Java. I don't know of any in Scala. See also an explanation of Tries on Wikipedia.

Ken Bloom