views:

195

answers:

5

I'm trying to learn Scala and tried to write a sequence comprehension that extracts unigrams, bigrams and trigrams from a sequence. E.g., [1,2,3,4] should be transformed to (not Scala syntax)

[1; _,1; _,_,1; 2; 1,2; _,1,2; 3; 2,3; 1,2,3; 4; 3,4; 2,3,4]

In Scala 2.8, I tried the following:

def trigrams(tokens : Seq[T]) = {
  var t1 : Option[T] = None
  var t2 : Option[T] = None
  for (t3 <- tokens) {
    yield t3
    yield (t2,t3)
    yield (t1,t2,Some(t3))
    t1 = t2
    t2 = t3
  }
}

But this doesn't compile as, apparently, only one yield is allowed in a for-comprehension (no block statements either). Is there any other elegant way to get the same behavior, with only one pass over the data?

+1  A: 

I guess I should have given this more thought.

def trigrams(tokens : Seq[T]) : Seq[(Option[T],Option[T],T)] = {
  var t1 : Option[T] = None
  var t2 : Option[T] = None
  for (t3 <- tokens)
    yield {
      val tri = (t1,t2,t3)
      t1 = t2
      t2 = Some(t3)
      tri
    }
}

Then extract the unigrams and bigrams from the trigrams. But can anyone explain to me why 'multi-yields' are not permitted, and if there's any other way to achieve their effect?

larsmans
to achieve their effect, try to add another nested level of iteration, as in my first proposed solution to your problem.
Ken Bloom
Multiple yields (if you see them as exit points which they aren’t) would turn the simple closure into a coroutine which is one reason they are not so easy to deal with. There are some solutions with the continuations plugin, however.
Debilski
+1  A: 
val basis = List(1, 2, 3, 4)
val nGrams = basis.map(x => (x)) ::: (for (a <- basis; b <- basis) yield (a, b)) ::: (for (a <- basis; b <- basis; c <- basis) yield (a, b, c))
nGrams: List[Any] = ...
nGrams foreach (println(_))
1
2
3
4
(1,1)
(1,2)
(1,3)
(1,4)
(2,1)
(2,2)
(2,3)
(2,4)
(3,1)
(3,2)
(3,3)
(3,4)
(4,1)
(4,2)
(4,3)
(4,4)
(1,1,1)
(1,1,2)
(1,1,3)
(1,1,4)
(1,2,1)
(1,2,2)
(1,2,3)
(1,2,4)
(1,3,1)
(1,3,2)
(1,3,3)
(1,3,4)
(1,4,1)
(1,4,2)
(1,4,3)
(1,4,4)
(2,1,1)
(2,1,2)
(2,1,3)
(2,1,4)
(2,2,1)
(2,2,2)
(2,2,3)
(2,2,4)
(2,3,1)
(2,3,2)
(2,3,3)
(2,3,4)
(2,4,1)
(2,4,2)
(2,4,3)
(2,4,4)
(3,1,1)
(3,1,2)
(3,1,3)
(3,1,4)
(3,2,1)
(3,2,2)
(3,2,3)
(3,2,4)
(3,3,1)
(3,3,2)
(3,3,3)
(3,3,4)
(3,4,1)
(3,4,2)
(3,4,3)
(3,4,4)
(4,1,1)
(4,1,2)
(4,1,3)
(4,1,4)
(4,2,1)
(4,2,2)
(4,2,3)
(4,2,4)
(4,3,1)
(4,3,2)
(4,3,3)
(4,3,4)
(4,4,1)
(4,4,2)
(4,4,3)
(4,4,4)
Randall Schulz
No, this gives a list of (what I think are called) multiset combinations. I'm looking for n-grams, substrings. It also passes over `basis` three times. What is `:::`, by the way?
larsmans
`basis` is iterated six times. `:::` is `List` concatenation.
Randall Schulz
`:::` is kinda slow.
Ken Bloom
Performance wasn't a part of the question, was it? `:::` O(n) where n is the length of its RHS argument.
Randall Schulz
@Randall: `:::` is O(n) where n is the length of its LHS argument.
Ken Bloom
+2  A: 
scala> val basis = List(1, 2, 3, 4)
basis: List[Int] = List(1, 2, 3, 4)

scala> val nGrams = (basis sliding 1).toList ::: (basis sliding 2).toList ::: (basis sliding 3).toList
nGrams: List[List[Int]] = ...

scala> nGrams foreach (println _)
List(1)
List(2)
List(3)
List(4)
List(1, 2)
List(2, 3)
List(3, 4)
List(1, 2, 3)
List(2, 3, 4)
Randall Schulz
This is going to be a little slow on account of the `:::` operator.
Ken Bloom
The concatenation doesn't have to happen. I just wanted to illustrate that `sliding n` does the essential work of extracting `n`-grams. And whether it's really "slow" depends on other parameters.
Randall Schulz
+5  A: 

You can't have multiple yields in a for loop because for loops are syntactic sugar for the map (or flatMap) operations:

for (i <- collection) yield( func(i) )

translates into

collection map {i => func(i)}

Without a yield at all

for (i <- collection) func(i)

translates into

collection foreach {i => func(i)}

So the entire body of the for loop is turned into a single closure, and the presence of the yield keyword determines whether the function called on the collection is map or foreach (or flatMap). Because of this translation, the following are forbidden:

  1. Using imperative statements next to a yield to determine what will be yielded.
  2. Using multiple yields

(Not to mention that your proposed verison will return a List[Any] because the tuples and the 1-gram are all of different types. You probably want to get a List[List[Int]] instead)

Try the following instead (which put the n-grams in the order they appear):

val basis = List(1,2,3,4)
val slidingIterators = 1 to 4 map (basis sliding _)

for {onegram <- basis
     ngram <- slidingIterators if ngram.hasNext}
     yield (ngram.next)

or

val basis = List(1,2,3,4)
val slidingIterators = 1 to 4 map (basis sliding _)
val first=slidingIterators head
val buf=new ListBuffer[List[Int]]

while (first.hasNext)
   for (i <- slidingIterators)
      if (i.hasNext)
         buf += i.next

If you prefer the n-grams to be in length order, try:

val basis = List(1,2,3,4)
1 to 4 flatMap { basis sliding _ toList }
Ken Bloom
+1  A: 

You could try a functional version without assignments:

def trigrams[T](tokens : Seq[T]) = {
  val s1 = tokens.map { Some(_) }
  val s2 = None +: s1
  val s3 = None +: s2
  s1 zip s2 zip s3 map {
    case ((t1, t2), t3) => (List(t1), List(t1, t2), List(t1, t2, t3))
  }
}
venechka