views:

263

answers:

4

I'm trying to read a zip file, check that it has some required files, and then write all valid files out to another zip file. The basic introduction to java.util.zip has a lot of Java-isms and I'd love to make my code more Scala-native. Specifically, I'd like to avoid the use of vars. Here's what I have:

val fos = new FileOutputStream("new.zip");
val zipOut = new ZipOutputStream(new BufferedOutputStream(fos));

while (zipIn.available == 1) {
  val entry = zipIn.getNextEntry
  if (entryIsValid(entry)) {
    zipOut.putNewEntry(new ZipEntry("subdir/" + entry.getName())
    // read data into the data Array
    var data = Array[Byte](1024)
    var count = zipIn.read(data, 0, 1024)
    while (count != -1) {
      zipOut.write(data, 0, count)
      count = zipIn.read(data, 0, 1024)
    }
  }
  zipIn.close
}
zipOut.close

I should add that I'm using Scala 2.7.7.

+1  A: 

Without tail-recursion, I'd avoid recursion. You would run the risk to a stack overflow. You could wrap zipIn.read(data) in an scala.BufferedIterator[Byte] and go from there.

sblundy
Ok... So are you suggesting that there isn't any better approach?
pr1001
Sorry, took me a few minutes to think of something.
sblundy
Hehe, fair enough!
pr1001
+1  A: 

Using scala2.8 and tail recursive call :

def copyZip(in: ZipInputStream, out: ZipOutputStream, bufferSize: Int = 1024) {
  val data = new Array[Byte](bufferSize)

  def copyEntry() {
    in getNextEntry match {
      case null =>
      case entry => {
        if (entryIsValid(entry)) {
          out.putNextEntry(new ZipEntry("subdir/" + entry.getName()))

          def copyData() {
            in read data match {
              case -1 =>
              case count => {
                out.write(data, 0, count)
                copyData()
              }
            }
          }
          copyData()
        }
        copyEntry()
      }
    }
  }
  copyEntry()
}
Patrick
Thanks, that looks quite nice. Unfortunately I should have specified I'm still on 2.7.7.
pr1001
Also, how will this blow up in 2.7.7 versus 2.8? I'm not very conversant on the subject of tail recursion. Thanks.
pr1001
@pr1001 Scala 2.8 optimize tail call when possible so you will avoid stackoverflow. For an introduction to what is tail call i suggest you read this entry for example : http://blog.richdougherty.com/2009/04/tail-calls-tailrec-and-trampolines.html
Patrick
+18  A: 

I don't think there's anything particularly wrong with using Java classes that are designed to work in imperative fashion in the fashion they were designed. Idiomatic Scala includes being able to use idiomatic Java as it was intended, even if the styles do clash a bit.

However, if you want--perhaps as an exercise, or perhaps because it does slightly clarify the logic--to do this in a more functional var-free way, you can do so. In 2.8, it's particularly nice, so even though you're using 2.7.7, I'll give a 2.8 answer.

First, we need to set up the problem, which you didn't entirely, but let's suppose we have something like this:

import java.io._
import java.util.zip._
import scala.collection.immutable.Stream

val fos = new FileOutputStream("new.zip")
val zipOut = new ZipOutputStream(new BufferedOutputStream(fos))
val zipIn = new ZipInputStream(new FileInputStream("old.zip"))
def entryIsValid(ze: ZipEntry) = !ze.isDirectory

Now, given this we want to copy the zip file. The trick we can use is the continually method in collection.immutable.Stream. What it does is perform a lazily-evaluated loop for you. You can then take and filter the results to terminate and process what you want. It's a handy pattern to use when you have something that you want to be an iterator, but it isn't. (If the item updates itself you can use .iterate in Iterable or Iterator--that's usually even better.) Here's the application to this case, used twice: once to get the entries, and once to read/write chunks of data:

val buffer = new Array[Byte](1024)
Stream.continually(zipIn.getNextEntry).
  takeWhile(_ != null).filter(entryIsValid).
  foreach(entry => {
    zipOut.putNextEntry(new ZipEntry("subdir/"+entry.getName))
    Stream.continually(zipIn.read(buffer)).takeWhile(_ != -1).
      foreach(count => zipOut.write(buffer,0,count))
  })
}
zipIn.close
zipOut.close

Pay close attention to the . at the end of some lines! I would normally write this on one long line, but it's nicer to have it wrap so you can see it all here.

Just in case it isn't clear, let's unpack one of the uses of continually.

Stream.continually(zipIn.read(buffer))

This asks to keep calling zipIn.read(buffer) for as many times as necessary, storing the integer that results.

.takeWhile(_ != -1)

This specifies how many times are necessary, returning a stream of indefinite length but which will quit when it hits a -1.

.foreach(count => zipOut.write(buffer,0,count))

This processes the stream, taking each item in turn (the count), and using it to write the buffer. This works in a slightly sneaky way, since you rely upon the fact that zipIn has just been called to get the next element of the stream--if you tried to do this again, not on a single pass through the stream, it would fail because buffer would be overwritten. But here it's okay.

So, there it is: a slightly more compact, possibly easier to understand, possibly less easy to understand method that is more functional (though there are still side-effects galore). In 2.7.7, in contrast, I would actually do it the Java way because Stream.continually isn't available, and the overhead of building a custom Iterator isn't worth it for this one case. (It would be worth it if I was going to do more zip file processing and could reuse the code, however.)


Edit: The looking-for-available-to-go-zero method is kind of flaky for detecting the end of the zip file. I think the "correct" way is to wait until you get a null back from getNextEntry. With that in mind, I've edited the previous code (there was a takeWhile(_ => zipIn.available==1) that is now a takeWhile(_ != null)) and provided a 2.7.7 iterator based version below (note how small the main loop is, once you get through the work of defining the iterators, which do admittedly use vars):

val buffer = new Array[Byte](1024)
class ZipIter(zis: ZipInputStream) extends Iterator[ZipEntry] {
  private var entry:ZipEntry = zis.getNextEntry
  def hasNext = entry != null
  def next = { val previous = entry; entry = zis.getNextEntry; previous }
}
class DataIter(is: InputStream, ab: Array[Byte]) extends Iterator[(Int,Array[Byte])] {
  private var count = 0
  private var waiting = false
  def hasNext = { 
    if (!waiting && count != -1) { count = is.read(ab); waiting=true }
    count != -1
  }
  def next = { waiting=false; (count,ab) }
}
(new ZipIter(zipIn)).filter(entryIsValid).foreach(entry => {
  zipOut.putNextEntry(new ZipEntry("subdir/"+entry.getName))
  (new DataIter(zipIn,buffer)).foreach(cb => zipOut.write(cb._2,0,cb._1))
})
zipIn.close
zipOut.close
Rex Kerr
Thanks, Rex, that's a very nice answer.
pr1001
Thanks for the continually trick
Patrick
+1  A: 

I'd try something like this (yes, pretty much the same idea sblundy had):

Iterator.continually {
  val data = new Array[Byte](100)
  zipIn.read(data) match {
    case -1 => Array.empty[Byte]
    case 0  => new Array[Byte](101) // just to filter it out
    case n  => java.util.Arrays.copyOf(data, n)
  }
} filter (_.size != 101) takeWhile (_.nonEmpty)

It could be simplified like below, but I'm not very fond of it. I'd prefer for read not to be able to return 0...

Iterator.continually {
  val data = new Array[Byte](100)
  zipIn.read(data) match {
    case -1 => new Array[Byte](101)
    case n  => java.util.Arrays.copyOf(data, n)
  }
} takeWhile (_.size != 101)
Daniel