views:

121

answers:

3

Should I use RegexParsers, StandardTokenParsers or are these suitable at all for parsing this kind of syntax? Example of the syntax can be found from here.

+4  A: 

This format was designed to be easy to parse, you can do it without any regular expressions and without tokenizing your input. Just go line by line and look at the first couple of characters. The file header and chunks headers will require a little more attention, but it's nothing you can't do with split.

Of course, if you want to learn how to use some parsing libraries, then go for it.

Radomir Dopieralski
Years ago, in my first week of programming, I started creating patches for Nethack. Not knowing about `diff`, I started writing the damn things by hand. You can imagine my embarrassment when someone in the newsgroup politely informed me that I might be out of my mind. Well, anyway not only are unified diffs easy to parse, they're actually not so hard to write by hand either. :)
guns
I wouldn't want to explicitly keep the state myself and learning to use the parser combinators is also one of my goals. On most of the examples there is some programming language kind of syntax to be parsed, but I was wondering couldn't the parser combinators also be used for parsing diff syntax or even binary formats.
JtR
+2  A: 

I'd use regex. It simplifies a few things, and makes the rest standard.

def process(src: scala.io.Source) {
  import scala.util.matching.Regex

  val FilePattern = """(.*) ''(.*)''"""
  val OriginalFile = new Regex("--- "+FilePattern, "path", "timestamp")
  val NewFile = new Regex("+++ "+FilePattern, "path", "timestamp")
  val Chunk = new Regex("""@@ -(\d+),(\d+) +(\d+),(\d+) @@""", "orgStarting", "orgSize", "newStarting", "newSize")
  val AddedLine = """+(.*)""".r
  val RemovedLine = """-(.*)""".r
  val UnchangedLine = """ (.*)""".r

  src.getLines() foreach {
    case OriginalFile(path, timestamp) => println("Original file: "+path)
    case NewFile(path, timestamp) => println("New file: "+path)
    case Chunk(l1, s1, l2, s2) => println("Modifying %d lines at line %d, to %d lines at %d" format (s1, l1, s2, l2))
    case AddedLine(line) => println("Adding line "+line)
    case RemovedLine(line) => println("Removing line "+line)
    case UnchangedLine(line) => println("Keeping line "+line)
  }
}
Daniel
I was hoping to be able to use parser combinators to get rid of keeping the state myself. I'm building an object graph out of the patch details something like that I have a Patch that contains FileModifications that contain Chunks. It seems that parser combinators could offer easier way to create objects out of parsed things instead of building the object graph on some variables on the way and tracking the parsing state.
JtR
Btw, I didn't know that regexps can be used in pattern matching like that, very neat!
JtR
+1  A: 

Here is a solution using RegexParsers.

import scala.util.parsing.combinator.RegexParsers

object UnifiedDiffParser extends RegexParsers {

  // case classes representing the data of the diff
  case class UnifiedDiff(oldFile: File, newFile: File, changeChunks: List[ChangeChunk])
  case class File(name: String, timeStamp: String)
  case class ChangeChunk(rangeInformation: RangeInformation, changeLines: List[String])
  case class RangeInformation(oldOffset: Int, oldLength: Int, newOffset: Int, newLength: Int)

  override def skipWhitespace = false

  def unifiedDiff: Parser[UnifiedDiff] = oldFile ~ newFile ~ rep1(changeChunk) ^^ {
    case of ~ nf ~ l => UnifiedDiff(of, nf, l)
  }   

  def oldFile: Parser[File] = ("--- " ~> filename) ~ ("""\s+""".r ~> timestamp <~ newline) ^^ {
    case f~t => File(f, t)
  }   
  def newFile: Parser[File] = ("+++ " ~> filename) ~ ("""\s+""".r ~> timestamp <~ newline) ^^ {
    case f~t => File(f, t)
  }   
  def filename: Parser[String] = """[\S]+""".r
  def timestamp: Parser[String] = """.*""".r

  def changeChunk: Parser[ChangeChunk] = rangeInformation ~ (newline ~> rep1(lineChange)) ^^ {
    case ri ~ l => ChangeChunk(ri, l)
  }   
  def rangeInformation: Parser[RangeInformation] = ("@@ " ~> "-" ~> number) ~ ("," ~> number) ~ (" +" ~> number) ~ ("," ~> number) <~ " @@" ^^ {
    case a ~ b ~ c ~ d => RangeInformation(a, b, c, d)
  }   

  def lineChange: Parser[String] = contextLine | addedLine | deletedLine
  def contextLine: Parser[String] = """ .*""".r <~ newline
  def addedLine: Parser[String] = """\+.*""".r <~ newline
  def deletedLine: Parser[String] = """-.*""".r <~ newline

  def newline: Parser[String] = """\n""".r
  def number: Parser[Int] = """\d+""".r ^^ {_.toInt}

  def main(args: Array[String]) {
    val reader = { 
      if (args.length == 0) {
        // read from stdin
        Console.in
      } else {
        new java.io.FileReader(args(0))
      }   
    }   
    println(parseAll(unifiedDiff, reader))
  }   
}   
michael.kebe
Thanks, just what I was looking for!
JtR