views:

443

answers:

3

Hi,

I'm writing a parser in Emacs Lisp. It's a parser for text files looking like this:

rule:
  int: 1, 2, 3, ...
  string: and, or, then, when
  text:
  ----------
  Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Pellentesque
  in tellus. In pharetra consequat augue. In congue. Curabitur
  pellentesque iaculis eros. Proin magna odio, posuere sed, commodo nec,
  varius nec, tortor.
  ----------
  more: ...

rule:
  ...

I don't really care about the key (int, string, ...). I want the value. So for the file above int has value "1, 2, 3, ...", string "and, or, then, when" and text "Lorem ..." (excluding the dashes).

I'm thinking about two different solutions, but I don't which one to use. Should I:

  1. create a simple parser that loops through all lines and for each line matches it with some regex and then group the parts I want out?

  2. do a more sophisticated parser with a lexer and a parser?

Right now the files are quite simple and I guess I don't need to do something as advance as the second option. But these files may get a bit more complicated, so I want to make it easy to extend.

How would you solve this?

+3  A: 

for parser stuff look to the Semantic library from CEDET project

Alex Ott
I did check that out. It seemed fairly overkill though. I guess it's much to learn to be able to do anything useful with it.
rejeep
CEDET really is the way to go if you think it is going to get at all complex. CEDET has been added to the upcoming GNU Emacs 23.2, so it is the officially sanctioned way forward. It all depends on how complex the grammar is and how much you expect the format to expand. Unless you're very sure the grammar won't get much more complicated, I would probably go with CEDET's Semantic.
haxney
+3  A: 

There is a relatively simple parser you can find on the Emacs Wiki: ParserCompiler

The Parser Compiler for Emacs creates Recursive Descent parsers in pure elisp.

The goal of the project is to create a useful Parser Compiler that is both innovative and practically useful. This is an original work created by Mike Mattie - [email protected]

Parsers are compiled by a Macro that translates a parser definition DSL to pure elisp. The syntax supports the PEG grammar class currently.

Trey Jackson
So if I use a Parser Compiler, do I have to include that library with my code? I would like to avoid external libraries and write the parser by hand.
rejeep
@rejeep Yes you would have to include the library.
Trey Jackson
+5  A: 

Are you already familiar with recursive descent parsers? They're relatively easy to write by hand in your favourite programming language, which would include Emacs Lisp. For very simple parsing, you can often get by with looking-at and search-forward. These would also form the basis of any tokenizing routines that would be called by your recursive descent parser, or any other style of parser.

[11 Feb 2009] I added an example recursive descent parser in emacs lisp below. It parses simple arithmetic expressions including addition, subtraction, multiplication, division, exponentiation, and parenthesized sub-expressions. Right now, it assumes all tokens are in the global variable *tokens*, but if you modify gettok and peektok as necessary you can have them walk through a buffer. To use it as is, just try out the following:

(setq *token* '( 3 ^ 5 ^ 7 + 5 * 3 + 7 / 11))
(rdh/expr)
=> (+ (+ (^ 3 (^ 5 7)) (* 5 3)) (/ 7 11))

The parsing code follows.

(defun gettok ()
  (and *token* (pop *token*)))
(defun peektok ()
  (and *token* (car *token*)))

(defun rdh/expr ()
  (rdh/expr-tail (rdh/factor)))

(defun rdh/expr-tail (expr)
  (let ((tok (peektok)))
    (cond ((or (null tok)
           (equal tok ")"))
       expr)
      ((member tok '(+ -))
       (gettok)
       (let ((fac (rdh/factor)))
         (rdh/expr-tail (list tok expr fac))))
      (t (error "bad expr")))))

(defun rdh/factor ()
  (rdh/factor-tail (rdh/term)))

(defun rdh/factor-tail (fac)
  (let ((tok (peektok)))
    (cond ((or (null tok)
           (member tok '(")" + -)))
       fac)
      ((member tok '(* /))
       (gettok)
       (let ((term (rdh/term)))
         (rdh/factor-tail (list tok fac term))))
      (t (error "bad factor")))))

(defun rdh/term ()
  (let* ((prim (rdh/prim))
         (tok (peektok)))
    (cond ((or (null tok)
               (member tok '(")" + - / *)))
           prim)
          ((equal tok '^)
           (gettok)
           (list tok prim (rdh/term)))
          (t (error "bad term")))))

(defun rdh/prim ()
  (let ((tok (gettok)))
    (cond ((numberp tok) tok)
      ((equal tok "(")
       (let* ((expr (rdh/expr))
          (tok (peektok)))
         (if (not (equal tok ")"))
         (error "bad parenthesized expr")
           (gettok)
           expr)))
      (t (error "bad prim")))))
Dale Hagglund
I found the recursive descent parsers page earlier. To bad the c-example wasn't complete so that I could test it though. But I guess this is a good way to do the parser in my case. Do you know of any Lisp example?
rejeep
I don't know of any particular lisp examples, I'm afraid, but as far as I could tell all that's missing from the example is the tokenizing routines.
Dale Hagglund
Don't know almost any c, so it would have been a lot better if the example was working from scratch. But I'll google for other examples. Thanks!
rejeep
I added an toy elisp recursive descent parser to my response.
Dale Hagglund
Thanks a lot for the example! Cant make it run though: Debugger entered--Lisp error: (wrong-type-argument symbolp (tok (peektok)))
rejeep
Sorry about that. I had tried it out, honest, but then I changed just one last thing and forgot to test it again. I've updated the code so it should work.
Dale Hagglund
Don't know if I'm doing something wrong. But I still cant make it work. In the function rdh/term, what is the term variable?
rejeep
Once again, I apologize. I don't know if I just plain forgot to save my updates, or did something wrong so they were lost. Anyway, try it now...
Dale Hagglund
Nice, this time it worked perfectly fine! :) Thanks! I will play with it and see if I can make it work any good with my files.
rejeep