ansaurus

Question

Answer 1

+6 A:

First off, if you want to learn about parsing, then write your own recursive descent parser. The language you've defined only requires a handful of productions. I suggest using Python's tokenize library to spare yourself the boring task of converting a stream of bytes into a stream of tokens.

For practical parsing options, read on...

A quick and dirty solution is to use python itself:

NINETY_NINE = 99       # Defines the constant `NINETY_NINE` to have the value `99`

rules = {
  '*': {     # Applies to all data
    'isYummy': {},      # Everything must be yummy

    'chocolate': {        # To validate, say `validate("chocolate", object)`
      'sweet': {},        # chocolate must be sweet (but not necessarily chocolate.*)

      'lindt': {          # To validate, say `validate("chocolate.lindt", object)`
        'tasty':{}        # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

        '*': {            # Applies to all data under chocolate.lindt
          'smooth': {}  # Could also be written smooth()
          'creamy': 1   # Level 1 creamy
        },
# ...
    }
  }
}

There are several ways to pull off this trick, e.g., here's a cleaner (albeit somewhat unusual) approach using classes:

class _:
    class isYummy: pass

    class chocolate:
        class sweet: pass

        class lindt:
            class tasty: pass

            class _:
                class smooth: pass
                class creamy: level = 1
# ...

As an intermediate step to a full parser, you can use the "batteries-included" Python parser, which parses Python syntax and returns an AST. The AST is very deep with lots of (IMO) unnecessary levels. You can filter these down to a much simpler structure by culling any nodes that have only one child. With this approach you can do something like this:

import parser, token, symbol, pprint

_map = dict(token.tok_name.items() + symbol.sym_name.items())

def clean_ast(ast):
    if not isinstance(ast, list):
        return ast
    elif len(ast) == 2: # Elide single-child nodes.
        return clean_ast(ast[1])
    else:
        return [_map[ast[0]]] + [clean_ast(a) for a in ast[1:]]

ast = parser.expr('''{

'*': {     # Applies to all data
  isYummy: _,    # Everything must be yummy

  chocolate: {        # To validate, say `validate("chocolate", object)`
    sweet: _,        # chocolate must be sweet (but not necessarily chocolate.*)

    lindt: {          # To validate, say `validate("chocolate.lindt", object)`
      tasty: _,        # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

      '*': {            # Applies to all data under chocolate.lindt
        smooth: _,  # Could also be written smooth()
        creamy: 1   # Level 1 creamy
      }
# ...
    }
  }
}

}''').tolist()
pprint.pprint(clean_ast(ast))

This approach does have its limitations. The final AST is still a bit noisy, and the language you define has to be interpretable as valid python code. For instance, you couldn't support this...

*:
    isYummy

...because this syntax doesn't parse as python code. Its big advantage, however, is that you control the AST conversion, so it is impossible to inject arbitrary Python code.

Marcelo Cantos 2010-01-10 05:58:27

Hmmm, interesting. This would make parsing very easy, but I'd really like to learn how to do the parsing myself. Also, this file is editable through a web interface, so I'd rather not allow arbitrary Python code to be entered, even if it is only editable by admins ;-)

Cameron 2010-01-10 06:06:21

+1, yes dict seems the optimal solution here and can be used easily on javascript side and passed as json

Anurag Uniyal 2010-01-10 06:17:52

Yes, one of the reasons that I invented this pseudo-language was to have a custom format that could be "compiled" into both Python and JavaScript objects.

Cameron 2010-01-10 06:29:33

Thanks all for the comments. I've added a section at the front to cover learning to parse, and one at the end for a cleaner (and in some ways dirtier) practical solution.

Marcelo Cantos 2010-01-10 07:32:17

+1: Don't invent a DSL. Just use Python. It's so much simpler and it already works. When you have a problem that's solved by a DSL, you now have two problems. The original problem plus support for the DSL.

S.Lott 2010-01-10 14:23:10

I think I'll go with a recursive descent [email protected]: Lol, you're right, now I have two problems, but I'd rather have a programming-language-independent format than one that's tied to Python. darren's answer points out that there are some such formats (e.g. XML) already out there with many parsers to choose from (in any number of programming languages), but I'd like to have my own that I could learn to parse (plus, IMO, XML has too much cruft for something that can be represented so simply)

Cameron 2010-01-11 04:58:13

Good move, @Cameron. I'd still suggest at least starting with Python's `tokenize`, since it deals with the whole indentation mess for you.

Marcelo Cantos 2010-01-12 02:47:29

Answer 2

A:

There are libraries and tools to make parsing easier. One of the more well known is lex / yacc. There's a python library called 'lex' and a tutorial on using it.

SpliFF 2010-01-10 06:18:55

See also David Beazley's PLY: http://www.dabeaz.com/ply/

gotgenes 2010-03-02 00:05:48

Answer 3

A:

As 'Marcelo Cantos' suggested you can use python dict, benefit is that you do not have to parse any thing, you can use same rules on server side as python dict and on client side using javascript objects, and can pass them from server to client or viceversa as JSON.

If you really want to do parsing yourself see this http://nedbatchelder.com/text/python-parsers.html

but I am not sure you will be easily able to parse a indented language.

Anurag Uniyal 2010-01-10 06:20:51

Thanks for the link. I'm doing this project more to learn than to actually get things done (though that would be nice too)

Cameron 2010-01-10 06:28:14

Answer 4

+1 A:

The language you've shown an example for is probably too complex to write a simple (and bug-free) parsing fuction for. I'd suggest reading up on parsing techniques such as recursive-descent or table-driven parsing such as LL(1), LL(k), etc.

But that may be too general and/or complicated. It might be easier to simplify your rules language to something simple like delimited text.

For example, something like

chocolate:sweet
chocolate.lindt:tasty
chocolate.lindt.*:smooth,creamy(1)

This would be easier to parse and could be done without formal parsers.

Sam Post 2010-01-10 06:23:55

Answer 5

+2 A:

If your goal is to learn about parsing, I'd highly recommend an OO style library like PyParsing. They are not as fast as the more sophisticated antler, lex, yac options, but you get started with the parsing right away.

Shane Holloway 2010-01-10 07:01:41

At PyCon 2010, Andrew Dalke showed that, while PyParsing can be slow, PLY is actually quite a fast alternative. His slides are up at http://us.pycon.org/2010/conference/schedule/event/139/ and video of his talk should be up shortly. His blog writings are also not to be missed if you're into parsing. http://www.dalkescientific.com/writings/diary/

gotgenes 2010-03-02 00:04:30

@gotgenes I would have liked to attend that talk!

Shane Holloway 2010-03-02 17:03:31

Answer 6

A:

what is the motivation for the customized file structure? Would it be possible to remodel your data into a better known structure like XML? If so you could use one of a multitude to parse your file. Using an accepted parsing tool may save you a lot of time debugging, and it may make your file more readable if that is a consideration

darren 2010-01-10 07:02:57

Answer 7

+1 A:

Again not teaching you about parsing, but your format is so close to legal YAML that you might want to just redefine your language as a subset of YAML and use a standard YAML parser.

Beni Cherniavsky-Paskin 2010-03-01 23:48:39

+1! I realized this myself a little while ago, and started implementing it this way. It worked great, though I didn't learn much about parsing. I've since gone in a different direction that does not use a DSL, but thanks for the reply!

Cameron 2010-03-02 15:19:12

ansaurus

tags:

views:

answers:

Tips on how to parse custom file format

related questions