views:

923

answers:

13

I'm looking to write a programming language for fun, however most of the resource I have seen are for writing a context free language, however I wish to write a language that, like python, uses indentation, which to my understanding means it can't be context free.

+2  A: 

I don't know of any tutorials/guides, but you could try looking at the source for tinypy, it's a very small implementation of a python like language.

Ryan
+5  A: 

You might want to read this rather well written essay on parsing Python, Python: Myths about Indentation.

While I haven't tried to write a context free parser using something like yacc, I think it may be possible using a conditional lexer to return the indentation change tokens as described in the url.

By the way, here is the official python grammar from python.org: http://www.python.org/doc/current/ref/grammar.txt

Jason Dagit
+2  A: 

Using indentation in a language doesn't necessarily mean that the language's grammar can not be context free. I.e. the indentation will determine in which scope a statement exists. A statement will still be a statement no matter which scope it is defined within (scope can often be handled by a different part of the compiler/interpreter, generally during a semantic parse).

That said a good resource is the antlr tool (http://www.antlr.org). The author of the tool has also produced a book on creating parsers for languages using antlr (http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference). There is pretty good documentation and lots of example grammars.

Michael Barker
A: 

Just because a language uses significant indentation doesn't mean that it is inherently context-sensitive. As an example, Haskell makes use of significant indentation, and (to my knowledge) its grammar is context-free.

An example of source requiring a context-sensitive grammar could be this snippet from Ruby:

my_essay = << END_STR
This is within the string
END_STR

<< self
  def other_method
    ...
  end
end

Another example would be Scala's XML mode:

def doSomething() = {
  val xml = <code>def val <tag/> class</code>
  xml
}

As a general rule, context-sensitive languages are slightly harder to imagine in any precise sense and thus far less common. Even Ruby and Scala don't really count since their context sensitive features encompass only a minor sub-set of the language. If I were you, I would formulate my grammar as inspiration dictates and then worry about parsing methodologies at a later date. I think you'll find that whatever you come up with will be naturally context-free, or very close to it.

As a final note, if you really need context-sensitive parsing tools, you might try some of the less rigidly formal techniques. Parser combinators are used in Scala's parsing. They have some annoying limitations (no lexing), but they aren't a bad tool. LL(*) tools like ANTLR also seem to be more adept at expressing such "ad hoc" parsing escapes. Don't try to use Yacc or Bison with a context-sensitive grammar, they are far to strict to express such concepts easily.

Daniel Spiewak
A: 

I would recommend that you write your parser by hand, in which case having significant whitespace should not present any real problems.

The main problem with using a parser generator is that it is difficult to get good error recovery in the parser. If you plan on implementing an IDE for your language, then having good error recovery is important for getting things like Intellisence to work. Intellisence always works on incomplete syntactic constructs, and the better the parser is at figuring out what construct the user is trying to type, the better an intellisence experience you can deliver.

If you write a hand-written top-down parser, you can pretty much implement what ever rules you want, where ever you want to. This is what makes it easy to provide error recovery. It will also make it trivial for you to implement significant whitespace. You can simply store what the current indentation level is in a variable inside your parser class, and can stop parsing blocks when you encounter a token on a new line that has a column position that is less than the current indentation level. Also, chances are that you are going to run into ambiguities in your grammar. Most “production” languages in wide use have syntactic ambiguities. A good example is generics in C# (there are ambiguities around "<" in an expression context, it can be either a "less-than" operator, or the start of a "generic argument list"). In a hand-written parser solving ambiguities like that are trivial. You can just add a little bit of non-determinism where you need it with relatively little impact on the rest of the parser,

Furthermore, because you are designing the language yourself, you should assume it's design is going to evolve rapidly (for some languages with standards committees, like C++ this is not the case). Making changes to automatically generated parsers to either handle ambiguities, or evolve the language, may require you to do significant refactoring of the grammar, which can be both irritating and time consuming. Changes to hand written parsers, particularly for top-down parsers, are usually pretty localized.

I would say that parser generators are only a good choice if:

  1. You never plan on writing an IDE ever,
  2. The language has really simple syntax, or
  3. You need a parser extremely quickly, and are ok with a bad user experience
Scott Wisniewski
C was built using a parser generator (yacc). Scala is built using combinators, which are sort-of like an implicit parser generator. Building a parser by hand is error-prone and difficult to maintain in the non-trivial case. I strongly disagree with all three of your reasons. :-)
Daniel Spiewak
For command line compilers, parser generators are OK. A batch parser does not need much error recovery.For interactive scenarios, however, I really think you need a hand-written parser.In my experience hand written parsers are pretty easy to write, and are also relatively easy to maintain.
Scott Wisniewski
In some sense I can see your point. Editors are very different from compilers. However, generally speaking, I think an editor parser would only need to be top-down (as opposed to the yacc bottom-up) due to the need to unambiguously select production rules.
Daniel Spiewak
The choice has nothing to do with theory.Writing the parser by hand yields a better user experience. Once you decide to write the parser by hand, the choice of top-down is obvious. Top-down parsers are easy to write. Bottom up parsers are hard to write.
Scott Wisniewski
+1  A: 

Have you read Aho, Sethi, Ullman: "Compilers: Principles, Techniques, and Tools"? It is a classical language reference book.

/Allan

Allan Wind
A: 

If you've never written a parser before, start with something simple. Parsers are surprisingly subtle, and you can get into all sorts of trouble writing them if you've never studied the structure of programming languages.

Reading Aho, Sethi, and Ullman (it's known as "The Dragon Book") is a good plan. Contrary to other contributors, I say you should play with simpler parser generators like Yacc and Bison first, and only when you get burned because you can't do something with that tool should you go on to try to build something with an LL(*) parser like Antlr.

Kent Quirk
A: 

A context-sensitive language? This one's non-indented: Protium (http://www.protiumble.com)

boost
+4  A: 

I would familiarize myself with the problem first by reading up on some of the literature that's available on the subject. The classic Compilers book by Aho et. al. may be heavy on the math and comp sci, but a much more aproachable text is the Let's Build a Compiler articles by Jack Crenshaw. This is a series of articles that Mr. Crenshaw wrote back in the late 80's and it's the most under-appreciated text on compilers ever written. The approach is simple and to the point: Mr. Crenshaw shows "A" approach that works. You can easily go through the content in the span of a few evenings and have a much better understanding of what a compiler is all about. A couple of caveats are that the examples in the text are written in Turbo Pascal and the compilers emit 68K assembler. The examples are easy enough to port to a more current programming language and I recomment Python for that. But if you want to follow along as the examples are presented you will at least need Turbo Pascal 5.5 and a 68K assembler and emulator. The text is still relevant today and using these old technologies is really fun. I highly recommend it as anyone's first text on compilers. The great news is that languages like Python and Ruby are open sourced and you can download and study the C source code in order to better understand how it's done.

Jonas Gorauskas
+16  A: 

A context-free grammar is, simply, one that doesn't require a symbol table in order to correctly parse the code. A context-sensitive grammar does.

The D programming language is an example of a context free grammar. C++ is a context sensitive one. (For example, is T*x declaring x to be pointer to T, or is it multiplying T by x ? We can only tell by looking up T in the symbol table to see if it is a type or a variable.)

Whitespace has nothing to do with it.

D uses a context free grammar in order to greatly simplify parsing it, and so that simple tools can parse it (such as syntax highlighting editors).

Walter Bright
T*x by it's self or on the left of '=' is assumed to be a pointer deceleration in both D and under GCC. I happened to run into this a while back.
BCS
T* x; by itself is valid if its a declaration and just as valid if its multiplying. I dont think its inherently correct to assume that its always a pointer delcaration. (I can write 2*3; so I could also, just as easily write something like int T=2; int x=3; T*x; even if the result is lost)
Dan
A: 

If you're really going to take a whack at language design and implementation, you might want to add the following to your bookshelf:

  • Programming Language Pragmatics, Scott et al.
  • Design Concepts in Programming Languages, Turbak et al.
  • Modern Compiler Design, Grune et al. (I sacrilegiously prefer this to "The Dragon Book" by Aho et al.)

Gentler introductions such as:

  • Crenshaw's tutorial (as suggested by @'Jonas Gorauskas' here)
  • The Definitive ANTLR Reference by Parr
  • Martin Fowler's recent work on DSLs

You should also consider your implementation language. This is one of those areas where different languages vastly differ in what they facilitate. You should consider languages such as LISP, F# / OCaml, and Gilad Bracha's new language Newspeak.

Larry OBrien
+2  A: 

"Context-free" is a relative term. Most context-free parsers actually parse a superset of the language which is context-free and then check the resulting parse tree to see if it is valid. For example, the following two C programs are valid according to the context-free grammar of C, but one quickly fails during context-checking:

int main()
{
    int i;
    i = 1;
    return 0;
}

int main()
{
    int i;
    i = "Hello, world";
    return 0;
}

Free of context, i = "Hello, world"; is a perfectly valid assignment, but in context you can see that the types are all wrong. If the context were char* i; it would be okay. So the context-free parser will see nothing wrong with that assignment. It's not until the compiler starts checking types (which are context dependent) that it will catch the error.

Anything that can be produced with a keyboard can be parsed as context-free; at the very least you can check that all the characters used are valid (the set of all strings containing only displayable Unicode Characters is a context-free grammar). The only limitation is how useful your grammar is and how much context-sensitive checking you have to do on your resulting parse tree.

Whitespace-dependent languages like Python make your context-free grammar less useful and therefore require more context-sensitive checking later on (much of this is done at runtime in Python through dynamic typing). But there is still plenty that a context-free parser can do before context-sensitive checking is needed.

Imagist