views:

362

answers:

7

So I'm reading this interview with Walter Bright about the D language in Bitwise (http://www.bitwisemag.com/copy/programming/d/interview/d_programming_language.html), and I come across this really interesting quote about language parsing:

From a theoretical perspective, however, being able to generate a good diagnostic requires that there be redundancy in the syntax. The redundancy is used to make a guess at what was intended, and the more redundancy, the more likely that guess will be correct. It's like the English language - if we misspell a wrod now and then, or if a word missing, the redundancy enables us to correctly guess the meaning. If there is no redundancy in a language, then any random sequence of characters is a valid program.

And now I'm trying to figure out what the heck he means when he says "redundancy".

I can barely wrap my head around the last part, where he mentions that it is possible to have a language in which "any random sequence of characters is a valid program." I was taught that there are three kinds of errors: syntactic, run-time, and semantic. Are there languages in which the only possible errors are semantic? Is assembly like that? What about machine code?

+2  A: 

It means that the syntax contains more information than necessary to encode a working program. An example is function prototypes. As K&R C shows us, they're redundant because the compiler can just let the caller push whatever arguments you want on, then let the function pop the correct arguments off. But C++ and other languages mandate them, because they help the compiler check that you're calling the function the right way.

Another example is the requirement to declare variables before using them. Some languages have this, while others don't. It it is clearly redundant, but it often helps prevent errors (e.g misspelling, using a variable that has been removed).

Matthew Flaschen
Basically correct, though the example is not the best: headers only "help" the compiler in that they make single-pass-compilers possible (because forward references are forbidden).
delnan
@delnan, I don't follow. In C, without headers, the compiler can't check your function call at all. At best, you'll get a linker error. The redundant information in headers thus lets the compiler protect the programmer from certain errors.
Matthew Flaschen
Forward declarations != headers. Forward declarations make single-pass compilers possible. Headers make separate compilation possible when type information is not part of the object file format.
dan04
It is fully possible to check function calls in the compiler without headers or similar, though that requires a multi-pass compiler and forward references. And of course if the header is wrong, you don't get a compiler error, but again at most a linker error. Edit @dan04 nobody talks about forward *declarations*, only about forward *references*.
delnan
@delnan, as @dan noted, object files (ELF, etc.) don't preserve type information. So I don't understand how a multi-pass compiler (without function prototypes, typically in headers) helps. It could check that your function calls are internally consistent, but not that those to external library functions are correct.
Matthew Flaschen
Ah, yes, of course it requires other means of getting the type information. Stupid me, just taking that for granted. Guess how modern languages handle that ;) the compiler looks in the file that supplies the implementation. Yeah, that prevents seperate compilation (less of an issue if the language is easier to parse, or with a precompiled-header-equivalent like D interfaces) and doesn't take already-compiled static libraries (i.e. no source) into account. And btw, headers don't guarantee much either - if source and header are asynch, the compiler is still happy.
delnan
@Matthew: I think you're right: a multipass compiler is irrelevant here. What you need is either an "object" format that preserves type information (e.g., a Java class file) or else eliminate separate compilation, and instead use a "global" compiler that looks at everything at once.
Jerry Coffin
+6  A: 

Well, to use an example from C# (since I don't know D). If you have a class with an abstract method, the class itself must be marked abstract:

public abstract class MyClass
{
    public abstract MyFunc();
}

Now, it would be trivial for the compiler to automatically mark MyClass as abstract (that is the way C++ handles it), but in C#, you must do it explicitly, so that your intentions are clear.

Similarly with virtual methods. In C++, if declare virtual in a base class, a method is automatically virtual in all derived classes. In C#, the method must nevertheless be explicit marked override, so there is no confusion about what you wanted.

James Curran
But these examples don't help the parser/compiler to give more insight, right? I mean it knows exactly as well in both cases what's going on and could respond exactly the same way.
Mark Peters
The compiler knows what *it* thinks the code means; it does not knows what *you* think the code means.
James Curran
Sorry, what I'm saying is this: If you implement a subclass of MyClass and don't implement MyFunc, the compiler can tell you in both cases exactly what's wrong: you didn't implement an abstract method. If you try to instantiate MyClass the compiler can tell you exactly what's wrong: MyClass is abstract (it can even say "due to MyFunc being abstract"). The redundancy here doesn't have anything to do with what Bright was talking about: parsing/compiling hints and communication back to the user. It's just redundancy for those looking at the code which is different.
Mark Peters
BTW, ironically, you mispelled "abstract" in your snippit :-).
Mark Peters
@Mark: Ironically, you misspelled "snippet" in your observation. The likelihood that someone playing the role of Grammar Nazi makes a grammar or spelling error of their own is inordinately high, it seems, so it's best to just ignore all but a few exceptional cases.
Jon Purdy
+8  A: 

nglsh nclds ll srts of xtr ltrs t mk it ezr t read

Doug Currie
Very good example :)
James Gaunt
In fact, it's psibolse to raearnrge all but the fsrit and lsat ltteers of a wrod and raiten at lseat smoe snsee. And dha rizn fonetisaizd Ingglish haznt cot on iz dhat it's notoriasly mor difikalt tu rid widhaut dha vizhual kyuz uv dha kurent orthografi.
Jon Purdy
+6  A: 

Assembly language (most assembly languages, anyway) is not like that at all -- they have quite a rigid syntax, and most random strings would be diagnosed as errors.

Machine code is a lot closer. Since there's no translation from "source" to "object" code involved, all errors are semantic, not syntactic. Most processors do have various inputs they'd reject (e.g., execute a "bad opcode" trap/interrupt). You could argue that in some cases this would be syntactic (e.g., an opcode that wasn't recognized at all) where others are semantic (e.g., a set of operands that weren't allowed for that instruction).

For those who remember it, TECO was famous (notorious?) for assigning some meaning to almost any possible input, so it was pretty much the same way. An interesting challenge was to figure out what would happen if you typed in (for one example) your name.

Jerry Coffin
Re. TECO: there's a similar challenge involving predicting the effect of holding down Control or Meta and typing your name in Emacs.
Jon Purdy
+1 for bringing it old school. Thanks, this was just the kind of thing I was wondering about.
tel
You have a TECO link?
BCS
Yeah, it took me a minute to find it too (kept getting hits on google for Tampa electric): http://en.wikipedia.org/wiki/Text_Editor_and_Corrector
tel
+14  A: 

I'll focus on why (I think) Walther Bright thinks redunancy is good. Let's take XML as an example. This snippet:

<foo>...</foo>

has redunancy, the closing tag is redunant if we use S-Expressions instead:

(foo ...)

It's shorter, and the programmer doesn't have to type foo more often than neccessary to make sense of that snippet. Less redunancy. But it has downsides, as an example from http://www.prescod.net/xml/sexprs.html shows:

(document author: "[email protected]"
    (para "This is a paragraph " (footnote "(better than the one under there)" ".")
    (para "Ha! I made you say \"underwear\"."))


<document author="[email protected]">
<para>This is a paragraph <footnote>(just a little one).</para>
<para>Ha! I made you say "underwear".</para>
</document>

In both, the end tag/a closing paren for footnote is missing. The xml version is plain invalid as soon as the parser sees </para>. The S-Expression one is only invalid by the end of the document, and only if you don't have an unneeded closing paren somewhere else. So redunancy does help, in some cases, to udnerstand what the writer meant (and point out errors in his way of expressing that).

delnan
Thanks for the very complete answer. Do you know anywhere I could start looking for a more technical/theoretical discussion on how parsers actually implement this kind of thing? I guess that it's pretty easy to figure out how the XML parser would specifically identify the error when it sees </para> when it was expecting </footnote>, but, in general, how do parsers hunt down the column, row, and type of an error? Is there like a "General Theory of Error-Handling for Dummies"?
tel
As far as a know there isn't a "General Theory of Error-Handling" at all. It seems to be more of an art.
BCS
I agree with BCS, I don't think there is such a thing. But I suppose (some of) the usual compiler resources (most of which focus heavily on the frontend i.e. parser) include *one* way of doing so.
delnan
+2  A: 

I think a better example of redundancy is something like int a[10] =. At this point, the compiler knows what should come next, an int array initializer, and can come up with an appropriate error message if what follows isn't an int array initializer. If the language syntax said that anything could follow int a[10], it would be a lot harder for the compiler to figure out problems with one.

David Thornley
+3  A: 

I think he was talking about syntactical structures in the language and how they can be interpreted. As an example, consider the humble "if" statement, rendered in several languages.

In bash (shell script), it looks like this:

if [ cond ]; then
  stmts;
elif [ other_cond ]; then
  other_stmts;
else
  other_other_stmts;
fi

in C (w/single statments, no curly braces):

if (cond)
  stmt;
else if (other_cond)
  other_stmt;
else
  other_other_stmt;

You can see that in bash, there is a lot more syntactical structure to the if statement than there is in C. In fact, all control structures in bash have their own closing delimiters (e.g. if/then/fi, for/do/done, case/in/esac,...), whereas in C the curly brace is used everywhere. These unique delimiters disambiguate the meaning of the code, and thereby provide context from which the interpreter/compiler can diagnose error conditions and report them to the user.

There is, however, a tradeoff. Programmers generally prefer terse syntax (a la C, Lisp, etc.) to verbose syntax (a la Pascal, Ada, etc.). However, they also prefer descriptive error messages containing line/column numbers and suggested resolutions. These goals are of course at odds with each other--you can't have your cake and eat it too (at least, while keeping the internal implementation of the compiler/interpreter simple).

Drew Hall
It's pretty neat to think about all of these design tradeoffs that underlie the layout and syntax of all of these languages. It really makes you realize that to write off aspects of a language, like, for example, its choice of syntax, as arbitrary or merely a function of "style" is somewhat naive. All of these choices were made consciously and (I guess at least in the case of good languages) rationally. This idea that there's tradeoffs between all of these conflicting goals also helps to explain why there's like a 1000+ language out there :)
tel
There's also `while;do;done`. I don't think it's really by design either; just an artifact of traditional sh. And then, of course, there's Java, with incredibly verbose identifiers and a handful of braces.
tc.
@tc: Agreed--I didn't mean to imply that it had been done on purpose to introduce redundancy, just that it had had that effect. Honestly I feel a bit guilty picking on sh/bash for redundancy given its usual succinctness (one character special variables, etc.). It was just an obvious example of a verbose version of a construct (if/else) that can be expressed more succinctly (less redundantly) in another language.
Drew Hall