views:

127

answers:

2

Hello,

I'm developing a small python like language using flex, byacc (for lexical and parsing) and C++, but i have a few questions regarding scope control.

just as python it uses white spaces (or tabs) for indentation, not only that but i want to implement index breaking like for instance if you type "break 2" inside a while loop that's inside another while loop it would not only break from the last one but from the first loop as well (hence the number 2 after break) and so on.

example:

while 1
    while 1
        break 2
        'hello world'!! #will never reach this. "!!" outputs with a newline
    end
    'hello world again'!! #also will never reach this. again "!!" used for cout
end
#after break 2 it would jump right here

but since I don't have an "anti" tab character to check when a scope ends (like C for example i would just use the '}' char) i was wondering if this method would the the best:

I would define a global variable, like "int tabIndex" on my yacc file that i would access in my lex file using extern. then every time i find a tab character on my lex file i would increment that variable by 1. when parsing on my yacc file if i find a "break" keyword i would decrement by the amount typed after it from the tabIndex variable, and when i reach and EOF after compiling and i get a tabIndex != 0 i would output compilation error.

now the problem is, whats the best way to see if the indentation got reduced, should i read \b (backspace) chars from lex and then reduce the tabIndex variable (when the user doesn't use break)?

another method to achieve this?

also just another small question, i want every executable to have its starting point on the function called start() should i hardcode this onto my yacc file?

sorry for the long question any help is greatly appreciated. also if someone can provide an yacc file for python would be nice as a guideline (tried looking on Google and had no luck).

thanks in advance.

+1  A: 

Very interesting exercise. Can't you use the end keyword to check when the scope ends?

On a different note, I have never seen a language that allows you to break out of several nested loops at once. There may be a good reason for that...

Dima
Our much-maligned friend 'goto' will allow that in C/C++ ... :)
Jeremy Friesner
...and there are good reasons why goto is considered harmfull... :)
Dima
Now, now, @Jeremy and @Dima, don't get us all riled up! :)
Kevin Little
Technically, you can do this in JavaScript with a continuation. for (..) { (function(){ for (..) { for (..) { return; } } })(); }
Christopher Done
+2  A: 

I am currently implementing a programming language rather similar to this (including the multilevel break oddly enough). My solution was to have the tokenizer emit indent and dedent tokens based on indentation. Eg:

while 1: # colons help :)
    print('foo')
    break 1

becomes:

["while", "1", ":",
    indent,
    "print", "(", "'foo'", ")",
    "break", "1",
    dedent]

It makes the tokenizer's handling of '\n' somewhat complicated though. Also, i wrote the tokenizer and parser from scratch, so i'm not sure whether this is feasable in lex and yacc.

Edit:

Semi-working pseudocode example:

level = 0
levels = []
for c = getc():
    if c=='\n':
        emit('\n')
        n = 0
        while (c=getc())==' ':
            n += 1
        if n > level:
            emit(indent)
            push(levels,n)
        while n < level:
            emit(dedent)
            level = pop(levels)
            if level < n:
                error tokenize
        # fall through
    emit(c) #lazy example
David X
thanks for the answer, so basically all my scope will have to have an "end" of some sort (while end, if endif, func() end, etc) so i can return a dedent token from the tokenizer. i was trying to avoid that at least on functions, but i guess i will have to go that way.Also that way will make my language not supporting small scopes (without whiles or ifs or fors) like C supports.or im not understanding this correctly?
sap
You have to have `end` keywords if you have the tokenizer keep track of indent level (see example). `indent` and `dedent` are used just like `'{'` and `'}'` are for C.
David X
Grrrr, rather, you *don't* have to have `end` if you do that.
David X