views:

529

answers:

3

Hi guys, I'm very new to the concept of writing an assembler and even after reading a great deal of material, I'm still having difficulties wrapping my head around a couple of concepts.

1) What is the process to actually break up a source file into tokens? I believe this process is called lexing and I've searched high and low for a real code examples that make sense but I can't find a thing so simple code examples very welcome ;)

2) When parsing, does information ever need to be passed up or down the tree, the reason I ask is as follows, take:

LD BC, nn

It needs to be turned into the following parse tree once tokenized???

  ___ LD ___
  |        |
 BC        nn

Now, when this tree is traversed it needs to produce the following machine code:

01 n n

If the instruction had been:

LD DE,nn

Then the output would need to be:

11 n n

Meaning that it raises the question, does the LD node return something different based on the operand or is it the operand that returns something? and how is this achieved? More simple code examples would be excellent if time permits.

I'm most interested in learning some of the raw processes here rather than looking at advanced existing tools so please bare that in mind before sending me to Yacc or Flex.

Best regards, GP

+2  A: 

It is not necessary to build a parse tree. Z80 op codes are very simple. They consist of the op code amd 0, 1 or 2 operands, separated by commas. You Just need to split the opcode up into the (maximum of 3) components with a very simple parser - no tree is needed.

anon
What about relative memory addresses> surely some kind of symbolic tree is needed for this? perhaps even a table
Darknight
The questioner was asking about parsing the OP codes. You certainly need a symbol table to assist in creating the assembler's actual machine code output, but that is a completely different issue.
anon
@Darknight: Would there still not be a valid point to creating a parse tree. For example, other occurrences could play a part such as here are some ASM code strings that have variations in the syntax:Labels, Comments, Brackets, Global Variables, .DB, .DS, .DWWould these not also become more simplified if a parse tree was used?@Neil:Could you elaborate?Regards,GP
Gary Paluk
@Gary You could do that if you want to complicate your life. I've written a couple of assemblers (though not one for the Z80) and they are normally easily implemented via a simple tokenising scheme (as I suggested in my answer) and an expression evaluator to evaluate address and other expressions. The expression evlauator may use a tree (though a stack based one is enough in my experience) but there is no need to make the parse tree your basic data structure.
anon
@Neil Thanks for your feedback... You don't happen to have a code example do you? Would probably clarify alot! :)
Gary Paluk
@Gary Sorry, the last assembler I wrote was in REXX on IBM's VM/CMS mainframe OS. Even if I had the code, I don't think it would be useful to you.
anon
+4  A: 

Well, the structure of the tree you really want for an instruction that operates on a register and an memory addressing mode involing an offset displacement and an index register would look like this:

    INSTRUCTION-----+
    |      |        |
  OPCODE  REG     OPERAND
                  |     |
                OFFSET  INDEXREG

And yes, you want want to pass values up and down the tree. A method for formally specifying such value passing is called "attribute grammars", and you decorate the grammar for your langauge (in your case, your assembler syntax) with the value-passing and the computations over those values. For more background, see Wikipedia on attribute grammars.

In a related question you asked, I discussed a tool, DMS, which handles expression grammars and building trees. As language manipulation tool, DMS faces exactly these same up-and-down the tree information flows issues. It shouldn't surprise you, that as a high-end language manipulation tool, it can handle attribute grammar computations directly.

Ira Baxter
This was intended to be illustrative of the trees he needs to build, rather than matching a specific Z80 instrution.I coded a LOT of Z80 assembler back in the 1980s; I may know more about it than you think.
Ira Baxter
Then demonstrate your knowledge. Have you written a Z80 assembler? I Haven't, but I have written 8080 and 6809 assemblers, and your advice seems to me to be dead wrong.
anon
Your experience is probably different than mine. I've coded assemblers for the Collins 8311 (circa 1968), a 12 and 16 bit machines that didn't make it commercially so you havent heard of them, microcode assemblers, and the 6809 assembler you saw in the other message. I've also built a variety of compiler front ends(check out my bio information and website). My early assemblers were built by ad hoc means and worked. Stuff since the 80s I've built using lexers and parsers because it is far easier to specify, to maintain, to extend, you name it.
Ira Baxter
My experience may be qualitatively different than yours along another axis. I've always wanted tools to help me specify/analyze/manipulate languages. I've spent the last 25 years figuring how to go far beyond lex and yacc, and now have tools that let me (and others if they wish) do so. I use them daily, and they are effective. Without experiencing such tools, you might not be convinced. OK, let many flowers bloom.
Ira Baxter
Hmm. I'm totally in favour of tools, but some tools are fairly crappy. Along with Bjarne Stroustrup, I'm a big fan of recursive descent when it comes to parsing as opposed to table driven tools like yacc/bison. And when it comes to parsing assembler mnemonics, I'm an even bigger fan of ad hoc parsing. Assembler mnemonics are really not that regular. But you obviously do know what you are talking about, so I've deleted the initial comment (which was a bit rude) and removed the downvote.
anon
When recursive descent works, it works great. But it only works well on modest sized grammars that don't change fast (when you build DSLs sometimes the grammar does change fast) and which don't require any lookahead or any type information to resolve the parse (your beloved C++ has a bad lookahead problem here). When you aren't in that window, more powerful (read "full context free") parsing engines are better, and explicit specifications of the full grammar make it a lot easier to revise and maintain. And once you have such machinery, its easy to kill gnats, too, and you should.
Ira Baxter
anon
DMS has a full C++ parser (even handles GNU and MS dialects) and uses a table driven scheme (called GLR, a generalization of LR) based off exactly the same grammar style as you see in the related question to this one. That C++ parser has been used to carry out massive automated transformations on large embedded software systems. The parser is as beautiful as the C++ reference manual will allow it to be :-} I speak as a front end builder.
Ira Baxter
A: 

Actually, the opcodes have not a byte base but an octal base. The best description i know is here: http://www.z80.info/decoding.htm

Steeve