I've often fantasized about trying to build (yet another) high level computer language. The object would be to try to push the envelope of rapidity of development, and the performance of the result. I would try to build libraries of sort of minimal, fairly highly optimized operations, and then try to develop the language rules in such a way that any statement or expression expressible in the language would result in optimal code.. unless what was being expressed was just inherently suboptimal.
It would compile to byte code, which would be distributed, and then to machine code when installed, or when the processor environment changed. So when an executible loaded, there would be a loader piece that would check the processor and a few bytes of control data in the object, and if the two matched, then the executible part of the object could be loaded straight away, but if not, then the byte code for that object would have to be recompiled, and the executible part updated. (So it's not Just In Time compilation - it's On Program Install or on CPU Changed compilation.) The loader part would be very short and sweet, it would be in '386 code so it wouldn't need to be compiled. It would only load the byte-code compiler if it needed to, and if so, it would load a compiler object that was small and tight, and optimized for the detected architecture. Ideally, the loader and the compiler would stay resident, once loaded, and there would only be one instance of both.
Anyway, I wanted to respond to the idea that you have to have at least two passes - I don't think I quite agree. Yes, I would use a second pass through the compiled code, but not through the source code.
What you do is, when you come across a symbol, check your symbol hash table, and if there's no entry there, create one, and store a 'forward reference' marker in your compiled code with a pointer to the table entry. As you come across the definitions for labels and symbols, update (or put new data into) your symbol table.
Individual compiled objects should never be so large that they take up very much memory, so, definitely all the compiled code should be held in memory until the whole thing is ready to be written out. The way you keep your memory foot print small is simply by only dealing with one object at a time, and by never keeping more than one small buffer full of source code in memory at a time. Maybe 64k or 128k or something. (Something large enough that the overhead involved in making the call to load the buffer from disk is small in comparison with the time it takes to read the data from disk, so that the streaming is optimized.)
So, one pass through the source stream for an object, then you string your pieces together, collecting the necessary forward reference info from the hash table as you go, and if the data is not there - that's a compile error. That's the process I would be tempted to try.