views:

114

answers:

3

I am writing a programming language text parser, out of curiosity. Say i want to define an immutable (at runtime) graph of tokens as vertices/nodes. These are naturally of different type - some tokens are keywords, some are identifiers, etc. However they all share the common trait where each token in the graph points to another. This property lets the parser know what may follow a particular token - and so the graph defines the formal grammar of the language. My problem is that I stopped using C++ on a daily basis some years ago, and used a lot of higher level languages since then and my head is completely fragmented with regards to heap-allocation, stack-allocation and such. Alas, my C++ is rusty.

Still, I would like to climb the steep hill at once and set for myself the goal of defining this graph in this imperative language in a most performant way. For instance I want to avoid allocating each token object separately on the heap using 'new' because I think if I allocate the entire graph of these tokens back-to-back so to speak (in a linear fashion like elements in an array), this would benefit the performance somehow, per locality of reference principle - I mean when the entire graph is compacted to take up minimal space along a 'line' in memory, rather than having all its token objects at random locations, that is a plus? Anyway, like you see, this is a bit of a very open question.

class token
{

}

class word: token
{
    const char* chars;

    word(const char* s): chars(s)
    {
    }
}

class ident: token
{
    /// haven't thought about these details yet
}

template<int N> class composite_token: token
{
    token tokens[N];
}

class graph
{
    token* p_root_token;
}

The immediate question is: what would be the procedure to create this graph object? It's immutable and it's thought structure is known at compile time, that's why I can and want to avoid copying stuff by value and so on - it should be possible to compose this graph out of literals? I hope I am making sense here... (wouldn't be the first time I didn't.) The graph will be used by the parser at runtime as part of a compiler. And just because this is C++, I would be happy with a C solution as well. Thank you very much in advance.

+3  A: 

My C++ is rusty as well, so I probably don't know the best solution for this. But since nobody else stepped forward...

You are right in that allocating all nodes in one block would give you the best locality. However, if you dynamically allocate the graph at program start, chances are that your heap allocations will also cluster together closely.

To allocate all nodes in a single memory block, two possibilities come to my mind: create and populate a Vector<> at startup (with the drawback that now you have the graph information twice in memory), or use a static array initializer "Node[] graph = { ... };" .

For either approach, the biggest obstacle is that you want to create your graph of heterogenous objects. One obvious solution is "Don't": you could make your node a superset of all possible fields, and distinguishing the types with an explicit 'type' member.

If you want to keep the various node classes, you will have to use multiple arrays/vectors: one for each type.

Either way, the connections between the nodes will have to be initially defined in terms of array indices (Node[3] is followed by Node[10]). For better parsing performance, you could create direct object pointers at program startup based on these indices, of course.

I would not put literal strings into any node ('word' in your case): the recognition of keywords, identifiers and other lexical elements should be done in a lexer module separate from the parser. I think it would also help if you distinguish in terminalogy between the tokens generated by the Lexer based on the program's input, and the grammar graph nodes your program uses to parse the input.

I hope this helps.

Lars
+1, especially for the second paragraph.
Gunslinger47
Thank you very much for so much valuable information (it's always an insight when someone else shares their experience!). I am aware that the validation of tokens is in fact lexers domain, but one of the points of my exercise is to create a compiler so compact that in fact the entire job sans of perhaps generating target code can be done in one step :-) A sort of 'fail fast' principle implementation. I.e. I want to avoid to rerun a lexer on a generated token sequence. You're right on money about not being able storing my nodes in a vector - that's sort of the issue. I'll sleep on this :-)
amn
@amn - I think you are not going to achieve your goal. My suspicion is that you lack sufficient traditional compiler background to be able to think clearly about how you can achieve building such a "compact compiler". (In your question, you appear to focus on efficiency, which usually conflicts with compact, so now I'm not even sure that your goal is consistent). What I'd recommend is that you Run (don't walk) to buy any of the classic compiler books (e.g., Aho/Sethi/Ullman "Dragon" book), build a classic compiler first, and then revisit your vision.
Ira Baxter
Personally I'm fond of Wirth's "Compiler Construction" and Fraser/Hanson's "lcc - A Retargetable Compiler". Both are almost devoid of theory, but their hands-on approach gives a good background to then move on to books like the Dragon book.
Lars
At the university where I currently study, we have a 4 month course which teaches people unfamiliar with anything else except pretty basic Java programming, to successfully develop an x86 program compiler for a subset of C language, from scratch (in Java). It is expected (the course has been taught since 80's, albeit for Fortran/Cobol/Simula students, not Java) that more than half of participants accomplish the task during the course.
amn
+2  A: 

I don't see how you will define a "graph" of tokens that defines the syntax of any practical programming language, especially if the relation betweens tokens is "allowed-to-follow".

The usual way to represent the grammar of programming language is using Backus-Naur Form (BNF) or Extended versions of this termed "EBNF".

If you wanted to represent an EBNF ("as an immutable graph"), this SO answer discusses how to do that in C#. The ideas have direct analogs in C++.

The bad news is that most parsing engines can't use the EBNF for directly because it is simply too inefficient in practice. It is hard to build an efficient parser using the direct representation of the grammar rules; this is why people invented parser generators. So the need to put these rules into a memory structure at all, let alone an "efficient" one, is unclear unless you intend to write a parser generator.

Finally, even if you do pack the grammar-information somehow optimally, it probably won't make an ounce of difference in actual performance. Most of a parser's time is spent in grouping characters in lexemes, sometime even to the point of just doing blank supression.

Ira Baxter
I wondered about that as well. But the OP did mention that he was writing this parser out of curiosity, so handcrafting a parser instead of letting a generator do the work may be the goal itself; just like people write OS kernels from scratch.
Lars
@Lars: If he wants to construct a parser by his own methods, that's fine. He won't get one if his model of the language grammar, however encoded, isn't sufficient to capture its context free properties, and his proposal IMHO is not.
Ira Baxter
@Lars: ... with "follows" being the encoded relationship, what he has captured is essentially a finite state machine;
Ira Baxter
Well it's like you said yourself - because EBNF or BNF for that matter are not really layouts that can be used efficiently at runtime by a parser, there needs to be an extra indirection so to speak. I have no problem with BNF though - I am already considering writing a BNF file for the language and have either the compiler read it in order to load the graph, or another program generate the parser, which I believe is how a professional might approach this. I am definitely flagging your answer as useful though, it gives me food for thought.
amn
+1  A: 

I don't think many small allocations of the tokens will be a bottleneck, if it does you can always choose a memory pool.

Onto the problem; since all tokens have similar data (having a pointer to the next, and perhaps some enum value for what token we're dealing with) you could put the similar data in one std::vector. This will be continuous data in memory, and very efficient to loop over.

While looping, you retrieve the kind of information you need. I bet the tokens themselves would ideally only contain "actions" (member-functions), such as: if previous and next tokens are numbers, and I'm a plus sign, we should add the numbers together.

So, the data is stored in one central place, the tokens are allocated (but might not contain much data themselves actually) and work onto the data at the central place. This is actually a data-oriented design.

The vector could look like:

struct TokenData
{
    token *previous, *current, *next;
    token_id id; // some enum?
    ... // more data that is similar
}

std::vector<TokenData> token_data;

class token
{
    std::vector<TokenData> *token_data;
    size_t index;

    TokenData &data()
    {
        return (*token_data)[index];
    }

    const TokenData &data() const
    {
        return (*token_data)[index];
    }
}

// class plus_sign: token
// if (data().previous->data().id == NUMBER && data().next->data().id == NUMBER)

for (size_t i = 0; i < token_data.size(); i++)
{
    token_data[i].current->do_work();
}

It's an idea.

Daevius
But to be honest, if you want to do this all at compile time (with meta programming), I wouldn't pick C++. It's do-able, but you are really battling the compiler. I really like the way D handles meta programming, very intuitive, if you don't mind suboptimal tool chain, documentation and resources...the language itself is really great and powerful.
Daevius
@Daevius - yes, you are right here, absolutely. But this is precisely why I want to write this in C++ (with as little battling as possible.) I guess OCaml or Haskell or Scala would make perfect languages for writing a compiler, but I really want to crack this nut :-) Thanks for heads up on D though!
amn