ansaurus

Question

How to efficiently implement an immutable graph of heterogenous immutable objects in C++?

Answer 1

+3 A:

My C++ is rusty as well, so I probably don't know the best solution for this. But since nobody else stepped forward...

You are right in that allocating all nodes in one block would give you the best locality. However, if you dynamically allocate the graph at program start, chances are that your heap allocations will also cluster together closely.

To allocate all nodes in a single memory block, two possibilities come to my mind: create and populate a Vector<> at startup (with the drawback that now you have the graph information twice in memory), or use a static array initializer "Node[] graph = { ... };" .

For either approach, the biggest obstacle is that you want to create your graph of heterogenous objects. One obvious solution is "Don't": you could make your node a superset of all possible fields, and distinguishing the types with an explicit 'type' member.

If you want to keep the various node classes, you will have to use multiple arrays/vectors: one for each type.

Either way, the connections between the nodes will have to be initially defined in terms of array indices (Node[3] is followed by Node[10]). For better parsing performance, you could create direct object pointers at program startup based on these indices, of course.

I would not put literal strings into any node ('word' in your case): the recognition of keywords, identifiers and other lexical elements should be done in a lexer module separate from the parser. I think it would also help if you distinguish in terminalogy between the tokens generated by the Lexer based on the program's input, and the grammar graph nodes your program uses to parse the input.

I hope this helps.

Lars 2010-10-28 19:57:31

+1, especially for the second paragraph.

Gunslinger47 2010-10-28 21:55:14

Thank you very much for so much valuable information (it's always an insight when someone else shares their experience!). I am aware that the validation of tokens is in fact lexers domain, but one of the points of my exercise is to create a compiler so compact that in fact the entire job sans of perhaps generating target code can be done in one step :-) A sort of 'fail fast' principle implementation. I.e. I want to avoid to rerun a lexer on a generated token sequence. You're right on money about not being able storing my nodes in a vector - that's sort of the issue. I'll sleep on this :-)

amn 2010-10-29 11:50:18

@amn - I think you are not going to achieve your goal. My suspicion is that you lack sufficient traditional compiler background to be able to think clearly about how you can achieve building such a "compact compiler". (In your question, you appear to focus on efficiency, which usually conflicts with compact, so now I'm not even sure that your goal is consistent). What I'd recommend is that you Run (don't walk) to buy any of the classic compiler books (e.g., Aho/Sethi/Ullman "Dragon" book), build a classic compiler first, and then revisit your vision.

Ira Baxter 2010-10-29 14:12:58

Personally I'm fond of Wirth's "Compiler Construction" and Fraser/Hanson's "lcc - A Retargetable Compiler". Both are almost devoid of theory, but their hands-on approach gives a good background to then move on to books like the Dragon book.

Lars 2010-10-29 15:36:39

At the university where I currently study, we have a 4 month course which teaches people unfamiliar with anything else except pretty basic Java programming, to successfully develop an x86 program compiler for a subset of C language, from scratch (in Java). It is expected (the course has been taught since 80's, albeit for Fortran/Cobol/Simula students, not Java) that more than half of participants accomplish the task during the course.

amn 2010-10-29 19:56:19

Answer 2

+2 A:

I don't see how you will define a "graph" of tokens that defines the syntax of any practical programming language, especially if the relation betweens tokens is "allowed-to-follow".

The usual way to represent the grammar of programming language is using Backus-Naur Form (BNF) or Extended versions of this termed "EBNF".

If you wanted to represent an EBNF ("as an immutable graph"), this SO answer discusses how to do that in C#. The ideas have direct analogs in C++.

The bad news is that most parsing engines can't use the EBNF for directly because it is simply too inefficient in practice. It is hard to build an efficient parser using the direct representation of the grammar rules; this is why people invented parser generators. So the need to put these rules into a memory structure at all, let alone an "efficient" one, is unclear unless you intend to write a parser generator.

Finally, even if you do pack the grammar-information somehow optimally, it probably won't make an ounce of difference in actual performance. Most of a parser's time is spent in grouping characters in lexemes, sometime even to the point of just doing blank supression.

Ira Baxter 2010-10-28 22:46:20

I wondered about that as well. But the OP did mention that he was writing this parser out of curiosity, so handcrafting a parser instead of letting a generator do the work may be the goal itself; just like people write OS kernels from scratch.

Lars 2010-10-28 23:09:50

@Lars: If he wants to construct a parser by his own methods, that's fine. He won't get one if his model of the language grammar, however encoded, isn't sufficient to capture its context free properties, and his proposal IMHO is not.

Ira Baxter 2010-10-28 23:25:19

@Lars: ... with "follows" being the encoded relationship, what he has captured is essentially a finite state machine;

Ira Baxter 2010-10-29 02:28:32

Well it's like you said yourself - because EBNF or BNF for that matter are not really layouts that can be used efficiently at runtime by a parser, there needs to be an extra indirection so to speak. I have no problem with BNF though - I am already considering writing a BNF file for the language and have either the compiler read it in order to load the graph, or another program generate the parser, which I believe is how a professional might approach this. I am definitely flagging your answer as useful though, it gives me food for thought.

amn 2010-10-29 11:56:39

Answer 3

+1 A:

I don't think many small allocations of the tokens will be a bottleneck, if it does you can always choose a memory pool.

Onto the problem; since all tokens have similar data (having a pointer to the next, and perhaps some enum value for what token we're dealing with) you could put the similar data in one std::vector. This will be continuous data in memory, and very efficient to loop over.

While looping, you retrieve the kind of information you need. I bet the tokens themselves would ideally only contain "actions" (member-functions), such as: if previous and next tokens are numbers, and I'm a plus sign, we should add the numbers together.

So, the data is stored in one central place, the tokens are allocated (but might not contain much data themselves actually) and work onto the data at the central place. This is actually a data-oriented design.

The vector could look like:

struct TokenData
{
    token *previous, *current, *next;
    token_id id; // some enum?
    ... // more data that is similar
}

std::vector<TokenData> token_data;

class token
{
    std::vector<TokenData> *token_data;
    size_t index;

    TokenData &data()
    {
        return (*token_data)[index];
    }

    const TokenData &data() const
    {
        return (*token_data)[index];
    }
}

// class plus_sign: token
// if (data().previous->data().id == NUMBER && data().next->data().id == NUMBER)

for (size_t i = 0; i < token_data.size(); i++)
{
    token_data[i].current->do_work();
}

It's an idea.

Daevius 2010-10-28 23:01:38

But to be honest, if you want to do this all at compile time (with meta programming), I wouldn't pick C++. It's do-able, but you are really battling the compiler. I really like the way D handles meta programming, very intuitive, if you don't mind suboptimal tool chain, documentation and resources...the language itself is really great and powerful.

Daevius 2010-10-28 23:15:53

@Daevius - yes, you are right here, absolutely. But this is precisely why I want to write this in C++ (with as little battling as possible.) I guess OCaml or Haskell or Scala would make perfect languages for writing a compiler, but I really want to crack this nut :-) Thanks for heads up on D though!

amn 2010-10-29 11:52:53

ansaurus

tags:

views:

answers:

How to efficiently implement an immutable graph of heterogenous immutable objects in C++?

related questions