views:

129

answers:

3

Hi,

I'm sorry for the generic title of this question but I wish I was able to articulate it less generically. :-}

I'd like to write a piece of software (in this case, using C++) which translates a stream of input tokens into a stream of output tokens. There are just five input tokens (lets call them 0, 1, 2, 3, 4) and each of them can have a few different attributes (like, There might be an 4.x property or 0.foo). There are a few more output tokens, about ten, let's call them (Out0..Out9) each of them also has a few properties.

Now, we've been working on a mapping of sequences from input tokens to possible output tokens, like this:

01 -> Out0
34 -> Out1
0101 -> Out3

...so different sequences of input tokens map to a single output token.

In my scenario, the set of input tokens is fixed, but the set of output tokens is not - we might decide to introduce new 'productions'. My question is:

Does anybody know whether there are good patterns and/or idioms which help in such a situation?

Right now I have a set of 'Compressor' object, each of which can eat the input tokens and eventually produces the output tokens. The problem is that some input tokens clash, consider 'Out0' and 'Out3' in the above case. The input '0101' should yield Out3 but not Out0. However, the input '0104 should yield Out0 and then leave 0 and 4 in the queue.

I'm wondering whether there are maybe patterns from data compression or other areas which might be beneficial.

This work of 'reducing' an input of lowlevel tokens to highlevel tokens and dealing with possible conflicts is common among parser writers, no? Are there are useful patterns there?

UPDATE:

A bit more information:

  1. in my concrete case, the input tokens are C structs, and the output tokens are C++ objects. I have no control whatsoever over the input stream of tokens, but I can queue them and then modify the queue in case that is beneficial.

  2. I solved clashes (like Out3 (0101) vs. Out0 (01)) by trying to match Out3 first and then Out0, but it's a bit ugly. The possible productions are in a list and I simply try to apply them to the input stream, one after the other

  3. The list of possible productions can be extended by the user, so I cannot generate one huge DAG and then have a state machine which implements that to handle every possible transition. Of course, this means that the user can add clashes, but that's just the way it is.

+1  A: 

When designing a data compression algorithm, you need to take care that the beginning of one code can't be mistaken for another shorter code. This is the basis for Hamming code. The other alternative is to have a delimiter character separating your tokens, like in Morse code (which uses a short pause between characters).

Bill the Lizard
+1  A: 

Years ago I would have immediately said look at Bison http://www.gnu.org/software/bison/ or Yacc but I haven't done anything like this for some time so don't know if there is anything better.

Using them might be a bit over the top for what you are doing but the idioms used might be useful.

Dipstick
Yes, it feels a lot like parsing, I need something more generic though - something which can parse a set of input structs into a list of output objects.
Frerich Raabe
+1  A: 

You could define a graph, where each node contains an input token and an associated output. The links of each node describe the possible next tokens. Thus, a path in the graph describe a possible transformation rule.

To transform the data, start from the node corresponding to the first input token, and try to navigate the graph on the longest path possible, matching the next input token to the nodes linked to the current node. When no linked node matches the next input node, take the output associated with the current node as the result.

Cătălin Pitiș
Yes, this 'the longest match wins' idea is an alternative. So far, I manually sorted the order in which the rules are tried - which worked well enough for me (but maybe it doesn't scale well enough, I don't know).
Frerich Raabe
See the updated question for why I cannot define one big graph.
Frerich Raabe