views:

129

answers:

2

What JavaCC syntax does implement grammar that can parse these kind of lines:

[b]content[/b]
content[/b]
[b]content

Although the JavaCC parser needs to parse all lines, it must distinguish correct and incorrect tagging behavior.

Correct tags are like the 1st line, they have an open and close tag. When the tags are matched this will output a bold formated text.

Incorrect tags are like line's 2 and 3, they have no matching open or close tag. When these occure, they are written to the output as-is and will not be interpreted as tags.

I have tried the JavaCC code below (LOOKAHEAD = 999999). Problem is, this syntax will always match everything as invalidTag() instead of bold(). How can I make sure that the JavaCC parser will match bold() when ever possible?

String parse() :
{}
{
    body() <EOF>
    { return buffer; }
}

void body() :
{}
{
    (content())*
}

void content() :
{}
{ 
    (text()|bold()|invalidTag)
}

void bold() :
{}
{
    { buffer += "<b>";  }
    <BOLDSTART>(content())*<BOLDEND>
    { buffer += "</b>"; }
}

void invalidTag() :
{
}
{
    <BOLDSTART> | <BOLDEND>
    { // todo: just output token
    }
}

TOKEN :
{
    <TEXT : (<LETTER>|<DIGIT>|<PUNCT>|<OTHER>)+ >
    |<BOLDSTART : "[b]" >
    |<BOLDEND : "[/b]" >

    |<LETTER : ["a"-"z","A"-"Z"] >
    |<DIGIT : ["0"-"9"] >
    |<PUNCT : [".", ":", ",", ";", "\t", "!", "?", " "] >
    |<OTHER : ["*", "'", "$", "|", "+", "(", ")", "{", "}", "/", "%", "_", "-", "\"", "#", "<", ">", "=", "&", "\\"]     >
}
+2  A: 

Your grammar is ambiguous. This is probably not your fault, as it will probably be very difficult to produce an unambiguous grammar for the problem you are trying to solve.

An LL(k) parser is probably not the best tool for this job.

However, the tokenizer may be useful, and using a stack to find matching and unmatching pairs of tags may be a suitable alternative.

jamesh
I reached the same conclusion, but I find JavaCC a bit of overkill for just a tokenizer. Do you know of a 100% Java tokenizer? (So I don’t need extra build tools?)
Kdeveloper
+1  A: 

Some time ago I've learnt, that some trivial problems can be easily solved at the semantic or lexical level while proving to be very difficult or impossible at the syntactic level.

Note: I'm not too familiar with JavaCC, but I've worked with multiple compiler generators in the past (my favorite being sablecc).

You could probably just define your "content" as something like this:

(text()|boldstart()|boldend()|invalidTag)

Where boldstart() would just blindly output start tag and boldend() - an end tag.

If however you want to filter all that and only produce correctly ended tags, then I'd suggest making some sort of stateful automaton for that, feed it opening and ending tags, note if (say) bold should start, stop or continue (possibly including depth of nesting) and depending on that output either start, stop or no tag. This would be really easy to implement as opposed to using syntactic or lexic tools you have in JavaCC.

inkredibl
Thanks, do you have a link too examples on this? Or know of a Java tokenizer tool?
Kdeveloper