views:

206

answers:

3

I am trying to put together a proof of concept of an XSS-safe string interpolation scheme.

Given a string with substitutions,

"Hello <b>$planetoid</b>!"

I want break it into literal portions and substitutions ("Hello<b>" planetoid "</b>!") and then run a state machine left to right over the literal portions. When I reach an interpolated value (planetoid in the above), I need to be able to get from the state to an appropriate escaping function.

Does anyone know of any examples of how to use lex/yacc/bison to derive a state machine and be able to associate labels in the grammar with output states? I want to derive a state machine that I can use both in javascript, and to try and replace PHP's underlying string implementation.

My reasons for doing this are described here.

cheers, mike

+3  A: 

In general, it is not possible to create a state machine for grammars that can be represented in BNF. State machines can only recognize regular languages and BNF can specify context-free languages. Yacc can create parsers. Would that be sufficient?

PeterAllenWebb
+1, more precise than me about chomsky hierarchy, I should study it again.
Jack
I want to do this purely at a lexical level. HTML and CSS have regular grammars. JS does not have a regular lexical grammar, but I can come up with a lexically regular subset if I can just transition to an error state when I see something too widgy.
Mike Samuel
I think if you're willing to cut corners just searching for $ followed by a valid variable name and then replacing would suffice in most cases. Also, crazy as it might seem, HTML is not a regular language, since its nesting level is potentially unlimited.
PeterAllenWebb
I did not mean to say is regular -- the lexical grammar is regular.I have no problem finding interpolations. What I want is a state machine that I can use to determine the appropriate escaping convention for a substitution at a point in the language.I want to create a combined HTML/CSS/JS grammar, assign names to certain states, derive a state machine, and use the assigned names to map states to escaping functions.When I run the state machine over "<" foo " href=" bar ">"I want to know that when I see foo, I'm in a EXPECT_TAGNAME state, and bar is in a EXPECT_ATTR_VALUE.
Mike Samuel
Thanks for your help.
Mike Samuel
Ah. I see your point now. But I think that you'll have trouble doing this with a state machine alone. You will probably have to parse the HTML, CSS, and JS separately. Fortunately there are already tools for that.
PeterAllenWebb
A: 

It looks like I can put markers in the grammar, so if I use two different production types, one with no side effect, and that consumes characters, and one that consumes no characters, but updates a state variable

ST_EXPECT_TAG_NAME : { state = TAG_NAME };
TAG_BODY
    : '<' ST_EXPECT_TAG_NAME TAG_NAME ATTRS SPACES '>' ST_OUT_OF_TAG
    ;

the compiled output associates state names in a switch statement

YY_REDUCE_PRINT (yyn);
switch (yyn)
  {
      case 118:
#line 74 "tmp/html-combo.y"
    { state = TAG_NAME ;}
    break;

There may be a way to extract the tables without parsing C, but I'm so ignorant of yacc/bison.

Mike Samuel
A: 

You can use yacc/bison for this. Initially looking at bison its hard to tell where you can implement a state machine. Rules in bison are resolved left-right-up. Ie; If you have a rule (called rule0) which derives rule1 rule2 rule3, the actions are called in this order: rule1,rule2,rule3,rule0. You can combine that using a global state machine, with dynamic return values for rules (I use a union with different types like a string, an int or even a container for return values).

Alasdair
I don't want to implement a state machine though. I want to specify a grammar and get a state machine out.
Mike Samuel