tags:

views:

21

answers:

1

I'm writing a HTML parser for my own amusement and I wanted to try out M.

I base this work on the HTML 4.01 standard and in there it says

Although the STYLE and SCRIPT elements use CDATA for their data model, for these elements, CDATA must be handled differently by user agents. Markup and entities must be treated as raw text and passed to the application as is. The first occurrence of the character sequence "</" (end-tag open delimiter) is treated as terminating the end of the element's content. In valid documents, this would be the end tag for the element.

I think about it for a while and really what I wanna do is something like this

syntax Main 
    = "<script>" Script "</script>"
    ;
token Script
    = TakeWhileNot("</") // this is not valid M grammar
    ;

I find my self finding that I want to perform some kind of tokenization rule that matches until I reach an open angle bracket < followed by a forward slash /.

If the escape sequence was a single character this would not be a problem because then I could have written this.

token Script
    = ScriptEscape+
    ;
token ScriptEscape
    = !"<"
    ;

And that would work, not sure if I'm going about this the right way but the problem is sort of related to that I have a language embedded in another but I don't care about the script language in this case so I simply want to skip a head.

A: 

I figured out this neat trick, which wasn't entirely obvious...

syntax Main 
    = "<script>" Script* "</script>"
    ;
token Script
    = !('<')
    | '<' !('/')
    ;

Now that's valid MGrammar, which translates into:

  • Do NOT take '<' OR take '<' NOT followed by '/'

Which would consume anything until a </ token is encountered without consuming it.

John Leidegren