lexing

Expression parsing: how to tokenize

I'm looking to tokenize Java/Javascript-like expressions in javascript code. My input will be a string containing the expression, and the output needs to be an array of tokens. What's the best practice for doing something like this? Do I need to iterate the string or is there a regular expression that will do this for me? I need this t...

Python3.0 - tokenize and untokenize

I am using something similar to the following simplified script to parse snippets of python from a larger file: import io import tokenize src = 'foo="bar"' src = bytes(src.encode()) src = io.BytesIO(src) src = list(tokenize.tokenize(src.readline)) for tok in src: print(tok) src = tokenize.untokenize(src) Although the code is not...

Generate C++ code for BNF grammar

I have looked at the following software tools: Ragel ANTLR BNF Converter Boost::Spirit Coco/R YACC ANTLR seems the most straight-forward, however its documentation is lacking. Ragel looks possible, too, but I do not see an easy way to convert BNF into its syntax. What other tools are available that can take BNF input and generate a ...

Comment lexer rule

I'm new to ANTLR and i've come up with this lexer rule to parse out comments, will it work? COMMENT_LINE : (COMMENT (. - LINE_ENDING)* LINE_ENDING){$channel=hidden}; (I couldn't find anything regarding syntax such as this in the docs) ...

ANTLRWorks error compiling grammar: "syntax error: invalid char literal: INVALID"

I wrote a stub for a grammar (only matches comments so far), and it's giving me the error "syntax error: invalid char literal: <INVALID>". Moreover, i've tracked down the error to being in the following command: ... ~LINE_ENDING* ... LINE_ENDING : ( '\n' | '\r' | '\r\n'); Can someone help me fix this? ...

Character Consumption Question

If i have a subrule like the following: .. (~']' ~']')* ... will it only match an even number of characters? ...

How do i add parens to this rule?

I have a left-recursive rule like the following: EXPRESSION : EXPRESSION BINARYOP EXPRESSION | UNARYOP EXPRESSION | NUMBER; I need to add parens to it but i'm not sure how to make a left parens depend on a matching right parens yet still optional. Can someone show me how? (or am i trying to do entirely too much in lexing, and shoul...

How to exclude more than one character in rule?

I'm trying to write a string matching rule in ANTLRWorks, and i need to match either escaped quotes or any non quote character. I can match escaped quotes but I'm having trouble with the other part: ~'\'' | ~'\"' will end up matching everything and ~'\'\"' seems to be ignored by the grammar generator (at least the visual display). What s...

How to get ANTLR to output hierarchical ASTs?

I have a Lua grammar, (minor modifications to get it to output for C#, just namespace directives and a couple of option changes) and when I run it on some sample input, it gives me back a tree with a root "nil" node and as childs what looks to be a tokenized version of the input code. It looks like ANTLR's tree grammars operate on hierar...

Java Scanner with empty delimiter

I'd like to parse some text using an hand-written descending parser. I used Scanner with the following delimiter : "\\s*". Unfortunately, the fact that this pattern matches an empty String seems to make every hasNextFoo and nextFoo matching nothing anymore. The doc doesn't say anything about possibly empty delimitors. ...

How to evaluate a matched number later in a regex? - Lexing FORTRAN 'H' edit descriptor with Ply

I am using Ply to interpret a FORTRAN format string. I am having trouble writing a regex to match the 'H' edit descriptor which is of the form xHccccc ... where x specifies the number of characters to read in after the 'H' Ply matches tokens with a single regular expression, but I am having trouble using regular expression to perform ...

How to write a text transformer?

Suppose I have a text that I can easily parse. It consists of text and special identifiers. After parsing I get a list of tokens that correspond to text and special identifiers in the text. The problem I am having is how do I transform it from this token list into some other form? I can't understand how to approach this problem. I tri...

Best way to implement a meta language compiling down to PHP.

I've been working on the specifikation / kitchensink for a meta language that can compile down to PHP for some time now. Now I want to begin building the thing. Before I have implemented tiny DSL's using PHP_Lexergenerator and PHP_Parsergenerator and they have worked very well but I have never build anything this scale before. I would ap...

Recognize Identifiers in Chinese characters by using Lex/Yacc

How can I use Lex/Yacc to recognize identifiers in Chinese characters? Thanks for help. ...

Return multiple tokens in ocamllex

Is there any way to return multiple tokens in OCamlLex? I'm trying to write a lexer and parser for an indentation based language, and I would like my lexer to return multiple DEDENT tokens when it notices that the indentation level is less than it previously was. This will allow it to notify the parser when multiple blocks have ended...

Howto use JRuby's org.jruby.lexer.yacc.RubyYaccLexer

I'm using ripper to doing ruby-code lexing in mri-1.9., I would like to do the same thing in JRuby, i noticed there is this org.jruby.lexer.yacc.RubyYaccLexer used in org.jruby.parser.DefaultRubyParser, i'm thinking that i can use it to do what ripper in mri-1.9. does, though definitely at a lower level as compared to ripper. Being a noo...