lexer

Is there a working C++ grammar file for ANTLR?

Are there any existing C++ grammar files for ANTLR? I'm looking to lex, not parse some C++ source code files. I've looked on the ANTLR grammar page and it looks like there is one listed created by Sun Microsystems here. However, it seems to be a generated Parser. Can anyone point me to a C++ ANTLR lexer or grammar file? ...

How to write a text transformer?

Suppose I have a text that I can easily parse. It consists of text and special identifiers. After parsing I get a list of tokens that correspond to text and special identifiers in the text. The problem I am having is how do I transform it from this token list into some other form? I can't understand how to approach this problem. I tri...

ANTLR grammar: parser- and lexer literals

What's the difference between this grammar: ... if_statement : 'if' condition 'then' statement 'else' statement 'end_if'; ... and this: ... if_statement : IF condition THEN statement ELSE statement END_IF; ... IF : 'if'; THEN: 'then'; ELSE: 'else'; END_IF: 'end_if'; .... ? If there is any difference, as this impacts on performa...

Lexer antlr3 token problem

Can I construct a token ENDPLUS: '+' (options (greedy = false;):.) * '+' ; being considered by the lexer only if it is preceded by a token PREwithout including in ENDPLUS? PRE: '<<' ; Thanks. ...

problem string recursion antlr lexer token

How do I build a token in lexer that can handle recursion inside as this string: ${*anythink*${*anything*}*anythink*} ? thanks ...

Island grammar antlr3...

What are and how to use the "island grammar" in antlr3? ...

Ruby regex match specific string with special conditions

I'm currently trying to parse a document into tokens with the help of regex. Currently I'm trying to match the keywords in the document. For example I have the following document: Func test() Return blablaFuncblabla EndFunc The keywords that needs to be matched is Func, Return and EndFunc. I've comed up with the following regex: (...

ANTLR lexer mismatches tokens

I have a simple ANTLR grammar, which I have stripped down to its bare essentials to demonstrate this problem I'm having. I am using ANTLRworks 1.3.1. grammar sample; assignment : IDENT ':=' NUM ';' ; IDENT : ('a'..'z')+ ; NUM : ('0'..'9')+ ; WS : (' '|'\n'|'\t'|'\r')+ {$channel=HIDDEN;} ; Obviously, thi...

Performance of tokenizing CSS in PHP

This is a noob question from someone who hasn't written a parser/lexer ever before. I'm writing a tokenizer/parser for CSS in PHP (please don't repeat with 'OMG, why in PHP?'). The syntax is written down by the W3C neatly here (CSS2.1) and here (CSS3, draft). It's a list of 21 possible tokens, that all (but two) cannot be represented a...

Unable to compile output of lex

When I attempt to compile the output of this trivial lex program: # lex.l integer printf("found keyword INT"); using: $ gcc lex.yy.c I get: Undefined symbols: "_yywrap", referenced from: _yylex in ccMsRtp7.o _input in ccMsRtp7.o "_main", referenced from: start in crt1.10.6.o ld: symbol(s) not found collect2...

hand coding a parser

For all you compiler gurus, I wanna write a recursive descent parser and I wanna do it with just code. No generating lexers and parsers from some other grammar and don't tell me to read the dragon book, i'll come around to that eventually. I wanna get into the gritty details about implementing a lexer and parser for a reasonable simple ...

lexer/parser ambiguity

How does a lexer solve this ambiguity? /*/*/ How is it that it doesn't just say, oh yeah, that's the begining of a multi-line comment, followed by another multi-line comment. Wouldn't a greedy lexer just return the following tokens? /* /* / I'm in the midst of writing a shift-reduce parser for CSS and yet this simple comment th...

Writing re-entrant lexer with Flex

I'm newbie to flex. I'm trying to write a simple re-entrant lexer/scanner with flex. The lexer definition goes below. I get stuck with compilation errors as shown below (yyg issue): reentrant.l: /* Definitions */ digit [0-9] letter [a-zA-Z] alphanum [a-zA-Z0-9] identifier [a-zA-Z_][a-zA-Z0-9_]+ integer ...

Lexing newlines in scala StdLexical?

I'm trying to lex (then parse) a C like language. In C there are preprocessor directives where line breaks are significant, then the actual code where they are just whitespace. One way of doing this would be do a two pass process like early C compilers - have a separate preprocessor for the # directives, then lex the output of that. Ho...

Parsing C# code to evaluate expressions (basically, implementing Intellisense)

I'm trying to evaluate C# code as it gets typed, think of it as if I'm trying to write an IDE. So a person types code, I want to find out what code did he just write: String x = ""; I want to now register that x is a type of String. And now everytime the user types x again, and I want to show him all the things he can do with x, basic...

GNU Flex, multiline rule

Hi there i have a flex rule inside my lexer definition : operators "[]"|"[]="|"[]<"|".."|"."|".="|"+"|"+="|"-"|"-="|"/"|"/="|"*"|"*="|"%"|"%="|"++"|"--"|"^"|"^="|"~"|"&"|"&="|"|"|"|="|"<<"|"<<="|">>"|"!"|"<"|">"|">="|"<="|"=="|"!="|"&&"|"||"|"~=" Is there any way to split this ruole on more lines to keep it clearer? I tried with \ ju...

Call methods on native Javascript types without wrapping with ()

In Javascript, we can call methods on string literals directly without enclosing it within round brackets. But not for other types such as numbers, or functions. It is a syntax error, but is there a reason as to why the Javascript lexer needs these other types to be enclosed in round brackets? For example, if we extend Number, String, a...

lexers vs parsers

Are lexers and parsers really that different in theory ? It seems fashionable to hate regular expressions: coding horror, another blog post. However, popular lexing based tools: pygments, geshi, or prettify, all use regular expressions. They seem to lex anything... When is lexing enough, when do you need EBNF ? Has anyone used t...

How to efficently build an interpreter (lexer+parser) in C?

I'm trying to make a meta-language for writing markup code (such as xml and html) wich can be directly embedded into C/C++ code. Here is a simple sample written in this language, I call it WDI (Web Development Interface): /* * Simple wdi/html sample source code */ #include <mySite> string name = "myName"; string toCapital(strin...

Antlr Lexer Quoted String Predicate

I'm trying to build a lexer to tokenize lone words and quoted strings. I got the following: STRING: QUOTE (options {greedy=false;} : . )* QUOTE ; WS : SPACE+ { $channel = HIDDEN; } ; WORD : ~(QUOTE|SPACE)+ ; For the corner cases, it needs to parse: "string" word1" word2 As three tokens: "string" as STRING and word1" an...