lexer

Lexing partial SQL in C#

I'd need to parse partial SQL queries (it's for a SQL injection auditing tool). For example '1' AND 1=1-- Should break down into tokens like [0] => [SQL_STRING, '1'] [1] => [SQL_AND] [2] => [SQL_INT, 1] [3] => [SQL_AND] [4] => [SQL_INT, 1] [5] => [SQL_COMMENT] [6] => [SQL_QUERY_END] Are their any at least lexers for SQL that I base...

Parser generator for JavaME

First: I have looked at this SO question but unfortunately there is no mention of JavaME I am looking for a parser/lexer generator that produces code that can run on the Blackberry and its (obnoxious) JavaME. E.g. at first I thought I could use ANTLR however it seems the run-time library is not compatible with JavaME TIA ...

Does C# have (direct) flex/yacc port? Or what lexer/parser people use for C#?

I might be wrong, but it looks like that there's no direct flex/bison (lex/yacc) port for C#/.NET so far. For LALR parser, I found GPPG/GPLEX, and for LL parser, there is the famous ANTLR. But, I want to reuse my flex/bison grammar as much as possible. Is there any direct port of flex/bison for C#? What lexer/parser people normally ...

What lexer to build a lexer/parser in Scala

Hi there, I'm currently looking for a lexer/parser that generate Scala code from a BNF grammar (a ocamlyacc file with precedence and associativity) and I'm quite confused to find.. almost nothing: For parsing, I found scala-bison (that I have a lot of trouble to deal with). All the other tools are just Java parser imported into Scala (l...

Simple XML parser in bison/flex

I would like to create simple xml parser using bison/flex. I don't need validation, comments, arguments, only <tag>value</tag>, where value can be number, string or other <tag>value</tag>. So for example: <div> <mul> <num>20</num> <add> <num>1</num> <num>5</num> </add> </mul> <id>test</id> </div> If it h...

why does 'a'..'z' in ANTLR match wildcards like $ or £

When I run the following grammer: test : WORD+; WORD : ('a'..'z')+; WS : ' '+ {$channel = HIDDEN;}; and I give the input "?test" why does antlr accept this as valid input? I thought the ('a'..'z') would only match characters within the lowercase alphabet? ...

Design guidelines for parser and lexer?

I'm writing a lexer (with re2c) and a parser (with Lemon) for a slightly convoluted data format: CSV-like, but with specific string types at specific places (alphanumeric chars only, alphanumeric chars and minus signs, any char except quotes and comma but with balanced braces, etc.), strings inside braces and strings that look like funct...

Adding a new lexer to scintilla/scite (...and eventually wxPython StyledTextCtrl)

Has anyone of you successfully added a lexer to scintilla? I have been following the short instructions at http://www.scintilla.org/SciTELexer.html - and even discovered the secret extra instructions at http://www.scintilla.org/ScintillaDoc.html#BuildingScintilla (Changing Set of Lexers) Everything compiles, and I can add the lexer t...

Why is ANTLR parsing skipping token?

I'm trying to do some very basic C++ function declaration parsing. Here is my rule for parsing an input parameter: arg : 'const'? 'unsigned'? t=STRING m=TYPEMOD? n=STRING -> ^(ARG $n $t $m?) ; STRING : ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'::')+ ; TYPEMOD : ('*' | '&')+ ; The problem is I'm trying to pass it something like: int *pa...

How to parse template languages in Ragel?

I've been working on a parser for simple template language. I'm using Ragel. The requirements are modest. I'm trying to find [[tags]] that can be embedded anywhere in the input string. I'm trying to parse a simple template language, something that can have tags such as {{foo}} embedded within HTML. I tried several approaches to parse...

Basic problem with yacc/lex

Hello, I have some problems with a very simple yacc/lex program. I have maybe forgotten some basic steps (it's been a long time since I've used these tools). In my lex program I give some basic values like : word [a-zA-Z][a-zA-Z]* %% ":" return(PV); {word} { yylval = yytext; printf("yylval = %s\n",yylva...

ANTLR: MismatchedTokenException with similar literals

I have the following rule : A B; A : 'a_e' | 'a'; B : '_b'; Input: a_b //dont work a_e_b //works Why is the lexer having trouble matching this? When ANTLR matches the 'a_' in 'a_b' shouldnt it backtrack or use lookahead or something to see it cant match a token A and then decide to match token A as 'a' and then procede to matc...

Is there a Javascript lexer / tokenizer (in PHP)?

I've seen a couple of Python Javascript tokenizers and a cryptic document on Mozilla.org about a Javascript Lexer but can't find any Javascript tokenizers for PHP specifically. Are there any? Thanks ...

Parser vs. lexer and XML

I'm reading about compilers and parsers architecture now and I wonder about one thing... When you have XML, XHTML, HTML or any SGML-based language, what would be the role of a lexer here and what would be the tokens? I've read that tokens are like words prepared for parsing by the lexer. Although I don't have problem with finding tokens...

ANTLR: Unicode Character Scanning

Problem: Can't get Unicode character to print correctly. Here is my grammar: options { k=1; filter=true; // Allow any char but \uFFFF (16 bit -1) charVocabulary='\u0000'..'\uFFFE'; } ANYCHAR :'$' | '_' { System.out.println("Found underscore: "+getText()); } | 'a'..'z' { System.out.println("Found alpha: "+getText()); } | '\u...

Most Efficient way to 'look up' Keywords

Alright so I am writing a function as part of a lexical analyzer that 'looks up' or searches for a match with a keyword. My lexer catches all the obvious tokens such as single and multi character operators (+ - * / > < = == etc) (also comments and whitespace are already taken out) so I call a function after I've collected a stream of onl...

ANTLR - emitting multiple tokens for a lexer rule

Hi , I wanted to know if ANTLR supports emitting multiple tokens for a lexer rule, given the target language is JavaScript. I have found that it supports multiple tokens in other target languages, such as Java and CSharp, but could not find any documentation on this feature being supported in JavaScript. If anyone could point me to any...

Antlr (lexer): matching the right token

In my Antlr3 grammar, I have several "overlapping" lexer rules, like this: NAT: ('0' .. '9')+ ; INT: ('+' | '-')? ('0' .. '9')+ ; BITVECTOR: ('0' | '1')* ; Although tokens like 100110 and 123 can be matched by more than one of those rules, it is always determined by context which of them it has to be. Example: s: a | b | c ; a: '<' N...

ANTLR3 lexer precedence

I want to create a token from '..' in the ANTLR3 lexer which will be used to string together expressions like a..b // [1] c .. x // [2] 1..2 // [3] 3 .. 4 // [4] So, I have added, DOTDOTSEP : '..' ; The problem is that I already have a rule: FLOAT : INT (('.' INT (('e'|'E') INT)? 'f'?) | (('e'|'E') INT)? ('...

What is the name of the character that designates literals when lexing an input sequence?

I want to know what is the 'terminology name' of the character that designates a start of a literal in a lexing process. For example: a string starts and ends with an " character. a regular expression literal - with an / character. ...