lexer

Haskell lexer problems

I'm writing a lexer in haskell. Here's the code: lexer :: String -> [Token] lexer s | s =~ whitespace :: Bool = let token = s =~ whitespace :: String in lex (drop (length token) s) | s =~ number :: Bool = let token = s =~ number :: String in Val (read token) : lex (drop (length token) s) | s =~ operator...

Modify PL/SQL statement strings in C++

Hello all, This is my use case: Input is a string representing an Oracle PL/SQL statement of arbitray complexity. We may assume it's a single statement (not a script). Now, several bits of this input string have to be rewritten. E.g. table names need to be prefixed, aggregate functions in the selection list that don't use a column ali...

Is there a mechanism in Antlr to allow the lexer to match a token only during certain rules?

I'd like to add a keyword to my language. This keyword would only have to be matched during one particular parser grammar rule. Due to backward compatibility I'd like to allow this keyword to continue to be used as a variable name, ie it can be matched by the lexer rule that determines if a token is suitable for a variable name. The ...

several lexers for one parser with PLY ?

Hi, I'm trying to implement a python parser using PLY for the Kconfig language used to generate the configuration options for the linux kernel. There's a keyword called source which performs an inclusion, so what i do is that when the lexer encounters this keyword, I change the lexer state to create a new lexer which is going to lex th...

Lexer written in Javascript?

I have a project where a user needs to define a set of instructions for a ui that is completely written in javascript. I need to have the ability to parse a string of instructions and then translate them into instructions. Is there any libraries out there for parsing that are 100% javascript? Or a generator that will generate in javascri...

Is C++ code generation in ANTLR 3.2 ready?

Hi, I was trying hard to make ANTLR 3.2 generate parser/lexer in C++. It was fruitless. Things went well with Java & C though. I was using this tutorial to get started: http://www.ibm.com/developerworks/aix/library/au-c%5Fplusplus%5Fantlr/index.html When I checked the *.stg files, I found that: CPP has only ./tool/src/main/resources/...

Remove whitespace, but do so last?

I am attempting to parse Lua, which depends on whitespace in some cases due to the fact that it doesn't use braces for scope. I figure that by throwing out whitespace only if another rule doesn't match is the best way, but i have no clue how to do that. Can someone help me? ...

Using Scanner/Parser/Lexer for script collation

I'm working on a JavaScript collator/compositor implemented in Java. It works, but there has to be a better way to implement it and I think a Lexer may be the way forward, but I'm a little fuzzy. I've developed a meta syntax for the compositor which is a subset of the JavaScript language. As far as a typical JavaScript interpreter is c...

lexers / parsers for (un) structured text documents

There are lots of parsers and lexers for scripts (i.e. structured computer languages). But I'm looking for one which can break a (almost) non-structured text document into larger sections e.g. chapters, paragraphs, etc. It's relatively easy for a person to identify them: where the Table of Contents, acknowledgements, or where the main...

How to write a bison file to automatically use a token enumeration list define in a C header file ?

Hi everyone, I am trying to build a parser with Bison/Yacc to be able to parse a flow of token done by another module. The token different token id are already listed in a enumeration type as follow: // C++ header file enum token_id { TokenType1 = 0x10000000, TokenType2 = 0x11000000, TokenType3 = 0x1110000...

What is a suitable lexer generator that I can use to strip identifiers from many language source files?

I'm working on a group project for my University which is going to be used for plagiarism detection in Computer Science. My group is primarily going off the hashing/fingerprinting techniques described in this journal article: Winnowing: Local Algorithms for Document Fingerprinting. This is very similar to how the MOSS plagiarism detect...

How can I keep track of original character positions in a string across transformations?

I'm working on an anti-plagiarism project for my CS class. This involves detecting plagiarism in computer science courses (programming assignments), through a technique described "Winnowing: Local Algorithms for Document Fingerprinting." Basically, I'm taking a group of programming assignments. Lets say one of the assignments looks lik...

why do some languages require function to be declared in code before calling?

Suppose you have this pseudo-code do_something(); function do_something(){ print "I am saying hello."; } Why do some programming languages require the call to do_something() to appear below the function declaration in order for the code to run? ...

Antlr3 - HIDDEN token in the parser

Can you use a token defined in the lexer in a hidden channel in a single rule of the parser as if it were a normal token? The generated code is Java... thanks ...

Is there a jflex specification of java string literals somewhere ?

And by string literals I mean those containing \123-like characters too. I've written something but I don't know if it's perfect: <STRING> { \" { yybegin(YYINITIAL); return new Token(TokenType.STRING,string.toString()); } \\[0-3][0-7][0-7] { string.append( ...

Regular expression token antlrV3

Can I write a rule where the initial token is partly fixed and partly generic? rule: ID '=' NUMBER ; ID: (A.. Z | a.. Z) + NUMBER: (0 .. 9) + But only if the token ID is in the form var* (var is fixed) Thanks ...

Combined grammar ANTLR option filter

I have a combined grammar (lexer and parser on the same file). How do I set the filter = true to the lexer? Thanks ...

How can I modify the text of tokens in a CommonTokenStream with ANTLR?

I'm trying to learn ANTLR and at the same time use it for a current project. I've gotten to the point where I can run the lexer on a chunk of code and output it to a CommonTokenStream. This is working fine, and I've verified that the source text is being broken up into the appropriate tokens. Now, I would like to be able to modify the...

Lexers/tokenizers and character sets

When constructing a lexer/tokenizer is it a mistake to rely on functions(in C) such as isdigit/isalpha/... ? They are dependent on locale as far as I know. Should I pick a character set and concentrate on it and make a character mapping myself from which I look up classifications? Then the problem becomes being able to lex multiple chara...

Best way to tokenize and parse programming languages in my application

I'm working on a tool that will perform some simple transformations on programs (like extract method). To do this, I will have to perform the first few steps of compilation (tokenizing, parsing and possibly building a symbol table). I'm going to start with C and then hopefully extend this out to support multiple languages. My question i...