lexical-analyser

Recommendations for a good C#/ .NET based lexical analyser

Can anyone recommend a good .NET based lexical analyser, preferably written in C#? ...

Simple regex-based lexer in Python

Lexical analyzers are quite easy to write when you have regexes. Today I wanted to write a simple general analyzer in Python, and came up with: import re import sys class Token(object): """ A simple Token structure. Contains the token type, value and position. """ def __init__(self, type, val, pos): self.ty...

C#/.NET Lexer Generators

I'm looking for a decent lexical scanner generator for C#/.NET -- something that supports Unicode character categories, and generates somewhat readable & efficient code. Anyone know of one? EDIT: I need support for Unicode categories, not just Unicode characters. There are currently 1421 characters in just the Lu (Letter, Uppercase)...

Good APIs for scope analyzers

I'm working on some code generation tools, and a lot of complexity comes from doing scope analysis. I frequently find myself wanting to know things like What are the free variables of a function or block? Where is this symbol declared? What does this declaration mask? Does this usage of a symbol potentially occur before initialization?...

Implement word boundary states in flex/lex (parser-generator)

I want to be able to predicate pattern matches on whether they occur after word characters or after non-word characters. In other words, I want to simulate the \b word break regex char at the beginning of the pattern which flex/lex does not support. Here's my attempt below (which does not work as desired): %{ #include <stdio.h> %} %x ...

Python regular expressions - how to capture multiple groups from a wildcard expression?

I have a Python regular expression that contains a group which can occur zero or many times - but when I retrieve the list of groups afterwards, only the last one is present. Example: re.search("(\w)*", "abcdefg").groups() this returns the list ('g',) I need it to return ('a','b','c','d','e','f','g',) Is that possible? How can I do i...

How to make a flex (lexical scanner) to read UTF-8 characters input?

It seems that flex doesn't support UTF-8 input. Whenever the scanner encounter a non-ASCII char, it stops scanning as if it was an EOF. Is there a way to force flex to eat my UTF-8 chars? I don't want it to actually match UTF-8 chars, just eat them when using the '.' pattern. Any suggestion? EDIT The most simple solution would be: ...

Where might I obtain a lexical analyzer capable of reporting for-loop errors in C or C++?

I need a simple lexical analyzer that reports for-loop errors in C/C++. ...

Textual analysis of large documents

I have a project where I need to compare multi-chapter documents to a second document to determine their similarity. The issue is I have no idea how to go about doing this, what approaches exist or if their are any libraries available. My first question is... what is similar? The numbers of words that match, the number of consecutive wo...

what is the use of tokens.h when I am programming a lexer?

I am programming a lexer in C and I read somewhere about the header file tokens.h. Is it there? If so, what is its use? ...

Handling error conditions in Lex rather than Yacc?

Suppose I have a lex regular expression like [aA][0-9]{2,2}[pP][sS][nN]? { return TOKEN; } If a user enters A75PsN A75PS It will match But if a user says something like A75PKN I would like it to error and say "Character K not recognized, expecting S" What I am doing right now is just writing it like let [a-zA-Z] num [0-9] {l...

What's the difference between a parser and a scanner?

I already made a scanner, now I'm supposed to make a parser. What's the difference? ...

What's wrong with this lex file?

I have a Makefile so that when I type make the following commands run: yacc -d parser.y gcc -c y.tab.c flex calclexer.l gcc -c lex.yy.c But then after this I get the following error messages: calclexer.l:10: error: parse error before '[' token calclexer.l:10: error: stray '\' in program calclexer.l:15: error: stray '\' in program cal...

When is the symbol table for this program built

When I run make on the following Makefile, when is the symbol table built, if it even is? LEX = flex YACC = yacc CC = gcc calcu: y.tab.o lex.yy.o $(CC) -o calcu y.tab.o lex.yy.o -ly -lfl y.tab.c y.tab.h: parser.y $(YACC) -d parser.y y.tab.o: y.tab.c parser.h $(CC) -c y.tab.c lex.yy.o: y.tab.h lex.yy.c $(CC) -c lex.y...

Does this program grammar only recognize variables with the name 'ID'?

I need to make a scanner in lex/flex to find tokens and a parser in yacc/bison to process those tokens based on the following grammar. When I was in the middle of making the scanner, it appeared to me that variables, functions, and arrays in this language can only have the name 'ID'. Am I misreading this yacc file? /* C-Minus BNF Gram...

How does a lexer return a semantic value that the parser uses?

Is it always necessary to do so? What does it look like? ...

Why do I get a syntax error in my program made with flex and yacc?

I made a program that is supposed to recognize a simple grammar. When I input what I think is supposed to be a valid statement, I get an error. Specifically, if I type int a; int b; it doesn't work. After I type int a; the program echoes ; for some reason. Then when I type int b; I get syntax error. The lex file: %{ #include <st...

How to return literals from flex to yacc?

In my yacc file I have things like the following: var_declaration : type_specifier ID ';' | type_specifier ID '[' NUM ']' ';' ; type_specifier : INT | VOID ; ID, NUM, INT, and VOID are tokens that get returned from flex, so yacc has no problems recognizing them. The problem is that in the above there are things like ...

Using the regular expression [\[\];] in flex

If I have the following in my flex file, what does it do? [\\[\\];] { return yytext[0]; } ...

The program I made with flex/yacc doesn't always recognize identifiers

I made a program that is supposed to recognize a simple grammar. When I input what I think is supposed to be a valid statement, I get an error. Specifically, if I start out with an identifier, I automatically get a syntax error. However, I noticed that using an identifier won't generate an error if it is preceded by 'int'. If a is an...