lexical-analysis

How would you go about implementing off-side rule?

I've already written a generator that does the trick, but I'd like to know the best possible way to implement the off-side rule. Shortly: Off-side rule means in this context that indentation is getting recognized as a syntactic element. Here is the offside rule in pseudocode for making tokenizers that capture indentation in usable form...

Good APIs for scope analyzers

I'm working on some code generation tools, and a lot of complexity comes from doing scope analysis. I frequently find myself wanting to know things like What are the free variables of a function or block? Where is this symbol declared? What does this declaration mask? Does this usage of a symbol potentially occur before initialization?...

FLEX: Is there a way to return mutiple tokens at once.

In flex, I want to return multiple tokens for one match of a regular expression. Is there a way to do this? ...

What are alternatives to regexes for syntax highlighting ?

While editing this and that in Vim, I often find that its syntax highlighting (for some filetypes) has some defects. I can't remember any examples at the moment, but someone surely will. Usually, it consists of strings badly highlighted in some cases, some things with arithmetic and boolean operators and a few other small things as well....

Where does the compiler spend most of its time during parsing ?

I read in Sebesta book, that the compiler spends most of its time in lexing source code. So, optimizing the lexer is a necessity, unlike the syntax analyzer. If this is true, why lexical analysis stage takes so much time compared to syntax analysis in general ? I mean by syntax analysis the the derivation process. ...

Start states in Lex / Flex

I'm using Flex and Bison for a parser generator, but having problems with the start states in my scanner. I'm using exclusive rules to deal with commenting, but this grammar doesn't seem to match quoted tokens: %x COMMENT // { BEGIN(COMMENT); } <COMMENT>[^\n] ; <COMMENT>\n { BEGIN(INITIAL); } "==" ...

Parsing Meaning from Text

I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like: "Manny Ramirez makes his return for the Dodgers today against the Houston Astros", what's a light-weight/ easy way of getting the nouns out of a s...

Writing an z80 assembler - Lexing ASM and building a parse tree using composition?

Hi guys, I'm very new to the concept of writing an assembler and even after reading a great deal of material, I'm still having difficulties wrapping my head around a couple of concepts. 1) What is the process to actually break up a source file into tokens? I believe this process is called lexing and I've searched high and low for a real...

Best way to create word stream in C#

I'd like to be able to write something like the following. Can someone show me how to write a clean WordReader class in C#. (a word is [a-zA-Z]+) public List<string> GetSpecialWords(string text) { string word; List<string> specialWords = new List<string>(); using (WordReader wr = new WordReader(text)) ...

Lexical analysis and Macros

I'm writing a demo compiler for a toy programming language in C. What problems can arise if we do macro processing in a separate phase between reading the program and lexical analysis? ...

What are some examples of errors a lexical analyzer could detect?

What are some examples of errors a lexical analyzer could detect in a given piece of code in a language like Java, C++ or C? ...

Lexical Analysis libraries

I would like to make a piece of software able to regognize whether a sentence is positive or negative. Is there any Lexical Analysis libraries arround? I don't really know where I should start. ...

Elimination left recursion for E := EE+|EE-|id

How to eliminate left recursion for the following grammar? E := EE+|EE-|id Using the common procedure: A := Aa|b translates to: A := b|A' A' := ϵ| Aa Applying this to the original grammar we get: A = E, a = (E+|E-) and b = id Therefore: E := id|E' E' := ϵ|E(E+|E-) But this grammar seems incorrect because ϵE+ -> ϵ id + w...

Lexical Analysis of Python Programming Language

Does anyone know where a FLEX or LEX specification file for Python exists? For example, this is a lex specification for the ANSI C programming language: http://www.quut.com/c/ANSI-C-grammar-l-1998.html FYI, I am trying to write code highlighting into a Cocoa application. Regex won't do it because I also want grammar parsing to fold code...

Matching multiple regex groups and removing them

I have been given a file that I would like to extract the useful data from. The format of the file goes something like this: LINE: 1 TOKENKIND: somedata TOKENKIND: somedata LINE: 2 TOKENKIND: somedata LINE: 3 etc... What I would like to do is remove LINE: and the line number as well as TOKENKIND: so I am just left with a string that ...

Possible typos in ECMAScript 5 specification?

Does anybody know why, at the end of section 7.6 of the ECMA-262, 5th Edition specification, the nonterminals UnicodeLetter, UnicodeCombiningMark, UnicodeDigit, UnicodeconnectorPunctuation, and UnicodeEscapeSequence are not followed by two colons? From section 5.1.6: Nonterminal symbols are shown in italic type. The definition of ...

Can lexical analyzer stage check grammar rules during compilation?

Hi guys, sorry for such silly question, but I had argument with my pals about lexical analyze and we've decided to ask community. The question is: Whether the statement "int some_variable = ;" would be interpreted as invalid during the lexical analyze stage or during the syntax analyze stage in C grammar. Thanks ...

Python - lexical analysis and tokenization

I'm looking to speed along my discovery process here quite a bit, as this is my first venture into the world of lexical analysis. Maybe this is even the wrong path. First, I'll describe my problem: I've got very large properties files (in the order of 1,000 properties), which when distilled, are really just about 15 important properties...

categorize websites - open source LSI?

Im looking to categorize lots of websites (millions). I can use Nutch to crawl them and get the content of the sites, but I am looking for the best (and cheapest or free) tool to categorize them. One option is to create regular expressions that look for certain keywords and categorize the sites, but there area also high end LSI type too...

How do I lex this input?

I currently have a working, simple language implemented in Java using ANTLR. What I want to do is embed it in plain text, in a similar fashion to PHP. For example: Lorem ipsum dolor sit amet <% print('consectetur adipiscing elit'); %> Phasellus volutpat dignissim sapien. I anticipate that the resulting token stream would look somethi...