tokenizer

How do I tokenize input using Java's Scanner class and regular expressions?

Just for my own purposes, I'm trying to build a tokenizer in Java where I can define a regular grammar and have it tokenize input based on that. The StringTokenizer class is depracated, and I've found a couple functions in Scanner that hint towards what I want to do, but no luck yet. Anyone know a good way of going about this? ...

How to parse, persist and retrieve a string with tags separated by spaces?

My database consists of 3 tables (one for storing all items, one for the tags, and one for the relation between the two): Table: Post Columns: PostID, Name, Desc Table: Tag Columns: TagID, Name Table: PostTag Columns: PostID, TagID What is the best way to save a space separated string (e.g. "smart funny wonderful") into the 3 databas...

Tokenizing strings in c

I have been trying to tokenize a string using SPACE as delimiter but it doesn't work. Does any one have suggestion on why it doesn't work? Edit: tokenizing using: strtok(string, " "); the code is like the following pch = strtok (str," "); while (pch != NULL) { printf ("%s\n",pch); pch = strtok (NULL, " "); } ...

Reversed offset tokenizer

I have a string to tokenize. It's form is HHmmssff where H, m, s, f are digits. It's supposed to be tokenized into four 2-digit numbers, but I need it to also accept short-hand forms, like sff so it interprets it as 00000sff. I wanted to use boost::tokenizer's offset_separator but it seems to work only with positive offsets and I'd lik...

Looking for a clear definition of what a "tokenizer", 'parser" and "lexers" are and how they are related to each other and used?

Hello, I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions. I have been looking for examples and can find so...

Using Boost Tokenizer escaped_list_separator with different parameters

Hello i been trying to get a tokenizer to work using the boost library tokenizer class. I found this tutorial on the boost documentation: http://www.boost.org/doc/libs/1 _36 _0/libs/tokenizer/escaped _list _separator.htm problem is i cant get the argument's to escaped _list _separator("","",""); but if i modify the boost/tokenizer.hpp...

Pythonic way to implement a tokenizer

Hi, I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice? I've implemented a tokenizer before in C and in Java so I'm fine with the theory, I'd just like to ensure I'm following pythonic styles and best practices. Listing Token Types: In Java, for example, I would have a list of fields...

Scanner vs. StringTokenizer vs. String.Split

I just learned about Java's Scanner class and now I'm wondering how it compares/competes with the StringTokenizer and String.Split. I know that the StringTokenizer and String.Split only work on Strings, so why would I want to use the Scanner for a String? Is Scanner just intended to be one-stop-shopping for spliting? ...

What php html tokenizer's can I use?

I need to process html submitted in my web application and don't want to munge the whole thing with regular expressions. What tokenizer approach and/or software should I take? ...

How do I read input character-by-character in Java?

I am used to the c-style getchar(), but it seems like there is nothing comparable for java. I am building a lexical analyzer, and I need to read in the input character by character. I know I can use the scanner to scan in a token or line and parse through the token char-by-char, but that seems unwieldy for strings spanning multiple line...

Need to construct a XML representation for C# code

Hi, I need to convert C# code to an equivalent XML representation. I plan to convert the C# code (C# 2.0 code snippets, no generics or nullable types) to an AST and then convert the AST to XML. Looking for a simple lexer/parser for C# which outputs an AST. Any pointers on converting C# code to an XML representation (which can be convert...

Is it Better to Design a Language that Utilizes White Space Instead of Symbols to Group Code?

I have found myself designing a language for fun that is a cross between Ruby and Java, and as I work on the compiler / interpreter I find myself pondering using whitespace as a terminator, like: class myClass extends baseClass function someFunction(arg) value eq firstValue value2 eq anotherValue x = 2 The alter...

C Tokenizer (and it returns empty too when fields are missing. yay!)

See also: Is this a good substr() for C? strtok() and friends skip over empty fields, and I do not know how to tell it not to skip but rather return empty in such cases. Similar behavior from most tokenizers I could see, and don't even get me started on sscanf() (but then it never said it would work on empty fields to begin with). I...

Using escaped_list_separator with boost split

I am playing around with the boost strings library and have just come across the awesome simplicity of the split method. string delimiters = ","; string str = "string, with, comma, delimited, tokens, \"and delimiters, inside a quote\""; // If we didn't care about delimiter characters within a quoted section we could us vector<s...

What is more efficient a switch case or an std::map

I'm thinking about the tokenizer here. Each token calls a different function inside the parser. What is more efficient: A map of std::functions/boost::functions A switch case I thank everyone in advance for their answer. ...

C++ tokenize a string using a regular expression

Hi, I'm trying to learn myself some C++ from scratch at the moment. I'm well-versed in python, perl, javascript but have only encountered C++ briefly, in a classroom setting in the past. Please excuse the naivete of my question. I would like to split a string using a regular expression but have not had much luck finding a cle...

Term extraction: Generatings tags out of text

How to get the same results as http://developer.yahoo.com/search/content/V1/termExtraction.html This question has been asked quite a few times before. http://stackoverflow.com/questions/1078766/best-approach-to-analyze-text-in-php http://stackoverflow.com/questions/711062/what-is-a-good-keyword-extraction-web-service http://stackoverf...

Int tokenizer

I know there are string tokenizers but is there an "int tokenizer"? For example, I want to split the string "12 34 46" and have: list[0]=12 list[1]=34 list[2]=46 In particular, I'm wondering if Boost::Tokenizer does this. Although I couldn't find any examples that didn't use strings. ...

Tokenizing blocks of code in Python

I have this string: [a [a b] [c e f] d] and I want a list like this lst[0] = "a" lst[1] = "a b" lst[2] = "c e f" lst[3] = "d" My current implementation that I don't think is elegant/pythonic is two recursive functions (one splitting with '[' and the other with ']' ) but I am sure it can be done using list comprehensions or regula...

Comparing Values in a file with an array

I have a .txt file with integers on each line e.g. 1 4 5 6 I want to count the occurences of the values that are in an array with the file. My code extract is this String s = null; FileReader fr = new FileReader(file); BufferedReader br = new BufferedReader(fr); while ((s = br.readLine()) !=null) { StringTokeni...