tokenizer

How can I tokenize this with a regex?

Suppose I have strings like the following : OneTwo ThreeFour AnotherString DVDPlayer CDPlayer I know how to tokenize the camel-case ones, except the "DVDPlayer" and "CDPlayer". I know I could tokenize them manually, but maybe you can show me a regex that can handle all the cases? EDIT: the expected tokens are : OneTwo -> One Two ......

Parsing Classes, Functions and Arguments in PHP

I want to create a function which receives a single argument that holds the path to a PHP file and then parses the given file and returns something like this: class NameOfTheClass function Method1($arg1, $arg2, $arg2) private function Method2($arg1, $arg2, $arg2) public function Method2($arg1, $arg2, $arg2) abstract class Anot...

how to get data between quotes in java ?

I have this lines of text the number of quotes could change like: Here just one "comillas" But I also could have more "mas" values in "comillas" and that "is" the "trick" I was thinking in a method that return "a" list of "words" that "are" between "comillas" how I obtain the data between the quotes the result should be?: www.eg....

How to create a bmp file from rgb datas stored in a txt file?

Hi guys, I've to create a bmp image from two txt files.The first one is an mxn array: * * * * * * * * m n c11 c21 .. cm1 ... c1n c2n .. cmn * * * * * * * * * * * * * * * * 6 5 .7 .7 .6 1.0 1.2 .1 .9 .3 .7 1.1 .7 .2 1 1.1 1.2 1.3 1.7 .6 .5 .6 .5 .4 .9 .1101 2 .1 .1 .1 2.1 1.1 * * * * * * * * The second txt file is a color scale, like thi...

Tokenizer, Stop Word Removal, Stemming in Java

Hi there I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system. For example: "The big fat cat, said 'your funniest guy i know' to the kangaroo..." the tokenizer would remove the punctuation and return an arrayList of words the stop wo...

The intricacy of a string tokenization function in C

For brushing up my C, I'm writing some useful library code. When it came to reading text files, it's always useful to have a convenient tokenization function that does most of the heavy lifting (looping on strtok is inconvenient and dangerous). When I wrote this function, I'm amazed at its intricacy. To tell the truth, I'm almost convi...

Dealing with Tokens in C#

I have the following assignment for homework. Requirements design a class called TokenGiver with the following elements: a default constructor, a parametrized constructor that takes an int a method that adds a specified number of tokens to the number of tokens a method that subtracts exactly ONE token from your number of tokens a m...

Javascript lexer / tokenizer (in Python?)

Does anyone know of a Javascript lexical analyzer or tokenizer (preferably in Python?) Basically, given an arbitrary Javascript file, I want to grab the tokens. e.g. foo = 1 becomes something like: variable name : "foo" whitespace operator : equals whitespace integer : 1 ...

String Tokenizer

Can anybody help me understand how this string tokenizer works by adding some comments into the code? I would very much appreciate any help thanks! public String[] split(String toSplit, char delim, boolean ignoreEmpty) { StringBuffer buffer = new StringBuffer(); Stack stringStack = new Stack(); for (int i = 0; i < toSplit....

Lexers/tokenizers and character sets

When constructing a lexer/tokenizer is it a mistake to rely on functions(in C) such as isdigit/isalpha/... ? They are dependent on locale as far as I know. Should I pick a character set and concentrate on it and make a character mapping myself from which I look up classifications? Then the problem becomes being able to lex multiple chara...

Java: string tokenizer and assign to 2 variables?

Hi, Let's say I have a time hh:mm (eg. 11:22) and I want to use a string tokenizer to split. However, after it's split I am able to get for example: 11 and next line 22. But how do I assign 11 to a variable name "hour" and another variable name "min"? Also another question. How do I round up a number? Even if it's 2.1 I want it to rou...

Wind blowing on String

I have some basic idea on how to do this task, but I'm not sure if I'm doing it right. So we have class WindyString with metod blow. After using it : System.out.println(WindyString.blow( "Abrakadabra! The second chance to pass has already BEGUN! ")); we should obtain something like this : e ...

Tokenizer for full-text

This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain. Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text. Edit: Please note that the ...

A string tokenizer in C++ that allows multiple separators

Possible Duplicate: C++: How to split a string? Is there a way to tokenize a string in C++ with multiple separators? In C# I would have done: string[] tokens = "adsl, dkks; dk".Split(new [] { ",", " ", ";" }, StringSplitOptions.RemoveEmpty); ...

Is there a tokenizer for a cpp file

I have a cpp file with a huge class implementation. Now I have to modify the source file itself. For this, is there a library/api/tool that will tokenize this file for me and give me one token each time i request. My requirement is as below. OpenCPPFile() While (!EOF) token = GetNextToken(); process something based on this token...

What is the difference between EdgeNGramTokenizerFactory EdgeNGramFilterFactory in SOLR?

What is the difference between these two filters? They seem to have the same effect? Can anyone supply an example of how they are applied to some text? Thanks ...

c++ what is the advantage of lex and bison to a selfmade tokenizer / parser

Hi, I would like to do some parsing and tokenizing in c++ for learning purposes. Now I often times came across bison/yacc and lex when reading about this subject online. Would there be any mayor benefit of using those over for instance a tokenizer/parser written using STL or boost::regex or maybe even just C? ...

How do I parsing a complex file format in Delphi? (Not CSV, XML, etc)

It's been a few years since I've had to parse any files which were harder than CSV or XML so I am out of practice. I've been given the task of parsing a file format called NeXus in a Delphi application. The problem is I just don't know where to start, do I use a tokenizer, regex, etc? Maybe even a tutorial might be what I need at this...

C Tokenizer - How does it work?

How does this work? I know to use it you pass in: start: string (e.g. "Item 1, Item 2, Item 3") delim: delimiter string (e.g. ",") tok: reference to a string which will hold the token nextpos (optional): reference to a the position in the original string where the next token starts sdelim (optional): pointer to a character which will ...

Using multiple tokenizers in Solr

Excuse me if this is a dumb question. I was just thrown into this task, so I don't know much about Solr, indexing, etc. But basically what we want to be able to do is perform a query and get results back that are not case sensitive and that match partial words from the index. We have a Solr schema set up at the moment that has been mo...