questions about tokenizing | ansaurus

tokenizing

Tokenize from a textfile reading into an array in C

How do you tokenize when you read from a file in C? textfile: PES 2009;Konami;DVD 3;500.25; 6 Assasins Creed;Ubisoft;DVD;598.25; 3 Inferno;EA;DVD 2;650.25; 7 char *tokenPtr; fileT = fopen("DATA2.txt", "r"); /* this will not work */ tokenPtr = strtok(fileT, ";"); while(tokenPtr != NULL ) { printf("%s\n", tokenPtr); tokenPtr ...

Recursive woes - reducing an input string

Hi all, I'm working on a portion of code that is essentially trying to reduce a list of strings down to a single string recursively. I have an internal database built up of matching string arrays of varying length (say array lengths of 2-4). An example input string array would be: {"The", "dog", "ran", "away"} And for further examp...

Google-like search query tokenization & string splitting

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query: the quick "brown fox" jumps over the "lazy dog" I would like to have a string array with the following tokens: the quick brown fox jumps over the lazy dog As you can see, the tokens preserve the spaces with in ...

Is there anything like PPI or Perl::Critic for C?

PPI and Perl::Critic allow programmers to detect certain things in the syntax of their Perl programs. Is there anything like it that will tokenize/parse C and give you a chance to write a script to do something with that information? ...

auto-tokenize user agents strings for statistics?

We keep track of user agent strings in our website. I want to do some statistics on them, to see how many IE6 users we have ( so we know what we have to develop against), and also how many mobile users we have. So we have log entires like this: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; FunWebProducts) Mozilla/4.0 (compatible; ...

text-processing

Lucene Query WITHOUT Operators

I am trying to use Lucene to search for names in a database. However, some of the names contain words like "NOT" and "OR" and even "-" minus symbols. I still want the different tokens inside the names to be broken up using an Analyzer and searched upon as a boolean combination of terms, but I do not want Lucene to interpret any of the "N...

Split Strings and arrange db to display products in PHP

Hello, I'm new in php. Could you please help me to find the way to properly arrange following task: Table "Products" id - details 1 - 1-30,2-134:6:0;;2-7:55:0;;1-2,2-8:25:0 - where this string can be very long 2 - 3 - 1-360:17:0;;1-361:185:0 Every product 1, 2, 3, ... are stored in db in one row, although product is addition...

Java Shell wildcard tokenizer

My Java is extremely rusty and I'm stuck trying to make a user interface that simplifies the execution of shell scripts or batch files depending on whether it's Linus or Win32 respectively. The files have the following naming convention. module-verb-object-etc [args-list] mysql-connect-grid mysql-connect-rds mysql-dump-grid m...

directory-listing

java-util-scanner

import phone numbers from a string in vb.net

There has got to be an easier way to do this. I am trying to wirte a function for a Phone number class called "import phone number". It should take any string with 10 digits in it somewhere (and allow for an extension), and import them into it's own properties: AreaCode, Prefix, Suffix, and Extension (aaa-ppp-ssss-xxxx...). I check the...

token_get_all whitespaces behaviour

I don't know if someone can help me, but i'll ask anyway. I'm creating a function like the php token_get_all written in javascript. This function should "tokenize" a given php code, but i have some problems with whitespaces. Executing the token_get_all function in php i see that only some whitespaces are considered tokens, the other one...

Is there a better (more modern) tool than lex/flex for generating a tokenizer for C++?

I recent added source file parsing to an existing tool that generated output files from complex command line arguments. The command line arguments got to be so complex that we started allowing them to be supplied as a file that was parsed as if it was a very large command line, but the syntax was still awkward. So I added the ability...

Bad Pointer? - C++

Hi there, I'm writing a string tokenization program for a homework assignment in C++, that uses pointers. However, when I run & debug it, it says that my pointer pStart, is invalid. I have a feeling that my problem resides in my param'ed constructor, I've included both the constructor and the object creation below. I would appreciate it...

Access Violation With Pointers? - C++

Hi, I've written a simple string tokenizing program using pointers for a recent school project. However, I'm having trouble with my StringTokenizer::Next() method, which, when called, is supposed to return a pointer to the first letter of the next word in the char array. I get no compile-time errors, but I get a runtime error which stat...

Can Regular Expressions Achieve This?

Hi, I'm trying to split a string into tokens (via regular expressions) in the following way: Example #1 input string: 'hello' first token: ' second token: hello third token: ' Example #2 input string: 'hello world' first token: ' second token: hello world third token: ' Example #3 input string: hello world first token: hello second t...

Why isn't C++ strtok() working for me?

The program is supposed to receive an input through cin, tokenize it, and then output each one to show me that it worked properly. It did not. The program compiles with no errors, and takes an input, but fails to output anything. What am I doing wrong? int main(int argc, char* argv[]) { string input_line; while(std::cin >> input_...

how to remove whitespace while scanning text in java

I've implemented several different "scanners" in java, from the Scanner class to simply using String.split("\ss+") but when there are several whitespaces in a row like "the_quick____brown___fox" they all tokenize certain white spaces (Imagine the underscores are whitespaces). Any suggestions? ...

How to rewrite a stream of HTML tokens into a new document?

Suppose I have an HTML document that I have tokenized, how could I transform it into a new document or apply some other transformations? For example, suppose I have this HTML: <html> <body> <p><a href="/foo">text</a></p> <p>Hello <span class="green">world</span></p> </body> </html> What I have currently written is a tokenizer t...

Calculation Expression Parser with Nesting and Variables in ActionScript

Hi There, I'm trying to enable dynamic fields in the configuration file for my mapping app, but I can't figure out how to parse the "equation" passed in by the user, at least not without writing a whole parser from scratch! I'm sure there is some easier way to do this, and so I'm asking for ideas! Basic idea: public var testString:Stri...

Parsing/Tokenizing a String Containing a SQL Command

Are there any open source libraries (any language, python/PHP preferred) that will tokenize/parse an ANSI SQL string into its various components? That is, if I had the following string SELECT a.foo, b.baz, a.bar FROM TABLE_A a LEFT JOIN TABLE_B b ON a.id = b.id WHERE baz = 'snafu'; I'd get back a data structure/object somethin...

Token replacement

Hey, I currently implement a replace function in the page render method which replaces commonly used strings - such as replace [cfe] with the root to the customer front end. This is because the value may be different based on the version of the site - for example the root to the image folder ([imagepath]) is /Images on development and l...

stringtokenizer

string-replacement

1
2
3
4
5
...
6