tokenizing

Problem with using getline and strtok together in a program .

In the below program , I intend to read each line in a file into a string , break down the string and display the individual words.The problem I am facing is , the program now outputs only the first line in the file. I do not understand why this is happening ? #include<iostream> #include<string> #include<fstream> #include<cstdio> using ...

how to convert csv to table in oracle

How can I make a package that returns results in table format when passed in csv values. select * from table(schema.mypackage.myfunction('one, two, three')) should return one two three I tried something from ask tom but that only works with sql types. I am using oracle 11g. Is there something built-in? ...

How does a parser (for example, HTML) work?

For argument's sake lets assume a HTML parser. I've read that it tokenizes everything first, and then parses it. What does tokenize mean? Does the parser read every character each, building up a multi dimensional array to store the structure? For example, does it read a < and then begin to capture the element, and then once it meets ...

how to split a string in shell and get the last field

hi, suppose I have the string "1:2:3:4:5" and I want to get its last field ("5" in this case). how do I do that using Bash? I tried cut, but I don't know how to specify the last field with -f. ...

Problem with UI Layout plugin in combination with Tokenizing autocomplete plugin, MVC .NET application

Hello, i have problem with jquery layout plugin in combination with Tokenizing autocomplete plugin. When i click on close bar on one of panes inside which sits input text box with tokenizing autocomplete plugin, div close. But, when reope i can find 2 input text boxes with tokenizing plugin. Anyone have solution for this? Need to preve...

C Tokenizer - How does it work?

How does this work? I know to use it you pass in: start: string (e.g. "Item 1, Item 2, Item 3") delim: delimiter string (e.g. ",") tok: reference to a string which will hold the token nextpos (optional): reference to a the position in the original string where the next token starts sdelim (optional): pointer to a character which will ...

Trying to tokenize a string separated by commas

I'm novice so be gentle. trying to read a file of strings as such: "1,Duck_Soup,1933,Comedy,5,12" and tokenize it to different vars for each of the tokens between the commas. That's my code and I keep getting "segmentation fault" no matter what I try. Please help me fix the code, thank you. For starters I want to make it print the tok...

Syntax-aware substring replacement

I have a string containing a valid Clojure form. I want to replace a part of it, just like with assoc-in, but processing the whole string as tokens. => (assoc-in [:a [:b :c]] [1 0] :new) [:a [:new :c]] => (assoc-in [:a [:b,, :c]] [1 0] :new) [:a [:new :c]] => (string-assoc-in "[:a [:b,, :c]]" [1...

Splitting strings/tokens

hi, Is there a better way to read tokens in a file in java? I am currently using StringTokenizer for splitting the tokens. But it can be quite inefficient in most cases as you have to read token by token. Thank you ...

Parser vs. lexer and XML

I'm reading about compilers and parsers architecture now and I wonder about one thing... When you have XML, XHTML, HTML or any SGML-based language, what would be the role of a lexer here and what would be the tokens? I've read that tokens are like words prepared for parsing by the lexer. Although I don't have problem with finding tokens...

Question regarding regex and tokenizing

I need to make a tokenizer that is able to English words. Currently, I'm stuck with characters where they can be part of of a url expression. For instance, if the characters ':','?','=' are part of a url, i shouldn't really segment them. My qns is, can this be expressed in regex? I have the regex \b(?:(?:https?|ftp|file)://|www\.|ft...

Lucene.NET: Camel case tokenizer?

I've started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyzers/tokenizers treat the whole camel case source code identifier name as a single token. I'm looking for a way to treat camel case identifiers like MaxWidth into three tok...

Tokenization of a text file with frequency and line occurrence. Using C++.

Hello, once again I ask for help. I haven't coded anything for sometime! Now I have a text file filled with random gibberish. I already have a basic idea on how I will count the number of occurrences per word. What really stumps me is how I will determine what line the word is in. Gut instinct tells me to look for the newline character...

Is there a function to split a string in plsql?

I need to write a procedure to normalize a record that have multiple tokens concatenated by one char, I need to obtain these tokens splitting the string and insert each one as a new record in a table. Does Oracle have something like a "split" function? ...

Split column to multiple rows

Hi! I have table with a column that contains multiple values separated by comma (,) and would like to split it so I get earch Site on its own row but with the same Number in front. So my select would from this input table Sitetable Number Site 952240 2-78,2-89 ...

Java : The constructor JSONTokener(InputStreamReader) is undefined

Hi, I have a quite strange issue with Java, I'm getting an error on some machines only, I would like to know if there is any way I can avoid that: This is the line of code concerned: JSONTokener jsonTokener = new JSONTokener( new InputStreamReader(is, "UTF-8")); This is the error I get on some machines The file *.j...

MALLET tokenizer

Hi I want to use MALLET's topic modeling but can i provide my own tokenizer or tokenized version of the text documents when i import the data into mallet? I find MALLET's tokenizer inadequate for my usage... ...

How to use Stanford NLP API to retrieve phrases or tokens from NL query?

I need phrases returned from Stanford parser to use to in my program. ...

Indexing n-word expressions as a single term in Lucene

I want to index a "compound word" like "New York" as a single term in Lucene not like "new", "york". In such a way that if someone searches for "new place", documents containing "new york" won't match. I think this is not the case for N-grams (actually NGramTokenizer), because I won't index just any n-gram, I want to index only some spe...

Why is my string parsed differently via strtok on Windows and Linux?

In my program I'm cutting my char* with strtok. When I'm checking on Windows it's cut like I want, but when I'm doing the same thing on Linux, it's doing it wrong. Example : Windows: my char* (line) is : "1,21-344-32,blabla" in the first time I do strtok I get "1" in the second time I get "21-344-32" Linux: my char* (lin...