I have some reasonable (not obfuscated) Perl source files, and I need a tokenizer, which will split it to tokens, and return the token type of each of them, e.g. for the script
print "Hello, World!\n";
it would return something like this:
keyword 5 bytes
whitespace 1 byte
double-quoted-string 17 bytes
semicolon 1 byte
whitespace 1 b...
I've seen a couple of Python Javascript tokenizers and a cryptic document on Mozilla.org about a Javascript Lexer but can't find any Javascript tokenizers for PHP specifically. Are there any?
Thanks
...
I want to use string Tokenizer for CPP string but all I could find was for Char*.
Is there anything similar for CPP string ?
Thanks in advance
...
Does anyone here have experience with writing custom FTS3 (the full-text-search extension) tokenizers? I'm looking for a tokenizer that will ignore HTML tags.
Thanks.
...
I need to make a tokenizer that is able to English words.
Currently, I'm stuck with characters where they can be part of of a url expression.
For instance, if the characters ':','?','=' are part of a url, i shouldn't really segment them.
My qns is, can this be expressed in regex? I have the regex
\b(?:(?:https?|ftp|file)://|www\.|ft...
I've started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyzers/tokenizers treat the whole camel case source code identifier name as a single token.
I'm looking for a way to treat camel case identifiers like MaxWidth into three tok...
Hi,
Sorry for the kind of noob question but having issues trying to get a Tokenizer working. Tried this example but on the line of the Tokenize() I get an error Type mismatched. I've also tried to use Split with a very similar outcome.
The server is using IIS and is pretty old if that helps at all. Sorry, never used asp / .net before.
...
Hi I want to use MALLET's topic modeling but can i provide my own tokenizer or tokenized version of the text documents when i import the data into mallet? I find MALLET's tokenizer inadequate for my usage...
...