questions about tokenizer

How to tokenize Perl source code?

I have some reasonable (not obfuscated) Perl source files, and I need a tokenizer, which will split it to tokens, and return the token type of each of them, e.g. for the script print "Hello, World!\n"; it would return something like this: keyword 5 bytes whitespace 1 byte double-quoted-string 17 bytes semicolon 1 byte whitespace 1 b...

perl

tokenizer

Is there a Javascript lexer / tokenizer (in PHP)?

I've seen a couple of Python Javascript tokenizers and a cryptic document on Mozilla.org about a Javascript Lexer but can't find any Javascript tokenizers for PHP specifically. Are there any? Thanks ...

String tokenizer for CPP String ?

I want to use string Tokenizer for CPP string but all I could find was for Char*. Is there anything similar for CPP string ? Thanks in advance ...

c++

string

tokenizer

sqlite-fts3: custom tokenizer?

Does anyone here have experience with writing custom FTS3 (the full-text-search extension) tokenizers? I'm looking for a tokenizer that will ignore HTML tags. Thanks. ...

Question regarding regex and tokenizing

I need to make a tokenizer that is able to English words. Currently, I'm stuck with characters where they can be part of of a url expression. For instance, if the characters ':','?','=' are part of a url, i shouldn't really segment them. My qns is, can this be expressed in regex? I have the regex \b(?:(?:https?|ftp|file)://|www\.|ft...

Lucene.NET: Camel case tokenizer?

I've started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyzers/tokenizers treat the whole camel case source code identifier name as a single token. I'm looking for a way to treat camel case identifiers like MaxWidth into three tok...

New to ASP, trying to use a tokenizer.

Hi, Sorry for the kind of noob question but having issues trying to get a Tokenizer working. Tried this example but on the line of the Tokenize() I get an error Type mismatched. I've also tried to use Split with a very similar outcome. The server is using IIS and is pretty old if that helps at all. Sorry, never used asp / .net before. ...

asp

tokenizer

MALLET tokenizer

Hi I want to use MALLET's topic modeling but can i provide my own tokenizer or tokenized version of the text documents when i import the data into mallet? I find MALLET's tokenizer inadequate for my usage... ...

nlp

tokenizing

tokenizer