Hello!
I want to implement some applications with n-grams (preferably in PHP).
Which type of n-grams is more adequate for most purposes? A word level or a character level n-gram? How could you implement an n-gram-tokenizer in PHP?
First, I would like to know what N-grams exactly are. Is this correct? It's how I understand n-grams...
I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too.
If you do not know what I am writing about here is a link to an example of Popular...
A spectrum kernel function operates on strings by counting the same n-grams in between two strings. For example, 'tool' has three 2-grams ('to', 'oo', and 'ol'), and the similarity between 'tool' and 'fool' is 2. ('oo' and 'ol' in common).
How can I write a MATLAB function that could calculate this metric?
...
I am implementing a variation of spell checker. After taking various routes (for improving the time efficiency) I am planning to try out a component which would involve use of n-gram model. So essentially I want to prune the list of likely candidates for further processing. Would you guys happen to know if using one value of n (say 2) w...
I have Rails app with a Postgres backend.
I need to add full text search which would allow fuzzy searches based on Levenshtein distance or other similar metrics. Add the fact that the lexer/stemmer has to work with non-English words (it would be ok to just switch language-dependent features off when lexing, to not mess with the target l...
Hello.
I'm working on Markov Chains and I would like to know of efficient algorithms for constructing probabilistic transition matrices (of order n), given a text file as input.
I am not after one algorithm, but I'd rather like to build a list of such algorithms. Papers on such algorithms are also more than welcome, as any tips on ter...
Hi! I recently found out about n-grams and the cool possibility to compare frequency of phrases in a text body with it. Now I'm trying to make an vb.net app that simply gets an text body and returns a list of the most frequently used phrases (where n >= 2).
I found an C# example of how to generate a n-gram from a text body so I started ...
Drupal's core search module, only searches for keywords, e.g. "sandwich". Can I make it search with a substring e.g. "sandw" and return my sandwich-results?
Maybe there is a plugin that does that?
...
As part of excersise to better understand F# which I am currently learning , I wrote function to
split given string into n-grams.
1) I would like to receive feedback about my function : can this be written simpler or in more efficient way?
2) My overall goal is to write function that returns string similarity (on 0.0 .. 1.0 scale)...
Let's say I have a sentence of text:
$body = 'the quick brown fox jumps over the lazy dog';
and I want to get that sentence into a hash of 'keywords', but I want to allow multi-word keywords; I have the following to get single word keywords:
$words{$_}++ for $body =~ m/(\w+)/g;
After this is complete, I have a hash that looks like...
Hi all,
I have been working on a project about sentence similarity. I know it has been asked many times in SO, but I just want to know if my problem can be accomplished by the method I use by the way that I am doing it, or I should change my approach to the problem. Roughly speaking, the system is supposed to split all sentences of an ar...