n-gram

N-grams: Explanation + 2 applications

Hello! I want to implement some applications with n-grams (preferably in PHP). Which type of n-grams is more adequate for most purposes? A word level or a character level n-gram? How could you implement an n-gram-tokenizer in PHP? First, I would like to know what N-grams exactly are. Is this correct? It's how I understand n-grams...

How was the Google Books' Popular passages feature developed?

I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too. If you do not know what I am writing about here is a link to an example of Popular...

How to implement a spectrum kernel function in MATLAB?

A spectrum kernel function operates on strings by counting the same n-grams in between two strings. For example, 'tool' has three 2-grams ('to', 'oo', and 'ol'), and the similarity between 'tool' and 'fool' is 2. ('oo' and 'ol' in common). How can I write a MATLAB function that could calculate this metric? ...

What is the empirically found best value for n in n-gram model?

I am implementing a variation of spell checker. After taking various routes (for improving the time efficiency) I am planning to try out a component which would involve use of n-gram model. So essentially I want to prune the list of likely candidates for further processing. Would you guys happen to know if using one value of n (say 2) w...

Recommendation needed: Rails, Postgres and fuzzy full text search

I have Rails app with a Postgres backend. I need to add full text search which would allow fuzzy searches based on Levenshtein distance or other similar metrics. Add the fact that the lexer/stemmer has to work with non-English words (it would be ok to just switch language-dependent features off when lexing, to not mess with the target l...

Probability transition matrix

Hello. I'm working on Markov Chains and I would like to know of efficient algorithms for constructing probabilistic transition matrices (of order n), given a text file as input. I am not after one algorithm, but I'd rather like to build a list of such algorithms. Papers on such algorithms are also more than welcome, as any tips on ter...

N-gram function in vb.net -> create grams for words instead of characters

Hi! I recently found out about n-grams and the cool possibility to compare frequency of phrases in a text body with it. Now I'm trying to make an vb.net app that simply gets an text body and returns a list of the most frequently used phrases (where n >= 2). I found an C# example of how to generate a n-gram from a text body so I started ...

Can Drupal's search module search for a substring? (Partial Search)

Drupal's core search module, only searches for keywords, e.g. "sandwich". Can I make it search with a substring e.g. "sandw" and return my sandwich-results? Maybe there is a plugin that does that? ...

N-gram split function for string similarity comparison

As part of excersise to better understand F# which I am currently learning , I wrote function to split given string into n-grams. 1) I would like to receive feedback about my function : can this be written simpler or in more efficient way? 2) My overall goal is to write function that returns string similarity (on 0.0 .. 1.0 scale)...

Perl paragraph n-gram

Let's say I have a sentence of text: $body = 'the quick brown fox jumps over the lazy dog'; and I want to get that sentence into a hash of 'keywords', but I want to allow multi-word keywords; I have the following to get single word keywords: $words{$_}++ for $body =~ m/(\w+)/g; After this is complete, I have a hash that looks like...

n-gram sentence similarity with cosine similarity measurement

Hi all, I have been working on a project about sentence similarity. I know it has been asked many times in SO, but I just want to know if my problem can be accomplished by the method I use by the way that I am doing it, or I should change my approach to the problem. Roughly speaking, the system is supposed to split all sentences of an ar...