tags:

views:

346

answers:

2

A spectrum kernel function operates on strings by counting the same n-grams in between two strings. For example, 'tool' has three 2-grams ('to', 'oo', and 'ol'), and the similarity between 'tool' and 'fool' is 2. ('oo' and 'ol' in common).

How can I write a MATLAB function that could calculate this metric?

+1  A: 

The first step would be to create a function that can generate an n-gram for a given string. One way to do this in a vectorized fashion is with some clever indexing.

function subStrings = n_gram(fullString,N)
  nString = numel(fullString);
  index = repmat((0:(nString-N))',1,N)+...
          repmat(1:N,(nString-N+1),1);
  subStrings = unique(cellstr(fullString(index)));
end

This uses the function REPMAT to first create a matrix of indices that will select each set of unique N-length substrings from the given string. Indexing the given string with this index matrix will create a character array with one N-length substring per row. The function CELLSTR then places each row of the character array into a cell of a cell array. The function UNIQUE then removes repeated substrings.

With the above function you can then easily count the number of n-grams shared between two strings using the INTERSECT function:

subStrings1 = n_gram('tool',2);
subStrings2 = n_gram('fool',2);
sharedStrings = intersect(subStrings1,subStrings2);
nShared = numel(sharedStrings);
gnovice
You should add subStrings = intersect(subStrings, subStrings);to the function n_gram to avoid repetition. For instance, n_gram('hubbub', 2) would have a repeated 'ub' member in the original formulation.
mtrw
@mtrw: Good point, but the function UNIQUE would work better. I'll update accordingly.
gnovice
Oof. Yeah, UNIQUE would be quite a bit more appropriate.
mtrw
A: 

What you're looking for is called the Hamming distance, you can get a better description of it if you do doc pdist.

A=['Marcin'; 'Martin'; 'Marsha']  %data

squareform(pdist(A, 'hamming'))  returns

         0    0.1667    0.5000

    0.1667         0    0.5000

    0.5000    0.5000         0

This form shows you how many letters are different. Difference between 'Marcin' and 'Martin' is 1 out of 6 letters, so you get 1/6=0.1667 'Marcin' vs 'Marsha' has 3 of 6, so 3/6=0.5
If you want the actual number of letters that are different, just multiply the whole matrix by length(A).

Marcin
This doesn't quite sound like what the question describes.
gnovice

related questions