ansaurus

Question

Answer 1

+1 A:

What you need is a distance function that can handle strings. Check out the Levenshtein distance (edit distance). There are plenty of implementations out there:

Alternatively, you should extract some interesting features (ex: number of vowels, length of string, etc..) to build a vector space representation, then you can apply any of the usual distance measures (euclidean, ...) on the new representation.

EDIT

The problem with your code is that LINKAGE expects the input distances format to match that of PDIST, namely a row vector corresponding to pairs of observations in the order 1-vs-2, 1-vs-3, 2-vs-3, etc.. which is basically the lower half of the complete distance matrix (since its supposed to be symmetric as dist(1,2) == dist(2,1))

%# instances
str = {'I have a pen.'
    'I have a paper.'
    'I have a pencil.'
    'I have a cat.'};
numStr = numel(str);

%# create and fill upper half only of distance matrix
D = zeros(numStr,numStr);
for i=1:numStr
    for j=i+1:numStr
        D(i,j) = levenshtein_distance(str{i},str{j});
    end
end
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T)

Please refer to the documentation of the functions in question for more information...

Amro 2010-09-05 17:26:37

L.distance is different from HAC, isn't it ? I'm looking for HAC only. I couldnt find if HAC extracts vector using any special feature. Can I convert texts to vector (all characterless) ? Any hints on how to build a vector space ?

Tinglin 2010-09-05 17:54:21

Levenshtein distance is not a clustering algorithm, its a distance function between two strings. This is needed because Hierarchical clustering starts by computing the distance matrix between all pairs of instances `PDIST`, and then start to merge them in a bottom-up approach (agglomerative) `LINKAGE`

Amro 2010-09-05 17:57:43

I'm just reading the link you gave. will update on what I could do. Thanks Amro.

Tinglin 2010-09-05 18:06:42

Just read the documents and understood. copied the code and ran too. If I understand correctly , the first loop takes one string at a time and second loop it is compared with the rest. I defined one string as S1 and want to read others from a file. I'm updating the code which I just edited following your code. I'm having an error and wondering if I could do this way or not. Thanks very much and help appreciated.

Tinglin 2010-09-06 03:41:53

@Tinglin: what is it you're trying to do? hierarchical clustering takes **all** instances and compute their distances among each other (not one in particular vs all others), and then it groups similar instances together to form the clusters.. As to the loop, it will compare the instance in the order I explained before: 1-vs-2, 1-vs-3, 1-vs-4, 2-vs-3, 2-vs-4, and 3-vs-4 (basically all possible unordered pairs)

Amro 2010-09-06 03:55:34

@ Amro. I understood clearly what you explained. I'm just trying like : I have one string say (S1). In a file I have previously saved 4 strings (S2, S3, S4, S5). Assume they couldn't be clustered before or may be clustered. I would like to compare (S1-S2) (S1-S3) (S1-S4) (S1-S5) to see if S1 can be cluster with any of the four strings. If yes, I would replace say S2 with S1 and save the file. possible ? Example: I have a pen will replace previous I have a pencil. or, stock exchange fails will replace failure in stock exchange. Or, the whole concept is wrong ? Thanks very much.

Tinglin 2010-09-06 04:00:19

@Tinglin: I'm afraid I don't follow your logic. The whole point of clustering is that it discovers intrinsic grouping of your data. This means that given a bunch of instances (strings in your case), it will group them for you such that similar instances are placed together (or in this case a hierarchy of groups). It might help if you read on more on the subject: http://en.wikipedia.org/wiki/Cluster_analysis#Hierarchical_clustering

Amro 2010-09-06 04:18:05

@Amro. I read all the documents I had and wiki. I think I have to modify what I'm trying to do and here it goes: I received an information S1='I have a pen'. I stored it in the pc. Now, I have received a new information S2=' I have a pencil'. Before I store S2, I want to determine is S2 same/ similar to S1 ? If yes, I either want to truncate S2 or replace S1 with S2. But if, I find S2 is very different than S1. I store S2 as well. HAC groups similar items but I dont think appropriate for my scenario. Also LD. Have look at the code. Thanks.

Tinglin 2010-09-06 08:30:00

@Tinglin: Clearly clustering does not fit your problem. Still, this is way too vague: how would do you determine if two strings are similar/different? Perhaps you should post a new question, but make sure to clearly explain the task you are trying to achieve (based on your description, I suspect you're not sure of the concept yourself!)

Amro 2010-09-06 18:58:09

@Amro, you are dead right. I'm still thinking the concept and modifying. I came up with a decent one and posted a new Question. LD works upto some extent though. Thanks again.

Tinglin 2010-09-07 03:42:55

I'm linking to your new question for posterity: http://stackoverflow.com/questions/3655612/how-to-compute-similarity-between-two-sentences-syntactical-and-semantical

Amro 2010-09-07 19:12:11

ansaurus

tags:

views:

answers:

Clustering text in MATLAB

related questions