similarity

How to implement "related articles?"

How do I write code that would find related (similar) articles to the one that the user is currently reading? For example, suppose I have articles: Python programming tips Python programming for newbies Programming in Python, ActionScript and Flash Programming in the Jungle Tarzan saves newbie Judy from using Fortran programming langua...

Collaborative Filtering: Non-Personalized item-to-item similarity

I'm trying to compute item-to-item similarity along the lines of Amazon's "Customers who viewed/purchased X have also viewed/purchased Y and Z". All of the examples and references I've seen are for either computing item similarity for ranked items, for finding user-user similarity, or for finding recommended items based on the current u...

Detecting similar words among n text documents

Hi; I have n documents and want to find common words that are included in these documents. For example I want to say (n-3) documents include the word "web". Certainly I can do this by basic data structures but there maybe efficient algorithm or a way to handle same words with different suffix. Is there any algorithm for such purposes?...

Is there some algorithm to compare the DOM similarity of different pages ?

Has anyone some experience about this? ...

How to determine a text block of a file in one version come from which file in the previous version?

The problem is described below: Suppose I have a list of files in one version(say A,B,C,D). In the next version I have the following files(A,E,F,G). There are some similarities in their contents. The files in the later version comes from the previous version by file name renaming, content addition, deletion or partial modification or wi...

Find cosine similarity in R

I'm wondering if there is a built in function in R that can find the cosine similarity (or cosine distance) between two arrays? Currently, I implemented my own function, but I can't help but think that R should already come with one :) Thanks, Derek ...

Lucene numDocs and doqFreq on custom similarity class

Hi All, im doing an aplication with Lucene (im a noob with it) and im facing some problems. My aplication uses the Lucene 2.4.0 library with a custom similaraty implementation (the jar is imported) In my app im calculating doqFreq and numDocs manually (im adding the values of all indexes and then i calculate a global value in order to u...

Cosine Similarity Measure: Multiple results

My program uses clustering to produce subsets of similar items and then uses the cosine similarity measure as a method of determining how similar the clusters are. For instance if user 1 has 3 clusters and user 2 has 3 clusters then every cluster is compared against each other, 9 results using the cosine similarity measure will be produc...

Advice on String Similarity Metrics (Java). Distance, sounds like or combo?

Hello, A part of a process requires to apply String Similarity Algorithms. The results of this process will be stored and produce lets say SS_Dataset. Based on this Dataset, further decisions will have to be made. My questions are: Should i apply one or more string similarity algorithms to produce SS_Dataset ? Any comparisons...

Very fast document similarity

Hello, I am trying to determine document similarity between a single document and each of a large number of documents (n ~= 1 million) as quickly as possible. More specifically, the documents I'm comparing are e-mails; they are grouped (i.e., there are folders or tags) and I'd like to determine which group is most appropriate for a new...

About curse of dimensionality

My question is about this topic I've been reading about a bit. Basically my understanding is that in higher dimensions all points end up being very close to each other. The doubt I have is whether this means that calculating distances the usual way (euclidean for instance) is valid or not. If it were still valid, this would mean that wh...

about cosine similarity

hi i m finding cosine similarity between documents ..i did like dis D1=(8,0,0,1) where 8,0,0,1 are the tf-idf scores of the terms t1, t2, t3 , t4 D2=(7,0,0,1) cos(theta) = (56 + 0 + 0 + 1) / sqrt(64 + 49) sqrt(1 +1 ) which comes out to be cos(theta)= 5 now what do i evaluate from this value...i dont get it wat does cos(theta)=5 s...

measuring similarity between documents using jaccard coefficient

hi i m finding similarity between documents ....nd to measure that i used jaccard coefficient...i did like dis D1=(8,0,0,1) where 8,0,0,1 are the tf-idf scores of the terms t1, t2, t3 , t4 D2=(7,0,0,0) jaccard coefficient= dotproduct(d1,d2) / |d1|+|d2|-dotproduct(d1,d2) and the answer comes out to be " -1.367931 "...what does i...

Converting python collaborative filtering code to use Map Reduce

Using Python, I'm computing cosine similarity across items. given event data that represents a purchase (user,item), I have a list of all items 'bought' by my users. Given this input data (user,item) X,1 X,2 Y,1 Y,2 Z,2 Z,3 I build a python dictionary {1: ['X','Y'], 2 : ['X','Y','Z'], 3 : ['Z']} From that dictionary, I generate a...

'Similarity' in Data Mining

In the field of Data Mining, is there a specific sub-discipline called 'Similarity'? If yes, what does it deal with. Any examples, links, references will be helpful. Also, being new to the field, I would like the community opinion on how closely related Data Mining and Artificial Intelligence are. Are they synonyms, is one the subset of...

Solr search score in the range from 0 to 1

Hi, Is it possible to configure Solr so that the document similarity score would be in the range for example from 0 (no match) to 1 (complete document and query match). Thanks! ...

Java: JPQL search -similar- strings

What methods are there to get JPQL to match similar strings? By similar I mean: Contains: search string is found within the string of the matches entity Case-insensitive Small mispellings: e.g. "arow" matches "arrow" I suspect the first two will be easy, however, I would appreciate help with the last one Thank you ...

Computer Science taxonomy

I am developing web application where users have collection of tags. I need to create a suggestion list for users based on the similarity of their tags. For example, when a user logs in to the system, system gets his tags and search these tags in the DB of users and showing users who have similar tags. For instance if User 1 has followi...

Finding the closest match

I Have an object with a set of parameters like: var obj = new {Param1 = 100; Param2 = 212; Param3 = 311; param4 = 11; Param5 = 290;} On the other side i have a list of object: var obj1 = new {Param1 = 1221; Param2 = 212; Param3 = 311; param4 = 11; Param5 = 290;} var obj3 = new {Param1 = 35; Param2 = 11; Param3 = 319; param4 = 211; Pa...

What is the paper "Oliver [1993]" describing a PHP algorithm to calculate text similarity?

There is a function similar_text() in the PHP library. The documentation (http://php.net/manual/en/function.similar-text.php) tells me that "This calculates the similarity between two strings as described in Oliver [1993]." Despite extensive searching, I can't find the paper that "Oliver [1993]" is referring to; nor any candidate for w...