similarity

Package to compare LSA, TFIDF, Cosine metrics and Language Models

Hi, I'm looking for a package (any language, really) that I can use on a corpus of 50 documents to perform interdocument similarity testing in various metrics, like tfidf, okapi, language models, lsa, etc. I want as a result a document similarity matrix, i.e. doc1 is x% similar to doc2, etc... This is for research purposes, not for pr...

how to get the similar texts from a lot of pages?

get the x most similar texts from a lot of texts to one text. maybe change the page to text is better. You should not compare the text to every text, because its too slow. ...

Detecting image equality at different resolutions

I'm trying to build a script to go through my original, high-res photos and replace the old, low-res ones I uploaded to Flickr before I had a pro account. For many of them I can just use Exif info such as date taken to determine a match. But some are really old, and either the original file didn't have Exif info, or it got clobbered by ...

Saying "C & C# are equal by functionality, but not by concept"

An argument has been raised in my class regarding C and C#. I stated that it's correct to say that C & C# are the same (meant: same by functionality, but not by concept). Different by concept: C# meant to be easier to program with than C. C is more descriptive. Same by functionality: Everything you make with C# - you can also make with...

Speeding up self-similarity in an image

I'm writing a program that will generate images. One measurement that I want is the amount of "self-similarity" in the image. I wrote the following code that looks for the countBest-th best matches for each sizeWindow * sizeWindow window in the picture: double Pattern::selfSimilar(int sizeWindow, int countBest) { std::vector<int> ...

Similarity Between Users Based On Votes

lets say i have a set of users, a set of songs, and a set of votes on each song: =========== =========== ======= User Song Vote =========== =========== ======= user1 song1 [score] user1 song2 [score] user1 song3 [score] user2 song1 [score] user2 song2 [score] user...

get cosine similarity between two documents in lucene

Hi i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index. For example i am getting from previously opened IndexReader ir the documents with ids 2 and 4. Document d1 = ir.document(2); Document d2 = ir.document(4); How can i ge...

get similarity score between two document termfreqvectors

Hi I would like to extract similarity score between two document termfreqvectors. I checked that if i submit the first one as a query and look the second in the result set, I cannot have the precise score that lucene gives for these two vectors? any help? ...

Euclidian distance between posts based on tags

I am playing with the euclidian distance example from programming collective intelligence book, # Returns a distance-based similarity score for person1 and person2 def sim_distance(prefs,person1,person2): # Get the list of shared_items si={} for item in prefs[person1]: if item in prefs[person2]: si[item]=1 # ...

fast similarity detection

I have a large collection of objects and I need to figure out the similarities between them. To be exact: given two objects I can compute their dissimilarity as a number, a metric - higher values mean less similarity and 0 means the objects have identical contents. The cost of computing this number is proportional to the size of the sm...

Compare 5000 strings with PHP Levenshtein

I have 5000, sometimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999? Edit: I am also interested in alternate methods if anyone has suggestions. The overall goal is to fi...

Java: Equalator? (removing duplicates from a collection of objects)

I have a bunch of objects of a class Puzzle. I have overridden equals() and hashCode(). When it comes time to present the solutions to the user, I'd like to filter out all the Puzzles that are "similar" (by the standard I have defined), so the user only sees one of each. Similarity is transitive. Example: Result of computations: A ...

Cosine Similarity

Hi All, Thank you all great guys here for helping people like me :) I just need small hint .... I calculated tf/idf values of two documents. Following is the tf/idf values 1.txt 0.0 0.5 2.txt 0.0 0.5 The documents are like 1.txt = > dog cat 2.txt = > cat elephant As now I have tf/idf values. Can any body tell me how to use these valu...

PHP similar_text() in java

Do you know any strictly equivalent implementation of the PHP similar_text function in Java? ...

Computing degree of similarity among a group of sets

Suppose there are 4 sets: s1={1,2,3,4}; s2={2,3,4}; s3={2,3,4,5}; s4={1,3,4,5}; Is there any standard metric to present the similarity degree of this group of 4 sets? Thank you for the suggestion of Jaccard method. However, it seems pairwise. How can I compute the similarity degree of the whole group of sets? ...

Image comparison with php + gd

What's the best approach to comparing two images with php and the Graphic Draw Library? This is the scenario: I have an image, and I want to find which image of a given set is the most similar to it. The most similar image is in fact the same image, not pixel perfect match but the same image. I've dramatised the difference between th...

Appropriate similarity metrics for multiple sets of 2d co-ordinates

Pardon me if this has been asked/answered before...my searches did not bring it up. I have a collection of 2D co-ordinate sets (on the scale of a 100K-500K points in each set) and I am looking for the most efficient way to measure the similarity of 1 set to the other. I know of the usuals: Cosine, Jaccard/Tanimoto etc. However hoping f...

compare int arrays for 'similarity' - more accurate than weighted average?

say there is a number of arrays with length 12, containing signed integers in a range of roughly ±100, how can i compare the 'signature' or 'harmonic content' of these arrays to each other, in a way that is more accurate than a simple weighted average? Would i have to look into neural networks (if this even would be suitable, i don't kn...

grouping strings by similarity

I have an array of strings, not many (maybe a few hundreds) but often long (a few hundred chars). Those string are, generally, nonsense and different one from the other.. but in a group of those string, maybe 5 out of 300, there's a great similarity. In fact they are the same string, what differs is formatting, punctuation and a few wor...

How do you efficiently implement a document similarity search system?

How do you implement a "similar items" system for items described by a set of tags? In my database, I have three tables, Article, ArticleTag and Tag. Each Article is related to a number of Tags via a many-to-many relationship. For each Article i want to find the five most similar articles to implement a "if you like this article you wil...