similarity

How to calculate Mahalanobis distance between two time series of equal dimensions?

I am doing some data-mining on time series data. I need to calculate the distance or similarity between two series of equal dimensions. I was suggested to use Euclidean distance, Cos Similarity or Mahalanobis distance. The first two didn't give any useful information. I cannot seem to understand the various tutorials on the web. So, Gi...

Optimize algorithm for creating a list of items rated together, in Python.

given a list of purchase events (customer_id,item) 1-hammer 1-screwdriver 1-nails 2-hammer 2-nails 3-screws 3-screwdriver 4-nails 4-screws i'm trying to build a data structure that tells how many times an item was bought with another item. Not bought at the same time, but bought since I started saving data. the result would look like ...

Cosine Similarity of Vectors of different lengths?

I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying: #len(u)==201, len(v)==246 cosine_distance(u, v) ValueError: objects are not aligned #this works though: cosine_distan...

Efficient item similarity search using sphinx

Is it possible to perform document similarity search efficiently using sphinx search? My index consists of 500k documents, each which is tagged by 5-30 different short, all lowercase stemmed words which is the data to search through. For simplicity, all tags in the database has equal weights and I'm not using phrase searching. My first a...

How to find similar users using their interests

I am trying to create a system which would be able to find users with similar favourite movies/books/interests/etc., much like neighbours on last.fm. Users sharing the most mutual interests would have the highest match and would be displayed in user profiles (5 best matches or so). Is there any reasonably fast way to do this? The obviou...

hash function to index similar text

I'm searching about a sort of hash function to index similar text. So for example if we have two very long text called "A" and "B" where A and B differ not so much, then the hash function (called H) applied to A and B should return the same number. So H(A) = H(B) where A and B are similar text. I tried the "DoubleMetaphone" (I use ital...

Adding documents to a scored TF-IDF collection?

I have a large collection of documents that already have their TF-IDF computed. I'm getting ready to add some more documents to the collection, and I am wondering if there is a way to add TF-IDF scores to the new documents without re-processing the entire database? ...

Similarity search between time series in Matlab. Possible ? I cant find R-tree implementation in matlab

Hi there, I would like to implement similarity search in matlab. I wanna to know is it possible ? My plan is to do use 2 popular similarity measurement which are Euclidean Distance and Dynamic Time Warping. Both of these will be applied on time series dataset. My question at this point is how can I evaluate both of these two measurem...

How to find similar results and sort by similarity?

How do I query for records ordered by similarity? Eg. searching for "Stock Overflow" would return Stack Overflow SharePoint Overflow Math Overflow Politic Overflow VFX Overflow Eg. searching for "LO" would return: pabLO picasso michelangeLO jackson polLOck What I need help with: Using a search engine to index & search a MySQ...

generating bigram combinations from grouped data in pig.

given my input data in userid,itemid format: raw: {userid: bytearray,itemid: bytearray} dump raw; (A,1) (A,2) (A,4) (A,5) (B,2) (B,3) (B,5) (C,1) (C,5) grpd = GROUP raw BY userid; dump grpd; (A,{(A,1),(A,2),(A,4),(A,5)}) (B,{(B,2),(B,3),(B,5)}) (C,{(C,1),(C,5)}) I'd like to generate all of the combinations(order not important) of ...

Similarity measurement and similarity search . Difference?

What is the difference between similarity measurement in time series and similarity search in time series ? I am abit confused with these two terms. to my understanding. Similarity search is the process of obtaining similar time series using similarity measure such as euclidean distance, DTW, EDR, EDP and etc. Then what is similarity...

Find similar ASCII character in Unicode

Does someone know a easy way to find characters in Unicode that are similar to ASCII characters. An example is the "CYRILLIC SMALL LETTER DZE (ѕ)". I'd like to do a search and replace for similar characters. By similar I mean human readable. You can't see a difference by looking at it. ...

Algorithm to find if one document is included in another, when those two documents are similar.

I'm looking for an algorithm that finds whether two text documents are similar, where one document is included in the other document. I thank you in advance. ...

Algorithm to find if one document is included in another, when those two documents are similar

hi, I'm using an algorithme based on cosine similarity with TF-IDF to find if document X is similar to document Y. But I want to know more about X and Y if they are similar, I want to know if X contains Y, I mean if the information in Y are included withing X. the most important aspect for me is the semantic of the inclusion, not only th...

Hash function that hashes similar strings in the same bucket

Hello everybody! I'm searching for a "bad" hash function: I'd like to hash strings and put similar strings in one bucket. Can you give me a hint where to start my research? Some methods or algorithm names... Thnaks in advance! Sebastian ...

Adjusted Cosine Similarity

Hello there. Can you help me, please. I was confused to make PHP code about adjusted cosine similarity. I have build data like this : $data[UserID][ItemID] = Rating data example : $data[1][1] = 5; $data[1][2] = 3; $data[1][3] = 4; $data[2][1] = 3; $data[2][2] = 2; $data[2][4] = 3; $data[2][5] = 3; $data[3][1] = 4; $data[3][3] = 3; $d...

Calculating similarity between and centroid of Lucene documents

In order to perform a simple clustering algorithm on results that I get from Lucene, I have to calculate Cosine similarity between 2 documents in Lucene, I also need to be able to make a centroid document to represent the centroid of each cluster. All I can think of doing is building my own Vector Space model with tf-idf weighting, usi...

Please Optimize My Pearson Code

Hello there...I was wondering if you can make the opization of my code. Because, when I aplied in localhost the running is about "17 MINUTES" ( calculation with 100000 query) For data is like this : $data[UserID][ItemID] = Rating ==> $data[1][1] = 5; This is my code : <?php include "......."; set_time_limit(0); ...

How to array_merge a dynamic array based on one of it's value similarity

Good day, I am retrieving information from various websites using cURL and various parsing techniques. I made the code so I can, if desired, add additional websites I scan information from. The information retrieved is as follow : (Please note that the information may be inaccurate and may not reflect real price/name) Array ( [web...

Audio similarity library

I'm trying to find something like an audio similarity library for a school project. Something simple and well documented, written in python or java preferably, that could extract feature from audio files and estimate any form of similarity basing on these. Something like this code could also be fine but I think I don't have the skill to ...