I am doing some data-mining on time series data. I need to calculate the distance or similarity between two series of equal dimensions. I was suggested to use Euclidean distance, Cos Similarity or Mahalanobis distance. The first two didn't give any useful information. I cannot seem to understand the various tutorials on the web.
So,
Gi...
given a list of purchase events (customer_id,item)
1-hammer
1-screwdriver
1-nails
2-hammer
2-nails
3-screws
3-screwdriver
4-nails
4-screws
i'm trying to build a data structure that tells how many times an item was bought with another item. Not bought at the same time, but bought since I started saving data. the result would look like
...
I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying:
#len(u)==201, len(v)==246
cosine_distance(u, v)
ValueError: objects are not aligned
#this works though:
cosine_distan...
Is it possible to perform document similarity search efficiently using sphinx search? My index consists of 500k documents, each which is tagged by 5-30 different short, all lowercase stemmed words which is the data to search through. For simplicity, all tags in the database has equal weights and I'm not using phrase searching. My first a...
I am trying to create a system which would be able to find users with similar favourite movies/books/interests/etc., much like neighbours on last.fm. Users sharing the most mutual interests would have the highest match and would be displayed in user profiles (5 best matches or so).
Is there any reasonably fast way to do this? The obviou...
I'm searching about a sort of hash function to index similar text. So for example if we have two very long text called "A" and "B" where A and B differ not so much, then the hash function (called H) applied to A and B should return the same number.
So H(A) = H(B) where A and B are similar text.
I tried the "DoubleMetaphone" (I use ital...
I have a large collection of documents that already have their TF-IDF computed. I'm getting ready to add some more documents to the collection, and I am wondering if there is a way to add TF-IDF scores to the new documents without re-processing the entire database?
...
Hi there,
I would like to implement similarity search in matlab. I wanna to know is it possible ?
My plan is to do use 2 popular similarity measurement which are Euclidean Distance and Dynamic Time Warping. Both of these will be applied on time series dataset. My question at this point is how can I evaluate both of these two measurem...
How do I query for records ordered by similarity?
Eg. searching for "Stock Overflow" would return
Stack Overflow
SharePoint Overflow
Math Overflow
Politic Overflow
VFX Overflow
Eg. searching for "LO" would return:
pabLO picasso
michelangeLO
jackson polLOck
What I need help with:
Using a search engine to index & search a MySQ...
given my input data in userid,itemid format:
raw: {userid: bytearray,itemid: bytearray}
dump raw;
(A,1)
(A,2)
(A,4)
(A,5)
(B,2)
(B,3)
(B,5)
(C,1)
(C,5)
grpd = GROUP raw BY userid;
dump grpd;
(A,{(A,1),(A,2),(A,4),(A,5)})
(B,{(B,2),(B,3),(B,5)})
(C,{(C,1),(C,5)})
I'd like to generate all of the combinations(order not important) of ...
What is the difference between similarity measurement in time series and similarity search in time series ? I am abit confused with these two terms.
to my understanding. Similarity search is the process of obtaining similar time series using similarity measure such as euclidean distance, DTW, EDR, EDP and etc.
Then what is similarity...
Does someone know a easy way to find characters in Unicode that are similar to ASCII characters. An example is the "CYRILLIC SMALL LETTER DZE (ѕ)". I'd like to do a search and replace for similar characters. By similar I mean human readable. You can't see a difference by looking at it.
...
I'm looking for an algorithm that finds whether two text documents are similar, where one document is included in the other document.
I thank you in advance.
...
hi, I'm using an algorithme based on cosine similarity with TF-IDF to find if document X is similar to document Y. But I want to know more about X and Y if they are similar, I want to know if X contains Y, I mean if the information in Y are included withing X. the most important aspect for me is the semantic of the inclusion, not only th...
Hello everybody!
I'm searching for a "bad" hash function:
I'd like to hash strings and put similar strings in one bucket.
Can you give me a hint where to start my research?
Some methods or algorithm names...
Thnaks in advance!
Sebastian
...
Hello there. Can you help me, please. I was confused to make PHP code about adjusted cosine similarity.
I have build data like this : $data[UserID][ItemID] = Rating
data example :
$data[1][1] = 5;
$data[1][2] = 3;
$data[1][3] = 4;
$data[2][1] = 3;
$data[2][2] = 2;
$data[2][4] = 3;
$data[2][5] = 3;
$data[3][1] = 4;
$data[3][3] = 3;
$d...
In order to perform a simple clustering algorithm on results that I get from Lucene, I have to calculate Cosine similarity between 2 documents in Lucene, I also need to be able to make a centroid document to represent the centroid of each cluster.
All I can think of doing is building my own Vector Space model with tf-idf weighting, usi...
Hello there...I was wondering if you can make the opization of my code.
Because, when I aplied in localhost the running is about "17 MINUTES" ( calculation with 100000 query)
For data is like this : $data[UserID][ItemID] = Rating ==> $data[1][1] = 5;
This is my code :
<?php
include ".......";
set_time_limit(0);
...
Good day,
I am retrieving information from various websites using cURL and various parsing techniques. I made
the code so I can, if desired, add additional websites I scan information from.
The information retrieved is as follow :
(Please note that the information may be inaccurate and may not reflect real price/name)
Array
(
[web...
I'm trying to find something like an audio similarity library for a school project. Something simple and well documented, written in python or java preferably, that could extract feature from audio files and estimate any form of similarity basing on these. Something like this code could also be fine but I think I don't have the skill to ...