I've had some success comparing strings using the PHP levenshtein function.
However, for two strings which contain substrings that have swapped positions, the algorithm counts those as whole new substrings.
For example:
levenshtein("The quick brown fox", "brown quick The fox"); // 10 differences
are treated as having less in common ...
Hello!
To compute the similarity between two documents, I create a feature vector containing the term frequencies. But then, for the next step, I can't decide between "Cosine similarity" and "Hamming distance".
My question: Do you have experience with these algorithms? Which one gives you better results?
In addition to that: Could you...
I am currently using similar_text to compare a string against a list of ~50,000 which works although due to the number of comparisons it's very slow. It takes around 11 minutes to compare ~500 unique strings.
Before running this I do check the databases to see whether it has been processed in the past so everytime after the inital run i...
Hello, I'm trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:
addr_1 = '# 3 FAIRMONT LINK SOUTH'
addr_2 = '3 FAIRMONT LINK S'
addr_3 = '5703 - 48TH AVE'
adrr_4 = '5703- 48 AVENUE'
I'm planning on applying some string transformation to make long words abbrev...
Hi,
I'm looking for a package (any language, really) that I can use on a corpus of 50 documents to perform interdocument similarity testing in various metrics, like tfidf, okapi, language models, lsa, etc.
I want as a result a document similarity matrix, i.e. doc1 is x% similar to doc2, etc... This is for research purposes, not for pr...
Hi All,
i am using TF/IDF to calculate similarity. For example if i have following two doc.
Doc A => cat dog
Doc B => dog sparrow
It is normal it's similarity would be 50% but when I calculate its TF/IDF. It is as follow
Tf values for Doc A
dog tf = 0.5
cat tf = 0.5
Tf values for Doc B
dog tf = 0.5
sparrow tf = 0.5
IDF values for D...
I'm writing a piece of java software that has to make the final judgement on the similarity of two documents encoded in UTF-8.
The two documents are very likely to be the same, or slightly different from each other, because they have many features in common like date, location, creator, etc., but their text is what decides if they reall...
Hi!
I'm looking for a way to compare a string with an array of strings. Doing an exact search is quite easy of course, but I want my program to tolerate spelling mistakes, missing parts of the string and so on.
Is there some kind of framework which can perform such a search? I'm having something in mind that the search algorithm will r...
I have following situation:
String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically";
String b = "Web Crawler computer program browses the World Wide Web";
Is there any idea or standard algorithm to calculate the percentage of similarity?
For instance, above case, the similarity estimated...
Hi;
I have n documents and want to find common words that are included in these documents.
For example I want to say (n-3) documents include the word "web".
Certainly I can do this by basic data structures but there maybe efficient algorithm or a way to handle same words with different suffix.
Is there any algorithm for such purposes?...
Hello,
A part of a process requires to apply String Similarity Algorithms.
The results of this process will be stored and produce lets say SS_Dataset.
Based on this Dataset, further decisions will have to be made.
My questions are:
Should i apply one or more string similarity algorithms to produce SS_Dataset ?
Any comparisons...
I have a need to match cold leads against a database of our clients.
The leads come from a third party provider in bulk (thousands of records) and sales is asking us to (in their words) "filter out our clients" so they don't try to sell our service to a established client.
Obviously, there are misspellings in the leads. Charles become...
I am using getSimilarity(String s1, String s2) from the library : uk.ac.shef.wit.simmetrics.similaritymetrics.CosineSimilarity; to get the cosine similarity between two strings.
Well the problem is that when I pass two strings to compare from the xml directly it just hangs the programs doesn't exit. The same thing I do by assigning the ...
I'm using this piece of Java code to find similar strings:
if( str1.indexof(str2) >= 0 || str2.indexof(str1) >= 0 ) .......
but With str1 = "pizzabase" and str2 = "namedpizzaowl" it doesn't work.
how do I find the common substrings i.e. "pizza"?
...
I have a string that I want to compare against an array of strings, and return the array element that most closely matches.
I can write a sliding correlator that counts the number of matching characters at each step and returns the max correlation. But is there a better way?
For example:
control_string = drv_probability_1_max
List:...
Hello!
I have made this line of code to make the words, that is searced for, enhanced.
$tekst = preg_replace("/($searchstr)/i", '<span style="color: 8fb842; font-weight: bold;">$1</span>', $tekst);
But my problem is, that when I make $searchstr = '?'; it is setting between every letter in the $tekst string.
The whole script is:
//...
I'll explain my problem:
I have a database table called country. It has two columns: ID and name.
When I want to search for 'paris', but misspelled the word: 'pares' ('e' instead of 'i'), I won't get any result from DB.
I want the the system to suggest similar words that could help in the search.
So, I am looking for help writing a s...