string-similarity

Levenshtein distance: how to better handle words swapping positions?

I've had some success comparing strings using the PHP levenshtein function. However, for two strings which contain substrings that have swapped positions, the algorithm counts those as whole new substrings. For example: levenshtein("The quick brown fox", "brown quick The fox"); // 10 differences are treated as having less in common ...

Cosine similarity vs Hamming distance

Hello! To compute the similarity between two documents, I create a feature vector containing the term frequencies. But then, for the next step, I can't decide between "Cosine similarity" and "Hamming distance". My question: Do you have experience with these algorithms? Which one gives you better results? In addition to that: Could you...

Speeding up levenshtein / similar_text in PHP

I am currently using similar_text to compare a string against a list of ~50,000 which works although due to the number of comparisons it's very slow. It takes around 11 minutes to compare ~500 unique strings. Before running this I do check the databases to see whether it has been processed in the past so everytime after the inital run i...

strategies for finding duplicate mailing addresses

Hello, I'm trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses: addr_1 = '# 3 FAIRMONT LINK SOUTH' addr_2 = '3 FAIRMONT LINK S' addr_3 = '5703 - 48TH AVE' adrr_4 = '5703- 48 AVENUE' I'm planning on applying some string transformation to make long words abbrev...

Package to compare LSA, TFIDF, Cosine metrics and Language Models

Hi, I'm looking for a package (any language, really) that I can use on a corpus of 50 documents to perform interdocument similarity testing in various metrics, like tfidf, okapi, language models, lsa, etc. I want as a result a document similarity matrix, i.e. doc1 is x% similar to doc2, etc... This is for research purposes, not for pr...

tf idf similarity problem

Hi All, i am using TF/IDF to calculate similarity. For example if i have following two doc. Doc A => cat dog Doc B => dog sparrow It is normal it's similarity would be 50% but when I calculate its TF/IDF. It is as follow Tf values for Doc A dog tf = 0.5 cat tf = 0.5 Tf values for Doc B dog tf = 0.5 sparrow tf = 0.5 IDF values for D...

Text similarity function for strict document similarity

I'm writing a piece of java software that has to make the final judgement on the similarity of two documents encoded in UTF-8. The two documents are very likely to be the same, or slightly different from each other, because they have many features in common like date, location, creator, etc., but their text is what decides if they reall...

Comparing strings with tolerance

Hi! I'm looking for a way to compare a string with an array of strings. Doing an exact search is quite easy of course, but I want my program to tolerate spelling mistakes, missing parts of the string and so on. Is there some kind of framework which can perform such a search? I'm having something in mind that the search algorithm will r...

Percentage Similarity Analysis (Java)

I have following situation: String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically"; String b = "Web Crawler computer program browses the World Wide Web"; Is there any idea or standard algorithm to calculate the percentage of similarity? For instance, above case, the similarity estimated...

Detecting similar words among n text documents

Hi; I have n documents and want to find common words that are included in these documents. For example I want to say (n-3) documents include the word "web". Certainly I can do this by basic data structures but there maybe efficient algorithm or a way to handle same words with different suffix. Is there any algorithm for such purposes?...

Advice on String Similarity Metrics (Java). Distance, sounds like or combo?

Hello, A part of a process requires to apply String Similarity Algorithms. The results of this process will be stored and produce lets say SS_Dataset. Based on this Dataset, further decisions will have to be made. My questions are: Should i apply one or more string similarity algorithms to produce SS_Dataset ? Any comparisons...

Is there a way to filter a django queryset based on string similarity (a la python difflib)?

I have a need to match cold leads against a database of our clients. The leads come from a third party provider in bulk (thousands of records) and sales is asking us to (in their words) "filter out our clients" so they don't try to sell our service to a established client. Obviously, there are misspellings in the leads. Charles become...

Java Cosine Similarity Error

I am using getSimilarity(String s1, String s2) from the library : uk.ac.shef.wit.simmetrics.similaritymetrics.CosineSimilarity; to get the cosine similarity between two strings. Well the problem is that when I pass two strings to compare from the xml directly it just hangs the programs doesn't exit. The same thing I do by assigning the ...

Method to find similar substrings from two strings

I'm using this piece of Java code to find similar strings: if( str1.indexof(str2) >= 0 || str2.indexof(str1) >= 0 ) ....... but With str1 = "pizzabase" and str2 = "namedpizzaowl" it doesn't work. how do I find the common substrings i.e. "pizza"? ...

How can I find out if two strings are mostly equal (in perl)?

I have a string that I want to compare against an array of strings, and return the array element that most closely matches. I can write a sliding correlator that counts the number of matching characters at each step and returns the max correlation. But is there a better way? For example: control_string = drv_probability_1_max List:...

Problems with preg_replace and ? (question mark) - what to do?

Hello! I have made this line of code to make the words, that is searced for, enhanced. $tekst = preg_replace("/($searchstr)/i", '<span style="color: 8fb842; font-weight: bold;">$1</span>', $tekst); But my problem is, that when I make $searchstr = '?'; it is setting between every letter in the $tekst string. The whole script is: //...

How to find a similar word for a misspelled one in PHP?

I'll explain my problem: I have a database table called country. It has two columns: ID and name. When I want to search for 'paris', but misspelled the word: 'pares' ('e' instead of 'i'), I won't get any result from DB. I want the the system to suggest similar words that could help in the search. So, I am looking for help writing a s...