similarity

Identifying if 2 HTML pages are similar

I'm trying to identify differences between a base case and supplied case. Looking for a library to tell me similarity in percentage or something like that. For Example: I've 10 different HTML pages. * All of them are 404 responses with only one 2 lines of random code (such as time or quote of the day). Now when I supply a new 404 pag...

Algorithm to find similar text

I have many articles in a database (with title,text), I'm looking for an algorithm to find the X most similar articles, something like Stack Overflow's "Related Questions" when you ask a question. I tried googling for this but only found pages about other "similar text" issues, something like comparing every article with all the others...

Word comparison algorithm

I am doing a CSV Import tool for the project I'm working on. The client needs to be able to enter the data in excel, export them as CSV and upload them to the database. For example I have this CSV record: 1, John Doe, ACME Comapny (the typo is on purpose) Of course, the companies are kept in a separate table and linked with...

How do I determine the longest similar portion of several strings?

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings. Example: file:///home/gms8994/Music/t.A.T.u./ file:///home/gms8994/Music/nina%20sky/ file:///home/gms8994/Music/A%20Perfect%20Circle/ Ideally, I'd get back file:///home/gms8994/Music/, because that's th...

Textual Irregularities

Does anybody know of a library or piece of software out there that will locate irregularities in text? For example, lets say I have... 1. Name 1, Comment 2. Name 2, Comment 3. Name 3 , Comment 5. Name 10, Comment This software or library would first cut up portions of text that it would find similar (much alike a piece of compression...

Calculating Binary Data Similarity

I've seen a few questions here related to determining the similarity of files, but they are all linked to a particular domain (images, sounds, text, etc). The techniques offered as solutions require knowledge of the underlying file format of the files being compared. What I am looking for is a method without this requirement, where arbit...

A better similarity ranking algorithm for variable length strings

I'm looking for a string similarity algorithm that yields better results on variable length strings than the ones that are usually suggested (levenshtein distance, soundex, etc). For example, Given string A: "Robert", Then string B: "Amy Robertson" would be a better match than String C: "Richard" Also, preferably, this algorithm sh...

Determining if two or more summaries are similar

The problem is as follows: I have one summary, usually between 20 to 50 words, that I'd like to compare to other relatively similar summaries. The general category and the geographical location to which the summary refers to are already known. For instance, if people from the same area are writing about building a house, I'd like to be...

Is there any solution to know the similarity of two pdf without detail content compare

i want to know the similarity of tow pdf files, but i don't want to do the detail content compare . is there any solution just from its external structure .is it possible ?thanks! ...

Algorithm for similarity (of topic) of news items

I want to determine the similarity of the content of two news items, similar to Google news but different in the sense that I want to be able determine what the basic topics are then determine what topics are related. So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business...

C# comparing similar strings

Hi I have a generic with some filenames (LIST1) and another biggeneric with a full list of names (LIST2). I need to match names from LIST1 to similar ones in LIST2. For example LIST1 - **MAIZE_SLIP_QUANTITY_3_9.1.aif** LIST 2 1- TUTORIAL_FAILURE_CLINCH_4.1.aif 2- **MAIZE_SLIP_QUANTITY_3_5.1.aif** 3- **MAIZE_SLIP_QUANTITY_3_9.2.aif** ...

Cosine similarity vs Hamming distance

Hello! To compute the similarity between two documents, I create a feature vector containing the term frequencies. But then, for the next step, I can't decide between "Cosine similarity" and "Hamming distance". My question: Do you have experience with these algorithms? Which one gives you better results? In addition to that: Could you...

Visual similarity search algorithm

I'm trying to build a utility like this http://labs.ideeinc.com/multicolr, but I don't know which algorithm they are using, Does anyone know? ...

How can I measure the similarity between 2 strings?

Given two strings text1 and text2 public SOMEUSABLERETURNTYPE Compare(string text1, string text2) { // DO SOMETHING HERE TO COMPARE } Examples: First String: StackOverflow Second String: StaqOverflow Return: Similarity is 91% The return can be in % or something like that. First String: The simple text test Second String: Th...

Similarity of two texts (adaptive local alignment of keywords?)

Hi! I have 2 texts (max 4000 characters) of different length. And I need to get a similarity rate based on (partial-)paraphrasing. Please note that same portion of texts can be in different position in each text (So Levenshtein is not the solution). The comparison process should also: not increase expo. with text size be performance ...

Pearson Similarity Score, how can I optimise this further?

I have an implemented of Pearson's Similarity score for comparing two dictionaries of values. More time is spent in this method than anywhere else (potentially many millions of calls), so this is clearly the critical method to optimise. Even the slightest optimisation could have a big impact on my code, so I'm keen to explore even the s...

How to spot and analyse similar patterns like Excel does?

You know the functionality in Excel when you type 3 rows with a certain pattern and drag the column all the way down Excel tries to continue the pattern for you. For example Type... test-1 test-2 test-3 Excel will continue it with: test-4 test-5 test-n... Same works for some other patterns such as dates and so on. I'm trying...

Find a similarity of two vector shapes

Looking for any information/algorithms relating to comparing vector graphics. E.g. say there two point collections or vector files with two almost identical figures. I want to determine that a first figure is about 90% similar to the second one. ...

Speed up text comparisons (feature vectors) with spatial MySQL features

I have a function which takes two arrays containing the tokens/words of two texts and gives out the cosine similarity value which shows the relationship between both texts. The function takes an array $tokensA (0=>house, 1=>bike, 2=>man) and an array $tokensB (0=>bike, 1=>house, 2=>car) and calculates the similarity which is given back ...

Tips to show similarities in files

In a project, I found some css files that "smell" like there are copy-pasted rules in them. I wonder what are your strategies for detecting copy-paste stuff in files. Just of curiosity i'd like to hear your tips and tricks for showing file similarities! ...