deduplication

Single Instance Storage layers

Hi, I have a data storage requirement which is an excellent candidate for single instance storage and deduplication. Can anyone suggest any .Net compatible libraries or systems which handles SIS and deduplication, either with SQL Server as an actual back end or its own high performance storage engine? What have peoples experiences bee...

T-SQL Query Results Not as Expected Deduplication

Hi Guys, I am attempting to get all records where and Id field exists more than once, trouble is my query is returning nothing and I have no idea as to why!? And this is the only method I know. Some more information: There are up to 8 of the same Order Numbers Each set is grouped by ProcessOrder, I require the lowest value of these b...

Scrape unique image URLs from HTML

Using PHP to curl a web page (some URL entered by user, let's assume it's valid). Example: http://www.youtube.com/watch?v=Hovbx6rvBaA I need to parse the HTML and extract all de-duplicated URL's that seem like an image. Not just the ones in img src="" but any URL ending in jpe?g|bmp|gif|png, etc. on that page. (In other words, I don't ...

Advice and tools to help normalize a database

I have 7 MySQL tables that contain partly overlapping and redundant data in approximately 17000 rows. All tables contain names and addresses of schools. Sometimes the same school is duplicated in a table with a slightly different name, and sometimes the same school appears in multiple tables, again, with small differences in its name or ...

How to store bidirectional relationships

I am writing some code to find duplicate customer details in a database. I'll be using Levenshtein distance. However, I am not sure how to store the relationships. I use databases all the time but have never come accross this situation and wondered if someone could point me in the right direction. What confuses me is how to store the...

postgresql: Finding the ids of rows that contain case-insensitive string duplication.

I want to select and then delete a list of entries in my tables that have case-insensitive duplications. In other words, there are these rows that are unique... ..but they're not unique if you ignore case factor in case. They got in while I wasn't watching. So how can I s...

De-dupe NSArray of NSDictionaries based on specific keys

Hi, I am attempting to de-dupe an NSArray of NSDictionaries based on specific keys in the dictionaries. What I have looks something like this: NSDictionary *person1 = [NSDictionary dictionaryWithObjectsAndKeys:@"John", @"firstName", @"Smith", "lastName", @"7898", @"employeeID"]; NSDictionary *person2 = [NSDictionary dictionaryWithObject...