You can try either SOUNDEX
or DIFFERENCE
functions to help match string literals.
Example:
select difference('peter.green', 'Green, Peter')
returns 2
, whereby:
The integer returned is the number of
characters in the SOUNDEX values that
are the same. The return value ranges
from 0 through 4: 0 indicates weak or
no similarity, and 4 indicates strong
similarity or the same values.
See SOUNDEX and DIFFERENCE topics on MSDN.
Update:
Soundex & Difference may not function well when the order of words are considered, but if you have full text indexing capabilities installed, you don't need to create an index to use the word breaking and parsing capabilities of the full text engine. Assuming you're using SQL Server 2008, the following function will return you a list of normalised terms:
SELECT * FROM sys.dm_fts_parser('"Peter Green"', 1033, 0, 0)
Through which you can CROSS APPLY
to the remainder of your query.
See the sys.dm_fts_parser topic & Section K. Using Apply within the FROM topic for more info.
Example: (SQL Server Enterprise 2008 with Full Text Engine enabled)
if not OBJECT_ID('Names1', 'Table') is null drop table names1
if not OBJECT_ID('Names2', 'Table') is null drop table names2
create table Names1
(
id int identity(0, 1),
name nvarchar(128)
)
insert into Names1 (name) values ('Green, Peter')
insert into Names1 (name) values ('Smith, Peter')
insert into Names1 (name) values ('Aadland, Beverly')
insert into Names1 (name) values ('Aalda, Mariann')
insert into Names1 (name) values ('Aaliyah')
insert into Names1 (name) values ('Aames, Angela')
insert into Names1 (name) values ('Aames, Willie')
insert into Names1 (name) values ('Aaron, Caroline')
insert into Names1 (name) values ('Aaron, Quinton')
insert into Names1 (name) values ('Aaron, Victor')
insert into Names1 (name) values ('Abbay, Peter')
insert into Names1 (name) values ('Abbott, Dorothy')
insert into Names1 (name) values ('Abbott, Bruce')
insert into Names1 (name) values ('Abbott, Bud')
insert into Names1 (name) values ('Abbott, Philip')
insert into Names1 (name) values ('Abdoo, Rose')
insert into Names1 (name) values ('Abdul, Paula')
insert into Names1 (name) values ('Abel, Jake')
insert into Names1 (name) values ('Abel, Walter')
insert into Names1 (name) values ('Abeles, Edward')
insert into Names1 (name) values ('Abell, Tim')
insert into Names1 (name) values ('Aber, Chuck')
create table Names2
(
id int identity(200, 1),
name nvarchar(128)
)
insert into Names2 (name) values (LOWER('Peter.Green'))
insert into Names2 (name) values (LOWER('Peter.Smith'))
insert into names2 (name) values (LOWER('Beverly.Aadland'))
insert into names2 (name) values (LOWER('Mariann.Aalda'))
insert into names2 (name) values (LOWER('Aaliyah'))
insert into names2 (name) values (LOWER('Angela.Aames'))
insert into names2 (name) values (LOWER('Willie.Aames'))
insert into names2 (name) values (LOWER('Caroline.Aaron'))
insert into names2 (name) values (LOWER('Quinton.Aaron'))
insert into names2 (name) values (LOWER('Victor.Aaron'))
insert into names2 (name) values (LOWER('Peter.Abbay'))
insert into names2 (name) values (LOWER('Dorothy.Abbott'))
insert into names2 (name) values (LOWER('Bruce.Abbott'))
insert into names2 (name) values (LOWER('Bud.Abbott'))
insert into names2 (name) values (LOWER('Philip.Abbott'))
insert into names2 (name) values (LOWER('Rose.Abdoo'))
insert into names2 (name) values (LOWER('Paula.Abdul'))
insert into names2 (name) values (LOWER('Jake.Abel'))
insert into names2 (name) values (LOWER('Walter.Abel'))
insert into names2 (name) values (LOWER('Edward.Abeles'))
insert into names2 (name) values (LOWER('Tim.Abell'))
insert into names2 (name) values (LOWER('Chuck.Aber'));
with ftsNamesFirst (id, term) as
(
select id, terms.display_term
from names1 cross apply sys.dm_fts_parser('"' + name + '"', 1033, 0, 0) terms
), ftsNamesSecond (id, term) as
(
select id, terms.display_term
from names2 cross apply sys.dm_fts_parser('"' + name + '"', 1033, 0, 0) terms
)
select * from
(
select
ROW_NUMBER() over (partition by nfirst.id order by sum(DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term)) desc) ranking,
sum(DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term)) Confidence,
nFirst.id Names1ID,
nFirst.name Names1Name,
nSecond.id Names2ID,
nSecond.name Names2Name
from
ftsNamesFirst cross join ftsNamesSecond
left outer join names1 nFirst on nFirst.id = ftsNamesFirst.id
left outer join names2 nSecond on nSecond.id = ftsNamesSecond.id
where DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term) = 4
group by
nFirst.id, nFirst.name, nSecond.id, nSecond.name
) MatchedNames
where ranking = 1
Outputs:
Where the matches with highest confidence take precedence (all others are filtered out using a windowed ranking query).
Confidence Names1ID Names1Name Names2ID Names2Name
8 0 Green, Peter 200 peter.green
8 1 Smith, Peter 201 peter.smith
8 2 Aadland, Beverly 202 beverly.aadland
8 3 Aalda, Mariann 203 mariann.aalda
4 4 Aaliyah 204 aaliyah
8 5 Aames, Angela 205 angela.aames
8 6 Aames, Willie 206 willie.aames
It's not perfect, but this is a nice starting point from where it can be tweaked to give a higher probability of success.