ansaurus

Question

Check if one column value is contained in another column value (TSQL)?

Answer 1

+1 A:

There are several possible solutions depending on exactly what you need: Use could create auxiliary table to store keyword for every record

Use auxiliary table to store keywords for every record or record and field. E.g. table_helper(id int primary key, record_id int, keyword varchar), record_id - link to source table. Fill this table in trigger for table1, table2. Query to find common rows is a simple intersection of table_helper with itself. You could create one helper for table1 and table2 or use separate tables.
Use full-text indexes.

2010-08-03 08:03:07

Answer 2

+2 A:

You can try either SOUNDEX or DIFFERENCE functions to help match string literals.

Example:

select difference('peter.green', 'Green, Peter')

returns 2, whereby:

The integer returned is the number of characters in the SOUNDEX values that are the same. The return value ranges from 0 through 4: 0 indicates weak or no similarity, and 4 indicates strong similarity or the same values.

See SOUNDEX and DIFFERENCE topics on MSDN.

Update:

Soundex & Difference may not function well when the order of words are considered, but if you have full text indexing capabilities installed, you don't need to create an index to use the word breaking and parsing capabilities of the full text engine. Assuming you're using SQL Server 2008, the following function will return you a list of normalised terms:

SELECT * FROM sys.dm_fts_parser('"Peter Green"', 1033, 0, 0)

Through which you can CROSS APPLY to the remainder of your query.

See the sys.dm_fts_parser topic & Section K. Using Apply within the FROM topic for more info.

Example: (SQL Server Enterprise 2008 with Full Text Engine enabled)

if not OBJECT_ID('Names1', 'Table') is null drop table names1
if not OBJECT_ID('Names2', 'Table') is null drop table names2

create table Names1 
(
    id int identity(0, 1),
    name nvarchar(128)
)
insert into Names1 (name) values ('Green, Peter')
insert into Names1 (name) values ('Smith, Peter')
insert into Names1 (name) values ('Aadland, Beverly')
insert into Names1 (name) values ('Aalda, Mariann')
insert into Names1 (name) values ('Aaliyah')
insert into Names1 (name) values ('Aames, Angela')
insert into Names1 (name) values ('Aames, Willie')
insert into Names1 (name) values ('Aaron, Caroline')
insert into Names1 (name) values ('Aaron, Quinton')
insert into Names1 (name) values ('Aaron, Victor')
insert into Names1 (name) values ('Abbay, Peter')
insert into Names1 (name) values ('Abbott, Dorothy')
insert into Names1 (name) values ('Abbott, Bruce')
insert into Names1 (name) values ('Abbott, Bud')
insert into Names1 (name) values ('Abbott, Philip')
insert into Names1 (name) values ('Abdoo, Rose')
insert into Names1 (name) values ('Abdul, Paula')
insert into Names1 (name) values ('Abel, Jake')
insert into Names1 (name) values ('Abel, Walter')
insert into Names1 (name) values ('Abeles, Edward')
insert into Names1 (name) values ('Abell, Tim')
insert into Names1 (name) values ('Aber, Chuck')

create table Names2
(
    id int identity(200, 1),
    name nvarchar(128)
)
insert into Names2 (name) values (LOWER('Peter.Green'))
insert into Names2 (name) values (LOWER('Peter.Smith'))
insert into names2 (name) values (LOWER('Beverly.Aadland'))
insert into names2 (name) values (LOWER('Mariann.Aalda'))
insert into names2 (name) values (LOWER('Aaliyah'))
insert into names2 (name) values (LOWER('Angela.Aames'))
insert into names2 (name) values (LOWER('Willie.Aames'))
insert into names2 (name) values (LOWER('Caroline.Aaron'))
insert into names2 (name) values (LOWER('Quinton.Aaron'))
insert into names2 (name) values (LOWER('Victor.Aaron'))
insert into names2 (name) values (LOWER('Peter.Abbay'))
insert into names2 (name) values (LOWER('Dorothy.Abbott'))
insert into names2 (name) values (LOWER('Bruce.Abbott'))
insert into names2 (name) values (LOWER('Bud.Abbott'))
insert into names2 (name) values (LOWER('Philip.Abbott'))
insert into names2 (name) values (LOWER('Rose.Abdoo'))
insert into names2 (name) values (LOWER('Paula.Abdul'))
insert into names2 (name) values (LOWER('Jake.Abel'))
insert into names2 (name) values (LOWER('Walter.Abel'))
insert into names2 (name) values (LOWER('Edward.Abeles'))
insert into names2 (name) values (LOWER('Tim.Abell'))
insert into names2 (name) values (LOWER('Chuck.Aber'));

with ftsNamesFirst (id, term) as
(
    select id, terms.display_term
        from names1 cross apply sys.dm_fts_parser('"' + name + '"', 1033, 0, 0) terms
), ftsNamesSecond (id, term) as
(
select id, terms.display_term
        from names2 cross apply sys.dm_fts_parser('"' + name + '"', 1033, 0, 0) terms
)
select * from 
(
    select 
    ROW_NUMBER() over (partition by nfirst.id order by sum(DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term)) desc) ranking,
    sum(DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term)) Confidence,
    nFirst.id Names1ID,
    nFirst.name Names1Name, 
    nSecond.id Names2ID,
    nSecond.name Names2Name
    from 
    ftsNamesFirst cross join ftsNamesSecond 
    left outer join names1 nFirst on nFirst.id = ftsNamesFirst.id
    left outer join names2 nSecond on nSecond.id = ftsNamesSecond.id 
    where DIFFERENCE(ftsNamesFirst.term, ftsNamesSecond.term) = 4
    group by 
        nFirst.id, nFirst.name, nSecond.id, nSecond.name
) MatchedNames 
where ranking = 1

Outputs:

Where the matches with highest confidence take precedence (all others are filtered out using a windowed ranking query).

Confidence Names1ID Names1Name Names2ID Names2Name
8   0   Green, Peter    200 peter.green
8   1   Smith, Peter    201 peter.smith
8   2   Aadland, Beverly    202 beverly.aadland
8   3   Aalda, Mariann  203 mariann.aalda
4   4   Aaliyah 204 aaliyah
8   5   Aames, Angela   205 angela.aames
8   6   Aames, Willie   206 willie.aames

It's not perfect, but this is a nice starting point from where it can be tweaked to give a higher probability of success.

Rabid 2010-08-03 09:57:16

ansaurus

tags:

views:

answers:

Check if one column value is contained in another column value (TSQL)?

related questions