views:

1010

answers:

3

I am interested in algorithm in T-SQL calculating Levenshtein distance.

+3  A: 

Consider using CLR Stored Procedure if those are available. If not, see this and this.

Anton Gogolev
It seems I have googled a newer version of the Arnold Fribble's implementation.
Alexander Prokofyev
+3  A: 

Arnold Fribble proposes this one:

SET QUOTED_IDENTIFIER ON 
GO
SET ANSI_NULLS ON 
GO

CREATE FUNCTION edit_distance_within(@s nvarchar(4000), @t nvarchar(4000), @d int)
RETURNS int
AS
BEGIN
  DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
    @cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
  SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0
  WHILE @j <= @tl
    SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
  WHILE @i <= @sl
  BEGIN
    SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000
    WHILE @j <= @tl
    BEGIN
      SET @c = @c + 1
      SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END
      IF @c > @c1 SET @c = @c1
      SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
      IF @c > @c1 SET @c = @c1
      IF @c < @cmin SET @cmin = @c
      SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1
    END
    IF @cmin > @d BREAK
    SELECT @cv1 = @cv0, @i = @i + 1
  END
  RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
END
GO
Alexander Prokofyev
@Alexander, it seems to work but I would change your variable names to something more meaningfull. Also, I'd get rid of @d, you know the length of the two strings in your input.
Lieven
@Lieven: It isn't my implementation, the author is Arnold Fribble. @d parameter is a maximal allowed difference between strings after reaching which they are considered too diverse and function returns -1. It's added because the algorithm in T-SQL works too slowly.
Alexander Prokofyev
+3  A: 

IIRC, with SQL Server 2005 and later you can write stored procedures in any .NET language: Using CLR Integration in SQL Server 2005. With that it shouldn't be hard to write a procedure for calculating Levenstein distance.

A simple Hello, World! extracted from the help:

using System;
using System.Data;
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;

public class HelloWorldProc
{
    [Microsoft.SqlServer.Server.SqlProcedure]
    public static void HelloWorld(out string text)
    {
        SqlContext.Pipe.Send("Hello world!" + Environment.NewLine);
        text = "Hello world!";
    }
}

Then in your SQL Server run the following:

CREATE ASSEMBLY helloworld from 'c:\helloworld.dll' WITH PERMISSION_SET = SAFE

CREATE PROCEDURE hello
@i nchar(25) OUTPUT
AS
EXTERNAL NAME helloworld.HelloWorldProc.HelloWorld

And now you can test run it:

DECLARE @J nchar(25)
EXEC hello @J out
PRINT @J

Hope this helps.

Leandro López