views:

163

answers:

5

i have some customers with a particular address that the user is searching for:

123 generic way

There are 5 rows in the database that match:

ResidentialAddress1
=============================
123 GENERIC WAY
123 GENERIC WAY
123 GENERIC WAY
123 GENERIC WAY
123 GENERIC WAY

i run a FT query to look for these rows. i'll show you each step as i add more criteria to the search:

SELECT ResidentialAddress1 FROM Patrons
WHERE CONTAINS(Patrons.ResidentialAddress1, '"123*"')

ResidentialAddress1
=========================
123 MAPLE STREET
12345 TEST
123 MINE STREET
123 GENERIC WAY
123 FAKE STREET
...

(30 row(s) affected)

Okay, so far so good, now adding the word "generic":

SELECT ResidentialAddress1 FROM Patrons
WHERE  CONTAINS(Patrons.ResidentialAddress1, '"123*"')
AND CONTAINS(Patrons.ResidentialAddress1, '"generic*"')

ResidentialAddress1
=============================
123 GENERIC WAY
123 GENERIC WAY
123 GENERIC WAY
123 GENERIC WAY
123 GENERIC WAY

(5 row(s) affected)

Excellent. And now i'l add the final keyword that the user wants to make sure exists:

SELECT ResidentialAddress1 FROM Patrons
WHERE  CONTAINS(Patrons.ResidentialAddress1, '"123*"')
AND CONTAINS(Patrons.ResidentialAddress1, '"generic*"')
AND CONTAINS(Patrons.ResidentialAddress1, '"way*"')


ResidentialAddress1            
------------------------------ 

(0 row(s) affected)

Huh? No rows? What if i query for just "way*":

SELECT ResidentialAddress1 FROM Patrons
WHERE CONTAINS(Patrons.ResidentialAddress1, '"way*"')

ResidentialAddress1            
------------------------------ 

(0 row(s) affected)

At first i thought that perhaps it's because of the *, and it's requiring that the root way have more characters after it. But that's not true:

  • Searching for "123*" matches "123"
  • Searching for "generic*" matches "generic"
  • Books online says, The asterisk matches zero, one, or more characters

What if i remove the * just for s&g:

SELECT ResidentialAddress1 FROM Patrons
WHERE CONTAINS(Patrons.ResidentialAddress1, '"way"')

Server: Msg 7619, Level 16, State 1, Line 1
A clause of the query contained only ignored words. 

So one might think that you are just not allowed to even search for way, either alone, or as a root. But this isn't true either:

SELECT * FROM Patrons
WHERE CONTAINS(Patrons.*, '"way*"')

AccountNumber FirstName Lastname
------------- --------- --------
33589         JOHN      WAYNE                    

So sum up, the user is searching for rows that contain all the words:

123 generic way

Which i, correctly, translate into the WHERE clauses:

SELECT * FROM Patrons
WHERE CONTAINS(Patrons.*, '"123*"')
AND CONTAINS(Patrons.*, '"generic*"')
AND CONTAINS(Patrons.*, '"way*"')

which returns no rows. Tell me this just isn't going to work, that it's not my fault, and SQL Server is crazy.

Note: i've emptied the FT index and rebuilt it.

Update One

SELECT Lastname, ResidentialAddress1 FROM Patrons
WHERE CONTAINS(Patrons.*, '"gen*"')

Lastname                  ResidentialAddress1            
------------------------- ------------------------------ 
SAVE                      123 GENERIC WAY
Genders                   
SAVE                      123 GENERIC WAY
Patron                    123 GENERIC WAY
SAVE                      123 GENERIC WAY
SAVE                      234 GENERIC WAY
SAVE                      123 GENERIC WAY

(7 row(s) affected)

Update Two

Pretending the user typed in:

123 generic wa

SELECT ResidentialAddress1 FROM Patrons
WHERE  CONTAINS(Patrons.ResidentialAddress1, '"123*"')
AND CONTAINS(Patrons.ResidentialAddress1, '"generic*"')
AND CONTAINS(Patrons.ResidentialAddress1, '"wa*"')

ResidentialAddress1            
------------------------------ 

(0 row(s) affected)

The real problem is that the user is typing in something perfectly valid, and they would expect to see what any human being would expect to see.


Update Three

Someone asked for all this, it's not my fault!:

CREATE TABLE [dbo].[Patrons] (
    [PatronGUID]  uniqueidentifier ROWGUIDCOL  NOT NULL ,
    [AccountNumber] [bigint] NULL ,
    [FirstName] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [MiddleInitial] [varchar] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [Lastname] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [EyeColor] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [HairColor] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [Gender] [varchar] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [Birthday] [datetime] NULL ,
    [Height] [int] NULL ,
    [Weight] [int] NULL ,
    [FacialHair] [tinyint] NULL ,
    [Nationality] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [IdentifyingMarks] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [DriversLicenseNumber] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [DriversLicenseRegion] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [DriversLicenseCountry] [varchar] (2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [DriversLicenseExpires] [datetime] NULL ,
    [DriversLicenseDateVerified] [datetime] NULL ,
    [PassportNumber] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [PassportRegion] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [PassportCountry] [varchar] (2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [PassportExpires] [datetime] NULL ,
    [PassportDateVerified] [datetime] NULL ,
    [OtherIdentificationNumber] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [OtherIdentificationRegion] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [OtherIdentificationCountry] [varchar] (2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [OtherIdentificationType] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [OtherIdentificationExpires] [datetime] NULL ,
    [OtherIdentificationDateVerified] [datetime] NULL ,
    [ResidentialAddress1] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialAddress2] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialAddress3] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialCity] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialZipCode] [varchar] (15) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialRegion] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialCountry] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialPhoneNumber] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [CountryOfResidence] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessAddress1] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessAddress2] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessAddress3] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessCity] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessRegion] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessZipCode] [varchar] (15) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessCountry] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessName] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessPhone] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [PositionWithFirm] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [EmployerTelephone] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [MemberCardType] [varchar] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [PlayerStatusCode] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [AccountType] [varchar] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [AccountStatus1] [varchar] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [AccountStatus2] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [IsVIPExchangeRate] [tinyint] NULL ,
    [ChangedUserGUID_Depricated] [uniqueidentifier] NULL ,
    [ChangedUser] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ChangedDate] [datetime] NULL ,
    [ChangedWorkstation] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [PendingUpdates_Depricated] [varchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL 
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[Patrons] ADD 
    CONSTRAINT [DF_Patrons_PatronGUID] DEFAULT (newid()) FOR [PatronGUID],
    CONSTRAINT [PK_Patrons] PRIMARY KEY  NONCLUSTERED 
    (
        [PatronGUID]
    ) WITH  FILLFACTOR = 90  ON [PRIMARY] 
GO

if (select DATABASEPROPERTY(DB_NAME(), N'IsFullTextEnabled')) <> 1 
exec sp_fulltext_database N'enable' 

GO

if not exists (select * from dbo.sysfulltextcatalogs where name = N'TheFullTextCatalog')
exec sp_fulltext_catalog N'TheFullTextCatalog', N'create' 

GO

exec sp_fulltext_table N'[dbo].[Patrons]', N'create', N'TheFullTextCatalog', N'PK_Patrons'
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'FirstName', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'MiddleInitial', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'Lastname', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'EyeColor', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'IdentifyingMarks', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialAddress1', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialAddress2', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialAddress3', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialCity', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialZipCode', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialRegion', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialCountry', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialPhoneNumber', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'CountryOfResidence', N'add', 1033  
GO

exec sp_fulltext_table N'[dbo].[Patrons]', N'activate'  
GO

Here's the screenshots for the guy who didn't believe me:

The query that should work but doesn't: alt text

The query that works, but isn't useful: alt text

The query that works, but isn't useful, with the proof content: alt text


Update Four

The query cannot be written as

CONTAINS(Patrons.*, 'words...')

Since there are items not logically or physically covered by the FT index. e.g. the user queries for:

6/4/2010 ian boyd 619

Presents four keywords:

  • 6/4/2010
  • ian
  • boyd
  • 619

This means they want all the conditions to hold true, with pseudo-code being:

WHERE 6/4/2010 is in the row
AND ian is in the row
AND boyd is in the row
AND 619 is in the row

Which is translated into a partial query of:

WHERE --Keyword 1: 6/4/2010
(
   ((ChangedDate >= '20100604') AND (ChangedDate < '20100605'))
   OR 
   ((LastTransactionDate >= '20100604') AND (LastTransactionDate < '20100605'))
   OR 
   (CONTAINS(Patrons.*, '"6/4/2010*"')
)
AND --Keyword 2: ian
(
    CONTAINS(Patrons.*, '"ian*"')
)
AND --Keyword 3: boyd
(
    CONTAINS(Patrons.*, '"boyd*"')
)
AND --Keyword 4: 619
(
    (AccountNumber IN (SELECT CAST(619 AS bigint)))
    OR
    (CONTAINS(Patrons.*, '"619*"'))
)

One of the answerers was looking at the simplified example presented in the original question; not the real world. To say that it is incorrect to have multiple AND clauses is nieve.

A: 

The message is telling you that "way" is a stopword, which means it's ignored and not indexed. That's why you can find "wayne" but not "way".

So, no, it's not crazy and neither are you. There's just a simple misunderstanding.

Steven Sudit
What do you suggest? ie. How do i tell SQL Server to either search for way when it can search for way, but not search for way when it can't search for way.
Ian Boyd
@Ian Boyd: Have you tried '"wa*"'?
Ardman
@Ardman Tried "wa*", results added to question. Doesn't work.
Ian Boyd
@Ian: I upvoted two answers that had more detail, so try those. The short answer is that you need to remove "way" from your stop list and reindex.
Steven Sudit
/facepalm i wish there was **any** other solution, than the one requiring hand-editing of text files on machines i don't have access to. In the end the solution might have to involve turning `CONTAINS(table.*)` into a few dozen `LIKE` clauses...
Ian Boyd
@Ian: In my experience, projects that start off using the FT search capabilities of a database tend to end up offloading that aspect to engines such as Lucene and DTSearch, precisely because relational databases are limited and frustrating.
Steven Sudit
A: 

Perhaps it requires more than three alphabet characters. Try another three letter word, like gen*.

RedFilter
"gen*" works, which makes sense, since "123*" also works. Will add to question anyway
Ian Boyd
+2  A: 

http://www.codinghorror.com/blog/2008/11/stop-me-if-you-think-youve-seen-this-word-before.html

In that post, Jeff mentions you can turn them off.

Josh Smeaton
+1  A: 

Solution 1:

You want to try the Transform Noise Word option (SQL 2008).

Turning this off, should stop word removal.

example:

sp_configure 'show advanced options', 1
RECONFIGURE
GO
sp_configure 'transform noise words', 1
RECONFIGURE
GO

Edit 1:

Hopefully there may be something similar for older versions of MS SQL?

Darknight
+2  A: 

You probably used the system stoplist when you created the FT index. The word way happens to be in there. You can see it with this query:

SELECT *
FROM sys.fulltext_system_stopwords
WHERE stopword = 'way'
AND language_id = 1033

You can turn off the stoplist or create a custom one, but a better solution would be to write the query properly; don't use multiple WHERE CONTAINS clauses, combine them into one. Otherwise SQL Server might not be able to use the FT index as effectively.

Your query should look like this instead:

SELECT ResidentialAddress1 FROM Patrons
WHERE  CONTAINS(Patrons.ResidentialAddress1, '"123*" AND "generic*" AND "way*"')

If you do it this way, the stop word simply gets ignored; it'll still return all of the same results it would have returned if you hadn't included the term way*.


Edit: Just noticed that you tagged this sql-server-2000, so the first query might not work. In SQL 2000, they are "noise words" and I believe that the configuration is global, you don't have individual stoplists. Nevertheless, you'll still get results if you write a single WHERE CONTAINS clause instead of several.

To edit the noise words in SQL Server 2000, you have to edit the language-specific file in the SQL Server FTDATA configuration folder. More details are here: SQL Server Full Text Search Noise Words and Thesaurus Configurations.

Aaronaught
Running `SELECT * FROM Patrons WHERE CONTAINS(Patrons.*, '"123*" AND "generic*" AND "way*"')` returns no rows. And yes: 2000, noisewords, not queryable.
Ian Boyd
@Ian: That doesn't seem right at all. Can we get a repro including the table schema and FT DDL?
Aaronaught
That's a hellavalotta ddl, and scripts, for nobody to use.
Ian Boyd
Also writing the query as a single Contains, as they are fields no covered by the FT index (e.g. tinyint) that require a separate clause. So the solution checking for the presence of **all** keywords is correct.
Ian Boyd
@Ian: It shouldn't be a "hellavalotta" - if you're not able to reproduce it on a small scale then something else is wrong. And yeah, obviously if you're looking up numeric values in other fields then you have to write `AND` conditions, but all of the fulltext criteria should be in a single search string. Writing several `CONTAINS` clauses *isn't* correct unless you need to check different fields for *each*.
Aaronaught