views:

772

answers:

5

I have 2 tables to search. Searching photos for keywords, title and description. The keywords have been split off into a separate table. My advanced search will allow searching on all 3 but the basic will just be the keyword table.

Basic table setup:

PHOTO Table

  • PhotoID
  • Name
  • Title
  • Description

WORD2PHOTO Table

  • WordID
  • PhotoID
  • Word

Been trying views and stored procs, and not getting any good results. I get my pics from the views, but in multiple records. I would do filtering and that stuff on the app side, but I'm using subsonic and would like to use the built in paging; that's a whole other issue.

First, How do I search on multiple keywords? Second, how do I add in searching on the title and description?

I have a function (f_Split) that will return a temp table of my current words to search,

DECLARE @Words TABLE (Word varchar(20))

INSERT INTO @Words (Word)
SELECT Keyword FROM dbo.f_Split('cars|auto|red|fast','|')

Now how do I use the table it generates to get the photo records? Been struggling for days now? Thanks for any help.

A: 

It isn't entirely clear what you mean, but it sounds like like you just want:

SELECT /* some columns */
FROM @Words #w
INNER JOIN WORD2PHOTO wp ON wp.Word = #w.Word
INNER JOIN PHOTO p NO p.PhotoID = wp.PhotoID

Re the title and description; well, you could do something kludgy involving LIKE, but as an alternative, why don't you simply disassemble the title and description (split on whitespace/punctuation), and place these into the WORD2PHOTO table (with a marker to indicate that they are from the title/description) - then this becomes:

SELECT /* some columns */
FROM @Words #w
INNER JOIN WORD2PHOTO wp
   ON wp.Word = #w.Word
   AND wp.Source IN ('K','T','D') -- keywords/title/description
INNER JOIN PHOTO p NO p.PhotoID = wp.PhotoID

And just include different combinations of K/T/D to suit...

You'd just need a trigger so that when you INSERT/UPDATE the title/description, it removes all existing T/D entries and replaces with the new ones.

Marc Gravell
That actually sounds like a good idea to seperate out the title and description as well. A side question. If I do place all the words in the Word2Photo table, how bad will that effect perfomance? That will eventually be alot of words in that table?
Well, you can filter out some words ("the", "a", etc). But databases are *good* at that type of thing...
Marc Gravell
Re the number of words - it is still *almost* the same as just storing the description+title twice (a little extra overhead). Given the advantages of normalizing it, I'd say it is worth a try
Marc Gravell
+1  A: 

For Postgres or MySQL you might check out Sphinx for full text search at

http://www.sphinxsearch.com/

There are nice adapters/plugins for various web frameworks. ThinkingSphinx, for example, is excellent in Ruby on Rails

http://github.com/freelancing-god/thinking-sphinx

Sphinx supports full text search on fields of your choice, delta indexing, and scales well.

mobileAgent
+1  A: 

You need to decide how the multiple keywords are connected. If someone types "keyword1 keyword2" in the search, are they looking for both keywords to be associated with the same photo (an AND operation) or are they looking for either keyword (or both) to be associated with the same photo (an OR operation). What about providing both? And what about "this keyword but not that other keyword", etc...

I'm not clear what the WordID column provides - other than expenditure of disk space. If you had a table with 'WordID, Word' as the columns, and the cross-reference table had 'PhotoID, WordID' columns, that makes one sensible design. Another sensible design has 'PhotoID, Word'. Having a table with 'WordID, PhotoID, Word' is not particularly sensible; it'll work, but the WordID column is effectively unused. You would need a unique constraint on the combination PhotoID, Word to ensure you don't have repetition in that table.

Given your @Words (temporary) table, you can do this to get the AND option:

SELECT P.PhotoID, P.Name, P.Title, P.Description
    FROM Photo P, Word2Photo W
    WHERE P.PhotoID = W.PhotoID
    GROUP BY P.PhotoID, P.Name, P.Title, P.Description
    HAVING COUNT(*) = (SELECT COUNT(*) FROM @Words L, Word2Photo M
                           WHERE M.Word = L.Word
                             AND M.PhotoID = P.PhotoID
                      )

It ensures that the number of entries in the Word2Photo table is the same as the number of entries in the @Words table for the given photo. It is a correlated sub-query; it is not efficient but it is effective. The useful thing is that the structure can be repeated mostly for the OR option:

SELECT P.PhotoID, P.Name, P.Title, P.Description
    FROM Photo P, Word2Photo W
    WHERE P.PhotoID = W.PhotoID
    GROUP BY P.PhotoID, P.Name, P.Title, P.Description
    HAVING 1 <= (SELECT COUNT(*) FROM @Words L, Word2Photo M
                    WHERE M.Word = L.Word
                      AND M.PhotoID = P.PhotoID
                )

This looks for photos having at least one of the words in the list of words.

There probably are other ways to do it, but the symmetry is appealing. Clearly, if you get into more complex criteria (mixing AND and OR, or adding NOT), then the structure changes.

Caveat

Untested code.

Jonathan Leffler
+3  A: 
Max Gontar
Awesome answer. :)
Kjensen
+1  A: 

I did this on my website some years ago. What I did was shave off all the things SQL is not good at to the application. From memory, it was something like:

table photos (
    photoid        number unique indexed,
    name           varchar2,
    title          varchar2,
    description    varchar2,
    keywords       varchar2,
    ... etc
);

table photosearch (
    wordid      number indexed,  -- ID of word, more or less
    photoid     number,          -- ref photos.photoid
    context     number,          -- 9=title, 7=name, 5=desc, ..
    ... etc 
)

When a photo was inserted/updated, the basic algorithm was:

photoid = INSERT INTO PHOTOS VALUES (...)

foreach field in (name title description keywords) 
    int weight = getweight(field)
    foreach word in ( value(field) ) 
        # Discard useless words, e.g. "and, or, but, yes, ..."
        stem = word-stem-algorithm(word)
        key  = hash-to-number(stem)
        INSERT INTO PHOTOSEARCH VALUES 
            (key, photoid, weight)

The generic search was then something like:

keys [] = hash(stem(word)) foreach word in query

SELECT photoid, sum(context) FROM photosearch
 WHERE wordid IN keys[]
 GROUP BY photoid
 ORDER BY 2 DESC

The trick of using context==unique_weight allowed me to easily do "field contains word" searches (left as exercise to reader ;), and allowed me to "tune" the result order by varying the weighting of the fields.

Darren Holloway