views:

1825

answers:

9

I'm creating a site that allows users to submit quotes. How would I go about creating a (relatively simple?) search that returns the most relevant quotes?

For example, if the search term was "turkey" then I'd return quotes where the word "turkey" appears twice before quotes where it only appears once.

(I would add a few other rules to help filter out irrelevant results, but my main concern is that.)

A: 

I'd go with Full Text Search, look at it here: http://hockinson.com/fulltext-search-of-mysql-database-table.html

Filip Ekberg
+4  A: 

Use Google Custom Site Search. I've heard they know a thing or two about searching.

nickf
Haha funny :) +1
Filip Ekberg
I'd like to make my own. Each quote doesn't get its own page, so I don't think Google would work very well, anyway.
stalepretzel
+2  A: 

Stackoverflow plans to use the Lucene search engine. There is a PHP port of this written for the Zend Framework but can be downloaded as a separate entity without needing all the ZF bloat. This is called Zend_Search_Lucene, documentation for which can be found here.

DavidM
+1  A: 

If you want to write your own, take a look at phpBB's implementation. They have two tables, the first is a unique list of all the words that appear in entries, and the second is a many-to-many reference between the words and the entries. You could then do a group and count to sort the entries in the manner you're looking for.

It's a lot more work then implementing a third-party search engine (or full text search), but it will allow you greater control over the results.

Chris Pebble
A: 

Google Custom Site Search is great, if you don't query it much (I think you get 1k queries/ day for free) or if you're willing to pay.

MySQL's fulltext search is also a great resource (as has been mentioned previously).

Yahoo's BOSS is an intriguing project -- I'm going to give it a shot during my next search project.

And, finally, Lucene is a great resource if you need more power than fulltext, but want to tweak your own search engine. http://lucene.apache.org

Travis Leleu
+2  A: 

Your sql for that will look something like this (where you're trying to find quotes with 'turkey' in it):

SELECT * FROM Quotes
WHERE the_quote LIKE "%turkeyt%";

From there you can figure out what to do with whatever it spits out at you.

Be careful to properly handle cases where a malicious user might inject malicious SQL into your database, especially if you're planning on putting this on the www. If you're doing this for fun though, I guess it's just about what you want to learn.

If you're new to databases and sql, I recommend sqlite over mysql. Much easier to set up and work with, as in no set up. It'll get you around the potential headaches of having to install and set up mysql for the first time.

Jeffrey Martinez
This doesn't given you a rank (e.g. rank being the number of occurrences of turkey)
dcousineau
Wait, what? You mean I should be careful of SQL Injections? No way! :)I was thinking of using the approach you described, but I don't think it's so easy to turn this into a ranking system.
stalepretzel
Just wasn't sure of you skill level. If you want an easy ranking system you have the user create tags for quotes and search the tags, Since quotes are not huge bodies of text, you might have to analyze nouns/verbs to determine a rank. Or just count the number of occurrences of a word
Jeffrey Martinez
I don't fully understand the 'tags for quotes' comment, but would recommend against hacking together something like this when there are so many existing robust options out there. Including in MySQL out of the box...
Alabaster Codify
+19  A: 

Everyone is suggesting MySQL fulltext search, however you should be aware of a HUGE caveat. The Fulltext search engine is only available for the MyISAM engine (not InnoDB, which is the most commonly used engine due to its referential integrity and ACID compliance).

So you have a few options:

1. The simplest approach is outlined by Particle Tree. You can actaully get ranked searches off of pure SQL (no fulltext, no nothing). The SQL query below will search a table and rank results based off the number of occurrences of a string in the search fields:

SELECT
    SUM(((LENGTH(p.body) - LENGTH(REPLACE(p.body, 'term', '')))/4) +
        ((LENGTH(p.body) - LENGTH(REPLACE(p.body, 'search', '')))/6))
    AS Occurrences
FROM
    posts AS p
GROUP BY
    p.id
ORDER BY
    Occurrences DESC

edited their example to provide a bit more clarity

Variations on the above SQL query, adding WHERE statements (WHERE p.body LIKE '%whatever%you%want'), etc. will probably get you exactly what you need.

2. You can alter your database schema to support full text. Often what is done to keep the InnoDB referential integrity, ACID compliance, and speed without having to install plugins like Sphinx Fulltext Search Engine for MySQL is to split the quote data into it's own table. Basically you would have a table Quotes that is an InnoDB table that, rather than having your TEXT field "data" you have a reference "quote_data_id" which points to the ID on a Quote_Data table which is a MyISAM table. You can do your fulltext on the MyISAM table, join the IDs returned with your InnoDB tables and voila you have your results.

3. Install Sphinx. Good luck with this one.

Given what you described, I would HIGHLY recommend you take the 1st approach I presented since you have a simple database driven site. The 1st solution is simple, gets the job done quickly. Lucene will be a bitch to setup especially if you want to integrate it with the database as Lucene is designed mainly to index files not databases. Google custom site search just makes your site lose tons of reputation (makes you look amateurish and hacked), and MySQL fulltext will most likely cause you to alter your database schema.

dcousineau
This is very helpful.
stalepretzel
Option 1 is interesting! Never seen that before. Due to the complexity of the query, running benchmarks on live data would be essential, but could be a nice alternative to Sphinx/MyISAM if you have a small dataset
Alabaster Codify
+1  A: 

As an alternative to Sphinx and Lucene, a relatively simple search engine can be created using the Xapian library.

+ Supports many advanced search features (such as relevancy ranking)
+ Fast

- You would need to learn the API to create your interface
- Requires a php extension to be installed

Note also that Xapian stores its data in a separate index to mysql.

You might also be interested in Forage which is a wrapper for Solr, Xapian and Lucene.

The Xapian people also created the Omega search engine which is a frontend to Xapian, and can be called via cgi.

menko
A: 

I came across the Zoom Search Engine a few days ago and think this might be the simplest search engine I have ever used.

The Windows based tool creates a database of the site, then it also asks you what language (PHP, ASP.NET, JavaScript, etc), you want to use. I picked PHP and it built the PHP code for me. All, I had to do then was upload the files to the server and (optionally) customize the template and site search was working.

This is free to small sites, and the only con I can find is that the spider tool (database builder) has to run on Windows.

meme