views:

806

answers:

12

Hello.

I want to write a web application using Google App Engine (so the reference language would be Python). My application needs a simple search engine, so the users would be able to find data specifying keywords.

For example, if I have one table with those rows:

1 Office space
2 2001: A space odyssey
3 Brazil

and the user queries for "space", rows 1 and 2 would be returned. If the user queries for "office space", the result should be rows 1 and 2 too (row 1 first).

What are the technical guidelines/algorithms to do this in a simple way?
Can you give me good pointers to the theory behind this?

Thanks.

Edit: I'm not looking for anything complex here (say, indexing tons of data).

+3  A: 

I found these two books very useful when I used to build full-text search engines.

Information Retrieval

Managing Gigabytes

Ferruccio
I haven't read/seen Information Retrieval, but I have used Managing Gigabytes, and quite aside from covering basically everything that is necessary to make a full text search and retrieval system it is actually very readable.
olliej
+2  A: 

As always start in wikipedia. First start is usually building an inverted index.

Goran
A: 

See also a question I asked: How-to: Ranking Search Results.

Surely there are more approaches, but this is the one I'm using for now.

warren
+1  A: 

First build your index. Go through the input, split into words
For each word check if it is already in the index, if it is add the current record number to the index list, if not add the word and record number.
To look up a word go to the (possibly sorted) index and return all the record numbers for that word.
It's very esy to do this for a reasoable size list using Python's builtin storage types.

As an extra refinement you only want to store the base part of a word, eg 'find' for 'finding' - look up stemming algorithms.

Martin Beckett
If by "index" you mean "mapping" or "dictionary", then this is the job. If you do mean that, then take out the word "sorted" since the Python dictionary is a hash from key to value.
S.Lott
I meant index in the general (book/DB) sense. What container in python is best to implement it is a separate question.
Martin Beckett
Other than a Python dictionary, there aren't any other choices. Hence the suggestion that you consider dropping "sorted".
S.Lott
A: 

Honestly, smarter people than I have figured this stuff out. I'd load up the solr app and make json calls from my appengine app and let solr take care of indexing.

+3  A: 

Here's an original idea:

Don't build an index. Seriously.

I was faced with a similar progblem some time ago. I needed a fast method to search megs and megs of text that came from documentation. I needed to match not just words, but word proximity in large documents (is this word near that word). I just ended up writing it in C, and the speed of it surprised me. It was fast enough that it didn't need any optimizing or indexing.

With the speed of today's computers, if you write code that runs straight on the metal (compiled code), you often don't need an order log(n) type algorithm to get the performance you need.

Matthias Wandel
This appeals to my don't over optimize sensibilities. Up modded.
Justin Bozonier
+4  A: 

Read Tim Bray's series of posts on the subject.

  • Background
  • Usage of search engines
  • Basics
  • Precision and recall
  • Search engne intelligence
  • Tricky search terms
  • Stopwords
  • Metadata
  • Internationalization
  • Ranking results
  • XML
  • Robots
  • Requirements list
Mark Cidade
+2  A: 

I would not build it yourself, if possible.

App Engine includes the basics of a Full Text searching engine, and there is a great blog post here that describes how to use it.

There is also a feature request in the bug tracker that seems to be getting some attention lately, so you may want to hold out, if you can, until that is implemented.

Chuck
I've tried it, and it works, even if has some glitches; for example, it doesn't seem to support queries shorter than 3 chars. Anyway, a good compromise.
friol
A: 

I just found this article this weekend: http://www.perl.com/pub/a/2003/02/19/engine.html

Looks not too complicated to do a simple one (though it would need heavy optimizing to be an enterprise type solution for sure). I plan on trying a proof of concept with some data from Project Gutenberg.

If you're just looking for something you can explore and learn from I think this is a good start.

Justin Bozonier
+2  A: 

Lucene or Autonomy! These are not out of the box solutions for you. You will have to write wrappers on top of their interfaces.
They certainly do take care of the stemming, grammar , relational operators etc

Cherian
+1  A: 

The book Introduction to Information Retrieval provides a good introduction to the field.

A dead-tree version is published by Cambridge University Press, but you can also find a free online edition (in HTML and PDF) following the link above.

Xavier Martinez-Hidalgo
A: 

Look into the book "Managing Gigabytes" it covers storage and retrieval of huge amounts of plain text data -- eg. both compression and actual searching, and a variety of the algorithms that can be used for each.

Also for plain text retrieval you're best off using a vector based search system rather than a keyword->document indexing system as vector based systems can be much faster, and, more importantly can provide relevancy ranking relatively trivially.

olliej