views:

2131

answers:

7

We have OCRed thousands of pages of newspaper articles. The newspaper, issue, date, page number and OCRed text of each page has been put into a mySQL database.

We now want to build a Google-like search engine in PHP to find the pages given a query. It's got to be fast, and take no more than a second for any search.

How should we do it?

A: 

Check this Lucene port for PHP:

CMS
+6  A: 

There are some interesting search engines for you to take a look at. I don't know what you mean by "Google like" so I'm just going to ignore that part.

  • Take a look at the Lucene engine. The original is high performance but written in Java. There is a port of Lucene to PHP (already mentioned elsewhere) but it is too slow.
  • Take a serious look at the Xapian Project. It's fast. It's written in C++ so you'll most probably have to build it for your target server(s) but has PHP bindings.
Glenn
A: 

Your scenario suggest, that you'd like to roll your own; good starting points for a general search engine would include:

If you want to use an off-shelf solution:

Silver Dragon
Wow. Why write your own? I really don't see what about the OP's situation makes it worthwhile to re-implement what has recently become a commodity feature.
Alabaster Codify
The OP said "We now want to build"
Artelius
A: 

You might want to check Sphider. In my experience it is quite fast and does the indexing automatically. It is also open source so you could take the code and modify it for your needs.

Darryl Hein
+2  A: 

Why don't you try something like Google Search Appliance or Google Enterprise? It will have cost associated but then it will save you from re-inventing the wheel and give you "google like" search.

Pradeep
We would prefer to stick with PHP and mySQL because the database has cross purposes and needs to be integrated with the rest of our website.
lkessler
+4  A: 

You can also try out SphinxSearch. Craigslist uses sphinx and it can connect to both mysql and postgresql.

cnu
+2  A: 

If MySQL's fulltext search is taking 20 seconds per query, you either have it misconfigured or running on underpowered hardware - some big sites are successfully using plain old MyISAM searching.

My vote goes for Solr, however. It's based on Lucene, so you get all the richness and performance of that best of breed product, but with a RESTful API, making it very easily from PHP. There's even a dW article.

Alabaster Codify