views:

460

answers:

8

This is a college project:

I have a database ( mysql or postgresql doesn't matter ) with 100 billion of posts and I need to search ( as fast as possible ) a generic keyword.

Every post is 500-1000 keywords.

This isn't only a database issue but also a software ( for indexing or other ) issue.

How can I do that ?

I could use some advanced search engine technology but I don't know which.

+2  A: 

You might want to check out Sphinx. It's a full text search engine that handles distributed indexes. You can have parts of the data spread over many computers. And querying a single server can send the query to the other servers, and gather the results from each. It has pretty good speed, but you probably can't do 100 billion posts on a single machine.

You probably won't be able to do something like this in MySQL or Postgresql. While you could store all the data, MySQL and Postgres lack the full text indexing and search speed that a real full text index will bring you.

MySQL has a way to compile in support for the Sphinx storage engine, which, although the data will still be stored in Sphinx separate from MySQL, you can still query the Sphinx search engine using anything that connects to MySQL, and also perform joins to other tables that are in your MySQL database. However, if you just want to perform simple searches on the documents, and don't require joining to other data, you can just use the native PHP interface.

Kibbee
But MySQL and PostgreSQL have full text indexing.
Simon
But not the kind that can handle such large amounts of data. They also don't have built in ways of distributing the data over many machines. A full text search engine such as Sphinx or Lucene will offer much more performance, and many more features than you could get from MySQL or PostgreSQL.
Kibbee
+2  A: 

Try Sphinx http://sphinxsearch.com/

Sabeen Malik
+11  A: 

Have you considered using Apache Lucene ?

That does not directly work directly with your SQL database, you'll have to write some code to feed the documents into it in order to build and index which you then can query.

I don't know however how much additional space that would require and how much time it will take.

Andre Holzner
i can recommend lucene, lucene is very mighty, not soo easy to learn but do-able
Tyzak
I used to use Apache Solr (based on Lucene), and it is awesome. Lightning fast, and quite easy to work with it.
Yorirou
+2  A: 

Have you tried the built-in full-text indexing functions of your database? You should try it and prove it doesn't work before you decide it's not suitable and look for something else.

Wade Williams
+1  A: 

Well, there are about 6.8 billion people on the planet, who can read about 1 post per minute (on average).

If everyone contributes, 100 billion divided by 6.8 billion is 14.7 minutes to have all posts read.

So:

1) Conquer Earth.
2) Make everyone your slave.
3) Have posts read.
4) ???
5) Profit!

Coronatus
What a heck? :D
hey
Sorry, I saw "100 billion" and had to write an equally ridiculous answer.
Coronatus
+1 for perfect parody on the opening post. Don't be too dull!
Col. Shrapnel
How is anybody supposed to find or read any of the posts if you don't index them? Sure, many of them may go unread, but some of them will be.
Kibbee
+1: For effort and making my day on SO.
shamittomar
-1 to keep things in balance.
InsertNickHere
+8  A: 

Sell the "100 billion" post data to Google. :)

They will index it for you free and you will make money.

shamittomar
Don't be silly -- Google already has the entire universe in memory!
Coronatus
But maybe they are missing his '100 billion' posts otherwise he could have directly search them :)
shamittomar
+1  A: 

Use Google Custom Search. Plus you will earn a little bit and save a lot of hosting resources.

hey
A: 

First of all, are we talking keywords in separate fields or in post contents?

If separate fields, it is sort of OK. Just build a table with keyword-post relationships and do simple search on SELECT post_id 7 ... WHERE keyword = 'X'.

If we are talking about full-text indexing, you'd be best off using some custom indexing software, such as suggested in some other posts.

Gnudiff