views:

82

answers:

2

Hi, How would you guys go about creating a "real-time" search engine on .Net platform. Near real-time search of the web is so popular nowadays and I was hoping you guys would help me brainstorm some ideas. I might try to make some prototype eventually, but mostly it is just a "mental training".

The requirements are:

  1. .NET platform, IIS, MS SQL server or Lucene.Net (file-system)
  2. input data to be indexed are only keywords plus some meta information - no further processing required
  3. data are grouped by keywords and ordered by number of occurrences of the keywords
  4. no historic data are kept (data older than some fixed amount of time are discarded or moved to some other data store)

Not knowing much about the subject matter, this is what I've come up with so far:

Data are fed to the system through web service. Since data are already in form of keywords, no further processing is performed. WS saves data to db. Select query is performed in fixed time intervals to return data (for example: we query incoming data for past hour and perform the query every second). Grouping and sorting is performed in memory to offload the sql server. Old data in db are discarded every couple minutes. I'm not sure how would sql server handle that if there were many new rows added constantly. Grouped and sorted data are then displayed.

I'm sure you guys have more experience and better ideas for this kind of thing.

Regards,

Ondrej

A: 

This site is not really for brainstorming, or to help you design applications.

You may want to post this on http://answers.onstartups.com/ and see what requirements and suggestions on this idea would be, to see if there is any business sense to a real-time websearch.

But, you would need to determine how you can go faster than Google.

James Black
I appreciate your input but I think you slightly misunderstood my question. I'm not asking if there is any business sense in it or how I could beat google. I'm simply asking, given the requirements above, what would be the best way to implement such system.
Ondrej Stastny
+1  A: 

From your description of the system, a bare-bones database schema might look like the following:

keyword - id (primary key) - keyword (unique)

input - id (primary key) - data (text)

input_keyword - id (primary key) - input_id (foreign key) - keyword_id (foreign key) - count (integer; the number of times keyword with id keyword_id appears in input with id input_id) - expiration_date (timestamp; at regular intervals, all entries that have expired need to be deleted)

Data operations would be as follows:

  1. Writes: Whenever an input operation is performed, your database engine will have to handle a write operation that writes to all three tables.
  2. Reads: Whenever a search operation is performed, your database engine will need to handle a read operations across all three tables
  3. Deletes: At regular intervals, you'll need to remove entries in input_keyword and, if desired, keyword tables.

On a highly trafficked system, your database will be hit quite often. Since you are really only using the database for the convenience of performing SELECT operations across these tables, and since the data is very short-lived, you might be better off just using an in-memory data structure to replace the "keyword" and "input_keyword" tables to eliminate hits to disk. This may require more complex application code, but it may be worth it on a busy system.

jkndrkn