views:

39

answers:

2

The speed of Eudora and GMail for instance in looking through thousands of emails and finding the right set of messages amazes me. I use Eudora and the search is so blazing fast at running through ten years of emails within a few seconds.

So my question is, how do they store and retrieve messages? What data structures to store the data, the indices, what algorithms? How are the messages stored on disk/database?

A: 

I would be surprised if this search was slow. Let's say, you have n=10000 emails, m=1000 characters each. Any decent substring-detection algorithm will give you O(n*m) speed. For provided values of n and m, it's under a second on modern PC.

Talking about storage, the clients I know put all emails in one big file, each client using their own format. This lets you to read all messages from disk reasonably fast.

If you're interested, this is a classical substring-search algorithm (there're many more):
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

edit
I'm not claiming any email app uses simple substring search, just that using it would be fast enough already.

Nikita Rybak
A: 

Both use the same secret sauce, albeit in a completely different technology : indexes.

Eudora uses the mbox format for every mailbox and folder which is basically a big file with all mails one after the other. If you check these files you see a smaller file with the same name and extension .IDX or somthing. This is an index which allows fast looking to where the individual emails start. Another smart move of Eudora is to remove the attachements fom the eamils which reduces the bulk of the mailboxes by an order fo magnitude, speeding up the managment in the process. This causes Eudora to be able to manage mailboxes with an order of magnitude more mails than most other clients.

Google is the master of indexes, they ahve been indexing the complete web for decades, so they applied their trade to your mailbox, allowing blazingly fast access to the mails because all relevant facts are separately indexed. They also have special technology to quickly retrieve documents like emails.

Peter Tillemans