Any kind of language or algorithms i can look at? Or is that any open source application i can tap on?
Basically you'll need following things:
- a web crawler (spider) which collects the content of web pages parsing the HTML and following links
- a full text search engine to store and index the data
- a web interface for querying the full text index
If you're familiar with .NET, you could look at this implemetation and series of articles:
Searcharoo.net: ASP.NET Search with C#
If you want to do it seriously, then you'll need a lot of resources in form of bandwidth and disk space to acquire and store all the data.
Take a look at http://lucene.apache.org/java/docs/ - the java search library. You wont be able to get the kind of performance, and effectiveness that google has (because their technology is proprietary obviously, and its damned good), but to learn about search, lucene is a good start.
Look at Google tech papers the concept of their own search engine is explained. Plus some more.
There are two ways you could become remotely successful in making a public search engine.
- You have a lot of money (for marketing, research and server costs), or
- You have a brilliant idea for how to make a search engine that delivers better results.
You ask about open source applications, implying you don't have a lot of money, and you ask about algorithms, indicating you don't have any brilliant ideas.
Still, in answer to the question:
A search engine fundamentally needs to accept some sort of search query (for instance, a series of words) and then return web resources that it thinks are appropriate to the query. The way it decides what is appropriate to the query can be very important. One can use a simple algorithm like counting how many times any of the search words appears on a page. Google uses sophisticated techniques, including investigating the pages that link to a given page.
However the web has billions of pages and whatever method is used must also be reasonably achievable.
I believe Google's PageRank algorithm involves finding an eigenvector for an enormous sparse matrix. You may find this document relevant.
Google itself gives some tools to embed their technology in each own's projects:
- Google Custom Search, searchs only a specific portion of the web, with custom rules
- Google AJAX Search lets you put a customized search form in you pages
- Google Enterprise and Google Search Appliance are scalable solutions that use Google's technology to search private intranets or index company's documents
There's a great book by Toby Segaran, Programming Collective Intelligence that covers most of the algorithms needed for recommendation engines, searching and ranking, etc. Highly recommended!
The chapter about search results ranking used to be available as as sample chapter on O'Reillys site, unfortunately it seems this is no longer the case. See if you can find it somewhere on the net. You can also download the code examples (python).
Edit: you can get the sample chapter here: http://www.oreilly.de/catalog/9780596529321/chapter/ch04.pdf
So search engines are very complex beasts. However just like building game engines they certainly worth tackling if you are interested. Be warned that they present both difficult software engineering and computer science challenges. I started by building my own from scratch for a very specific part of the web. This took the better part of a year of fulltime work.
I would recommend that you start with Nutch which is written in Java. It is a search engine that uses lucene and hadoop and is also an apache project. It will scale up to a reasonable size on a cluster although it lacks many capabilities of a modern search engine. However, since it is based on Hadoop you can fill in alot of gaps, PageRank, BrowseRank, Spam detection with SVMs, crawler scaling, etc...
Understanding core search technology is definetly worthwhile. It will open your eyes to new applications that might be built. You will definetly learn alot while building it. There are lots of applications of search technologies to things other than public search engines such as google. I know since I work on a really cool one that I would never have gotten the opportunity to if I had not worked on building my own from scratch before.
So set your expectations to something realistic and get building. I recommend working on something focused like tech blogs or news related to a specific industry. Keeping it focused will allow you to better understand the problem space. Happy coding.