tags:

views:

168

answers:

4

As a part of my final-year project in undergraduate Computer Science I am looking to write a search engine that ranks websites on validity of standards. I've read several articles and posts about writing a search engine and how difficult it can be, but I'm feeling ambitious and hope that within 5-6 months I would be able to have a working search engine, obviously not commercially viable or working for real users but enough to be able to crawl my own domain of several hundred pages and rank them on their subject and how they conform to W3C standards.

I'd like to give this a try in a language like C#, PHP or Python, but before I get ahead of myself I'd like to know what knowledge one must obtain to be able to undertake such a project and whether it is doable in half a year.

For me this is purely a learning exercise in order to stress what I am capable of. I know that there are many open-source search engine scripts available like Lucene.NET that I could use in a real-world situation but I'd like to give writing one a shot. Do you think that a final-year undergrad is capable of writing a functional search engine for a small website?

EDIT: This search engine would be an online search engine, usable through a web page front-end. I'd only want to crawl the web pages on a dummy website I've put up, consisting of no more than fifty pages for now. The idea is to use several metrics to determine what website is best from a design perspective, most notably by using a code validator.

+6  A: 

It depends on the undergrad. A couple of grad students did pretty well at it. But especially with the wealth of prior art and research out there to refer to, I'd think with that much time someone with a reasonable skillset should be able to put something non-commercial-grade together.

T.J. Crowder
+3  A: 

Should be do-able. Essentially you would be wiring up something similar to W3Cs standards checker to a web crawler like "wget" or "curl" and storing the results in some sort of database, for analysis.

Most of the essential components are out there, and there is a good deal of literature on the subject. Also as far as I know I have never seen anything exactly like your proposal for a standards checking crawler so the results would be both origonal and interesting.

Good Luck!

James Anderson
I actually found the idea on Y Combinator. My goal was originally to rank websites on how "good" their design is, but this is probably too ambitious for half a year of work for an undergrad. A lot of niche search engines seem to be taking off (PowerSet being an example) so I decided that writing a search engine would both be challenging and easy to relate to, as well as sounding impressive on a grad-school application form.
EnderMB
+2  A: 

Sounds perfectly feasible on the ~1000 page scale. That's not really a whole lot of data.

Once you've got an algorithm prototype I think the main problem you'd have with scaling up to google-like size is (especially funding) the sheer amount of storage & processing power needed.

wefwfwefwe
Once the basic search and classification were sorted out, it could be distributed as a screen saver which would search the 'local' network and report the results back - something like [email protected] are gigaflops of unsed computing power sitting in back bedrooms.
James Anderson
I'd have limited funding as it would be a university project. A lot of my lecturers are very hardware oriented so I'm sure it wouldn't be a problem to find an old server somewhere that could be used.
EnderMB
+3  A: 

I think what you need to be aware of is expected size and scope of the target domain.

A search engine has two parts:

  • Part 1 essentially crawls over a set of web pages and builds some form of index of them.

  • Part 2 takes a user query and looks at the index to locate pages which match (and ranks the matches in order of importance)

Both of these tasks could be fairly simple for very small domains. For example, if your search space only included 2 pages, your index could simply consist of the entire textual content of the page (probably ordered alphabetically), and your search simple returns matches based on a matching word count.

It quickly becomes harder as you (a) increase search space, and (b) increase complexity of the index - and thus complexity of the search algorithm.

As you increase the search space, your index has to become cleverer as you can no longer afford to fully store the pages as a search across the full text would take too long. This problem gets hard very quickly. You will very quickly start to hit limits in practical storage space and retrieval times. (This is why search engines like Google have re-invented basic things like the database and the file system to cope with the shear quantity and speed of data access required.

I would say that a very simple search engine was certainly possible for a good 3rd year student in 6 months. Provided you don't set your expectations on search size too high too early. Start low, implement something that just crawls and searches a few pages (say 5-10) on a single domain. Perhaps start by having a list of known pages to search. Then expand. Add link following. Then increase the search size. Slowly build up until you reach the end of the project.

Good luck.

Simon P Stevens