views:

37

answers:

1

Hello all,

I am developing a component that stores URLs and maintains a certain set of keywords associated with these URLs. For example-

URL: http://www.imdb.com Keywords: search, movies, movie-index, reviews

The keywords themselves are not restricted by number. The number of urls may be huge in number ranging between 10K to 100K. What's the best approach to associate and store the URLs with their keywords? This should support search by keywords and listing based on keyword combinations. I surely feel its not a good approach to use relational DB for this.

Maybe my question summarizes to "how does a search engine work". But I am looking for more specific information like- Are there tools available to store the keywords and index them? I have heard of Apache Lucene, that seems to more of a full text search engine.

What does stackoverflow use internally to associate the keywords with articles? :)

Thanks,
-Keshav

+1  A: 

You really should be using a fully normalised database design.

A table of URLS, each with a unique URLID. A table of KEYWORDS, each with a unique KEYWORDID, and finally a table to describe the relationship they share, TABLE_URL_KEYWORDS, with the columns KEYWORDID, and URLID that links the two together in a one URL to many KEYWORDS relationship.

You do not need to worry about performance, good database systems are exceptionally fast for these sorts of operations, they have been designed specifically with a normalised format in mind. 100k URLS along with 100k KEYWORDS, and 400K relation records is perfectly manageable for all the popular database systems out there.

If you go down a different route, IE un-normalied, it's going to become a huge mess and will cost you a lot of time to sort out, or someone else to sort out, later down the line. I would put a lot of money on the fact that StackOverflow stores there data in a fundamentally identical format as described.

Tom Gullen
Well said... although the poster is going to have issues re: Google, now that httpS://www.google.com is alive.
NinjaCat
For the OP, I think he can strip the protocol off from the URL as it probably doesn't change the description of the site. However, if he decides it does matter, then you can either treat each protocol as a seperate URL entry (simplest and probably best), or you can normalise fully, of which there are several ways to do this, all of which start becoming a little more complex.
Tom Gullen