tags:

views:

95

answers:

5

How do I write code that would find related (similar) articles to the one that the user is currently reading?

For example, suppose I have articles:

Python programming tips
Python programming for newbies
Programming in Python, ActionScript and Flash
Programming in the Jungle
Tarzan saves newbie Judy from using Fortran programming language

(I came up with these titles right now.)

How could I query the database and find that they are all related?

I'd appreciate any suggestions.

Thanks, Boda Cydo.

A: 

Which database are you using? "Full-text search" may help you, and MySQL just have it builtin. Google about it.

Jaú
I'm using MySQL. Gonna google, thanks!
bodacydo
A: 

I suggest you take a look at cosine similarity and tf-idf

Cosine similarity is a simple method used to measure similarity between two documents (but not only) and it can't take as input a vector of words weighted using tf-idf.
Basically the tf-idf weight is higher if a word is frequent in the current document, (term frequency - tf) but rare in the others (inverse document frequency - idf).

f4
+1  A: 

This book contains some tips for that; more specifically, this sounds like a Collaborative Filtering problem.

There are several approaches to the problem. One is tagging, rely on readers and contributors tagging those articles and you can match keywords with tags for ex.

Another approach could be to combine search plus analytics i.e. a Google approach. You show results for a search query, users click on them, overtime those who clicked on some of them also clicked on related ones and you could establish a relationship between them.

Ariel
A: 

If your case is really a content-driven website, then probably asking the editors to add tags to every article is the best way. Tthat's the way it's done all over the web (e.g. Wordpress)

Additionally there can be ways to do it with language processing, but since you use Python I will leave that up to the people who are python experts...

Hans Westerbeek
A: 

One suggestion is to add tags to all your articles. The related articles are the ones with similar tags.

dale
I will consider this approach, thanks!
bodacydo