views:

83

answers:

2

It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).

I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:

  1. Some sources don't provide rss feeds. How do I create one?
  2. What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?

Please add any other things (problems, suggestions, whatever) I might not have considered.

+1  A: 

You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.

About duplicates, take a look at this pipe.

Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.

Mauricio Scheffer
+1  A: 
MrAnonymous