I'm writing a special crawler-like application that needs to retrieve the main content of various pages. Just to clarify : I need the real "meat" of the page (providing there is one , naturally)
I have tried various approaches:
- Many pages have rss feeds , so I can read the feed and get this page specific contnent.
- Many pages use "content" meta tags
- In a lot of cases , the object presented in the middle of screen is the main "content" of the page
However , these methods don't always work , and I've noticed that Facebook do a mighty fine job doing just this (when you want to attach a link , they show you the content they've found on the link page) .
So - do you have any tip for me on an approach I've over looked?
Thanks!