What open source projects can be used for extracting relevant content from various webpages?

views:

answers:

What open source projects can be used for extracting relevant content from various webpages?

So for example the youtube video ID from a youtube page, or a tweet ID from a twitter page, or a Facebook uid from a facebook profile...

You don't need an open source project for that. Lifting the ID from the page is usually a matter of parsing the URL that got you there. In youtube's case, the "v" querystring parameter indicates the video ID. The other examples have similar answers.

Scott Stafford 2010-07-14 02:55:56

Scott, Youtube only is easy. What if I want to do that for 100 site types?

David Haddad 2010-07-14 03:47:50

@David Haddad: Can you clarify your question then? You want a generic way to extract what exactly from arbitrary web pages? Just the identifying ID? Semantic information?

Scott Stafford 2010-07-14 04:17:24

@Scott Stafford It's kind of hard to explain. The main content of a page changes from one page type to the other. So let's say if you pass it the link to a tweet page, then the main output would be the tweet_id, twitterer, and the tweet text. It would vary from one site to the other. However if you do the same with a youtube video link, it would send you the youtube video ID/title/etc...

David Haddad 2010-07-14 19:23:26

@David Haddad: I am pretty sure you're not going to find any project that is prewritten that just knows all the specific formats of every popular social networking/web 2.0 site and can parse it for you.

Scott Stafford 2010-07-15 02:54:49

The oembed protocol has a specification for accessing structured relevant data based on a URL. embed.ly is a company that procides an api based on that standard.

http://www.oembed.com/ http://embed.ly

David Haddad 2010-10-11 17:29:02

ansaurus

tags:

views:

answers:

What open source projects can be used for extracting relevant content from various webpages?

related questions