views:

267

answers:

1

I'm having problems getting to the rss link that tells the browser where the rss is for the site. The link is found in the <head> tag of the html here is an example of what the link looks like.

<link rel="alternate" type="application/rss+xml" title="CNN - Top Stories [RSS]" href="http://rss.cnn.com/rss/cnn_topstories.rss" />

My original approach was to treat the site like an XML file and look through the tags, but most sites have an arbitrary number of <meta> tags that forget to have a ending /> so the <link> tag I'm looking for becomes a child of a random <meta> tag.

Now I'm thinking of just treating the site like a string and looking for the <link> tag in it, but this causes problems since the <link> tag can have its attributes in any order possible. Of course I can work around this, but I would prefer something a bit neater than look for type="application/rss+xml" then look to the left and right of it for the first href it sees.

+1  A: 

HTML parsing is hard! Even if you find a solution that works for one site, it will likely break in another. if you can find a library to help you your life will be a lot easier.

If you can not find a html parser for actionscript 2, maybe you could set up a server script to it for you? Like:

myXML.load("http://yourserver.com/cgi-bin/findrss?url=foo.com");

and then have it return the url as xml

If you try this approach, I recommend the python library Beautiful Soup. I've used it before and, in my opinion, it's amazing. It will work on any website you give it, no matter how horrible the markup is.

It would look something like this:

#!/usr/bin/python
import cgi
import cgitb; cgitb.enable() # Optional; for debugging only
import urllib2
from BeautifulSoup import BeautifulSoup

def getRssFromUrl(url):
    try:
        Response = urllib2.urlopen(url)
    except Exception:
        print "<error>error getting url</error>"
        return []
    html = Response.read()
    soup = BeautifulSoup(html)
    rssFeeds = soup.findAll('link', attrs={"type" : "application/rss+xml"})
    return rssFeeds

print "Content-type: text/xml\n\n"
form = cgi.FieldStorage()
if form.has_key("url") is True:
    url = form["url"].value
else:
    url = ""
print "<xml>"
rssFeeds = getRssFromUrl(url)
for feed in rssFeeds:
    print ("<url>%s</url>" % feed["href"])
print "</xml>"
asperous.us
Unfortunately I cannot set up a server script to do it for me, but thank you for the good response :D
Anton