tags:

views:

36

answers:

1

I am experimenting with scraping certain pages from an RSS feed using curl and php. The page scraping was working fine when I was just using actual links, not links from the rss feeds. However, I realize now that links in rss feeds are usually just redirects to the actual page (at least this is what it seems like). Because now when I scrape a page with the rss link, it doesn't actually get the information I am looking for.

Has anyone encountered this and know of a workaround. Is there anyway to see where the rss link is redirecting to and capturing that value?

A: 

I think you might need to use the -L switch to tell it to follow redirects. I'm not sure if you can do this directly from PHP or whether you need to follow this approach http://php.net/manual/en/function.curl-setopt.php#95027. It is always possible that the site you are scraping blocks by user agent or something as well. Maybe try one of the links in a browser while running Fiddler or similar to see if any redirection is actually taking place.

Martin Smith
thanks, yeah I managed to find a script that loops through the redirects and finds the last one. It seems like most sites don't block by useragent, but there are some.
pfunc