views:

327

answers:

1

For whatever reason, IBM uses https (without requiring credentials) for their RSS feeds. I'm trying to consume https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en with a .NET 4 SyndicationFeed. I can open this feed in a browser and it loads just fine. Here's the code:

        using (XmlReader xml = XmlReader.Create("https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en"))
        {
            var items = from item in SyndicationFeed.Load(xml).Items
                        select item;
        }

Here's the exception:

System.Net.WebException was unhandled by user code
Message=The remote server returned an error: (500) Internal Server Error.
Source=System
StackTrace:
   at System.Net.HttpWebRequest.GetResponse()
   at System.Xml.XmlDownloadManager.GetNonFileStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
   at System.Xml.XmlDownloadManager.GetStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
   at System.Xml.XmlUrlResolver.GetEntity(Uri absoluteUri, String role, Type ofObjectToReturn)
   at System.Xml.XmlReaderSettings.CreateReader(String inputUri, XmlParserContext inputContext)
   at System.Xml.XmlReader.Create(String inputUri, XmlReaderSettings settings, XmlParserContext inputContext)
   at System.Xml.XmlReader.Create(String inputUri)
   at EDN.Util.Test.FeedAggTest.LoadFeedInfoTest() in D:\cdn\trunk\CDN\Dev\Shared\net\EDN.Util\EDN.Util.Test\FeedAggTest.cs:line 126

How do I configure the reader to work with an https feed?

+2  A: 

I don't think it has anything to do with security. A 500 error is a server-side error. Something in the request generated by XmlReader.Create(url) is confusing the ibm website. If it was simply a security issue, as suggested in your question, then you'd expect to get a 403 error, or "Authorization Denied". But you got 500, which is an application error.

Even so, maybe there's something the client app can do, to avoid confusing the server.

I looked at the outgoing HTTP request headers, using Fiddler. For a request generated by IE, the headers look like this:

GET https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en HTTP/1.1
Accept: image/gif, image/jpeg, image/pjpeg, application/x-ms-application, application/vnd.ms-xpsdocument, application/xaml+xml, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-silverlight, application/x-shockwave-flash, application/x-silverlight-2-b2, */*
Accept-Language: en-us
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; .NET CLR 3.5.30729;)
Accept-Encoding: gzip, deflate
Host: www.ibm.com
Connection: Keep-Alive
Cookie: UnicaNIODID=Ww06gyvyPpZ-WPl6K7y; conxnsCookie=en; IBMPOLLCOOKIE=""; UnicaNIODID=QridYHCNf7M-WYM8Usr

For a request from XmlReader.Create(url), the headers look like this:

GET https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en HTTP/1.1
Host: www.ibm.com
Connection: Keep-Alive

Quite a difference. Also, in the response to the latter, I got a Set-Cookie header, in the 500 response, which wasn't present in the response to IE.

Based on that I theorized that it was the difference in request headers, in particular the cookie, that was confusing ibm.com.


I don't know how to convince XmlReader.Create() to embed all the request headers I wanted, including the cookie. But I know how to do that with an HttpWebRequest. So I used that.

There were a few hurdles I had to clear.

  1. I needed the persistent cookie for ibm.com. For that I had to resort to a p/invoke of the Win32 InternetGetCookie. See the PersistentCookies class attached in the user-contributed content at the bottom of the doc page for WebRequest, for how to do that. After attaching the cookie, I was no longer getting 500 errors. Hooray!

  2. But the resulting stream could not be read by XmlReader.Create(). It looked binary to me. I realized I needed to de-compress the gzip or deflated content. For that I had to wrap a GZipStream or DeflateStream around the received response stream, and use the decompressing stream for XmlReader. set the AutomaticDecompression property on HttpWebRequest. I could have avoided the need for this by not including "gzip, deflate" on the Accept-Encoding header in the outbound request. Actually, after setting the AutomaticDecompression property, those headers are set implicitly in the outbound HTTP Request.

  3. When I did that, I got actual text. But some of the byte codes were off. Next I needed to use the proper text encoding in the TextReader, as indicated in the HttpWebResponse.

  4. After doing that, I got a sensible string, but the resulting decompressed rss stream caused the XmlReader to choke, with

    ReadElementString method can only be called on elements with simple or empty content. Line 11, position 25.

    I looked and found a small <script> block, at that location, within the <copyright> element in the rss document. It seems IBM is trying to get the browser to "localize" the copyright date by attaching logic that would run in the browser to format the date. Seems like overkill to me, or even a bug by IBM. But because the angle bracket within the text node of an element bothered the XmlReader, I removed the script block with a Regex replace.


After clearing those hurdles, it worked. The .NET app was able to read the RSS stream from that https url.

I didn't do any further testing - to see if varying the Accept header or the Accept-Encoding header would change the behavior. That's for you to figure out, if you care.

The resulting code is below. It's much uglier than your simple 3-liner. I don't know how to make it any simpler.

public void Run()
{
    string url;
    url = "https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en";

    HttpWebRequest hwr = (HttpWebRequest) WebRequest.Create(url);
    // attach persistent cookies
    hwr.CookieContainer =
        PersistentCookies.GetCookieContainerForUrl(url);
    hwr.Accept = "text/xml, */*";
    hwr.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-us");
    hwr.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; .NET CLR 3.5.30729;)";
    hwr.KeepAlive = true;
    hwr.AutomaticDecompression = DecompressionMethods.Deflate |
                                 DecompressionMethods.GZip;

    using (var resp = (HttpWebResponse) hwr.GetResponse())
    {
        using(Stream s = resp.GetResponseStream())
        {            
            string cs = String.IsNullOrEmpty(resp.CharacterSet) ? "UTF-8" : resp.CharacterSet;
            Encoding e = Encoding.GetEncoding(cs);

            using (StreamReader sr = new StreamReader(s, e))
            {
                var allXml = sr.ReadToEnd();

                // remove any script blocks - they confuse XmlReader
                allXml = Regex.Replace( allXml,
                                        "(.*)<script type='text/javascript'>.+?</script>(.*)",
                                        "$1$2",
                                        RegexOptions.Singleline);

                using (XmlReader xmlr = XmlReader.Create(new StringReader(allXml)))
                {
                    var items = from item in SyndicationFeed.Load(xmlr).Items
                        select item;
                }
            }
        }
    }
}
Cheeso
What a fantastic effort! I appreciate your time on this. I suspected both of those issues: cookie and header values, but I was hoping it was something simple to resolve first. Dunno why IBM wants to make their feed consumption requirements so heavy-weight. I'm processing about 150 different feed URLs, and no one else does it this way. I'll see if I can make a generic solution out of your excellent effort. Thanks very much.
John Kaster
I was able to reproduce your successful test. Thanks again. Now to see if this works for the rest of the feeds I have to consume ...
John Kaster