views:

49

answers:

3
foreach (var node in root.Find("a[href]"))
{
    var href = node.Attributes["href"].Value;
    Uri uri;
    try
    {
        uri = new Uri(item.Value.Uri, href);
    }
    catch(UriFormatException)
    {
        continue;
    }
    // *snip*
    try
    {
        if (_imageHosts.IsMatch(uri.Host)) // <--- problematic line
            priority--;
    }catch(UriFormatException)
    {
        MessageBox.Show(uri.OriginalString); // <--- gets displayed when I expected it wouldn't
        continue;
    }
    // *snip*
}

The message box shows up with an address like

mailto: webmaster [ @ ] somehost ?webmaster

Which is obviously malformed, but what I don't get is why it wasn't caught by the first catch block?

MSDN says it can only throw an InvalidOperationException. This is quite problematic, because it means my app can explode at any time then!

[[snip]]

+1  A: 

If you dig deep into the Uri.Host property (real deep), it can eventually call a static function GetException which returns UriFormatException objects for different conditions of invalid URIs. Print out the full UriFormatException you are getting and compare it to the ones generated by Uri.GetException. You might get more details out of it.

Matthew Ferreira
+4  A: 

First of all, I want to say its no so good idea to use Exception for checking validity because you can use Uri.TryCreate method. So you can rewrite your code and not rely it on which exception can be thrown and catched.

So better change your

Uri uri;
try
{
    uri = new Uri(item.Value.Uri, href);
}
catch(UriFormatException)
{
    continue;
}

to

Uri uri;
if (!Uri.TryCreate(item.Value.Uri, href, out uri)) continue;

But this is not full check anyway.

As for your question, answer is relatively simple. You are wrong assuming malformed:

mailto: webmaster [ @ ] somehost ?webmaster

URI is Uniform Resource Identifier so its basic syntax

{scheme name} : {hierarchical part} [ ? {query} ] [ # {fragment} ]

obviously valid for your input. You are end with resource's URI with "mailto:" scheme.

When you try to access Host property you assume resource was Http, but "mailto"-scheme parser used by default can't parse original string for host component and hence raised exception.

So to write your check correctly you have to modify your code a bit:

Uri uri;
if (!Uri.TryCreate(item.Value.Uri, href, out uri)) continue;

if (uri.Scheme != Uri.UriSchemeHttp && uri.Scheme != Uri.UriSchemeHttps) continue;

Read some info about UriParser


Here update based on @Mark comments.

I'm pretty sure it threw an exception when I tried to get the AbsoluteUri property too..why should that fail?

You can't pass Scheme check since it will be "mailto". So here quick test:

        var baseUri = new Uri("http://localhost");
        const string href = "mailto: webmaster [ @ ] somehost ?webmaster";

        Uri uri;
        if (!Uri.TryCreate(baseUri,href, out uri)) 
        {
            Console.WriteLine("Can't create");
            return;
        }

        if (uri.Scheme != Uri.UriSchemeHttp && uri.Scheme != Uri.UriSchemeHttps)
        {
            Console.WriteLine("Wrong scheme");
            return;
        }

        Console.WriteLine("Testing uri: {0}", uri);

It ends with "Wrong scheme". Maybe I don't understand you correctly?

When you change href to:

        const string href = "http: webmaster [ @ ] somehost ?webmaster";

It passed correctly, automatically escaping uri to:

http://localhost/%20webmaster%20%5B%20@%20%5D%20somehost%20?webmaster

also all uri's components will be available to you.

The main problem I try to explain in first part following:

It seems to me you incorrectly treats any Uniform Resource Identifier as http(s) based url, but this is wrong. mailto:[email protected] or gopher://gopher.hprc.utoronto.ca/ or myreshandler://something@somewhere also valid URI which can be succesfully parsed. Take a look on Official IANA-registered schemes

So

Uri constructor behaviour is expected and correct.

it tries validate incoming URI for known schemes:

  • UriSchemeFile - Specifies that the URI is a pointer to a file.
  • UriSchemeFtp - Specifies that the URI is accessed through the File Transfer Protocol (FTP).
  • UriSchemeGopher - Specifies that the URI is accessed through the Gopher protocol.
  • UriSchemeHttp - Specifies that the URI is accessed through the Hypertext Transfer Protocol (HTTP)
  • UriSchemeHttps - Specifies that the URI is accessed through the Secure Hypertext Transfer Protocol (HTTPS).
  • UriSchemeMailto - Specifies that the URI is an email address and is accessed through the Simple Network Mail Protocol (SNMP).
  • UriSchemeNews - Specifes that the URI is an Internet news group and is accessed through the Network News Transport Protocol (NNTP).
  • UriSchemeNntp - Specifies that the URI is an Internet news group and is accessed through the Network News Transport Protocol (NNTP)

Basic URI parser is used when scheme is not known (see URI scheme generic syntax) .


Basicly Uri.TryCreate() and scheme checks enough to get links which can be passed to .NET HttpWebRequest for example. You don't reallyneed check whether they well-formed or no. If links are bad (not well-formed or don't exists) you just get corresponded HttpError when try to request them.

As for your example:

http://www.google.com/search?q=cheesy poof

it passes my check and becomes:

http://www.google.com/search?q=cheesy%20poof

You don't need to check is it well-formed or no. Just do base checks and try request. Hope it helps.


Also, the string mailto: webmaster [ @ ] somehost ?webmaster is malformed. I literally mean, that string, with the stupid []s and everything in it

This string is malformed by meaning is not well-formed (since contains excluded characters according RFC 2396) but it still can be considered as valid due to conformance generic syntax of URI scheme (check also how it escaped when created with http:).

Nick Martyshchenko
I'm pretty sure it threw an exception when I tried to get the `AbsoluteUri` property too..why should that fail?
Mark
Also, the string `mailto: webmaster [ @ ] somehost ?webmaster` *is* malformed. I literally mean, that string, with the stupid []s and everything in it.
Mark
I figured mailtos would have an AbsoluteUri... I guess not. I'll just check the scheme like you suggested then! The Uri's are being based into `WebClient.DownloadData`, I basically want to allow anything that that can handle... "By default, the .NET Framework supports URIs that begin with http:, https:, ftp:, and file: scheme identifiers." -- guess I should include those too then.
Mark
It usually works for me pretty well. Hope for you too :)
Nick Martyshchenko
+1  A: 

Based on Nick's answer:

private static readonly string[] SupportedSchmes = { Uri.UriSchemeHttp, Uri.UriSchemeHttps, Uri.UriSchemeFtp, Uri.UriSchemeFile };

private static bool TryCreateUri(string uriString, out Uri result)
{
    return Uri.TryCreate(uriString, UriKind.Absolute, out result) && SupportedSchmes.Contains(result.Scheme);
}

private static bool TryCreateUri(Uri baseAddress, string relativeAddress, out Uri result)
{
    return Uri.TryCreate(baseAddress, relativeAddress, out result) && SupportedSchmes.Contains(result.Scheme);
}
Mark