First of all, I want to say its no so good idea to use Exception for checking validity because you can use Uri.TryCreate method. So you can rewrite your code and not rely it on which exception can be thrown and catched.
So better change your
Uri uri;
try
{
uri = new Uri(item.Value.Uri, href);
}
catch(UriFormatException)
{
continue;
}
to
Uri uri;
if (!Uri.TryCreate(item.Value.Uri, href, out uri)) continue;
But this is not full check anyway.
As for your question, answer is relatively simple. You are wrong assuming malformed:
mailto: webmaster [ @ ] somehost ?webmaster
URI is Uniform Resource Identifier so its basic syntax
{scheme name} : {hierarchical part} [ ? {query} ] [ # {fragment} ]
obviously valid for your input. You are end with resource's URI with "mailto:" scheme.
When you try to access Host property you assume resource was Http, but "mailto"-scheme parser used by default can't parse original string for host component and hence raised exception.
So to write your check correctly you have to modify your code a bit:
Uri uri;
if (!Uri.TryCreate(item.Value.Uri, href, out uri)) continue;
if (uri.Scheme != Uri.UriSchemeHttp && uri.Scheme != Uri.UriSchemeHttps) continue;
Read some info about UriParser
Here update based on @Mark comments.
I'm pretty sure it threw an exception when I tried to get the AbsoluteUri property too..why should that fail?
You can't pass Scheme check since it will be "mailto". So here quick test:
var baseUri = new Uri("http://localhost");
const string href = "mailto: webmaster [ @ ] somehost ?webmaster";
Uri uri;
if (!Uri.TryCreate(baseUri,href, out uri))
{
Console.WriteLine("Can't create");
return;
}
if (uri.Scheme != Uri.UriSchemeHttp && uri.Scheme != Uri.UriSchemeHttps)
{
Console.WriteLine("Wrong scheme");
return;
}
Console.WriteLine("Testing uri: {0}", uri);
It ends with "Wrong scheme". Maybe I don't understand you correctly?
When you change href to:
const string href = "http: webmaster [ @ ] somehost ?webmaster";
It passed correctly, automatically escaping uri to:
http://localhost/%20webmaster%20%5B%20@%20%5D%20somehost%20?webmaster
also all uri's components will be available to you.
The main problem I try to explain in first part following:
It seems to me you incorrectly treats any Uniform Resource Identifier as http(s) based url, but this is wrong. mailto:[email protected]
or gopher://gopher.hprc.utoronto.ca/
or myreshandler://something@somewhere
also valid URI which can be succesfully parsed. Take a look on Official IANA-registered schemes
So
Uri constructor behaviour is expected and correct.
it tries validate incoming URI for known schemes:
UriSchemeFile
- Specifies that the URI is a pointer to a file.
UriSchemeFtp
- Specifies that the URI is accessed through the File Transfer Protocol (FTP).
UriSchemeGopher
- Specifies that the URI is accessed through the Gopher protocol.
UriSchemeHttp
- Specifies that the URI is accessed through the Hypertext Transfer Protocol (HTTP)
UriSchemeHttps
- Specifies that the URI is accessed through the Secure Hypertext Transfer Protocol (HTTPS).
UriSchemeMailto
- Specifies that the URI is an email address and is accessed through the Simple Network Mail Protocol (SNMP).
UriSchemeNews
- Specifes that the URI is an Internet news group and is accessed through the Network News Transport Protocol (NNTP).
UriSchemeNntp
- Specifies that the URI is an Internet news group and is accessed through the Network News Transport Protocol (NNTP)
Basic URI parser is used when scheme is not known (see URI scheme generic syntax) .
Basicly Uri.TryCreate()
and scheme checks enough to get links which can be passed to .NET HttpWebRequest for example. You don't reallyneed check whether they well-formed or no. If links are bad (not well-formed or don't exists) you just get corresponded HttpError when try to request them.
As for your example:
http://www.google.com/search?q=cheesy poof
it passes my check and becomes:
http://www.google.com/search?q=cheesy%20poof
You don't need to check is it well-formed or no. Just do base checks and try request. Hope it helps.
Also, the string mailto: webmaster [ @ ] somehost ?webmaster is malformed. I literally mean, that string, with the stupid []s and everything in it
This string is malformed by meaning is not well-formed (since contains excluded characters according RFC 2396) but it still can be considered as valid due to conformance generic syntax of URI scheme (check also how it escaped when created with http:).