views:

48

answers:

2

The following snippet has worked for validating user entered html snippets for ages, now in past day it started rejecting everything.

public override bool IsValid(object value)
{
    var isValid = true;
    try
    {
        var doc = new XmlDocument();
        doc.LoadXml(string.Format(@"
            <!DOCTYPE html [<!ENTITY % xhtml-lat1 SYSTEM ""http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent""&gt;
            <!ENTITY % xhtml-special SYSTEM ""http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent""&gt;
            <!ENTITY % xhtml-symbol SYSTEM ""http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent""&gt;
            %xhtml-lat1;
            %xhtml-special;
            %xhtml-symbol;]>
            <html>{0}</html>", value));
    }
    catch (XmlException)
    {
        isValid = false;
    }
    return isValid;
}
+1  A: 

It may not be the flaw, but one flaw I see immediately is that you depend upon successfully connecting to the W3C's site and downloading the entity files. If that fails, then you're going to get an XmlException which you are assuming means a failure with the validation itself, and not even looking at it.

It's also wasteful for your resources and a bit rude to the W3C to add to the 130million needless requests a day they complained about nearly three years ago. If anything I'd imagine that, despite that complaint, the number of requests to DTDs, entities, XML Schemata and even the dereferencing of namespace names, has probably increased since then.

Use a local copy of the entities; it's quite clearly allowed in the MIT license they are released under.

Also, try to be more explicit in examining the exception raised here.

Jon Hanna
+3  A: 

It is bad manners to download the DTD from w3.org every time you need to validate a document. Their servers are under crushingly heavy load and it is very expensive for them to pay for all the bandwidth, servers, and IT workers to manage it all. It has always been bad form to download the DTD excessively (per operation), and until recently W3 has been relying on the politeness of software developers and vendors to write their programs in such a way as to not download the DTD per-operation.

However, this reliance on good manners is no longer working. Recently W3 has been taking matters into their own hands by blocking DTD downloads based on User Agent matching rules, as well as other blocking rules like IP-based blocking for particularly bad offenders. Very recently I believe they started blocking DTD downloads with very broad User Agent string matching: Internet Explorer user agents, Java user agents, and .NET user agents, to name a few.

You should download the DTD just once, and have your validator reference the DTD from local disk, or at least host the DTD using your own server and bandwidth. All parsers worth a darn have a features to help re-map "DTD namespace" to "physical DTD location."

"Many XML utilities have the ability to use an XML catalog to map URIs for external resources to a locally-cached copy of the files. For information on configuring XML applications to use a catalog, see Norman Walsh's Caching in with Resolvers article or Catalog support in libxml."

Also note others have recently started encountering problems with w3.org, DTDs, .NET and IE.

Mike Clark
that first link is a good read
mrnye