views:

416

answers:

3

I'm writing a microformats parser in C# and am looking for some refactoring advice. This is probably the first "real" project I've attempted in C# for some time (I program almost exclusively in VB6 at my day job), so I have the feeling this question may become the first in a series ;-)

Let me provide some background about what I have so far, so that my question will (hopefully) make sense.

Right now, I have a single class, MicroformatsParser, doing all the work. It has an overloaded constructor that lets you pass a System.Uri or a string containing a URI: upon construction, it downloads the HTML document at the given URI and loads it into an HtmlAgilityPack.HtmlDocument for easy manipulation by the class.

The basic API works like this (or will, once I finish the code...):

MicroformatsParser mp = new MicroformatsParser("http://microformats.org");
List<HCard> hcards = mp.GetAll<HCard>();

foreach(HCard hcard in hcards) 
{
    Console.WriteLine("Full Name: {0}", hcard.FullName);

    foreach(string email in hcard.EmailAddresses)
        Console.WriteLine("E-Mail Address: {0}", email);
}

The use of generics here is intentional. I got my inspiration from the way that the the Microformats library in Firefox 3 works (and the Ruby mofo gem). The idea here is that the parser does the heavy lifting (finding the actual microformat content in the HTML), and the microformat classes themselves (HCard in the above example) basically provide the schema that tells the parser how to handle the data it finds.

The code for the HCard class should make this clearer (note this is a not a complete implementation):

[ContainerName("vcard")]
public class HCard
{
    [PropertyName("fn")]
    public string FullName;

    [PropertyName("email")]
    public List<string> EmailAddresses;

    [PropertyName("adr")]
    public List<Address> Addresses;

    public HCard()
    { 
    }
}

The attributes here are used by the parser to determine how to populate an instance of the class with data from an HTML document. The parser does the following when you call GetAll<T>():

  • Checks that the type T has a ContainerName attribute (and it's not blank)
  • Searches the HTML document for all nodes with a class attribute that matches the ContainerName. Call these the "container nodes".
  • For each container node:
    • Uses reflection to create an object of type T.
    • Get the public fields (a MemberInfo[]) for type T via reflection
    • For each field's MemberInfo
      • If the field has a PropertyName attribute
        • Get the value of the corresponding microformat property from the HTML
        • Inject the value found in the HTML into the field (i.e. set the value of the field on the object of type T created in the first step)
        • Add the object of type T to a List<T>
    • Return the List<T>, which now contains a bunch of microformat objects

I'm trying to figure out a better way to implement the step in bold. The problem is that the Type of a given field in the microformat class determines not only what node to look for in the HTML, but also how to interpret the data.

For example, going back to the HCard class I defined above, the "email" property is bound to the EmailAddresses field, which is a List<string>. After the parser finds all the "email" child nodes of the parent "vcard" node in the HTML, it has to put them in a List<string>.

What's more, if I want my HCard to be able to return phone number information, I would probably want to be able to declare a new field of type List<HCard.TelephoneNumber> (which would have its own ContainerName("tel") attribute) to hold that information, because there can be multiple "tel" elements in the HTML, and the "tel" format has its own sub-properties. But now the parser needs to know how to put the telephone data into a List<HCard.TelephoneNumber>.

The same problem applies to FloatS, DateTimeS, List<Float>S, List<Integer>S, etc.

The obvious answer is to have the parser switch on the type of field, and do the appropriate conversions for each case, but I want to avoid a giant switch statement. Note that I'm not planning to make the parser support every possible Type in existence, but I will want it to handle most scalar types, and the List<T> versions of them, along with the ability to recognize other microformat classes (so that a microformat class can be composed from other microformat classes).

Any advice on how best to handle this?

Since the parser has to handle primitive data types, I don't think I can add polymorphism at the type level...

My first thought was to use method overloading, so I would have a series of a GetPropValue overloads like GetPropValue(HtmlNode node, ref string retrievedValue), GetPropValue(HtmlNode, ref List<Float> retrievedValue), etc. but I'm wondering if there is a better approach to this problem.

+4  A: 

Instead of a large switch statement you can construct a dictionary that maps the string to a delegate and look it up when you want to parse using the appropriate method.

Mehrdad Afshari
+4  A: 

Mehrdad's approach is basically the one I'd suggest to start with, but as the first step out of potentially more.

You can use a simple IDictionary<Type,Delegate> (where each entry is actually from T to Func<ParseContext,T> - but that can't be expressed with generics) for single types (strings, primitives etc) but then you'll also want to check for lists, maps etc. You won't be able to do this using the map, because you'd have to have an entry for each type of list (i.e. a separate entry for List<string>, List<int> etc). Generics make this quite tricky - if you're happy to restrict yourself to just certain concrete types such as List<T> you'll make it easier for yourself (but less flexible). For instance, detecting List<T> is straightforward:

if (type.IsGenericType && type.GetGenericTypeDefinition() == typeof(List<>))
{
    // Handle lists
    // use type.GetGenericArguments() to work out the element type
}

Detect whether a type implements IList<T> for some T (and then discovering T) can be a pain, especially as there could be multiple implementations, and the concrete type itself may or may not be generic. This effort could be worthwhile if you really need a very flexible library used by thousands of developers - but otherwise I'd keep it simple.

Jon Skeet
+1. I purposely wanted to avoid dealing with IList<T> knowing that would get complicated fast. For the purposes of this project, I will probably just enforce the notion that if you want more than one of something, it has to be a List<T>.
Mike Spross
This has me wondering though if the overall API design is flawed. Maybe I'm approaching the whole thing from the wrong angle. OTOH, I really like the idea of being able to define new microformats by just tacking attributes to a class. I think "is there a different way to do this?" is coming next :-)
Mike Spross
Accepted. It was a toss-up because all the answers here provide good information, but this one provides (a) a good suggestion for one way to implement the requirement and (b) potential pitfalls of that approach.
Mike Spross
+1  A: 

This is a very similar problem to what my serialization engine (protobuf-net) faces. I simply break it down into common sets of logic - IList<T> etc (although there is a big logic/type test to handle the various primitives). The approach I use is: only do it once... build an interface/base-class-based model that can handle properties of different types, and work from there. I do this in the static initializer of a generic cache class - i.e. Foo<T>; when T is HCard, I pre-compute a model (i.e. I create an object per-property that can parse/render that property, and store them) that lets me process HCard without any more thinking later on.

I'm not saying it is the best code on earth, but it seems to work OK.

Marc Gravell
Sounds interesting. I'll definitely take a look at this.
Mike Spross