I'm writing a microformats parser in C# and am looking for some refactoring advice. This is probably the first "real" project I've attempted in C# for some time (I program almost exclusively in VB6 at my day job), so I have the feeling this question may become the first in a series ;-)
Let me provide some background about what I have so far, so that my question will (hopefully) make sense.
Right now, I have a single class, MicroformatsParser
, doing all the work. It has an overloaded constructor that lets you pass a System.Uri
or a string
containing a URI: upon construction, it downloads the HTML document at the given URI and loads it into an HtmlAgilityPack.HtmlDocument
for easy manipulation by the class.
The basic API works like this (or will, once I finish the code...):
MicroformatsParser mp = new MicroformatsParser("http://microformats.org");
List<HCard> hcards = mp.GetAll<HCard>();
foreach(HCard hcard in hcards)
{
Console.WriteLine("Full Name: {0}", hcard.FullName);
foreach(string email in hcard.EmailAddresses)
Console.WriteLine("E-Mail Address: {0}", email);
}
The use of generics here is intentional. I got my inspiration from the way that the the Microformats library in Firefox 3 works (and the Ruby mofo
gem). The idea here is that the parser does the heavy lifting (finding the actual microformat content in the HTML), and the microformat classes themselves (HCard
in the above example) basically provide the schema that tells the parser how to handle the data it finds.
The code for the HCard
class should make this clearer (note this is a not a complete implementation):
[ContainerName("vcard")]
public class HCard
{
[PropertyName("fn")]
public string FullName;
[PropertyName("email")]
public List<string> EmailAddresses;
[PropertyName("adr")]
public List<Address> Addresses;
public HCard()
{
}
}
The attributes here are used by the parser to determine how to populate an instance of the class with data from an HTML document. The parser does the following when you call GetAll<T>()
:
- Checks that the type
T
has aContainerName
attribute (and it's not blank) - Searches the HTML document for all nodes with a
class
attribute that matches theContainerName
. Call these the "container nodes". - For each container node:
- Uses reflection to create an object of type
T
. - Get the public fields (a
MemberInfo[]
) for typeT
via reflection - For each field's
MemberInfo
- If the field has a
PropertyName
attribute- Get the value of the corresponding microformat property from the HTML
- Inject the value found in the HTML into the field (i.e. set the value of the field on the object of type
T
created in the first step) - Add the object of type
T
to aList<T>
- If the field has a
- Return the
List<T>
, which now contains a bunch of microformat objects
- Uses reflection to create an object of type
I'm trying to figure out a better way to implement the step in bold. The problem is that the Type
of a given field in the microformat class determines not only what node to look for in the HTML, but also how to interpret the data.
For example, going back to the HCard
class I defined above, the "email"
property is bound to the EmailAddresses
field, which is a List<string>
. After the parser finds all the "email"
child nodes of the parent "vcard"
node in the HTML, it has to put them in a List<string>
.
What's more, if I want my HCard
to be able to return phone number information, I would probably want to be able to declare a new field of type List<HCard.TelephoneNumber>
(which would have its own ContainerName("tel")
attribute) to hold that information, because there can be multiple "tel"
elements in the HTML, and the "tel"
format has its own sub-properties. But now the parser needs to know how to put the telephone data into a List<HCard.TelephoneNumber>
.
The same problem applies to Float
S, DateTime
S, List<Float>
S, List<Integer>
S, etc.
The obvious answer is to have the parser switch on the type of field, and do the appropriate conversions for each case, but I want to avoid a giant switch
statement. Note that I'm not planning to make the parser support every possible Type
in existence, but I will want it to handle most scalar types, and the List<T>
versions of them, along with the ability to recognize other microformat classes (so that a microformat class can be composed from other microformat classes).
Any advice on how best to handle this?
Since the parser has to handle primitive data types, I don't think I can add polymorphism at the type level...
My first thought was to use method overloading, so I would have a series of a GetPropValue
overloads like GetPropValue(HtmlNode node, ref string retrievedValue)
, GetPropValue(HtmlNode, ref List<Float> retrievedValue)
, etc. but I'm wondering if there is a better approach to this problem.