views:

84

answers:

2

Disclaimer: This is a personal project that I'm doing for fun. I'm not looking to use existing libraries since it would take some of the joy out of learning more about wheels.

That being said, I'm working on a web spider and I've come to the problem of how to represent HTML form elements with a single object.

What I want to do is have an "HTML Document" object, which contains an array of all form elements as one of its properties. The problem is that I can't figure out a way to represent <input /> tags, as well as <select /> tags, since select tags can have multiple child <option /> tags.

Is there any good way to represent both <input /> tags, which store basically only name/value pairs, and <select /> tags which have an array of name/value pairs in the same class?

The best idea I've come up with so far is to treat the <option /> tags of a <select /> tag as individual form fields, similar to how I would represent <input type="radio" /> or <input type="checkbox" />.

So I would have this:

class FormField {
    public string Name { get; set; }
    public string Value { get; set; }
    public string Type { get; set; }
}

And then a collection class for iterating would:

  • The collection class would be an "array of arrays". The outer array would have a single inner array for each name in the HTML document.
  • Its indexer could get fields by Name. This index would return an array of FormField objects.
  • When enumerating over the entire document's form fields, each iteration would have an array of FormField objects, since it would be an array of arrays.

Is this the best solution, or is there a simpler way to represent this?

+1  A: 

I would view the entire document as a structure of parent-child links starting with the body tag. The first tag is the body tag. Any top level div, p, form, etc tags would go into the Children object. When you get to form elements like select, you can then fill in the values of the select as more htmlObjects

class htmlObject {
    public string Name { get; set; }
    public string Value { get; set; }
    public string Type { get; set; }
    public List<htmlObject> Children { get; }
}

From your example, what you are missing is the Children property to represent the underlying properties.

When you are ready to define the elements in greater detail, then class htmlObject becomes interface IhtmlObject and you then create specialized classes for each tag. Then the specialize tag can implement the functionality you want to handle each tags special conditions.

JDMX
Dan Herbert
That fine... then build the structure with the form tag as the base element instead of the body tag. Everything else applies from that point on.
JDMX
+2  A: 

Are you trying to represent the dom structure, or the values as they would appear in a http-post?

Two inputs with the same name will cause the value to be posted twice, and read similar to as if you had checked checkboxes (with same name). And have you checked what happens if you have both a textbox and a checkbox with the same name? Can you differ between them in the resulting http-post? And what type should you be using in that case?

Tried a simple NameValueCollection which would allow you to store several string values per key?

Simon Svensson
Dan Herbert
Web spiders usually don't follow forms/http-posts, or attempt to fake these.
Simon Svensson