ansaurus

Question

How much work should the constructor for an HTML parsing class do?

Answer 1

+5 A:

A constructor should do whatever is necessary to put that instance into a runnable, valid, ready-to-use state. If that means some validation or analysis, I'd say it belongs there. Just be careful about how much the constructor does.

There might be other places in your design where validation fits as well.

If the input values are coming from a UI, I'd say that it should have a hand in ensuring valid input.

If the input values are being unmarshalled from an incoming XML stream, I'd think about using schemas to validate it.

duffymo 2009-07-26 02:00:30

The HTML being parsed is not particularly well formed... The parser actually uses simple regex to find values within the HTML. I figured parsing within the constructor would allow me to throw an error early if the parsed data is invalid.

Ben S 2009-07-26 04:55:45

I think this sounds wrong headed to me. I'd say that the constructor should advertise exactly what it needs to receive. Let another class deal with the details of acquiring HTML, parsing the stream, validating the input, and giving the advertised values to the constructor. A constructor is responsible for creating a valid object; you're making it parse HTML, too. What happens if I pass in any old String that's not HTML? The constructor's doing too much in this case. Keep the contract narrow.

duffymo 2009-07-26 14:33:34

I agree. Unless you seal/finalize the class, you've forced all inheriting classes to accept the implementation that was included in the constructor. I think it might be better to have each process of the extraction in a separate method. If the regular expression detects an 'img' element, delegate the parsing to an img processing method (or class [by interface]). This way you all inheriting classes to override implementation.

Richard Clayton 2009-07-26 21:29:38

Answer 2

+2 A:

The constructor should create a valid object. If in your case that requires reading and parsing information, than so be it.

If the object can be used for other purposes without parsing the information first, than consider making two constructors, or a separate method.

SP 2009-07-26 02:00:40

If this was the case, I would consider a separate class. The parser class is useless without parsing. The purpose of it is to make the parsed information available.

Ben S 2009-07-26 02:21:18

Answer 3

+1 A:

A constructor should set up the object to be used.

So whatever that is. That may include taking action on some data or just setting fields. It will change from each class.

In the case you are speaking of an Html Parser, I would opt for creating the class, and then calling a Parse Html method. The reason for this is it gives you a furture opportunity to set items in the class for parsing the Html.

David Basarab 2009-07-26 02:01:28

Your comments seem contradictory. Having the constructor parse the HTML is the only way to have the object really be useful, but you suggests removing the parsing logic into a separate method.

Ben S 2009-07-26 02:16:39

Answer 4

+32 A:

I normally follow one easy principle:

Everything that is mandatory for the correct existence and behavior of the class instance should be passed and done into the constructor.

Every other activity is done by other methods.

The constructor should never:

use other methods of the class with the purpose of using overriding behavior
act on its private attributes via methods

Because I learned the hard way that while you are in the constructor, the object is in a incoherent, intermediate state which is too dangerous to handle. Some of this unexpected behavior could be expected from your code, some could be from the language architecture and compiler decisions. Never guess, stay safe, be minimal.

In your case, I would use a Parser::parseHtml(file) method. The instantiation of the parser and the parsing are two different operations. When you instance a parser, the constructor puts it in the condition to perform its job (parsing). Then you use its method to perform the parsing. You then have two choices:

Either you allow the parser to contain the results of the parsing, and give the clients an interface to retrieve the parsed information (e.g. Parser::getFooValue()). The methods will return Null if you haven't performed parsing yet, or if the parsing failed.
or your Parser::parseHtml() returns a ParsingResult instance, containing what the Parser found.

The second strategy grants you better granularity, as the Parser is now stateless, and the client needs to interact with the methods of the ParsingResult interface. The Parser interface remains sleek and simple. The internals of the Parser class will tend to follow the Builder pattern.

You comment: "I feel as though returning an instance of a parser that hasn't parsed anything (as you suggest), a constructor that's lost its purpose. There's no use in initializing a parser without the intent of actually parsing the information. So if parsing is going to happen for sure, should we parse as early as possible and report and errors early, such as during the construction of the parser? I feel as though initializing a parser with invalid data should result in an error being thrown."

Not really. If you return an instance of a Parser, of course it's going to parse. In Qt, when you instantiate a button, of course it's going to be shown. However, you have the method QWidget::show() to manually call before something is visible to the user.

Any object in OOP has two concerns: initialization, and operation (ignore finalization, it's not on discussion right now). If you keep these two operations together, you both risk trouble (having an incomplete object operating) and you lose flexibility. There are plenty of reasons why you would perform intermediate setup of your object before calling parseHtml(). Example: suppose you want to configure your Parser to be strict (so to fail if a given column in a table contains a string instead of an integer) or permissive. Or to register a listener object which is warned every time a new parsing is performed or ended (think GUI progress bar). These are optional information, and if your architecture puts the constructor as the übermethod that does everything, you end up having a huge list of optional method parameters and conditions to handle into a method which is inherently a minefield.

"Caching should not be the responsibility of a parser. If data is to be cached, a separate cache class should be created to provide that functionality."

On the opposite. If you know that you are going to use the parsing functionality on a lot of files, and there's a significant chance that the files are going to be accessed and parsed again later on, it is internal responsability of the Parser to perform smart caching of what it already saw. From the client perspective, it is totally oblivious if this caching is performed or not. He is still callling the parsing, and still obtaining a result object. but it is getting the answer much faster. I think there's no better demonstration of separation of concerns than this. You boost performance with absolutely no change in the contract interface or the whole software architecture.

However, note that I am not advocating that you should never use a constructor call to perform parsing. I am just claiming that it's potentially dangerous and you lose flexibility. There are plenty of examples out there where the constructor is at the center of the actual activity of the object, but there is also plenty of examples of the opposite. Example (although biased, it arises from C style): in python, I would consider very weird something like this

f = file()
f.setReadOnly()
f.open(filename)

instead of the actual

f = file(filename,"r")

But I am sure there are IO access libraries using the first approach (with the second as a sugar-syntax approach).

Edit: finally, remember that while it's easy and compatible to add in the future a constructor "shortcut", it is not possible to remove this functionality if you find it dangerous or problematic. Additions to the interface are much easier than removals, for obvious reasons. Sugary behavior must be weighted against future support you have to provide to that behavior.

Stefano Borini 2009-07-26 02:01:42

So, in my case, would you say it's correct to parse in the constructor? The only public members to the class are the constructor and the accessors. The design is setup to use helper methods for sub parsing, but only the constructor ever calls them.

Ben S 2009-07-26 02:13:28

The first strikes me as inelegant with the various null values. The second solution seems much cleaner to me, however, I fail to see how that's different than having a constructor return an object with accessors to the parsed values. The only difference is that the constructor is available statically, and your solution requires instantiating a stateless object. Isn't it better to have static calls for stateless methods? Constructors that parse are essentially static methods that return an parsed object.

Ben S 2009-07-26 02:31:15

well, elegant, inelegant. Depends. A parser who hasn't parsed contains no data. This could be an important message to the client, which could differentiate between parsers who have parsed and parser who hadn't.

Stefano Borini 2009-07-26 02:38:44

A constructor is more than a static method. A static method is a method that performs an operation and is inside the namespace of a class (effectively, it's using the class as a namespace), without having any instance association to it. A constructor is much more. It returns you an instance, something that a static method does not (unless is a factory method, but that's another story). If there's no meaning to return an instance, then the constructor looses its purpose.

Stefano Borini 2009-07-26 02:43:22

In the end, you have to find the tradeoff between where to put the parsing logic and where to put the accessor interface and state. Although you would find useless to have a stateless parser, it could still be advantageous to have an internal state for performance (e.g. caching of ParseResults objects so that they are returned unchanged if the md5 of the file is the same)

Stefano Borini 2009-07-26 02:47:02

I feel as though returning an instance of a parser that hasn't parsed anything (as you suggest), a constructor that's lost its purpose. There's no use in initializing a parser without the intent of actually parsing the information. So if parsing is going to happen for sure, should we parse as early as possible and report and errors early, such as during the construction of the parser? I feel as though initializing a parser with invalid data should result in an error being thrown.

Ben S 2009-07-26 02:52:51

Caching should not be the responsibility of a parser. If data is to be cached, a separate cache class should be created to provide that functionality.

Ben S 2009-07-26 02:53:47

I answer you in the actual answer, not in the comments. I need more space

Stefano Borini 2009-07-26 12:36:04

Answer 5

+5 A:

I'd probably just pass enough to initialize the object and then have a 'parse' method. The idea is that expensive operations should be as obvious as possible.

Rodrick Chapman 2009-07-26 02:01:47

This was the main thought I had for placing the parse logic in a separate method.

Ben S 2009-07-26 02:09:42

Answer 6

+17 A:

"Should the parsing code be placed within a void parseHtml() method and the accessors only return valid values once this method is called?"

Yes.

"The design of the class is such that the class' constructor does the parsing"

This prevents customization, extension, and -- most importantly -- dependency injection.

There will be times when you want to do the following

Construct a parser.
Add Features to the parser: Business Rules, Filters, Better Algorithms, Strategies, Commands, whatever.
Parse.

Generally, it's best to do as little as possible in a constructor so that you are free to extend or modify.

Edit

"Couldn't extensions simply parse the extra information in their constructors?"

Only if they don't have any kind of features that need to be injected. If you want to add features -- say a different strategy for constructing the parse tree -- your subclasses have to also manage this feature addition before they parse. It may not amount to a simple super() because the superclass does too much.

"Also, parsing in the constructor allows me to fail early"

Kind of. Failing during construction is a weird use case. Failing during construction makes it difficult to construct a parser like this...

class SomeClient {
    parser p = new Parser();
    void aMethod() {...}
}

Usually a construction failure means you're out of memory. There's rarely a good reason to catch construction exceptions because you're doomed anyway.

You're forced to build the parser in a method body because it has too complex arguments.

In short, you've removed options from the clients of your parser.

"It's inadvisable to inherit from this class to replace an algorithm."

That's funny. Seriously. It's an outrageous claim. No algorithm is optimal for all possible use cases. Often a high-performance algorithm uses a lot of memory. A client may want to replace the algorithm with a slower one that uses less memory.

You can claim perfection, but it's rare. Subclasses are the norm, not an exception. Someone will always improve on your "perfection". If you limit their ability to subclass your parser, they'll simply discard it for something more flexible.

"I don't see needing step 2 as described in the answer."

A bold statement. Dependencies, Strategies and related injection design patterns are common requirements. Indeed, they're so essential for unit testing that a design which makes it difficult or complex often turns out to be a bad design.

Limiting the ability to subclass or extend your parser is a bad policy.

Bottom Line.

Assume nothing. Write a class with as few assumptions about it's use cases as possible. Parsing at construction time makes too many assumptions about client use cases.

S.Lott 2009-07-26 02:02:56

Dependency injection is still possible through the constructor

Samuel Carrijo 2009-07-26 02:04:14

@samuelcarrijo: That generally makes the constructor horribly complex.

S.Lott 2009-07-26 02:08:29

Couldn't extensions simply parse the extra information in their constructors? For example, if an extension wanted to parse and extra piece of information X, couldn't it just add a `getX()` method and parse X in its constructor, and call `super(html)` to have my base class parse the rest of the information. I fail to see how parsing in the constructor prevents customization or extension.

Ben S 2009-07-26 02:19:42

Also, parsing in the constructor allows me to fail early, letting the client code know as soon as possible when the data is invalid.

Ben S 2009-07-26 02:27:10

The parsing of the base information is already being done in an optimal way and has been thoroughly performance tested. It's inadvisable to inherit from this class to replace an algorithm. Should it still have a separate parsing method?

Ben S 2009-07-26 02:46:03

The first reason not to do the initialisation in the constructor is if you could reasonably expect that it will not work on a regular basis. Exceptions should be kept for exceptional circumstances.Other reasons such as customisation may also come into play - though you have not indicated such and people seem to be adding this as a requirement. Given Den S example I don't see needing step 2 as described in the answer.

Greg Domjan 2009-07-26 06:16:17

Answer 7

+3 A:

You should try to keep the constructor from doing unnecessary work. In the end, it all depends on what the class should do, and how it should be used.

For instance, will all the accessors be called after constructing your object? If not, then you've processed data unnecessarily. Also, there's a bigger risk of throwing a "senseless" exception (oh, while trying to create the parser, I got an error because the file was malformed, but I didn't even ask it to parse anything...)

On second thought, you might need the access to this data fast after it is built, but you may take long building the object. It might be ok in this case.

Anyway, if the building process is complicated, I'd suggest using a creational pattern (factory, builder).

Samuel Carrijo 2009-07-26 02:03:15

My thought was that, creating an object is generally regarded as a fairly expensive thing to do. Which is the main reason why I include the parsing in the constructor. It's necessary that the parsing is done to have the accessors return a valid value, so they user code should assume the constructor will parse.

Ben S 2009-07-26 02:11:34

Answer 8

+3 A:

It is good rule of thumb to only initialize fields in constructors, and otherwise do as little as possible to initialize the Object. Using Java as an example, you could run into problems if you call methods in your constructor, especially if you subclass your Object. This is because, due to the order of operations in the instantiation of Objects, instance variables will not be evaluated until after the super constructor has finished. If you try to access the field during the super constructor's process, you will throw an Exception

Suppose you have a superclass

class Test {

   Test () {
      doSomething();
   }

   void doSomething() {
     ...
   }
 }

and you have a subclass:

class SubTest extends Test {
    Object myObj = new Object();

    @Override
    void doSomething() {
        System.out.println(myObj.toString()); // throws a NullPointerException          
    }
 }

This is an example specific to Java, and while different languages handle this sort of ordering differently, it serves to drive the point home.

edit as an answer to your comment:

Though I would normally shy away from methods in constructors, in this case you have a few options:

In your constructor, set the HTML string as a field in your Class, and parse every time your getters are called. This most likely will not be very efficient.
Set the HTML as a field on your object, and then introduce a dependency on parse(), with it needing to be called either right after the constructor is finished or include some sort of lazy parsing by adding something like 'ensureParsed()' at the head of your accessors. I dont like this all that much, as you could have the HTML around after you've parsed, and your ensureParsed() call could be coded to set all of your parsed fields, thereby introducing a side-effect to your getter.
You could call parse() from your constructor and run the risk of throwing an exception. As you say, you are setting the fields to initialize the Object, so this is really OK. With regard to the Exception, stating that there was an illegal argument passed into a constructor is acceptable. If you do this, you should be careful to ensure that you understand the way that your language handles the creation of Objects as discussed above. To follow up with the Java example above, you can do this without fear if you ensure that only private methods (and therefore not eligible for overriding by subclasses) are called from within a constructor.

akf 2009-07-26 02:14:26

So, in Java, you would agree with having my constructor parse, since the parsing is required to have the accessors return useful values, which might be used by sub-classes constructors?

Ben S 2009-07-26 02:23:01

I think the best advise anyone can give you in this case is that you should separate the parsing from the class and inject the parsing object. Program to the interface. I doubt any application you write (I write, heck anyone writes) is going to get the parsing right the first time. If you leave the parsing within the class, you will continually have to rewrite/compile the class each time you make a change. Use dependency injection or OSGi to define the parser that will be used at runtime. This way, you could use a parser that doesn't even exist within the same "jar" as your other object.

Richard Clayton 2009-07-26 03:34:42

This class implements the parsing interface my application uses. It implements the interface of transforming an HTML string into the information required. The discussion is about the implementation of that interface, not the surrounding application. Eventually a class needs to implement the parsing, this is about where the parsing code should be placed, within the constructor, or outside the constructor.

Ben S 2009-07-26 04:15:00

The third option you list sounds exactly like my current implementation.

Ben S 2009-07-26 05:30:44

Please tell me, are you going to throw the implementation for parsing every single html element into one class? How are you going to do this, a bunch of regular expressions? The point I think many people have tried to make is that maybe this class isn't as refined as it should be. We understand that you are asking about implementation in the constructor, but we're trying to help you not make some common programming mistakes (hence all the talk about dependency injection). Maybe you should review the suggestions after you've looked up some of these topics.

Richard Clayton 2009-07-26 21:21:11

Answer 9

A:

I would not do the parsing in the constructor. I would do everything necessary to validate the constructor parameters, and to ensure that the HTML can be parsed when needed.

But I'd have the accessor methods do the parsing if the HTML is not parsed by the time they need it to be. The parsing can wait until that time - it does not need to be done in the constructor.

Suggested code, for discussion purposes:

public class MyHtmlScraper {
    private TextReader _htmlFileReader;
    private bool _parsed;

    public MyHtmlScraper(string htmlFilePath) {
        _htmlFileReader = new StreamReader(htmlFilePath);
        // If done in the constructor, DoTheParse would be called here
    }

    private string _parsedValue1;
    public string Accessor1 {
        get {
            EnsureParsed();
            return _parsedValue1;
        }
    }

    private string _parsedValue2;
    public string Accessor2 {
        get {
            EnsureParsed();
            return _parsedValue2;
        }
    }

    private void EnsureParsed(){
        if (_parsed) return;
        DoTheParse();
        _parsed = true;
    }

    private void DoTheParse() {
        // parse the file here, using _htmlFileReader
        // parse into _parsedValue1, 2, etc.
    }
}

With this code in front of us, we can see there's very little difference between doing all the parsing in the constructor, and doing it on demand. There's a test of a boolean flag, and the setting of the flag, and the extra calls to EnsureParsed in each accessor. I'd be surprised if that extra code were not inlined.

This isn't a huge big deal, but my inclination is to do as little as possible in the constructor. That allows for scenarios where the construction needs to be fast. These will no doubt be situations you have not considered, like deserialization.

Again, it's not a huge big deal, but you can avoid doing the work in the constructor, and it's not expensive to do the work elsewhere. I admit, it's not like you're off doing network I/O in the constructor (unless, of course, a UNC file path is passed in), and you're not going to have to wait long in the constructor (unless there are networking problems, or you generalize the class to be able to read the HTML from places other than a file, some of which might be slow).

But since you don't have to do it in the constructor, my advice is simply - don't.

And if you do, it could be years before it causes an issue, if at all.

John Saunders 2009-07-26 02:18:51

This occurred to me as an optimization that was not useful in my implementation since constructing a parser object is a guarantee that at least one accessor will be called to retrieve a value. So I might as well perform the parsing in the constructor since it's going to get done at some point any way. Parsing in the constructor also allows be to throw an error if the parser was going to be invalid. Early failure is generally held as good practice.

Ben S 2009-07-26 02:26:10

Downvoter - please explain the reason for the downvote. I can't improve the answer if I don't know what you find lacking in it.

John Saunders 2009-07-26 13:18:07

John, he just doesn't like someone telling him that he is wrong by putting all that logic in the constructor.

Richard Clayton 2009-07-26 14:41:50

Sorry, "He" who?

John Saunders 2009-07-26 15:29:01

"Ben S", the person who posted the question. He's been down-voting everybody.

Richard Clayton 2009-07-26 17:01:45

Answer 10

A:

I think when you create a class ($obj = new class), the class should not affect the page at all, and should be relatively low processing.

For instance:

If you have a user class, it should be checking for incoming login/logout parameters, along with cookies, and assigning them to class variables.

If you have a database class, it should make the connection to the database so it is ready when you are going to start a query.

If you have a class that deals with a particular form, it should go get the form values.

In a lot of my classes, I check for certain parameters to define an 'action', like add, edit or delete.

All of these things don't really affect the page, so it wouldn't matter too much if you created them or not. They are simply ready for when you are going to call that first method.

Chacha102 2009-07-26 02:31:24

But calling that first method, in all cases will require initialization of the object, namely, parsing the HTML. So if all accessor calls will cause me to need parsing, shouldn't I parse as early as possible and report problems early?

Ben S 2009-07-26 02:39:12

I think that you should set up the file to be parsed. As in, the name or contents of the file, the things that will be returned, and the any other 'customizations' that are needed to be made, and then parse the the HTML with a runClass or parse function.

Chacha102 2009-07-26 02:43:36

This would allow you to make the object able to be used in other cases.

Chacha102 2009-07-26 02:45:49

In my case, the entire contents of the HTML file are passed through a String. The string is no longer required once it is parsed and is fairly large (a few hundred kilobytes). So it would be best to not keep it in memory. The object shouldn't be used for other cases. It was designed to parse a certain page. Parsing something else should prompt the creation of a different object to parse that.

Ben S 2009-07-26 03:03:40

I mean, the great part about programming is you can do it in so many different ways. If you believe that is the best way, go for it. You can always change it to act differently later if you find there is a big problem in doing that way. Seems like doing it the way you want to do it is going to suite your needs.

Chacha102 2009-07-26 03:20:23

I think the implementation I'm using will work well for me, but I was wondering if in some cases, it should be done differently, and if so, how should it be done in those cases.

Ben S 2009-07-26 04:53:58

Answer 11

A:

Why not just pass the parser to the constructor? This would allow you to change the implementation without changing the model:

public interface IParser
{
    Dictionary<string, object> ParseDocument(string document);
}

public class HtmlParser : IParser
{
    // Properties, etc...

    public Dictionary<string, object> ParseDocument(string document){
         //Do what you need to, return the collection of properties
         return someDictionaryOfHtmlObjects;
    }
}

public class HtmlScrapper
{
    // Properties, etc...

    public HtmlScrapper(IParser parser, string HtmlDocument){
         //Set your properties
    }

    public void ParseDocument(){
         this.myDictionaryOfHtmlObjects = 
                  parser.ParseDocument(this.htmlDocument);
    }

}

This should give you some flexibility in changing/improving how your application performs without needing to rewrite this class.

Richard Clayton 2009-07-26 03:28:42

This seems like poor design to me. HtmlScrapper is a IParser _and_ has a IParser, which it uses to be a an IParser. Then why shouldn't you just use the contained IParser without the extra useless level of abstraction? If I wanted to use the passed IParser, I would just use it rather than pass it to HtmlScrapper. This discussion is about whether or not the implementation of IParser should parse in its constructor, or separately in a multi-stage operation.

Ben S 2009-07-26 04:24:03

In my opinion, your class does to much. By abstracting the parser, you can interchange new versions or implementations as needed. That's the goal of object oriented design. But if you insist on containing all that logic in one class, have fun maintaining that code.

Richard Clayton 2009-07-26 14:40:05

Answer 12

A:

In my case, the entire contents of the HTML file are passed through a String. The string is no longer required once it is parsed and is fairly large (a few hundred kilobytes). So it would be best to not keep it in memory. The object shouldn't be used for other cases. It was designed to parse a certain page. Parsing something else should prompt the creation of a different object to parse that.

It sounds very much as though your object isn't really a parser. Does it just wrap a call to a parser and presents the results in a (presumably) more usable fashion? Because of this, you need to call the parser in the constructor as your object would be in a non-useful state otherwise.

I'm not sure how the "object-oriented" part helps here. If there's only one object and it can only process one specific page then it's not clear why it needs to be an object. You could do this just as easily in procedural (i.e. non-OO) code.

For languages that only have objects (e.g. Java) you could just create a static method in a class that had no accessible constructor and then invoke the parser and return all of the parsed values in a Map or similar collection

barrowc 2009-07-26 03:42:01

This object uses regex to parse information out of the HTML string and make it available using its accessor methods. Specifically, it parses the HTML from TV.com's show summary page. So parsing the HTML page would be done by calling: `ShowParser p = new ShowParser(htmlPage);` followed by getter calls such as `p.getGenre()` or `p.getSummary()`. Why would it be preferable to return a `Map`? If I were to use a `Map` the I would have to have a list of static Strings for each of the values ("Genre, Summary"). I would much rather remain OO and have an object with accessors instead of magic strings

Ben S 2009-07-26 04:31:39

Answer 13

A:

I personally put nothing in constructors and have a set of initialization functions. I find standard constructor methods have limited and cumbersome reuse.

2009-07-26 03:42:48

How are they limited and cumbersome? Do you have any examples?

Ben S 2009-07-26 04:47:59

Double initialisation for the sake of it? bleah

Greg Domjan 2009-07-26 06:10:00

For one thing, you can't re-initialize an object by calling its constructor again, you would have to create a new object or gut the constructor and put it in to a function that the constructor calls.To term this a different way, what *advantage* does the C#/Java initialization method have over normal methods? If you could call new MyObject.Initialize1() or new MyObject.Initialize2() where Initialize1 and Initialize2 are normal functions, when isn't that sufficient?

2009-07-26 16:09:14

Answer 14

A:

A possible option is to move the parsing code to a seperate function, make the constructor private, and have a static function parse( html ) that constructs the object and immediately calls the parse function.
This way you avoid the problems with parsing in the constructur (inconsistent state, problems when calling overridden functions, ...). But the client code still gets all the advantages (one call to get the parsed html or an 'early' error).

jos 2009-07-26 06:46:50

Answer 15

+1 A:

In this particular case, I would say there is two classes here: A parser and a parse result.

public class Parser {
    public Parser() {
        // Do what is necessary to construct a parser.
        // Perhaps we need to initialize a Unicode library, UTF-8 decoder, etc
    }
    public virtual ParseResult parseHTMLString(final string html) throws ParsingException
    {
        // Parser would do actual work here
        return new ParseResult(1, 2);
    }
}
public class ParseResult
{
    private int field1;
    private int field2;
    public ParseResult(int _field1, int _field2)
    {
        field1 = _field1;
        field2 = _field2;
    }
    public int getField1()
    {
        return field1;
    }
    public int getField2()
    {
        return field2;
    }
}

If your parser could work on partial sets of data, I'd suspect it would be suitable to add another class into the mix. Possibly a PartialParseResult?

rpetrich 2009-07-26 07:14:07

Answer 16

A:

As quite a few have commented the general rule is to only do initialization in constructors and never use say virtual methods (you will get a compiler warning if you try pay attention to that warning :) ). In you specific case I wouldn't go for the parHTML method either. an object should be in a valid state when it's constructed you should have to do stuff to the object before you can really use it.

Personally I'd go for a factory method. Exposing a class with no public constructors and create it using a factory method instead. Let you're factory method do the parsing and pass the parsed result to a private/protected constructor.

take a look at System.Web.WebRequest if you wanna see a sample of some similiar logic.

Rune FS 2009-07-26 09:53:30

Answer 17

A:

I agree with the posters here arguing minimal work in the constructor, really just putting the object into a non-zombie state, then have verb functions like parseHTML();

One point I'd like to make, although I don't want to cause a flame war, is consider the case of a non-exception environment. I know you're talking about C#, but I try to keep my programming models as similar as possible between c++ and c#. For various reasons, I don't use exceptions in C++ (think embedded video game programming), I use return code errors.

In this case, I can't throw exceptions in a constructor, so I tend to not have a constructor do anything which could fail. I leave that to the accessor functions.

Shane 2009-07-26 11:26:58

As I recall C++, constructors were _precisely_ the reason for exceptions!

John Saunders 2009-07-26 13:20:05

@John Saunders: I think you misunderstand me. Like I said, you can't always use exceptions in C++, for a variety of reasons, in which case you can't easily return a reasonable error from a constructor. That's why I, personally, don't do any more work in a constructor other than initialisation, then put code that could potentially fail in doWork() methods. I code the same way in Objective C too: i.e. alloc+init, then a function to do the work. It's also what I do in pure C, in that I allocate, initialise, then have functions to do the work. It's a common pattern for me. My 2c.

Shane 2009-07-26 23:18:13

Answer 18

+3 A:

Misko Hevery has a nice story on this subject, from a unit testing perspective, here.

jqno 2009-07-26 13:26:53

Answer 19

A:

In general, a constructor should:

Initialize all the fields.
Leave the resulting object in a valid state.

However, I would not use the constructor in the way you have. Parsing should be separated from using the parsing results.

Generally when I write a parser I write it as a singleton. I don't store any fields in the object except the single instance; instead, I only use local variables within the methods. Theoretically these could just be static (class-level) methods, but that would mean that I couldn't make them virtual.

Imagist 2009-07-29 05:33:17

ansaurus

tags:

views:

answers:

How much work should the constructor for an HTML parsing class do?

related questions