views:

375

answers:

7

I am planning out some work to introduce Dependency Injection into what is currently a large monolithic library in an attempt to make the library easier to unit-test, easier to understand, and possibly more flexible as a bonus.

I have decided to use NInject, and I really like Nate's motto of 'do one thing, do it well' (paraphrased), and it seems to go particularly well within the context of DI.

What I have been wondering now, is whether I should split what is currently a single large assembly into multiple smaller assemblies with disjoint feature sets. Some of these smaller assemblies will have inter-dependencies, but far from all of them, because the architecture of the code is pretty loosely coupled already.

Note that these feature sets are not trivial and small unto themselves either... it encompasses things like client/server communications, serialisation, custom collection types, file-IO abstractions, common routine libraries, threading libraries, standard logging, etc.

I see that a previous question: What is better, many small assemblies, or one big assembly? kind-of addresses this issue, but with what seems to be even finer granularity that this, which makes me wonder if the answers there still apply in this case?

Also, in the various questions that skirt close to this topic a common answer is that having 'too many' assemblies has caused unspecified 'pain' and 'problems'. I would really like to know concretely what the possible down-sides of this approach could be.

I agree that adding 8 assemblies when before only 1 was needed is 'a bit of a pain', but having to include a big monolithic library for every application is also not exactly ideal... plus adding the 8 assemblies is something you do only once, so I have very little sympathy for that argument (even tho I would probably complain along with everyone else at first).

Addendum:
So far I have seen no convinging reasons against smaller assemblies, so I think I will proceed for now as if this is a non-issue. If anyone can think of good solid reasons with verifiable facts to back them up I would still be very interested to hear about them. (I'll add a bounty as soon as I can to increase visibility)

EDIT: Moved the performance analysis and results into a separate answer (see below).

A: 

Personally, I like the monolithic approach.

But sometimes you cannot help to create more assemblies. .NET remoting is normally responsible for this, when you require a common interface assembly.

I am not sure how 'heavy' the overhead is of loading an assembly. (perhaps someone can enlighten us)

leppie
+3  A: 

There is a slight performance hit to loading each assembly (even more if they are signed), so that's one reason to tend to cluster commonly-used things together in the same assembly. I don't believe there's a big overhead once things are loaded (though there may be some static optimization stuff that the JIT may have a harder time performing when crossing an assembly boundary).

The approach I try to take is this: Namespaces are for the logical organization. Assemblies are to group classes/namespaces that should be physically used together. Ie. if you don't expect to want ClassA and not ClassB (or vice versa), they belong in the same assembly.

Jonathan
Good point about whole program optimization :)
leppie
Although I would not take for granted that it cannot do these optimizations, because it should still be able to load all referenced assemblies into memory and then make the same kind of global optimization decisions. I'm not saying it will or won't... just that there is no technical reason that it cannot as far as I can see.
jerryjvl
Regarding the performance hit, it's not necessarily so "slight". I've had a project on Windows Mobile where reducing the number of assemblies and DLLs reduced the start time from 1.5 minutes to around 15 seconds. Of course, it's less of a problem if your target is a PC or server, but it's definitely something you should consider.
Tal Pressman
I haven't done any mobile development, but it's "slight" on the normal framework. Of course, 'slight' times 200 signed assemblies isn't very slight anymore :)
Jonathan
I second the performance hit part of it, and yes, I'm talking about normal framework. It all depends on how many of those assemblies you've got. To share a personal anecdote: when optimizing startup time of Visual Studio 2010, we actually had too many assembly loads as one of the contributing factors that came up during profiling.
Pavel Minaev
A: 

monolithic monsters make reusing part of the code for later work more expensive than it had to be. and leads to coupling (often explicit) between classes that didn't need to be coupled resulting in higher cost of maintance since testing and error correction will be harder as a result.

A down side of having many projects is that it (at least in VS) takes quite some time to compile compare to few projects.

Rune FS
Do you have any references or measurements for the difference in compile time? Not that it's even remotely a deal-breaker, but I'd be interested to have a feel for this aspect.
jerryjvl
How it will impact the compile process/time depends on the build machine and the projects, the more your able to build in parallel the less it will impact. Having 8 projects build on 8 cores fully paralellel might be faster than building one large on just one of them. The latest project I've tested it on we had 19 projects decreased to 8 and got something like 30% decrease in compile time (building on one CPU core)
Rune FS
+1  A: 

I guess if you are only talking about a dozen, you should be ok. I'm working on an application with 100+ assemblies, and it is very painful.

If you don't have someway of managing the dependencies - knowing what is going to get broken if you modify assembly X, you are in trouble.

One 'nice' problem I have come across is when assembly A references assemblies B and C, and B references V1 of assembly D, while C references V2 of assembly D. ('Twisted diamond' would be quite a good name for that)

If you want to have an automated build, you're going to have fun maintaining the build script (which will need to build in reverse order of dependencies), or else have 'one solution to rule them all', which will be nearly impossible to use in Visual Studio if you have lots of assemblies.

EDIT I think the answer to your question depends very much on the semantics of your assemblies. Are different applications likely to share an assembly? Do you want to be able to update the assemblies for both applications separately? Are you intending to use the GAC? Or copy the assemblies next to the executables?

Benjol
I am not entirely sold on the 'knowing what is going to get broken' argument... I have unit-testing and coverage analysis to help there. Also, the 'twisted-diamond' problem is not likely, because our build system is required to always re-build all from source. I guess this to some extent limits our exposure to problems.
jerryjvl
I'd love to know what specifically might get problematic with build scripts or MSBuild/VS though?
jerryjvl
To be honest, we've yet to create a build script here, which is why doing it for 100+ assemblies seems a bit daunting. I'm still 'new' here, my approach has been to gradually integrate assemblies into a master solution as and when I need to modify them. I use the configuration manager to limit rebuilds, and solution folders to 'organise'. I'm at 50 projects now, and it's useable perf-wise, but a bit cluttered (See http://stackoverflow.com/questions/152053/structuring-projects-dependencies-of-large-winforms-applications-in-c/298462#298462)
Benjol
... So my comment on build scripts may be unfair. Personally, I would opt for a big solution, even if it was only ever used by night build. (though this then poses other potential problems with project references - http://stackoverflow.com/questions/177338/what-do-you-do-about-references-when-unloading-a-project-in-visual-studio). I'm still feeling my way on these questions.
Benjol
+6  A: 

I will give you a real-world example where the use of many (very) small assemblies has produced .Net DLL Hell.

At work we have a large homegrown framework that is long in the tooth (.Net 1.1). Aside from usual framework type plumbing code (including logging, workflow, queuing, etc), there were also various encapsulated database access entities, typed datasets and some other business logic code. I wasn't around for the initial development and subsequent maintenance of this framework, but did inherit it's use. As I mentioned, this entire framework resulted in numerous small DLLs. And, when I say numerous, we're talking upwards of 100 -- not the managable 8 or so you've mentioned. Further complicating matters were that the assemblies were all stronly-signed, versioned and to appear in the GAC.

So, fast-forward a few years and a number of maintenance cycles later, and what's happened is that the inter dependencies on the DLLs and the applications they support has wreaked havoc. On every production machine is a huge assembly redirect section in the machine.config file that ensures that "correct" assembly get's loaded by Fusion no matter what assembly is requested. This grew out of the difficulty that was encountered to rebuild every dependent framework and application assembly that took a dependency on one that was modified or upgraded. Great pains (usually) were taken to ensure that no breaking changes were made to assemblies when they were modified. The assemblies were rebuilt and a new or updated entry was made in the machine.config.

Here's were I will pause to listen to the sound of a huge collective groan and gasp!

This particular scenario is the poster-child for what not to do. Indeed in this situation, you get into a completely unmaintainable situation. I recall it took me 2 days to get my machine setup for development against this framework when I first started working with it -- resolving differences between my GAC and a runtime environment's GAC, machine.config assembly redirects, version conflicts at compile time due to incorrect references or, more likely, version conflict due to direct referencing component A and component B, but component B referenced component A, but a different version than my application's direct reference. You get the idea.

The real problem with this specific scenario is that the assembly contents were far too granular. And, this is ultimately what caused the tangled web of inter dependencies. My thoughts are that the initial architects thought this would create a system of highly maintainable code -- only having to rebuild very small changes to components of the system. In fact, the opposite was true. Further, to some of the other answers posted here already, when you get to this number of assemblies, loading a ton of assemblies does incur a performance hit -- definitely during resolution, and I would guess, though I have no empirical evidence, that runtime might suffer in some edge case situations, particularly where reflection might come into play -- could be wrong on that point.

You'd think I'd be scorned, but I believe there are logic physical separations for assemblies -- and when I say "assemblies" here, I am assuming one assembly per DLL. What it all boils down to are the inter dependencies. If I have an assembly A that depends on assembly B, I always ask myself if I'll ever have the need to reference assembly B with out assembly A. Or, is there a benefit to that separation. Looking at how assemblies are referenced is usually a good indicator as well. If you were to divide your large library in assemblies A, B, C, D and E. If you referenced assembly A 90% of the time and because of that, you always had to reference assembly B and C because A was dependent on them, then it's likely a better idea that assemblies A, B and C be combined, unless there's a really compelling argument to allow them to remain separated. Enterprise Library is classic example of this where you've nearly always got to reference 3 assemblies in order to use a single facet of the library -- in the case of Enterprise Library, however, the ability to build on top of core functionality and code reuse are the reason for it's architecture.

Looking at architecture is another good guideline. If you have a nice cleanly stacked architecture, where your assembly dependencies are int the form of a stack, say "vertical", as opposed to a "web", which starts to form when you have dependencies in every direction, then separation of assemblies on functional boundaries makes sense. Otherwise, look to roll things into one or look to re-architect.

Either way, good luck!

Peter Meyer
+1 for a good example. It does occur to me that to a large extent the problems in this scenario did not stem from the many-small-assemblies itself, but rather from the fact that this granularity allowed other things to go wrong much more spectacularly? ... I'm not saying that is not a concern though, because risk should obviously be limited as much as possible.
jerryjvl
Agreed. That point wasn't clear from my diatribe, though I tried to express it -- I'm all for using multiple smaller assemblies as long as it makes sense. In other words what goes where can not be arbitrary but rather well planned along architectural, maintenance and reuse concerns.
Peter Meyer
+1  A: 

The biggest factor in your assembly organization should be your dependency graph, at a class as well as an assembly level.

Assemblies should not have circular references. That should be pretty obvious to start.

The classes that have the most dependencies on each other should be in a single assembly.

If a class A depends on class B, and while B may not depend directly on A, it is unlikely to be ever used apart from A, then they should share an assembly.

You can also use assemblies to enforce separation of concerns - having your GUI code in one assembly while your business logic resides in another will provide some level of enforcement of your business logic being agnostic of your GUI.

Assembly separation based on where the code will be run is another point to consider - common code between executables should (generally) be in a common assembly, rather than having one .exe refer directly to another.

Perhaps one of the more important things you can use assemblies for is to differentiate between public APIs, and objects used internally to enable the public APIs to work. By putting an API in a separate assembly, you can enforce the opaqueness of its API.

kyoryu
This is all very true, but none of it addresses the question. I am assuming the same people will be doing the work either way, so the design is not going to be dramatically different either way. The only motivation for altering the size of the assemblies will be to alter how many features are implemented in each assembly.
jerryjvl
And to manage dependencies. If your assembly division is too fine-grained, consumers will have to reference many assemblies, and you'll have so many cross-assembly dependencies that you'll lose the separation advantage of having separate assemblies in the first place. While you've done some nice analysis below, I suspect that on a larger project (one that would need multiple assemblies), the organization benefits (or disadvantages) of multiple assemblies will far outweigh the compilation/solution loading impacts. Just my opinion, of course :)
kyoryu
+8  A: 

Since the performance analysis has become a little lengthier than expected, I've put it into it's own separate answer. I will be accepting Peter's answer as official, even though it lacked measurements since it was most instrumental in motivating me to perform the measurements myself, and since it gave me the most inspiration for what might be worth measuring.

Analysis:
The concrete downsides mentioned so far seem to all focus on performance of one kind of another but actual quantitative data was missing, I have done some measurements of the following:

  • Time to load solution in the IDE
  • Time to compile in the IDE
  • Assembly load time (time it takes the application to load)
  • Lost code optimisations (time it takes an algorithm to run)

This analysis completely ignores the 'quality of the design', which some people have mentioned in their answers, since I do not consider the quality a variable in this trade-off. I am assuming that the developer will first and foremost let their implementation be guided by the desire to get the best possible design. The trade-off here is whether it is worthwhile aggregating functionality into larger assemblies than the design strictly calls for, for the sake of (some measure of) performance.

Application structure:
The application I built is somewhat abstract because I needed a large number of solutions and projects to test with, so I wrote some code to generate them all for me.

The application contains 1000 classes, grouped into 200 sets of 5 classes that inherit from each other. Classes are named Axxx, Bxxx, Cxxx, Dxxx and Exxx. Classes A is completely abstract, B-D are partially abstract, overriding one of the methods of A each, and E is concrete. The methods are implemented so that a call of one method on instances of E will perform multiple calls up the hierarchy chain. All method bodies are simple enough that they should theoretically all inline.

These classes were distributed across assemblies in 8 different configurations along 2 dimensions:

  • Number of assemblies: 10, 20, 50, 100
  • Cutting direction: across the inheritance hierarchy (none of A-E are ever in the same assembly together), and along the inheritance hierarchy

The measurements are not all exactly measured; some were done by stopwatch and have a larger margin of error. The measurements taken are:

  • Opening the solution in VS2008 (stopwatch)
  • Compiling the solution (stopwatch)
  • In IDE: Time between start and first executed line of code (stopwatch)
  • In IDE: Time to instantiate one of Exxx for each of the 200 groups in the IDE (in code)
  • In IDE: Time to execute 100,000 invocations on each Exxx in the IDE (in code)
  • The last three 'In IDE' measurements, but from the prompt using the 'Release' build

Results:
Opening the solution in VS2008

                               ----- in the IDE ------   ----- from prompt -----
Cut    Asm#   Open   Compile   Start   new()   Execute   Start   new()   Execute
Across   10    ~1s     ~2-3s       -   0.150    17.022       -   0.139    13.909
         20    ~1s       ~6s       -   0.152    17.753       -   0.132    13.997
         50    ~3s       15s   ~0.3s   0.153    17.119    0.2s   0.131    14.481
        100    ~6s       37s   ~0.5s   0.150    18.041    0.3s   0.132    14.478

Along    10    ~1s     ~2-3s       -   0.155    17.967       -   0.067    13.297
         20    ~1s       ~4s       -   0.145    17.318       -   0.065    13.268
         50    ~3s       12s   ~0.2s   0.146    17.888    0.2s   0.067    13.391
        100    ~6s       29s   ~0.5s   0.149    17.990    0.3s   0.067    13.415

Observations:

  • The number of assemblies (but not the cutting direction) seems to have a roughly linear impact on the time it takes to open the solution. This does not strike me as surprising.
  • At about 6 seconds, the time it takes to open the solution does not seem to me an argument to limit the number of assemblies. (I did not measure whether associating source control had a major impact on this time).
  • Compile time increases a little more than linearly in this measurement. I imagine most of this is due to the per-assembly overhead of compilation, and not inter-assembly symbol resolutions. I would expect less trivial assemblies to scale better along this axis. Even so, I personally don't find 30s of compile time an argument against splitting, especially when noting that most of the time only some assemblies will need re-compilation.
  • There appears to be a barely measurable, but noticeable increase in start-up time. The first thing the application does is output a line to the console, the 'Start' time is how long this line took to appear from start of execution (note these are estimates because it was too quick to measure accurately even in worst-case).
  • Interestingly, it appears that outside the IDE assembly loading is (very slightly) more efficient than inside the IDE. This probably has something to do with the effort of attaching the debugger, or some such.
  • Also note that re-start of the application outside the IDE reduced the start-up time a little further still in the worst-case. There may be scenarios where 0.3s for start-up is unacceptable, but I cannot imagine this will matter in many places.
  • Initialisation and execution time inside the IDE are solid regardless of the assembly split-up; this may be a case of the fact that it needs to debug causing it to have an easier time at resolving symbols across assemblies.
  • Outside the IDE, this stability continues, with one caveat... the number of assemblies does not matter for the execution, but when cutting across the inheritance hierarchy, the execution time is a fraction worse than when cutting along. Note that the difference appears too small to me to be systemic; it probably is extra time it takes the run-time once to figure out how to do the same optimisations... frankly although I could investigate this further, the differences are so small that I am not inclined to worry too much.

So, from all this it appears that the burden of more assemblies is predominantly borne by the developer, and then mostly in the form of compilation time. As I already stated, these projects were so simple that each took far less than a second to compile causing the per-assembly compilation overhead to dominate. I would imagine that sub-second assembly compilation across a large number of assemblies is a strong indication that these assemblies have been split further than is reasonable. Also, when using pre-compiled assemblies the major developer argument against splitting (compilation time) would also disappear.

In these measurements I can see very little if any evidence against splitting into smaller assemblies for the sake of run-time performance. The only thing to watch out for (to some extent) is to avoid cutting across inheritance whenever possible; I would imagine that most sane designs would limit this anyway because inheritance would typically only occur within a functional area, which would normally end up within a single assembly.

jerryjvl
+1 for detailed research
Jeremy McGee
This is a great analysis. Thanks for doing the research.
Peter Meyer
Thanks... in the spirit of Stack Overflow it seemed the right thing to do... now at least there is some concrete knowledge to work from ;) ... I'd love for someone to do some more in-depth testing especially in the area of run-time behaviour of optimizations, and to confirm that the performance drop outside the IDE when splitting across is only a step-change and not a scaling one.
jerryjvl
If nobody else does I might eventually be motivated to dig deeped myself, but for now I think I have the answer that I need, in that with a properly designed set of libraries it is highly unlikele to really matter how many assemblies are involved.
jerryjvl