views:

412

answers:

10

We unit test most of our business logic, but are stuck on how best to test some of our large service tasks and import/export routines. For example, consider the export of payroll data from one system to a 3rd party system. To export the data in the format the company needs, we need to hit ~40 tables, which creates a nightmare situation for creating test data and mocking out dependencies.

For example, consider the following (a subset of ~3500 lines of export code):

public void ExportPaychecks()
{
   var pays = _pays.GetPaysForCurrentDate();
   foreach (PayObject pay in pays)
   {
      WriteHeaderRow(pay);
      if (pay.IsFirstCheck)
      {
         WriteDetailRowType1(pay);
      }
   }
}

private void WriteHeaderRow(PayObject pay)
{
   //do lots more stuff
}

private void WriteDetailRowType1(PayObject pay)
{
   //do lots more stuff
}

We only have the one public method in this particular export class - ExportPaychecks(). That's really the only action that makes any sense to someone calling this class ... everything else is private (~80 private functions). We could make them public for testing, but then we'd need to mock them to test each one separately (i.e. you can't test ExportPaychecks in a vacuum without mocking the WriteHeaderRow function. This is a huge pain too.

Since this is a single export, for a single vendor, moving logic into the Domain doesn't make sense. The logic has no domain significance outside of this particular class. As a test, we built out unit tests which had close to 100% code coverage ... but this required an insane amount of test data typed into stub/mock objects, plus over 7000 lines of code due to stubbing/mocking our many dependencies.

As a maker of HRIS software, we have hundreds of exports and imports. Do other companies REALLY unit test this type of thing? If so, are there any shortcuts to make it less painful? I'm half tempted to say "no unit testing the import/export routines" and just implement integration testing later.

Update - thanks for the answers all. One thing I'd love to see is an example, as I'm still not seeing how someone can turn something like a large file export into an easily testable block of code without turning the code into a mess.

A: 

Have you looked into Moq?

Quote from the site:

Moq (pronounced "Mock-you" or just "Mock") is the only mocking library for .NET developed from scratch to take full advantage of .NET 3.5 (i.e. Linq expression trees) and C# 3.0 features (i.e. lambda expressions) that make it the most productive, type-safe and refactoring-friendly mocking library available.

Paul Sasik
We actually use Moq rather extensively, and did so in our testing for this class. But when you essentially have 3500 lines of procedural code (broken up into many short private functions), something like Moq doesn't help much.
Andrew
the problem here is not how to mock, but how to refactor in order to be able to test it, I think
Stephane
@serbrech - possibly, which is why I asked the question. We can refactor it such that all calls are public, but that seems a little silly. 1) It would be confusing to our developers (since there would be many public calls, but only 1 should ever be called by another program 2) It means that we have to create test/mock objects to pass between the various functions, and these test objects would have no meaning outside the context of the export class. I feel like I'm either missing something or that this type of thing isn't tested. People at large enterprises I spoke with said the latter.
Andrew
What you can do is make the methods internal and virtual. This way you can override the class and test the overridden class making sure the InternalSVisibleTo (Google for this) attribute is set on the assembly you want to test.
Burt
+2  A: 

I have nothing to do with C#, but I have some idea you could try here. If you split your code a bit, then you'll notice that what you have is basically chain of operations performed on sequences.

First one gets pays for current date:

    var pays = _pays.GetPaysForCurrentDate();

Second one unconditionally processes the result

    foreach (PayObject pay in pays)
    {
       WriteHeaderRow(pay);
    }

Third one performs conditional processing:

    foreach (PayObject pay in pays)
    {
       if (pay.IsFirstCheck)
       {
          WriteDetailRowType1(pay);
       }
    }

Now, you could make those stages more generic (sorry for pseudocode, I don't know C#):

    var all_pays = _pays.GetAll();

    var pwcdate = filter_pays(all_pays, current_date()) // filter_pays could also be made more generic, able to filter any sequence

    var pwcdate_ann =  annotate_with_header_row(pwcdate);       

    var pwcdate_ann_fc =  filter_first_check_only(pwcdate_annotated);  

    var pwcdate_ann_fc_ann =  annotate_with_detail_row(pwcdate_ann_fc);   // this could be made more generic, able to annotate with arbitrary row passed as parameter

    (Etc.)

As you can see, now you have set of unconnected stages that could be separately tested and then connected together in arbitrary order. Such connection, or composition, could also be tested separately. And so on (i.e. - you can choose what to test)

Tomasz Zielinski
+1  A: 

I maintain some reports similar to what you describe, but not as many of them and with fewer database tables. I use a 3-fold strategy that might scale well enough to be useful to you:

  1. At the method level, I unit test anything I subjectively deem to be 'complicated'. This includes 100% of bug fixes, plus anything that just makes me feel nervous.

  2. At the module level, I unit test the main use cases. As you have encountered, this is fairly painful since it does require somehow mocking the data. I have accomplished this by abstracting the database interfaces (i.e. no direct SQL connections within my reporting module). For some simple tests I have typed the test data by hand, for others I have written a database interface that records and/or plays back queries, so that I can bootstrap my tests with real data. In other words, I run once in record mode and it not only fetches real data but it also saves a snapshot for me in a file; when I run in playback mode, it consults this file instead of the real database tables. (I'm sure there are mocking frameworks that can do this, but since every SQL interaction in my world has the signature Stored Procedure Call -> Recordset it was quite simple just to write it myself.)

  3. I'm fortunate to have access to a staging environment with a full copy of production data, so I can perform integration tests with full regression against previous software versions.

Eric
Does this mean you make all your methods public so that they can be testable?
Andrew
In the case of reporting code, which is not reused between applications or even assemblies, yes, I just make it public. If it were library code, I would mark methods that I do want to test but don't want to be externally visible as internal and then explicitly allow access to the test assembly using `InternalsVisibleTo`
Eric
+1  A: 

I think Tomasz Zielinski has a piece of the answer. But if you say you have 3500 lines of procedural codes, then the the problem is bigger than that. Cutting it into more functions will not help you test it. However, it' a first step to identify responsibilities that could be extracted further to another class (if you have good names for the methods, that can be obvious in some cases).

I guess with such a class you have an incredible list of dependencies to tackle just to be able to instanciate this class into a test. It becomes then really hard to create an instance of that class in a test... The book from Michael Feathers "Working With Legacy Code" answer very well such questions. The first goal to be able to test well that code into should be to identify the roles of the class and to break it into smaller classes. Of course that's easy to say and the irony is that it's risky to do without tests to secure your modifications...

You say you have only 1 public method in that class. That should ease the refactoring as you don't need to worry about the users fro, all the private methods. Encapsulation is nice, but if you have so much stuff private in that class, that probably means it doesn't belong here and you should extract different classes from that monster, that you will eventually be able to test. Pieces by pieces, the design should look cleaner, and you will be able to test more of that big piece of code. You best friend if you start this will be a refactoring tool, then it should help you not to break logic while extracting classes and methods.

Again the book from Michael Feathers seems to be a must read for you :) http://www.amazon.com/Working-Effectively-Legacy-Michael-Feathers/dp/0131177052

ADDED EXAMPLE :

This example come from the book from Michael Feathers and illustrate well your problem I think :

RuleParser  
public evaluate(string)  
private brachingExpression  
private causalExpression  
private variableExpression  
private valueExpression  
private nextTerm()  
private hasMoreTerms()   
public addVariables()  

obvioulsy here, it doesn't make sense to make the methods nextTerm and hasMoreTerms public. Nobody should see these methods, the way we are moving to the next item is definitely internal to the class. so how to test this logic??

Well if you see that this is a separate responsibility and extract a class, Tokenizer for example. this method will suddenly be public within this new class! because that's its purpose. It becomes then easy to test that behaviour...

So if you would apply that to your huge piece of code, and extract pieces of it to other classes with less responsibilities, and where it would feel more natural to make these methods public, you also will be able to test them easily. You said you are accessing about 40 different tables to map them. Why not breaking that into classes for each part of the mapping?

It's a bit hard to reason about a code I can't read. You maybe have other issues that prevent you to do this, but that's my best try on it.

Hope this helps Good luck :)

Stephane
This isn't legacy code, and all of our other code is easily testable. A lot of people say 'just refactor it', but nobody has been able to provide good direction on what this means for a complex task that requires many steps to complete. I'd love to see some type of example.
Andrew
well, you say you have a 3500 line method basically. The fact that it's cut into smaller private methods doesn't change really that long piece of code. I did give you hints in my answer to the refactoring. It is impossible that a class with so many lines of code holds only one responsibility. if you separate the responsibilities, you will end up with more classes that have different public methods, which becomes testable.I'll edit my answer for more details
Stephane
thinking more about the code you posted in the question, you could have a PayHeaderWriter class where it would make sense to have the WriteHeaderRow method public?
Stephane
Thanks, these are some good thoughts. I wonder if people test custom ETL code? Sometimes it can be quite complex, but often runs into the thousands of lines. I'd love to see an example in a book or on the web with what other people did ... I'll keep looking.
Andrew
But is the whole problem (or most of it) about something being private? I don't know about C#, but in C++ you have `friend` keyword which can be used to expose everything to tests.
Tomasz Zielinski
There's an attribute/tag that can be added, but C# has no native language support for 'friend'
Andrew
you also have smthg like this in C#, but if you need to test a private method, that's a design smell...
Stephane
serbrech - your example is very good, thank you!
Andrew
+4  A: 

What you should have initially are integration tests. These will test that the functions perform as expected and you could hit the actual database for this.

Once you have that savety net you could start refactoring the code to be more maintainable and introducing unit tests.

As mentioned by serbrech Workign Effectively with Legacy code will help you to no end, I would strongly advise reading it even for greenfield projects.

http://www.amazon.com/Working-Effectively-Legacy-Michael-Feathers/dp/0131177052

The main question I would ask is how often does the code change? If it is infrequent is it really worth the effort trying to introduce unit tests, if it is changed frequently then I would definatly consider cleaning it up a bit.

Burt
I guess my big question is: how do I refactor such a thing? I read the 'Working Effectively ...' book a while back, but applying such logic to a complex file export seems harder than just using a big procedure. For example - my 80 or so functions correlate to about 20 export types in this file across utilizing ~40 domain objects. I can split it into easily tested classes, but then we have a ton of classes for a single database. When your code is 'do step 1, then step 2, 3, 4, ... 20', the easiest to maintain approach often mirrors those steps. Or maybe I'm just dense.
Andrew
@Andrew: I don't work with anything nearly as huge as you're code base, but I wrote a few not-that-small importers/exportes in the past. They were also code blobs with one public method - and from todays perspective I clearly see, that breaking them into pieces and even some basic automatic testing would save a lot of time spent on debugging.
Tomasz Zielinski
Have a read at the following:http://stackoverflow.com/questions/1620855/unit-testing-a-large-methodIt sounds like you have a god method (Google around for it) on your hands that does everything. It does need split up and there are patterns out there for exactly this sort of thing.
Burt
@Burt - have you seen examples where people have followed these patterns in practice? I can certainly split things up into classes, but there's no way to abstract away that a single paycheck has dozens of dependencies that would need to be setup/maintained. I've asked this to dozens of people outside of SO, and the answer is always along the lines of "oh, there's a pattern for that", but nobody can provide a real example of how a LARGE export/report procedure was split up into easy to maintain/test blocks of code.
Andrew
Are your dependencies sitting behind interfaces? If so you could provide either Mocks or stubs to return information instead of hitting the actual dependency.Are you aware of dependency injection, this could help you when it comes to mocking out the dependencies.There are ways you can test private methods that you do not want to expose, so for example if you have Step 1, Step 2, ... pulled out into separate methods then you can test them steps individually:http://kurtschindler.net/blog/?tag=/internalsvisibleto
Burt
+9  A: 

This style of (attempted) unit testing where you try to cover an entire huge code base through a single public method always reminds me of surgeons, dentists or gynaecologists whe have perform complex operations through small openings. Possible, but not easy.

Encapsulation is an old concept in object-oriented design, but some people take it to such extremes that testability suffers. There's another OO principle called the Open/Closed Principle that fits much better with testability. Encapsulation is still valuable, but not at the expense of extensibility - in fact, testability is really just another word for the Open/Closed Principle.

I'm not saying that you should make your private methods public, but what I am saying is that you should consider refactoring your application into composable parts - many small classes that collaborate instead of one big Transaction Script. You may think it doesn't make much sense to do this for a solution to a single vendor, but right now you are suffering, and this is one way out.

What will often happen when you split up a single method in a complex API is that you also gain a lot of extra flexibility. What started out as a one-off project may turn into a reusable library.


Here are some thoughts on how to perform a refactoring for the problem at hand: Every ETL application must perform at least these three steps:

  1. Extract data from the source
  2. Transform the data
  3. Load the data into the destination

(hence, the name ETL). As a start for refactoring, this give us at least three classes with distinct responsibilities: Extractor, Transformer and Loader. Now, instead of one big class, you have three with more targeted responsibilities. Nothing messy about that, and already a bit more testable.

Now zoom in on each of these three areas and see where you can split up responsibilities even more.

  • At the very least, you will need a good in-memory representation of each 'row' of source data. If the source is a relational database, you may want to use an ORM, but if not, such classes need to be modeled so that they correctly protect the invariants of each row (e.g. if a field is non-nullable, the class should guarantee this by throwing an exception if a null value is attempted). Such classes have a well-defined purpose and can be tested in isolation.
  • The same holds true for the destination: You need a good object model for that.
  • If there's advanced application-side filtering going on at the source, you could consider implementing these using the Specification design pattern. Those tend to be very testable as well.
  • The Transform step is where a lot of the action happens, but now that you have good object models of both source and destination, transformation can be performed by Mappers - again testable classes.

If you have many 'rows' of source and destination data, you can further split this up in Mappers for each logical 'row', etc.

It never needs to become messy, and the added benefit (besides automated testing) is that the object model is now way more flexible. If you ever need to write another ETL application involving one of the two sides, you alread have at least one third of the code written.

Mark Seemann
Conceptually I understand, but I've never seen an example where someone created a complex export routine, translation process, or something similar in anything but a 'Tranaction Script'. Have you seen any examples? A procedural approach is very straightforward and easy to read/write as long as logic is broken out into different functions. When you are writing to a 3rd party specification that is laid out as a series of steps/requirements, converting that into a series of testable objects/classes seems far more difficult to maintain than just writing a procedure.
Andrew
Yes, it may be a lot easier to do it like that... until it comes to testing it :) Back in 2003-2004 when I worked for Microsoft Services I TDD'ed a very complex ETL application using this approach, so I have definitely seen examples :)
Mark Seemann
Edited my answer to include refactoring examples.
Mark Seemann
@Mark- good answer. Pipes and filters is another good pattern to mention. Filters are very easy to unit test- http://www.eaipatterns.com/PipesAndFilters.html
RichardOD
@RichardOD: Good link, thanks :)
Mark Seemann
+4  A: 

Something general that came to my mind about refactoring:

Refactoring does not mean you take your 3.5k LOC and divide it into n parts. I would not recommend to make some of your 80 methods public or stuff like this. It's more like vertically slicing your code:

  • Try to factor out self-standing algorithms and data structures like parsers, renderers, search operations, converters, special-purpose data structures ...
  • Try to figure out if your data is processed in several steps and can be build in a kind of pipe and filter mechanism, or tiered architecture. Try to find as many layers as possible.
  • Separate technical (files, database) parts from logical parts.
  • If you have many of these import/export monsters see what they have in common and factor that parts out and reuse them.
  • Expect in general that your code is too dense, i.e. it contains too many different functionalities next to each in too few LOC. Visit the different "inventions" in your code and think about if they are in fact tricky facilities that are worth having their own class(es).
    • Both LOC and number of classes are likely to increase when you refactor.
    • Try to make your code real simple ('baby code') inside classes and complex in the relations between the classes.

As a result, you won't have to write unit tests that cover the whole 3.5k LOC at all. Only small fractions of it are covered in a single test, and you'll have many small tests that are independent from each other.


EDIT

Here's a nice list of refactoring patterns. Among those, one shows quite nicely my intention: Decompose Conditional.

In the example, certain expressions are factored out to methods. Not only becomes the code easier to read but you also achieve the opportunity to unit test those methods.

Even better, you can lift this pattern to a higher level and factor out those expressions, algorithms, values etc. not only to methods but also to their own classes.

Wolfgang
Thanks Wolfgang. These are some good suggestions. I'd still love to see an example where someone took what is essentially a large procedural Transaction Script (PoEAA) and refactored it into something that is easily testable. Our units are relatively small, but they are all private so testing them is very difficult (far more so than writing the code to begin with).
Andrew
My idea was that your 80 methods are likely not to be the units usable for testing. Testable units might cut across those methods and might not be recognized yet. I'll see if I can find an example later.
Wolfgang
@Wolfgang - we actually went through the effort of refactoring this to be testable, but ended up making something that seemed less clear to our devs. We basically extracted methods at first, but then needed to extract class and use Dependency Injection for those classes to make them testable (otherwise the original method couldn't be tested by itself - i.e. ExportPaychecks above). This ended up creating a ton of classes that were testable, but most only had 1 or 2 methods in them.
Andrew
+1  A: 

I really find it hard to accept that you've got multiple, ~3.5 Klines data-export functions with no common functionality at all between them. If that's in fact the case, then maybe Unit Testing is not what you need to be looking at here. If there really is only one thing that each export module does, and it's essentially indivisible, then maybe a snapshot-comparison, data driven integration test suite is what's called for.

If there are common bits of functionality, then extract each of them out (as separate classes) and test them individually. Those little helper classes will naturally have different public interfaces, which should reduce the problem of private APIs that can't be tested.

You don't give any details about what the actual output formats look like, but if they're generally tabular, fixed-width or delimited text, then you ought at least to be able to split the exporters up into structural and formatting code. By which I mean, instead of your example code up above, you'd have something like:

public void ExportPaychecks(HeaderFormatter h, CheckRowFormatter f)
{
   var pays = _pays.GetPaysForCurrentDate();
   foreach (PayObject pay in pays)
   {
      h.formatHeader(pay);
      f.WriteDetailRow(pay);
   }
}

The HeaderFormatter and CheckRowFormatter abstract classes would define a common interface for those types of report elements, and the individual concrete subclasses (for the various reports) would contain logic for removing duplicate rows, for example (or whatever a particular vendor requires).

Another way to slice this is to separate data extraction and formatting from each other. Write code that extracts all the records from the various databases into an intermediate representation that's a super-set of the needed representations, then write relatively simple-minded filter routines that convert from the uber-format down to the required format for each vendor.


After thinking about this a little more, I realize you've identified this as an ETL application, but your example seems to combine all three steps together. That suggests that a first step would be to split things up such that all the data is extracted first, then translated, then stored. You can certainly test at least those steps separately.

Mark Bessey
Thanks, this is definitely ETL code
Andrew
+3  A: 

It sounds like integration tests may be sufficient. Especially if these export routines that don't change once their done or are only used for a limited time. Just get some sample input data with a variations, and have a test that verifies the final result is as expected.

A concern with your tests was the amount of fake data you had to create. You may be able to reduce this by creating a shared fixture (http://xunitpatterns.com/Shared%20Fixture.html). For unit tests the fixture which may be an in-memory representation of business objects to export, or for the case on integration tests it may be the actual databases initialized with known data. The point is that however you generate the shared fixture is the same in each test, so creating new tests is just a matter of doing minor tweaks to the existing fixture to trigger the code you want to test.

So should you use integration tests? One barrier is how to set up the shared fixture. If you can duplicate the databases somewhere, you could use something like DbUnit to prepare the shared fixture. It might be easier to break the code into pieces (import, transform, export). Then use the DbUnit based tests to test import and export, and use regular unit tests to verify the transform step. If you do that you don't need DbUnit to set up a shared fixture for the transform step. If you can break the code into 3 steps (extract, transform, export) at least you can focus your testing efforts on the part thats likely to have bugs or change later.

Frank Schwieterman
+1  A: 

This is one of those areas where the concept of mocking everything falls over. Certainly testing each method in isolation would be a "better" way of doing things, but compare the effort of making test versions of all your methods to that of pointing the code at a test database (reset at the start of each test run if necessary).

That is the approach I'm using with code that has a lot of complex interactions between components, and it works well enough. As each test will run more code, you are more likely to need to step through with the debugger to find exactly where something went wrong, but you get the primary benefit of unit tests (knowing that something went wrong) without putting in significant additional effort.

Tom Clarkson
Thanks Tom ... some of the other answers were very good, but we still end up with huge blocks of code to test with very little benefit. In the end, we'd still need integration tests, so why not just use them from the start?
Andrew