views:

1642

answers:

11

How would you go about converting a reasonably large (>300K), fairly mature C codebase to C++?

The kind of C I have in mind is split into files roughly corresponding to modules (i.e. less granular than a typical OO class-based decomposition), using internal linkage in lieu private functions and data, and external linkage for public functions and data. Global variables are used extensively for communication between the modules. There is a very extensive integration test suite available, but no unit (i.e. module) level tests.

I have in mind a general strategy:

  1. Compile everything in C++'s C subset and get that working.
  2. Convert modules into huge classes, so that all the cross-references are scoped by a class name, but leaving all functions and data as static members, and get that working.
  3. Convert huge classes into instances with appropriate constructors and initialized cross-references; replace static member accesses with indirect accesses as appropriate; and get that working.
  4. Now, approach the project as an ill-factored OO application, and write unit tests where dependencies are tractable, and decompose into separate classes where they are not; the goal here would be to move from one working program to another at each transformation.

Obviously, this would be quite a bit of work. Are there any case studies / war stories out there on this kind of translation? Alternative strategies? Other useful advice?

Note 1: the program is a compiler, and probably millions of other programs rely on its behaviour not changing, so wholesale rewriting is pretty much not an option.

Note 2: the source is nearly 20 years old, and has perhaps 30% code churn (lines modified + added / previous total lines) per year. It is heavily maintained and extended, in other words. Thus, one of the goals would be to increase mantainability.

[For the sake of the question, assume that translation into C++ is mandatory, and that leaving it in C is not an option. The point of adding this condition is to weed out the "leave it in C" answers.]

+4  A: 

I would write C++ classes over the C interface. Not touching the C code will decrease the chance of messing up and quicken the process significantly.

Once you have your C++ interface up; then it is a trivial task of copy+pasting the code into your classes. As you mentioned - during this step it is vital to do unit testing.

nlaq
The "C interface" starts and ends with "main()". I think you may have left a few steps out... :)
Barry Kelly
+7  A: 

What about:

  1. Compiling everything in C++'s C subset and get that working, and
  2. Implementing a set of facades leaving the C code unaltered?

Why is "translation into C++ mandatory"? You can wrap the C code without the pain of converting it into huge classes and so on.

Federico Ramponi
One of the points of making the code more modular, converting to C++, and adding unit tests, is to make it more maintainable. Just putting a facade over the front simply won't do.
Barry Kelly
The "translation to C++ mandatory" is to weed out those answers that say "leave the C unaltered".
Barry Kelly
A: 

If you have a small or academic project (say, less than 10,000 lines), a rewrite is probably your best option. You can factor it however you want, and it won't take too much time.

If you have a real-world application, I'd suggest getting it to compile as C++ (which usually means primarily fixing up function prototypes and the like), then work on refactoring and OO wrapping. Of course, I don't subscribe to the philosophy that code needs to be OO structured in order to be acceptable C++ code. I'd do a piece-by-piece conversion, rewriting and refactoring as you need to (for functionality or for incorporating unit testing).

Nick
+2  A: 

Your list looks okay except I would suggest reviewing the test suite first and trying to get that as tight as possible before doing any coding.

Paul Nathan
The test suite is pretty tight, trust me. 20 years of QA with tens of thousands of logged bugs with QA-written test cases tends to do that.
Barry Kelly
+2  A: 

Let's throw another stupid idea:

  1. Compile everything in C++'s C subset and get that working.
  2. Start with a module, convert it in a huge class, then in an instance, and build a C interface (identical to the one you started from) out of that instance. Let the remaining C code work with that C interface.
  3. Refactor as needed, growing the OO subsystem out of C code one module at a time, and drop parts of the C interface when they become useless.
Federico Ramponi
Barry Kelly
+6  A: 

Having just started on pretty much the same thing a few months ago (on a ten-year-old commercial project, originally written with the "C++ is nothing but C with smart structs" philosophy), I would suggest using the same strategy you'd use to eat an elephant: take it one bite at a time. :-)

As much as possible, split it up into stages that can be done with minimal effects on other parts. Building a facade system, as Federico Ramponi suggested, is a good start -- once everything has a C++ facade and is communicating through it, you can change the internals of the modules with fair certainty that they can't affect anything outside them.

We already had a partial C++ interface system in place (due to previous smaller refactoring efforts), so this approach wasn't difficult in our case. Once we had everything communicating as C++ objects (which took a few weeks, working on a completely separate source-code branch and integrating all changes to the main branch as they were approved), it was very seldom that we couldn't compile a totally working version before we left for the day.

The change-over isn't complete yet -- we've paused twice for interim releases (we aim for a point-release every few weeks), but it's well on the way, and no customer has complained about any problems. Our QA people have only found one problem that I recall, too. :-)

Head Geek
sounds scary... You should write up a more detailed article about this procedure, I bet it would be well read.
Ape-inago
I've written a couple of blog articles about specific parts of the conversion, http://geekblog.oakcircle.com/2008/07/19/ascii-unicode-and-windows/ and http://geekblog.oakcircle.com/2009/03/15/superbug/ . I'm not an amusing-enough writer to make the whole thing interesting though.
Head Geek
+2  A: 

Probably two things to consider besides how you want to start are on what you want to focus, and where you want to stop.

You state that there is a large code churn, this may be a key to focus your efforts. I suggest you pick the parts of your code where a lot of maintenance is needed, the mature/stable parts are apparently working well enough, so it is better to leave them as they are, except probably for some window dressing with facades etc.

Where you want to stop depends on what the reason is for wanting to convert to C++. This can hardly be a goal in itself. If it is due to some 3rd party dependency, focus your efforts on the interface to that component.

The software I work on is a huge, old code base which has been 'converted' from C to C++ years ago now. I think it was because the GUI was converted to Qt. Even now it still mostly looks like a C program with classes. Breaking the dependencies caused by public data members, and refactoring the huge classes with procedural monster methods into smaller methods and classes never has really taken off, I think for the following reasons:

  1. There is no need to change code that is working and that does not need to be enhanced. Doing so introduces new bugs without adding functionality, and end users don't appreciate that;
  2. It is very, very hard to do refactor reliably. Many pieces of code are so large and also so vital that people hardly dare touching it. We have a fairly extensive suite of functional tests, but sufficient code coverage information is hard to get. As a result, it is difficult to establish whether there are already sufficient tests in place to detect problems during refactoring;
  3. The ROI is difficult to establish. The end user will not benefit from refactoring, so it must be in reduced maintenance cost, which will increase initially because by refactoring you introduce new bugs in mature, i.e. fairly bug-free code. And the refactoring itself will be costly as well ...

NB. I suppose you know the "Working effectively with Legacy code" book?

andreas buykx
Yes, I have the book. Unfortunately, it is almost entirely only applicable to unit-testable code. The chief suggestion - barely more than a paragraph I recall - for people using non-OO code was use an OO-variant.
Barry Kelly
The information about mostly still looking like C, I can live with. There is significant churn, like I said, so being able to use C++ for the rewritten pieces moving forward would still be a win in terms of modularity.
Barry Kelly
There aren't really any "stable" parts, per se, apart from the memory manager. Primary goals would be being able to increase newly written source code abstraction level by carefully using templates, classes, and reduce cross-dependencies, particularly those caused by global variables.
Barry Kelly
The code churn is significant; it's not that rare that certain subsets of functionality are rewritten fairly comprehensively. E.g. code generator needs to be rewritten for 64-bit, scanner adapted for Unicode, overloading adjusted for generic method inferencing, etc. etc.
Barry Kelly
A: 

Here's what I would do:

  • Since the code is 20 years old, scrap down the parser/syntax analyzer and replace it with one of the newer lex/yacc/bison(or anything similar) etc based C++ code, much more maintainable and easier to understand. Faster to develop too if you have a BNF handy.
  • Once this is retrofitted to the old code, start wrapping modules into classes. Replace global/shared variables with interfaces.
  • Now what you have will be a compiler in C++ (not quite though).
  • Draw a class diagram of all the classes in your system, and see how they are communicating.
  • Draw another one using the same classes and see how they ought to communicate.
  • Refactor the code to transform the first diagram to the second. (this might be messy and tricky)
  • Remember to use C++ code for all new code added.
  • If you have some time left, try replacing data structures one by one to use the more standardized STL or Boost.
Sridhar Iyer
I don't think you appreciate all the subtleties of compilers. Commercial compilers use hand-written lexers and parser for many reasons, performance being only one. Secondly, don't get too addicted to classes. CLOS-style multiple dispatch on functions would more often be useful than virtual methods.
Barry Kelly
For example, how do you change the class of an instance on the fly? How do you create new classes at runtime? You end up adding levels of indirection, and losing much of the usual OO benefit. Actually, pattern matching, not just type matching, in the multiple dispatch would be even better.
Barry Kelly
Re parser: the language is not LL(1), nor is it LALR(1), it is context sensitive in ways that ad-hoc semantic and syntactic predicates can resolve. This is the price of flexible language extension over the years.
Barry Kelly
Re parser 2: Similarly, providing good intellisense / editor experience works best with parser integration. The compiler knows the types of everything, knows what's in scope, knows exactly where in the syntax the cursor (a special token) is placed, etc.
Barry Kelly
Re parser 3: This is easily reasoned about in hand-written recursive descent, but not easily tractable at all with shift/reduce state machine transitions. Since this cursor token can appear anywhere, it would complicate a grammar, as you can imagine - it's like explicit WS that needs an action.
Barry Kelly
@Barry; That sounds nasty.
Ape-inago
+3  A: 

Your application has lots of folks working on it, and a need to not-be-broken. If you are serious about large scale conversion to an OO style, what you need is massive transformation tools to automate the work.

The basic idea is to designate groups of data as classes, and then get the tool to refactor the code to move that data into classes, move functions on just that data into those classes, and revise all accesses to that data to calls on the classes.

You can do an automated preanalysis to form statistic clusters to get some ideas, but you'll still need an applicaiton aware engineer to decide what data elements should be grouped.

A tool that is capable of doing this task is our DMS Software Reengineering Toolkit. DMS has strong C parsers for reading your code, captures the C code as compiler abstract syntax trees, (and unlike a conventional compiler) can compute flow analyses across your entire 300K SLOC. DMS has a C++ front end that can be used as the "back" end; one writes transformations that map C syntax to C++ syntax.

A major C++ reengineering task on a large avionics system gives some idea of what using DMS for this kind of activity is like. See technical papers at www.semdesigns.com/Products/DMS/DMSToolkit.html, specifically Re-engineering C++ Component Models Via Automatic Program Transformation

This process is not for the faint of heart. But than anybody that would consider manual refactoring of a large application is already not afraid of hard work.

Yes, I'm associated with the company, being its chief architect.

Ira Baxter
good post, but you may want to add that you are affiliated with the mentioned product and company, otherwise people will start pointing this out by calling your posting a covert ad ;-)
none
This does sound like the way to go... and having avionics code means it has to be damn well sure it works.
Ape-inago
@none - I know Ira from comp.compilers. This conversion is probably off the table for us, not just because of cost / risk, but because we're exploring other avenues. However, the answer is useful for other folks with similar problems...
Barry Kelly
I'm always suprised by the "cost/risk" argument people apply to mass change technology, because they fail to compare to the "cost/risk" of doing it by manual methods. The cost to manually objectify such an application has to be huge, and the risk of course is that the programmers will incorrectly shuffle the code about. The risk of this technology is its ability to shuffle the code about accurately ... and we're relatively beyond that at this point. Now, it is true that the cost/risk of using tools for this is way higher than standing pat, but that has other costs.
Ira Baxter
I do think DMS is a very interesting product, but it seems to be targeted to very specific uses/users (possibly mainly by the costs involved)? Also, it would probably be easier to get an impression of its potential by providing some sort of web based access to a restricted version of it on the webpage (for example, to allow people to do a handful of specific transformations for C<->C++ like "rename symbol" for say 5-10 kbytes of code uploaded via some web form).
none
@none: If you can hand-hack your way through a problem in a day or a week, you don't need DMS. If you are contemplating man-months or more to get the job done, and/or you need to be very sure you didn't break the code, you need a tool like DMS.
Ira Baxter
+1  A: 

GCC is currently in midtransition to C++ from C. They started by moving everything into the common subset of C and C++, obviously. As they did so, they added warnings to GCC for everything they found, found under -Wc++-compat. That should get you on the first part of your journey.

For the latter parts, once you actually have everything compiling with a C++ compiler, I would focus on replacing things that have idiomatic C++ counterparts. For example, if you're using lists, maps, sets, bitvectors, hashtables, etc, which are defined using C macros, you will likely gain a lot by moving these to C++. Likewise with OO, you'll likely find benefits where you are already using a C OO idiom (like struct inheritence), and where C++ will afford greater clarity and better type checking on your code.

Paul Biggar
+1  A: 

You mention that your tool is a compiler, and that: "Actually, pattern matching, not just type matching, in the multiple dispatch would be even better".

You might want to take a look at maketea. It provides pattern matching for ASTs, as well as the AST definition from an abstract grammar, and visitors, tranformers, etc.

Paul Biggar