views:

234

answers:

4

A carelessly written template here, some excessive inlining there - it's all too easy to write bloated code in C++. In principle, refactoring to reduce that bloat isn't too hard. The problem is tracing the worst offending templates and inlines - tracing those items that are causing real bloat in real programs.

With that in mind, and because I'm certain that my libraries are a bit more bloat-prone than they should be, I was wondering if there's any tools that can track down those worst offenders automatically - i.e. identify those items that contribute most (including all their repeated instantiations and calls) to the size of a particular target.

I'm not much interested in performance at this point - it's all about the executable file size.

Are there any tools for this job, usable on Windows, and fitting with either MinGW GCC or Visual Studio?

EDIT - some context

I have a set of multiway-tree templates that act as replacements for the red-black tree standard containers. They are written as wrappers around non-typesafe non-template code, but they were also written a long time ago and as an "will better cache friendliness boost real performance" experiment. The point being, they weren't really written for long-term use.

Because they support some handy tricks, though (search based on custom comparisons/partial keys, efficient subscripted access, search for smallest unused key) they ended up being in use just about everywhere in my code. These days, I hardly ever use std::map.

Layered on top of those, I have some more complex containers, such as two-way maps. On top of those, I have tree and digraph classes. On top of those...

Using map files, I could track down whether non-inline template methods are causing bloat. That's just a matter of finding all the instantiations of a particular method and adding the sizes. But what about unwisely inlined methods? The templates were, after all, meant to be thin wrappers around non-template code, but historically my ability to judge whether something should be inlined or not hasn't been very reliable. The bloat impact of those template inlines isn't so easy to measure.

I have some idea which methods are heavily used, but that's the well-known opimization-without-profiling mistake.

+3  A: 

Map file analysis

I have seen a problem like this some time ago, and I ended up writing a custom tool which analysed map file (Visual Studio linker can be instructed to produce one). The tool output was:

  • list of function sorted descending by code size, listing only first N
  • list of source files sorted descending by code size, listing only first N

Parsing map file is relatively easy (function code size can be computed as a difference between current and following line), the hardest part is probably handling mangled names in a reasonable way. You might find some ready to use libraries for both of this, I did it a few years ago and I do not know the current situation.

Here is a short excerpt from a map file, so that you know what to expect:

Address         Publics by Value              Rva+Base       Lib:Object

0001:0023cbb4       ?ApplyScheme@Input@@QAEXPBVParamEntry@@@Z 0063dbb4 f   mainInput.obj
0001:0023cea1       ?InitKeys@Input@@QAEXXZ    0063dea1 f   mainInput.obj
0001:0023cf47       ?LoadKeys@Input@@QAEXABVParamEntry@@@Z 0063df47 f   mainInput.obj

Symbol Sort

As posted in Ben Staub's answer, Symbol Sort is a ready to use command line utility (comes with a complete C# source) which does all of this, with the only difference of not analysing map files, but rather pdb/exe files.

Suma
At first sight it answers the question "which functions are largest?", not "which items cause the most bloat in other functions?". Still interesting, of course, and for the name mangling there's always Agner Fog.
Steve314
I often parse map files in this way to find over-sized functions in embedded C code. Once you get the hang of doing it by hand, you should be able to write a script to do it for you (the mapfile format is compiler-specific, unfortunately, but the idea is the same). Be aware that some compilers don't generate map files by default, but most should have (if nothing else) a command-line option to do so.
bta
You can group (sum) items by any criteria you wish, not only by source file. Once you have demangled the name, you can e.g. group based on the template class name and ignore the rest.
Suma
As for demangling, there is `undname.exe` for script-based and `UndecorateSymbolName()` for programmatic solutions - the more annoying part is filtering out noise.
Georg Fritzsche
+1  A: 

Basically, you are looking for costly things that you don't need. Suppose there is some category of functions that you don't need taking some large percent of the space, like 20%. Then if you picked 20 random bytes out of the image size, on the average 4 of them (20 * 20%) will be in that category, and you will be able to see them. So basically, you take those samples, look at them, and if you see an obvious pattern of functions that you don't really need, then remove them. Then do it again because other categories of routines that used less space are now taking a higher percentage.

So I agree with Suma that parsing the map file is a good start. Then I would write a routine to walk through it, and every 5% of the way (space-wise) print the routine I am in. That way I get 20 samples. Often I find that a large chunk of object space results from a very small number (like 1) of lines of source code that I could easily have done another way.

You are also worried about too much inlining making functions larger than they could be. To figure that out, I would take each of those sample, and since it represents a specific address in a specific function, I would trace that back to the line of code it is in. That way, I can tell if it is in an expanded function. This is a bit of work, but doable.

A similar problem is how to find tumors when disks get full. The same idea there is to walk the directory tree, adding up the file sizes, Then you walk it again, and as you pass each 5% point, you print out the path of the file you are in. This tells you not only if you have large files, it tells you if you have large numbers of small files, and it doesn't matter how deeply they are buried or how widely they are scattered. When you clean out one category of files that you don't need, you can do it again to get the next category, and so on.

Good luck.

Mike Dunlavey
I've been giving this some thought, and my conclusion seems to be that Ben Straubs answer may be more useful than I thought (possibly - not sure yet). A good idea - but maybe I can get the same results quicker.
Steve314
@Steve314: If I'm only doing it once, I just do it by hand - get the total size, divide it into 20 evenly separated addresses, then look up each one in the map. It's a little tedious, but typically I don't need to look up all of them before finding something to fruitfully eliminate. If some wasteful category shows up on at least 2 samples, I know it takes enough space to be worth eliminating.
Mike Dunlavey
+2  A: 

Check out Symbol Sort. I used it a while back to figure out why our installer had grown by a factor of 4 in six months (it turns out the answer was static linking of the C runtime and libxml2).

Ben Straub
Looks like this will give me roughly the same information as the already suggested map-file approach, but the works already done. Definitely worth a +1.
Steve314
This utility is really great. It even does the template merging.
Suma
This looks like both my short term just-get-something-done-now answer and (along with Georgs comment to Sumas answer) a long term answer if I ever decide to invest substantial time. That is - I like the fact that the source is available, and that it's (unchecked but obvious) using the DBGHELP library which means it should be pretty easy to extend with no need to get bogged down in file format issues.
Steve314
+2  A: 

So what I'm reading based on your question and your comments is that the library is not actually too large.

The only tool you need to determine this is a command shell, or Windows File explorer. Look at the file size. Is it so big that it causes real actual problems? (Unacceptable download times, won't fit in memory on the target platform, that kind of thing)?

If not, then you should worry about code readability and maintainability and nothing else. And the tool for that is your eyes. Read the code, and take the actions needed to make it more readable if necessary.

If you can point to an actual reason why the executable size is a problem, please edit that into your question, as it is important context.

However, assuming the file size is actually a problem:

Inlined functions are generally not a problem, because the compiler, and no one else, chooses which functions to inline. Simply marking something inline does not inline the actual generated code. The compiler inlines if it determines the trade-off between larger code and less indirection to be worth it. If a function is called often, it will not be inlined, because that would dramatically affect code size, which would hurt performance.

If you're worried that inlined functions cause code bloat, simply compile with the "optimize for size" flag. Then the compiler will restrict inlining to the cases where it doesn't affect executable size noticeably.

For finding out which symbols are biggest, parse the map file as @Suma suggested.

But really, you said it yourself when you mentioned "the well-known opimization-without-profiling mistake."

The very first act of profiling you need to do is to ask is the executable size actually a problem? In the comments you said that you "have a feeling", which, in a profiling context is useless, and can be translated into "no, the executable size is not a problem".

Profile. Gather data and identify trouble spots. Before worrying about how to bring down the executable size, find out what the executable size is, and identify whether or not that is actually a problem. You haven't done that yet. You read in a book that "code bloat is a problem in C++", and so you assume that code bloat is a problem in your program. but is it? Why? How do you determine that it is?

jalf
That's the thing - the compiled library is big. But I don't know how big it *should* be. Even a few minutes seeing what difference refactoring the absolute worst offender causes would be a big clue - if its easy to make significant reductions, that's a fair indication that there's real bloat. Little or no gain - fine, revert the pointless change and be happy. Gathering data and identifying trouble spots is precisely what I'm trying to do! As for a good reason - when thinking of airing their source in public, I suspect many people worry a bit more about how dirty it is.
Steve314
On the inline-doesn't-mean-inline thing, good point. I know, of course, but somehow that doesn't mean I always take it into account.
Steve314
"the compiled library is big" - Compiled library? I though you care about executable size. What kind of library is this? Dll, or libary intended for static linking? If DLL, the same approach as for exe still applies (DLL is basically an exe). As for static libs, I do not thing their size is relevant in any respect.
Suma
"If not, then you should worry about code readability and maintainability and nothing else." +1
Brian Hooper