views:

123

answers:

5

Hi,

I don't properly understand compilation and linking of C++ programs. Is there a way, I can look at object files generated by compiling a C++ program(in an understandable format). This should help me understand format of object files, how C++ classes are compiled, what information is needed by compiler to generate object files and help me understand statements like:

if a class is used only as a input parameters and return type, we don't need to include the whole class header file. Forward declaration is enough, but if a derived class derives from base class, we need to include the file containing the definition of base class (Taken from "Exceptional C++").

I am reading the book "Linking and Loading" to understand format of object files, but I would prefer something specially tailored for C++ source code.

Thanks,

Jagrati

Edit:

I know that with nm I can look at symbols present in the object files, but I am interested in knowing more about the object files.

A: 

Have you tried inspecting your binaries with readelf (provided you're on a Linux platform)? This provides pretty comprehensive information on ELF object files.

Honestly, though, I'm not sure how much this would help with understanding compilation and linking. I think the right tack is probably to get a handle on how C++ code maps to assembly pre- and post-linking.

Borealid
A: 

You normally don't need to know in details the internal format of the Obj files, since they are generated for you. All you need to know is that for every class you create, the compiler generates and Obj file, which is the binary byte code of your class, suited for the OS you are compiling for. Then the next step - linking - will put together the object files for all classes you need for your program in a single EXE or DLL (or whatever other format for the non-Windows OS-es). Could be also EXE + several DLLs, depending on your wishes.

The most important is that you separate the interface (declaration) and implementation (definition) of your class.

Always put in the header file interface declarations of your class only. Nothing else - no implementations here. Avoid also member variables, with custom types, which are not pointers, because for them forward declarations are not enough and you need to include other headers in your header. If you have includes in your header, then the design smells and also slows down the building process.

All implementations of the class methods or other functions should be in the CPP file. This will guarantee that the Obj file, generated by the compiler, won't be needed when somebody includes your header and you can have includes from others in the CPP files only.

But why bother? The answer is that if you have such separations, then the Linking is faster, because each of your Obj files is used once per class. Also, if you change your class, this will change also a small amount of other object files during the next build.

If you have includes in the header, this means that when the compiler generates the Obj file for your class it should first generate Obj file for the other classes included in your header, which may require again other Obj files and so on. Could be even a circular dependency and then you can not compile! Or if you change something in your class, then the compiler will need to regenerate a lot of other Obj files, because they become very tight dependent after some time, if you don't separate.

m_pGladiator
RE: "Avoid also member variables, with custom types" -- how can you avoid that w/o using raw pointers? Me guess would be smart pointers, but any other ideas?
msiemeri
@msiemeri: I think that advise is overblown anyway. You might want to do that in a few instances to break a dependency cycle, but it's bad as general advise. And yes, in that case one should use a scoped_ptr or similar.
peterchen
A: 

nm is a unix tool which will show you the names of the symbols in an object file.

objdump is a GNU tool which will show you more information.

But both tools will show you quite raw information that is used by the linker, but not designed to be read by human beings. That will probably not help you to better understand what happen at the C++ level.

dolmen
+1  A: 

First things, first. Disassembling the compiler output will most probably not help you in any way to understand any of the issues you have. The output of the compiler is no longer a c++ program, but plain assembly and that is really harsh to read if you do not know what the memory model is.

On the particular issues of why is the definition of base required when you declare it to be a base class of derived there are a few different reasons (and probably more that I am forgetting):

  1. When an object of type derived is created, the compiler must reserve memory for the full instance and all subclasses: it must know the size of base
  2. When you access a member attribute the compiler must know the offset from the implicit this pointer, and that offset requires knowledge of the size taken by the base subobject.
  3. When an identifier is parsed in the context of derived and the identifier is not found in derived class, the compiler must know whether it is defined in base before looking for the identifier in the enclosing namespaces. The compiler cannot know whether foo(); is a valid call inside derived::function() if foo() is declared in the base class.
  4. The number and signatures of all virtual functions defined in base must be known when the compiler defines the derived class. It needs that information to build the dynamic dispatch mechanism --usually vtable--, and even to know whether a member function in derived is bound for dynamic dispatch or not --if base::f() is virtual, then derived::f() will be virtual regardless of whether the declaration in derived has the virtual keyword.
  5. Multiple inheritance adds a few other requirements --like relative offsets from each baseX that must be rewritting before final overriders for the methods are called (a pointer of type base2 that points to an object of multiplyderived does not point to the beginning of the instance, but to the beginning of the base2 subobject in the instance, which might be offsetted by other bases declared before base2 in the inheritance list.

To the last question in the comments:

So doesn't instantiation of objects (except for global ones) can wait until runtime and thus the size and offset etc could wait until link time and we shouldn't necessarily have to deal with it at the time we are generating object files?

void f() {
   derived d;
   //...
}

The previous code allocates and object of type derived in the stack. The compiler will add assembler instructions to reserve some amount of memory for the object in the stack. After the compiler has parsed and generated the assembly, there is no trace of the object, in particular (assuming a trivial constructor for a POD type: i.e. nothing is initialized), that code and void f() { char array[ sizeof(derived) ]; } will produce exactly the same assembler. When the compiler generates the instruction that will reserve the space, it needs to know how much.

David Rodríguez - dribeas
A: 

Im reading "http://www.network-theory.co.uk/docs/gccintro/" - "Introduction to GCC". This has given me a good insight in linking and compiling. Its on a beginners level, but I dont care.

mslot