views:

424

answers:

11

Many bad things happened and continue to happen (or not, who knows, anything can happen) due to undefined behavior. I understand that this was introduced to leave some wiggle-room for compilers to optimize, and maybe also to make C++ easier to port to different platforms and architectures. However the problems caused by undefined behavior seem to be too large to be justified by these arguments. What are other arguments for undefined behavior? If there are none, why does undefined behavior still exist?

Edit To add some motivation for my question: Due to several bad experiences with less C++-crafty co-workers I have gotten used to making my code as safe as possible. Assert every argument, rigorous const-correctness and stuff like that. I try to leave as little room has possible to use my code the wrong way, because experience shows that, if there are loopholes, people will use them, and then they will call me about my code being bad. I consider making my code as safe as possible a good practice. This is why I do not understand why undefined behavior exists. Can someone please give me an example of undefined behavior that cannot be detected at runtime or compile time without considerable overhead?

+6  A: 

The problems are not caused by undefined behaviour, they are caused by writing the code that leads to it. The answer is simple - don't write that kind of code - not doing so is not exactly rocket science.

As for:

an example of undefined behavior that cannot be detected at runtime or compile time without considerable overhead

A real world issue:

int * p = new int;
// call loads of stuff which may create an alias to p called q
delete p;

// call more stuff, somewhere in which you do:
delete q;

Detecting this at compile time is imposisible. at run-time it is merely extremely difficult and would require the memory allocation system to do far more book-keeping (i.e. be slower and take up more memory) than is the case ifwe simply say the second delete is undefined. If you don't like this, perhaps C++ is not the language for you - why not switch to java?

anon
That is true, however, if a "feature" leads to so many wasted hours of work, then you've got to wonder why it is not just removed. There has to be a pretty impressive upside to justify all that.
Space_C0wb0y
undefined behavior is not a "feature"
Carson Myers
Why then, what exactly is it?
Space_C0wb0y
@Space_C0wb0y it's behavior that has not been defined because it was inconvenient for the committee, it would harm the portability of the language or make the compilers impossibly hard to write. I mean... it's not a feature, it's a lack of certain features that weren't needed or were implausible.
Carson Myers
So it comes down to politics? That is disappointing but expected.
Space_C0wb0y
Your answer is from the programmer perspective, while the question is posed from a language design perspective. Q:"Why does this bad thing exist?" A:"Avoid it"
Ross
I don't really see it as politics, as far as the committee thing goes. I mean there's only so much time to write the language spec, and it already takes a really long time. Defining all the things you're not supposed to do anyway seems like a silly waste of time.
Carson Myers
If one where to raise a statistic about the amount of money wasted due to bugs that can be traced back to undefined behavior, I am certain you pay a lot of smart people to write very detailed specs from that.
Space_C0wb0y
No, it is not politics, but rather engineering. Not everything can be checked within reasonable terms. Say that dereferencing an invalid pointer is changed from undefined behavior to a known error. Then the standard would be **requiring** all implementations to perform checks around each and every pointer dereference to produce that error. And I am not just talking about dereferencing null, but **all** pointers. Whenever you see `*p`, you would have to verify that `p` is a pointer to a valid block of memory, requiring the runtime to track all allocated memory for that check.
David Rodríguez - dribeas
I second David. Think of all the money which is saved by not doing those extra checks. Think of all the things that are possible by not doing those extra checks. Millions are spent on programmer errors... but a NullPointerException is an error too, and money is lost if your program was not carved out for an exception-throwing environment. So in the end, UB does not affect the money lost, careless programming does whatever the language.... and it's just hard to program correctly :)
Matthieu M.
On the amount of money that is lost tracking bugs: Undefined Behavior from the standard point of view does not mean that it has to be undefined in you particular implementation. Many implementations have specific code in debug builds to diagnose errors. Different implementations will offer greater diagnostics support to try and grab a bigger piece of the market. Not having it standardized means that the same implementation can do bounds checking on iterators in debug mode and at the same time have a fast unchecked release version.
David Rodríguez - dribeas
+6  A: 

My take on undefined behavior is this:

The standard defines how the language is to be used, and how the implementation is supposed to react when used in the correct manner. However, it would be a lot of work to cover every possible use of every feature, so the standard just leaves it at that.

However, in a compiler implementation, you can't just "leave it at that," the code has to be turned into machine instructions, and you can't just leave blank spots. In many cases, the compiler can throw an error, but that's not always feasible: There are some instances where it would take extra work to check whether the programmer is doing the wrong thing (for instance: calling a destructor twice -- to detect this, the compiler would have to count how many times certain functions have been called, or add extra state, or something). So if the standard doesn't define it, and the compiler just lets it happen, witty things can sometimes happen, maybe, if you're unlucky.

Carson Myers
Well, Java for instance provides a reference implementation. This is as clear a definition as it gets. Why is this not done here? If it has to be defined at some point, why not define it as early as possible?
Space_C0wb0y
I'm not going to pretend to be an expert on all C++ internals, but the fact that C++ programs run closer to the metal than java is probably a huge reason. The language spec has to leave room for implementations on vastly different hardware.
Carson Myers
It's more a matter of philosophy. C++ means speed, and throw caution to the wind. Extra checks don't mesh in this philosophy.
Matthieu M.
@Matthieu: Not precisely. C++ means allowing speed, and having caution as an option. As Stroustrup put it, you can build safety on top of a fast implementation, but you can't build speed on top of a safe implementation, assuming that safety and speed conflict. If you want fast access to elements of a vector, you use `[]`. If you want checked access, you use `.at()`. They're both in the Standard.
David Thornley
I don't argue with the idea of offering both, I argue with the fact that most developers don't need speed but use the idiomatic way of accessing an index `[]`, which is also unsafe... And thus I would prefer having a safe idiomatic way, and an unsafe other way for those who really need speed.
Matthieu M.
+1  A: 

Many things that are defined as undefined behavior would be hard if not impossible to diagnose by the compiler or runtime environment.

The ones that are easy have already turned into defined-undefined behavior. Consider calling a pure virtual method: it is undefined behavior, but most compilers/runtime environments will provide an error in the same terms: pure virtual method called. The defacto standard is that calling a pure virtual method call is a runtime error in all environments I know of.

David Rodríguez - dribeas
And why does nobody make the defacto standard a real standard?
Space_C0wb0y
What difference would that make? That is, the implementation is standard compliant (well, anything is compliant with UB), and users get the information that they need. What need is there to modify the standard? If it ain't broken don't fix it --you might just break it in a different way.
David Rodríguez - dribeas
Kepe in mind that if you want to "standardize" it, you're rapidly cerating more problems than you solve. Must the message go to std::cout or std::cerr ? What if they're redirected? What if the pure virtual call happens inside the redirecting `streambuf` ? Must the message be in English or can it be localized? And most damning: my applications users won't understand it anyway.
MSalters
A: 

There are times when undefined behavior is good. Take a big int for example.

union BitInt
{
    __int64 Whole;
    struct
    {
        int Upper;
        int Lower; // or maybe it's lower upper. Depends on architecture
    } Parts;
};

The spec says if we last read or wrote to Whole then reading/writing from Parts is undefined.

Now, that's just a tad silly to me because if we couldn't touch any other parts of the union then there is no point in having the union in the first place, right?

But anyway, maybe some functions will take __int64 while other functions take the two separated ints. Rather than convert every time we can just use this union. Every compiler I know treats this undefined behavior in a pretty clear way. So in my opinion undefined behavior isn't so bad here.

ProgramMax
As you note in the comment yourself, this behaviour is architecture dependant regarding endianness (and also regarding padding of fields, and size of fields). So it works on some platforms and does not work on others. If you stay on a platform where it works, fine for you. But then you are leaning on that specific platform architecture and compiler implementation.
Péter Török
Yeah, this is true. But you get what I'm saying about "Then why even have union if the only benefit it provides is undefined behavior anyway?" What I'm trying to point out is the union keyword exists because not all undefined behavior is bad.
ProgramMax
Unions compress multiple types into a single block of memory. This usage is quite useful, and does not require any dependence on undefined behavior.
Dennis Zickefoose
Hrmm. The C standard section 6.5 / 7 says that once you write to one part of a union access to all others is undefined. But looking at the C++ spec section 9.5 about unions doesn't say anything about that.
ProgramMax
@ProgramMax: 9.5 does actually speak to this, only in riddles you must decipher: 9.5.1: "In a union, at most one of the data members can be active at any time, that is, the value of at most one of the datamembers can be stored in a union at any time."
John Dibling
Ah, thanks John. :D I jumped to 9.5 "Unions" and then skimmed for "undefined" and didn't find it. But you are right, it is there...just in hints. :D
ProgramMax
There's tons of undefined behavior in C++, and most of is undefined by omission.
MSalters
+1  A: 

I asked myself that same question a few years ago. I stopped considering it right away, when I tried to provide a proper definition for the behavior of a function that writes to a null pointer.

Not all devices have a concept of protected memory. So you can't possibly rely on the system to protect you via a segfault or similar. Not all devices have read only memory, so you can't possibly say that the write simply does nothing. The only other option I could think of is to require that the application raise an exception [or abort, or something] without help from the system. But in that case, the compiler has to insert code before every single memory write to check for null unless it can guarantee that the pointer has not changed since the list memory write. That is clearly unacceptable.

So, leaving the behavior undefined was the only logical decision I could come to, without saying "Compliant C++ compilers can only be implemented on platforms with protected memory."

Dennis Zickefoose
"That is clearly unacceptable". I don't think it's all that clear. I've worked with a Java JIT that did exactly this on certain CPUs. Performance was fine. C and C++ programmers are pretty much defined as people to whom this is unacceptable, but there's no particular reason that they (we - I include myself for some projects) should be (a) numerous, or (b) always correct to rule it out ;-)
Steve Jessop
@Steve: I agree, most of the times it's perfectly acceptable to check, the only issue is that in a number of tight loops we need unchecked versions to make things run faster (or be doomed). Unfortunately programmers think in term of binary and often use the same idiom everywhere, thus writing the same unchecked calls routines even when speed isn't required :'(
Matthieu M.
The Java approach to this is to let the JIT hoist bounds checks outside the loop where possible. You're right, there are some cases where the programmer knows the bounds will not be exceeded, but the proof is too hard for the compiler/JIT to produce, *and* the performance cost would be significant. So, OK, if there did not exist a language in which the checks could be omitted, that would be unacceptable and someone would invent one. And quickly use it to omit bounds checks in cases where their proof that the bounds will not be exceeded is unavailable to the compiler because it's *wrong* ;-)
Steve Jessop
+3  A: 

I think the heart of the concern comes from the C/C++ philosophy of speed above all.

These languages were created at a time when raw power was sparse and you needed to get all the optimizations you could just to have something usable.

Specifying how to deal with UB would mean detecting it in the first place and then of course specifying the handling proper. However detecting it is against the speed first philosophy of the languages!

Today, do we still need fast programs ? Yes, for those of us working either with very limited resources (embedded systems) or with very harsh constraints (on response time or transactions per second), we do need to squeeze out as much as we can.

I know the motto throw more hardware at the problem. We have an application where I work:

  • expected time for an answer ? Less than 100ms, with DB calls in the midst (say thanks to memcached).
  • number of transactions per second ? 1200 in average, peaks at 1500/1700.

It runs on about 40 monsters: 8 dual core opteron (2800MHz) with 32GB of RAM. It gets difficult to be "faster" with more hardware at this point, so we need optimized code, and a language that allows it (we did restrain to throw assembly code in there).

I must say that I don't care much for UB anyway. If you get to the point that your program invokes UB then it needs fixing whatever the behavior that actually occurred. Of course it would be easier to fix them if it was reported straight away: that's what debug builds are for.

So perhaps that instead of focusing on UB we should learn to use the language:

  • don't use unchecked calls
  • (for experts) don't use unchecked calls
  • (for gurus) are you sure you really need an unchecked call here ?

And everything is suddenly better :)

Matthieu M.
+1  A: 

Compilers and programming languages are one of my favorite topics. In the past I did some research related with compilers and I have found many many times undefined behavior.

C++ and Java are very popular. It does not mean that they have a great design. They are widely used because they took risks in detriment of their design quality just to gain acceptance. Java went for garbage collection, virtual machine and pointer-free appearance. They were the partly pioneers and could not learn from many previous projects.

In the case of C++ one of the main goals was to give object oriented programming to C users. Even C programs should compile with a C++ compiler. That made a lot of nasty open points and C had already many ambiguities. C++ emphasis was power and popularity, not integrity. Not many languages give you multiple-inheritance, C++ give you that although not in a very polished way. Undefined behavior will always be there to support its glory and backwards compatibility.

If you really want a robust and well defined language you must look somewhere else. Sadly that is not the main concern of most people. Ada for example is a great language where a clear and defined behavior is important, but hardly anyone cares about the language because of its narrow user base. I am biased with the example because I really like that language, I posted something on my blog but if you want to learn more about how a language definition can help to to have less bugs even before you compile have a look at these slides

I am not saying C++ is a bad language! It just have different goals and I love working with it. You also have a large community, great tools, and much more great stuff such as STL, Boost and QT. But your doubt is also the root to become a great C++ programmer. If you want to be great with C++ this should be one of your concerns. I would encourage you to read the previous slides and also this critic. It will help you a lot to understand those times when the language is not doing what you expect.

And by the way. Undefined behavior goes totally against portability. In Ada for example, you have control about the layout of data structures (in C and C++ it can change according machine and compiler). Threads are part of the language. So porting C and C++ software will give you more pain than pleasure

Francisco Garcia
Strange then that vast amounts of highly portable software has been written in C and C++ - more than in any other languages, I would estimate.
anon
@Neil. With C/C++ if you switch compilers you might alter the layout of your structures (worst witching microprocessors). If you use threads in Linux you will have to use a different library when going for Windows. Ada does not have those problems (nor many others) but no one use it because its difficulty and lack of popularity. That you can easily shoot your foot (C) or blow it away (C++) is what people love because both give immediate low level access to all the power of their CPUs. Portability could be easier with other languages, but even I would choose C/C++ just because their popularity
Francisco Garcia
At the time of this comment I have noticed that after clicking the "undefined behavior" tag, all the questions in Stack Overflow (except one) are related with either the C or C++ tags
Francisco Garcia
A: 

Here's my favourite: after you've done delete on a non-null pointer using it (not only dereferencing, but also castin, etc) is UB (see this question).

How you can run into UB:

{
    char* pointer = new char[10];
    delete[] pointer;
    // some other code
    printf( "deleted %x\n", pointer );
}

Now on all architectures I know the code above will run fine. Teaching the compiler or runtime to perform analysis of such situations is very hard and expensive. Don't forget that sometimes it might be millions lines of code between delete and using the pointer. Settings pointers to null immediately after delete can be costly, so it's not a universal solution as well.

That's why there's the concept of UB. You don't want UB in your code. Maybe works maybe not. Works on this implementation, breaks on another.

sharptooth
A: 

The standard leaves "certain" behaviour undefined in order to allow a variety of implementations, without burdening those implementations with the overhead of detecting "certain" situations, or burdening the programmer with constraints required to prevent those situations arising in the first place.

There was a time when avoiding this overhead was a major advantage of C and C++ for a huge range of projects.

Computers are now several thousand times faster than they were when C was invented, and the overheads of things like checking array bounds all the time, or having a few megabytes of code to implement a sandboxed runtime, don't seem like a big deal for most projects. Furthermore, the cost of (e.g.) overrunning a buffer has increased by several factors, now that our programs handle many megabytes of potentially-malicious data per second.

It is therefore somewhat frustrating that there is no language which has all of C++'s useful features, and which in addition has the property that the behaviour of every program which compiles is defined (subject to implementation-specific behaviour). But only somewhat - it's not actually all that difficult in Java to write code whose behaviour is so confusing that from the POV of debugging, if not security, it might as well be undefined. It's also not at all difficult to write insecure Java code - it's just that the insecurity usually is limited to leaking sensitive information or granting incorrect privileges over the app, rather than giving up complete control of the OS process the JVM is running in.

So the way I see it is that good software engineering requires discipline in all languages, the difference is what happens when our discipline fails, and how much we're charged by other languages (in performance and footprint and C++ features you like) for insurance against that. If the insurance provided by some other language is worth it for your project, take it. If the features provided by C++ are worth paying for with the risk of undefined behaviour, take C++. I don't think there's much mileage in trying to argue, as if it was a global property that's the same for everyone, whether the benefits of C++ "justify" the costs. They're justified within the terms of reference for the design of the C++ language, which are that you don't pay for what you don't use. Hence, correct programs should not be made slower in order that incorrect programs get a useful error message instead of UB, and most of the time behaviour in unusual cases (e.g. << 32 of a 32-bit value) should not be defined (e.g. to result in 0) if that would require the unusual case to be checked for explicitly on hardware which the committee wants to support C++ "efficiently".

Look at another example: I don't think the performance benefits of Intel's professional C and C++ compiler justify the cost of buying it. Hence, I haven't bought it. Doesn't mean others will make the same calculation I made, or that I will always make the same calculation in future.

Steve Jessop
+2  A: 

The main source of undefined behaviour are pointers, and that's why C and C++ have a lot of undefined behaviour.

Consider this code:

char * r = 0x012345ff;
std::cout << r;

This code looks very bad, but should it issue an error? What if that address is indeed readable i.e. it's a value I obtained somehow (maybe a device address, etc.)?

In cases like this, there's no way to know if the operation is legal or not, and if it isn't, it's behaviour is indeed unpredictable.

Apart from this: in general C++ was designed with "The zero overhead rule" in mind (see The Design and Evolution of C++), so it couldn't possibly impose any burden on implementations to check for corner cases etc. You should always keep in mind that this language was designed and is indeed used not only on the desktop but also in embedded systems with limited resources.

UncleZeiv
+1  A: 

It's important to be clear on the differences between undefined behavior and implementation-defined behavior. Implementation defined behavior gives compiler writers the opportunities to add extensions to the language in order to leverage their platform. Such extensions are necessary in order to write code that works in the real world.

UB on the other hand exists in cases where it is difficult or impossible to engineer a solution without imposing major changes in the language or big differences from C. One example taken from a page where BS talks about this is:

int a[10];
a[100] = 0; // range error
int* p = a;
// ...
p[100] = 0; // range error (unless we gave p a better value before that assignment)

The range error is UB. It is an error, but how precisely the platform should deal with this is undefined by the Standard because the Standard can't define it. Each platform is different. It can't be engineered to an error because this would necessitate including automatic range checking in the language, which would require a major change to the language's feature set. The p[100] = 0 error is even more difficult for the language to generate a diagnostic for, either at compile- or run-time, because the compiler can't know what p really points to without run-time support.

John Dibling