We overload the global new and delete operators where I work for many reasons:
- pooling all small allocations -- decreases overhead, decreases fragmentation, can increase performance for small-alloc-heavy apps
- framing allocations with a known lifetime -- ignore all the frees until the very end of this period, then free all of them together (admittedly we do this more with local operator overloads than global)
- alignment adjustment -- to cacheline boundaries, etc
- alloc fill -- helping to expose usage of uninitialized variables
- free fill -- helping to expose usage of previously deleted memory
- delayed free -- increasing the effectiveness of free fill, occasionally increasing performance
- sentinels or fenceposts -- helping to expose buffer overruns, underruns, and the occasional wild pointer
- redirecting allocations -- to account for NUMA, special memory areas, or even to keep separate systems separate in memory (for e.g. embedded scripting languages or DSLs)
- garbage collection or cleanup -- again useful for those embedded scripting languages
- heap verification -- you can walk through the heap data structure every N allocs/frees to make sure everything looks ok
- accounting, including leak tracking and usage snapshots/statistics (stacks, allocation ages, etc)
The idea of new/delete accounting is really flexible and powerful: you can, for example, record the entire callstack for the active thread whenever an alloc occurs, and aggregate statistics about that. You could ship the stack info over the network if you don't have space to keep it locally for whatever reason. The types of info you can gather here are only limited by your imagination (and performance, of course).
We use global overloads because it's convenient to hang lots of common debugging functionality there, as well as make sweeping improvements across the entire app, based on the statistics we gather from those same overloads.
We still do use custom allocators for individual types too; in many cases the speedup or capabilities you can get by providing custom allocators for e.g. a single point-of-use of an STL data structure far exceeds the general speedup you can get from the global overloads.
Take a look at some of the allocators and debugging systems that are out there for C/C++ and you'll rapidly come up with these and other ideas:
(One old but seminal book is Writing Solid Code, which discusses many of the reasons you might want to provide custom allocators in C, most of which are still very relevant.)
Obviously if you can use any of these fine tools you will want to do so rather than rolling your own.
There are situations in which it is faster, easier, less of a business/legal hassle, nothing's available for your platform yet, or just more instructive: dig in and write a global overload.