Ah, the good old days! When men were men, women were women, and we wrote code by carving it into the silicon wafers with a hammer and chisel.
I can't recall ever having written an entire program in assembly language. I might have done a small one, just for fun, on a TRS-80 in Z80 assembler back about 1979 or so.
However, back in the mid-80s, a lot of C compilers were terrible at code optimization. They made up for it by letting you embed assembly language in the middle of C. So a lot of times, I would find some nested loop that was chewing up a lot of time, and accelerate the middle by recoding the C into assembly.
Also in the same era, as you mention, you could sometimes get to special features on computers with assembly language that weren't available from C.
For instance, I was on a team that got tipped off that the Mac II was coming out with a floating point unit on a chip. None of the existing Mac compilers or assemblers could use it, but I was able to hand-code some machine language instructions for it, and wrap those in 68000 assembly, and wrap those in C. I believe that was the first shipping, commercial Mac program that could use the FPU directly. It wound up being about 100x as fast as not having an FPU, and several times faster than using Apple's software wrappers for the FPU.
Later on, I worked on Mac AutoCAD when there was such a thing. The first revision had a bug, and they didn't want to replace the whole program. So I wrote a patcher that fit on a single floppy. It ran through the code and patched the buggy stuff by replacing it with a jump to some non-buggy code which it tacked on. That all had to be done in 68k assembly.
Since then, most of my use of assembly has been just reading it in a debugger. I once reverse engineered the assembly for a DOS device driver, rewrote it in C, and ported it to the Mac.
Nowadays, CPUs have changed so much, it's very hard for a human to write assembly code as fast or small as a compiler can generate. When you combine things like RISC instruction sets, branch prediction, multiple levels of caching, and multi-stage pipelines with several operations running in sequential/parallel in the CPU at once, the most optimized code for a particular operation can be very unintuitive unless you're a real expert.