There are two steps to optimizing code.
First, you need to find out what's slow. That's where a profiler is handy. They're generally straightforward to use. You run your application through a profiler, and when it terminates, the profiler will show you how much time was spent in each function, exclusive (this function without counting time spent in function called from that) as well as inclusive (time spent in this function, including child function calls).
In other words, you get a big call tree, and you just have to hunt down the big numbers. Usually, you have very few functions consuming more than 10% of the execution time. So locate these and you know what to optimize.
Then comes the trickier part, how to optimize the code.
Of course, the most effective approach is often the high-level one. Does the problem have to be solved in this way? Does it have to be solved at all? Could it have been solved in advance and the result cached so it could be delivered instantly when the rest of the app needed it?
Are there more efficient algorithms for solving the problem?
If you can apply such high-level optimizations, do that, see if that improved performance sufficiently, and if not, profile again.
Sooner or later, you may have to dive into more low-level optimizations. This is tricky territory though. Today's computers are pretty complex, and the performance you get from them is not straightforward. The cost of a branch or a function call can vary widely depending on the context. Adding two numbers together may take anywhere from 1 to 100 clock cycles depending on whether both values were already in the CPU's registers, and so on. So the more low-level knowledge of how the CPU works, the better. However, there are a few trends that are often worth following:
I/O is expensive. CPU instructions are measured in fractions of a nanosecond. RAM access is in the order of tens to hundreds of nanoseconds. A harddrive access may take tens of milliseconds. So often, I/O will be what's slowing down your application.
Does your application perform few large I/O reads (read a 20MB file in one big chunk), or countless small ones (read bytes 2,052 to 2073 from one file, then read a couple of bytes from another file)? Fewer large reads can speed your I/O up by a factor of several thousand.
Pagefaults involve harddrive accesses too. In-memory pages have to be pushed to the pagefile, and paged-out ones have to be read back into memory. If this happens a lot, it's going to be slow. can you improve the locality of your data so fewer pages will be needed at the same time? Can you simply buy more RAM for the host computer to avoid having to page data out? (As a general rule, hardware is cheap. Upgrading the computer is a perfectly valid optimization - but make sure the upgrade will make a difference. Disk reads won't be a lot faster by buying a faster computer. And if everything fits into RAM on your old system, there's no point in buying one with 8 times as much RAM)
Your database relies on harddrive accesses too. So can you get away with caching more data in RAM, and only occasionally writing it out to the database? (Of course there's a risk there. What happens if the application crashes?
And then there's everyone favorite, threading. A modern CPU has anywhere from 2 to 16 CPU cores available. Are you using them all? Would you benefit from using them? Are there long-running operations that can be executed asynchronously? The application starts the operation in a separate thread, and is then able to resume normal operation instantly, rather than blocking until the operation is complete.
So basically, use the profiler to understand your application. How does it spend its execution time, where is it being spent? Is memory consumption a problem? What are the I/O patterns (both harddrive and network accesses, as well as any other kind of I/O)?
Is the CPU just churning away all the time, or is it idle waiting for some external events, such as I/O or timers?
And then understand as much as possible about the computer it's running on. Understand what resources it has available (CPU cache, multiple cores), and what each of them means for performance.
This is all pretty vague, because the tricks to optimizing a big database server are going to be very different from what you'd do to optimize some big number-crunching algorithm.