This is a bit tricky to answer without knowing what sort of "off the shelf" products you are trying to assess. Are you looking for UI responsiveness, throughput (e.g. email, transactions/sec), startup time, etc - all of these have different criteria for what measures you should track and different tools for testing or evaluating. But to answer some of your general questions:
Credibility - this is important. Try to make sure that whatever you are measuring has little run to run variance. Utilize the technique of doing several runs of the same scenario, get rid of outliers (i.e. your lowest and highest), and evaluate your avg/max/min/median values. If you're doing some sort of throughput test, consider making it long running so you have a good sample set. For example, if you are looking at something like Microsoft Exchange and thus are using their perf counters, try to make sure you are taking frequent samples (once per sec or every few secs) and have the test run for 20mins or so. Again, chop off the first few mins and the last few mins to eliminate any startup/shutdown noise.
Heisenburg - tricky. In most modern systems, depending on what application/measures you are measuring, you can minimize this impact by being smart about what/how you are measuring. Sometimes (like in the Exchange example), you'll see near 0 impact. Try to use as least invasive tools as possible. For example, if you're measuring startup time, consider using xperfinfo and utilize the events built into the kernel. If you're using perfmon, don't flood the system with extraneous counters that you don't care about. If you're doing some exteremely long running test, ratchet down your sampling interval.
Also try to eliminate any sources of environment variability or possible sources of noise. If you're doing something network intensive, consider isolating the network. Try to disable any services or applications that you don't care about. Limit any sort of disk IO, memory intensive operations, etc. If disk IO might introduce noise in something that is CPU bound, consider using SSD.
When designing your tests, keep repeatability in mind. If you doing some sort of microbenchmark type testing (e.g. perf unit test) then have your infrastructure support running the same operation n times exactly the same. If you're driving UI, try not to physically drive the mouse and instead use the underlying accessibility layer (MSAA, UIAutomation, etc) to hit controls directly programmatically.
Again, this is just general advice. If you have more specifics then I can try to follow up with more relavant guidance.
Enjoy!