Your question is pretty broad, so unfortunately my answer will not be very specific either.
First, benchmarking is hard. Do not underestimate the effort necessary to produce meaningful, repeatable, high-confidence results.
Second, what is your performance goal? Is it throughput (transaction or operations per second)? Is it latency (time it takes to execute a transaction)? Do you care about average performance? Do I care about worst case performance? Do you care about the absolute worst case or I care that 90%, 95% or some other percentile get adequate performance?
Depending on which goal you have, then you should design your benchmark to measure against that goal. So, if you are interested in throughput, you probably want to send messages / transactions / input into your system at a prescribed rate and see if the system is keeping up.
If you are interested in latency, you would send messages / transactions / input and measure how long it takes to process each one.
If you are interested in worst case performance you will add load to the system until up to whatever you consider "realistic" (or whatever the system design says it should support.)
Second, you do not say if these modules are going to be CPU bound, I/O bound, if they can take advantage of multiple CPUs/cores, etc. As you are trying to evaluate different hardware solutions you may find that your application benefits more from a great I/O subsystem vs. a huge number of CPUs.
Third, the best benchmark (and the hardest) is to put realistic load into the system. Meaning, you record data from a production environment, and put the new hardware solution through this data. Getting this done is harder than it sounds, often, this means adding all kinds of measure points in the system to see how it behaves (if you do not have them already,) modifying the existing system to add record/playback capabilities, modifying the playback to run at different rates, and getting a realistic (i.e., similar to production) environment for testing.