I haven't found a way to profile CE apps; I use a brute force approach. Here are my recommendations:
1) Avoid using divide and floating point operations in your time critical code since they're not native instructions of the older ARM processors. A simple integer divide turns into 100 clocks of runtime library code and floating point operations are even slower.
2) Write your "inner-loop" code in assembly language since the compiler doesn't do a great job.
3) Use the internal timer (GetTickCount has a resolution of 1ms on WinCE) to time your own functions.
4) Selectively enable/disable sections of your code to measure how much time each section takes.
Hope this helps,
L.B.