This is what I did today...
I debug HW/SW interaction and its often the case logging (instrumentation) changes or hides the bug. Hence tests are performed "at-speed". I call these bugs "roaches" as they run away from any light I can shine on them.
So I have to:
Find the transaction that causes the bug. List the HW interaction via logging (this test passes, but it illustrates the flow).
Instrument before and after the bug to print state changes.
The bug I'm solving now of course is worst case as the HW locks up. The HW includes the CPU so its like being in a well lit room then the power fails and its pitch black.
I have a special backdoor view into memory, but of course this is locked up also. I tried power cycling in the hopes that the memory would stay non-volatile long enough to reenable the backdoor. No such luck. This is possible though.
I very very carefully wrote all the steps I went through to characterize this bug (what works, what fails etc). Sent this to developers with similar HW to verify it just wasn't me or my HW.
I took a few hours break to let this info settle and see if any lightbulbs lit elsewhere.
No replies, this bug is mine to solve...
This HW SW interaction is a loop tha does some setup then enters a polling loop that reads when the transaction is finished. Many transactions should occur. Which transaction fails? Is it the first one (indicating I can debug the transaction and not some noise in the HW). Is it the always the Nth transaction? What makes the Nth different than the first or the (N-1)th. The SW is single threaded and built to be predictable. No preemption, no interrupts enabled.
This SW has worked before, whats new? All the HW is new. In this case all the silicon is new as its an ASIC. Even the embedded CPU is new and customized so the ISA is new.
So I suspect everything and I'm blind. I'll have to sneak up on this roach.
I enabled just the log that reports how many times the SW polls the HW for completion. In this way the first transaction runs at speed, I get an idea how often I touch the HW in a tight polling loop. The test passes. I know its the Nth transaction and I recorded the peak number of polls for all transactions (perhaps meaningless data).
After modifing anythin, I have to put it back the way it was to verify the bug still exists. After all the earth has rotated and the solar winds are not as strong ;)
Looked at all the checkins, saw a contractor changed some important setup parameters with no explanation. These (outsourced) people are still under evaluation. This will not help.
Found there was no spinwait in the polling loop. Bad for the loop timeout as without it the timeout depends on CPU speed. Added spinwait, still no happiness.
Limited the number of transactions to see where it fails, somewhere before 1000.
Setup the HW to run slower, still hangs.
Hate to leave anyone reading this hanging too, but this diatribe will have to wait till tomorrow.