views:

442

answers:

3

i profiled a portion of my application using the Delphi Sampling Profiler. Like most people, i see a majority of the time spent inside ntdll.dll.

Note: i turned on the options to ignore Application.Idle time, and calls from System.pas. So it isn't inside ntdll because the application is idle:

alt text

After multiple runs, multiple times, the majority of the time seems to be spent inside ntdll.dll, but the odd thing is who the caller is: alt text

The caller is from the Virtual Treeview's:

PrepareCell(PaintInfo, Window.Left, NodeBitmap.Width);    

Note: The application is not inside ntdll.dll because the application is idle, because the caller isn't Application.Idle.

What confuses me is that it's this line itself (i.e. not something inside PrepareCell) is the caller into ntdll. Even more confusing is that:

  • not only is it not something inside PrepareCell()
  • it's not even the setup of PrepareCell (e.g. popping stack variables, setting up implicit exception frames, etc) that is the caller. Those things would show up in the profiler as a hotspot on the begin inside PrepareCell.

VirtualTrees.pas:

procedure TBaseVirtualTree.PrepareCell(var PaintInfo: TVTPaintInfo; WindowOrgX, MaxWidth: Integer);
begin
   ...
end;

So i'm trying to figure out how this line:

PrepareCell(PaintInfo, Window.Left, NodeBitmap.Width);    

is calling ntdll.dll.


The only other ways in are the three parameters:

  • PaintInfo
  • Window.Left
  • NodeBitmap.Width

Maybe one of those is a function, or a property getter, that would call into ntdll. So i put a breakpoint on the line, and look at the CPU window at runtime:

alt text

There is a line in there that might be the culprit:

call dword ptr [edx+$2c]

But when i follow that jump, it doesn't end up in ntdll.dll, but TBitmap.GetWidth:

alt text

Which, as you can see, doesn't call anywhere; and certainly not into ntdll.dll.


So how is the line:

PrepareCell(PaintInfo, Window.Left, NodeBitmap.Width);    

calling into ntdll.dll?


Note: i know full well it isn't really calling into ntdll.dll. So any valid answer will have to include the words "Sampling Profiler is misleading; that line is not calling into ntdll.dll." The answer will also have to either say that the majority of the time is not spent in ntdll.dll, or that the highlighted line is not the caller. Finally any answer will have to explain why Sampling Profiler is wrong, and how it can be fixed.

Update 2

What is ntdll.dll? Ntdll is Windows NT's native API set. The Win32 API is a wrapper around ntdll.dll that looks like the Windows API that existed in Windows 1/2/3/9x. In order to actually get into ntdll you have to call a function that uses ntdll directly or indirectly.

For example, when my Delphi application goes idle, it waits for a message by calling the user32.dll function:

WaitMessage;

When when you actually look at it is:

USER32.WaitMessage
  mov eax,$00001226
  mov edx,$7ffe0300
  call dword ptr [edx]
  ret

Calling the function specified at $7ffe0300 is the way Windows transitions into Ring0, calling the FunctionID specified in EAX. In this case, the System Function being called is 0x1226. On my operating system, Windows Vista, 0x1226 corresponds to the system function NtUserWaitMessage.

This is how to you get into ntdll.dll: you call it.

i was desperately trying to avoid a hand-waving non-answer when i worded the original question. By being very specific, carefully pointing out the reality of what i'm seeing, i was trying to prevent people from ignoring the facts, and trying to use a hand-waving argument.


Update Three

i converted the two parameters:

PrepareCell(PaintInfo, Window.Left, NodeBitmap.Width);

into stack variables:

_profiler_WindowLeft := Window.Left;
_profiler_NodeBitmapWidth := NodeBitmap.Width;
PrepareCell(PaintInfo, _profiler_WindowLeft, _profiler_NodeBitmapWidth);

To confirm that the bottleneck is not is the call to

  • Windows.Left, or
  • Nodebitmap.Width

Profiler still indicates that the line

PrepareCell(PaintInfo, _profiler_WindowLeft, _profiler_NodeBitmapWidth);

itself is the bottleneck; not anything inside PrepareCell. This must mean that it's something inside the setup of the call to prepare cell, or at the start of PrepareCell:

VirtualTrees.pas.15746: PrepareCell(PaintInfo, _profiler_WindowLeft, _profiler_NodeBitmapWidth);
   mov eax,[ebp-$54]
   push eax
   mov edx,esi
   mov ecx,[ebp-$50]
   mov eax,[ebp-$04]
   call TBasevirtualTree.PrepareCell

Nothing in that calls into ntdll. Now the pre-amble in PrepareCell itself:

VirtualTrees.pas.15746: begin
   push ebp
   mov ebp,esp
   add esp,-$44
   push ebx
   push esi
   push edi
   mov [ebp-$14],ecx
   mov [ebp-$18],edx
   mov [ebp-$1c],eax
   lea esi,[ebp-$1c]
   mov edi,[ebp-$18]

Nothing in there calls into ntdll.dll.


The questions still remain:

  • why is pushing of one variable onto the stack, and two others into registers the bottleneck?
  • why isn't anything inside PrepareCell itself the bottleneck?
+2  A: 

Well, this problem was actually my main reason to make my own sampling profiler:
http://code.google.com/p/asmprofiler/wiki/AsmProfilerSamplingMode

Maybe not perfect, but you could give it a try. Let me know what you think about it.

Btw, I think it has to do with the fact that almost all calls ends into calls to the kernel (memory requests, paint events, etc). Only calculations do not need to call the kernel. Most calls ends in waiting for kernel results:

ntdll.dll!KiFastSystemCallRet

You can see this in Process Explorer with thread stack view, or in Delphi, or using StackWalk64 API in my "Live view" of AsmProfiler:
http://code.google.com/p/asmprofiler/wiki/ProcessStackViewer

André
A: 

Have you tried breaking this line up and assigning Window.Left and NodeBitmap.Width to local variables?

Bruce McGee
Yes. The results are the same.
Ian Boyd
A: 

There are probably two things happening there.

The first is that SamplingProfiler identifies the caller by walking up the stack, until it encounters what looks like a valid call point into Delphi from Delphi code.

The thing is, some procedures may reserve a large amount of stack at once, without reinitializing it. This could result in a false positive. The only clue then would be that your false positive was recently invoked.

The second thing is the ntdll localization, that is known for certain, however, ntdll is your wait point in user-space, and as user197220, ntdll is where you'll end up waiting most of the time you're calling system stuff and waiting for the result.

In your case, unless you reduced the sampling rate, you're looking at 247ms of CPU work time, which could probably pass as idle if those 247 samples were collected over many seconds of real time. Since the false positive points to VirtualTree paint preparations, my bet would be that the ntdll time is actually paint time (driver or OS software). You can try commenting out the code that actually does the painting to be sure.

Eric Grange
i'd be really curious for you to expand, and clarify this answer. Because, honestly, i don't understand most of it. "The second thing is the ntdll localization, that is known for certain, however, ntdll is your wait point in user-sapce, and as user197220, ntdll is where you'll end up waiting most of the time you're calling system stuff and waiting for the result." What do you mean by "ntdll localization"? Certainly not anything do with with different locale's of Windows. "that is known for certain" What is known for certain, the ntdll localization, or that ntdll is my...
Ian Boyd
The whole "second thing" paragraph is one long sentence. i tried editing to break it up to make it clearer to me - but i don't know what phrase does with which subject. Where are you getting `user197220`, is that address `$197220`? i see the 247 samples that were inside ntdll, but how do you arrive at 247ms of CPU time? That would imply it takes 1ms of cpu time to take a sample - that's not right.
Ian Boyd
What would a valid "call point" from Delphi code to Delphi code look like? Did you mean that the procedure allocates a large amount of stack space without *initializing* it (in other words leaving whatever junk was there)? i don't know what a procedure would have to do to *re*-initialize stack variables.
Ian Boyd
Sorry, couple typos in there: a missing "said" after user197220, which is the person who answered you above. And instead of "localization", you should read "location".A valid call point is a stack return address pointing to a proper call and stack setup in Delphi, and yes, that's caused by a procedure allocating a lot of stack and leaving it untouched. This could be a Delphi procedure, but it could be in the windows DLLs too.
Eric Grange
The sampling frequency by default is 1kHz, 1000 samples per sec, statistically, 247 samples roughly translate into 247 ms if you're stuck in a call, or a series of call (as long as there is no resonance).
Eric Grange