views:

596

answers:

4

Hi Friends,

I am using ARM926EJS. I am getting 20 % more memory speed in memory copy test, without Linux ( Just as a Getting Started executable). But in linux same code is running 20% slower.

Code is

 
/// Below code just performs burst mode memcopy test.        
void asmcpy(void *a, void *b, int iSize)
{
   do
  {
    asm volatile (
             "ldmia %0!, {r3-r10} \n\t"
             "stmia %0!, {r3-r10} \n\t"
             :"+r"(a), "+r"(b)
             :
             :"r"(r3),"r"(r4),"r"(r5),"r"(r6),"r"(r7),"r"(r8),"r"(r9),"r"(r10)
             );
  }while(size--)
}

I verified that no other process is taking CPU time on linux.( I checked this with the use of time command, It shows real time is same as usr time)

Please tell me what can be the problem with linux?

Thanks & Regards.

ADDED:

my test code is

int main()
{
  int a[320 * 120], b[320 * 120];

 for(int i=0; i != 10000; i++)
 {
   /// Size is divided by 8 because our memcpy function performs 8 integer load stores in the iteration
   asmcpy(a, b, (320 * 120) / 8);
 }
}

Getting Started executable is a bin file which is sent to the RAM using serial port and executes directly by jumping to that address in RAM. (without the need of an OS)

ADDED.

I haven't seen such performance difference on other processors.They were using SD RAM, This processor is using DDR Ram. Can it be a reason?

ADDED. Data Cache is not enabled in getting started code and Data Cache is eabled in Linux mode, So Ideally all data should be cached and get accessed without any RAM latency, But still Linux is 20% slow.

ADDED: My microcontroller is LPC3250. Both the test are been tested on same external DDR RAM.

+8  A: 

This chip has an MMU, so Linux is likely using it to manage memory. Maybe just enabling it introduces some performance hit. Also, Linux uses a lazy memory allocation strategy, only assigning memory pages to a process when it first hits it. If you're copying a big chunk of memory, the MMU will generate page faults to ask the kernel to allocate a page while inside your loop. On a low-end processor, all these context switches cause cache flushes and introduce a noticeable slowdown.

If your system is small enough, try an MMU-less version of Linux (like uClinux). Maybe it would let you use a cheaper chip with similar performance. On embedded systems, every penny counts.

update: Some extra details:

Every Linux process gets it's own memory mappings, At first this include only the kernel and (maybe) executable code. All the rest of the linear 4GB (on 32bit) seems available, but there's no RAM pages assigned to them. As soon as you read or write an unallocated memory address, the MMU signals a page fault and switches to the kernel. The kernel sees that it still has lots of free RAM pages, so picks one, assigns it to the faulted point and returns to your code, which finishes the interrupted instruction. The very next one won't fail because the whole page (typically 4KB) is already assigned; but a few iterations later, it will hit another non-assigned space, and the MMU will invoke the kernel again.

Javier
HI Javier, I am doing mem copy from ram to ram only. So How page fault can happen? I am doing memcopy with 153KB memory allocated on stack. I am running it in loop for 10,000 times.
Sunny
All RAM is memory-managed, so a fault can happen any time. see update.
Javier
Hum... 300KB is just a few pages, and after the first one, all that space should be mapped, so you shouldn't get faults anymore. As mentioned above, some simplistic MMUs introduce another step on the processing pipeline and might affect performance just because they're active, even if not generating faults anymore.
Javier
Another issue, even you say 'no other process is taking CPU time', you always have the timer ticks and a handful of kernel threads. Check /proc/<pid>/stat for some insight on where the time is going
Javier
+3  A: 

How are you performing the timing? There is no timing code in your example.

Are you sure that you are not measuring process load/unload time?

Is the processor clock speed the same in both cases?

If using external SDRAM are the RAM timings the same in both cases?

Is the data cache enabled in both cases?

Clifford

Clifford
Is the "time" syscommand returning the right numbers? It might be misconfigured. When you get weird results like this, a good option is to have the program print out a couple of things a minute apart according to its timer, and time them with a physical clock (or stopwatch).
Brooks Moses
Data Cache is disabled on Getting stated mode. Will use Stopwatch and let you know, Thanks.
Sunny
+2  A: 

Getting started is not "just an executable". There must be some code to set the DDR controller register.

If cache is also enabled, then so must be the MMU. I think on ARM926EJS, you can't have data cache without MMU.

I believe every context switch results in a cache flush, because the cache is virtually indexed, virtually tagged and Kernel and Userspace don't share the same address space, so you probably have a lot more unwanted cache flush in the than without OS.

Here is a paper with some aspect on the cost of VIVT cache flush when running Linux

shodanex
+1  A: 

What microcontroller (not just what ARM CPU) are you using?

Is it possible that in the non-Linux run the array you're testing is RAM on the microcontroller device itself while in the Linux test the array being tested is in external RAM? Internal RAM is usually accessed much faster than external RAM - this might account for the Linux test being slower, even if data caching is enabled only for the Linux run.

Michael Burr
Hi Michael, My microcontroller is LPC3250. Both the data are been tested on same external DDR RAM.
Sunny