tags:

views:

3692

answers:

5

I've slightly modified the iPhone SDK's GLSprite example while learning OpenGL ES and it turns out to be quite slow. Even in the simulator (on the hw worst) so I must be doing something wrong since it's only 400 textured triangles.

const GLfloat spriteVertices[] = {
  0.0f, 0.0f, 
  100.0f, 0.0f,  
  0.0f, 100.0f,
  100.0f, 100.0f
};

const GLshort spriteTexcoords[] = {
  0,0,
  1,0,
  0,1,
  1,1
};

- (void)setupView {
 glViewport(0, 0, backingWidth, backingHeight);
 glMatrixMode(GL_PROJECTION);
 glLoadIdentity();
 glOrthof(0.0f, backingWidth, backingHeight,0.0f, -10.0f, 10.0f);
 glMatrixMode(GL_MODELVIEW);

 glClearColor(0.3f, 0.0f, 0.0f, 1.0f);

 glVertexPointer(2, GL_FLOAT, 0, spriteVertices);
 glEnableClientState(GL_VERTEX_ARRAY);
 glTexCoordPointer(2, GL_SHORT, 0, spriteTexcoords);
 glEnableClientState(GL_TEXTURE_COORD_ARRAY);

 // sprite data is preloaded. 512x512 rgba8888 
 glGenTextures(1, &spriteTexture);
 glBindTexture(GL_TEXTURE_2D, spriteTexture);
 glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, spriteData);
 free(spriteData);

 glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);

 glEnable(GL_TEXTURE_2D);
 glBlendFunc(GL_ONE, GL_ONE_MINUS_SRC_ALPHA);
 glEnable(GL_BLEND);
} 

- (void)drawView {
  ..
   glClear(GL_COLOR_BUFFER_BIT);
   glLoadIdentity();
   glTranslatef(tx-100, ty-100,10);
   for (int i=0; i<200; i++) { 
  glTranslatef(1, 1, 0);
  glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);
   }
  ..
}

drawView is called every time the screen is touched or the finger on the screen is moved and tx,ty are set to the x,y coordinates where that touch happened.

I've also tried using GLBuffer, when translation was pre-generated and there was only one DrawArray but gave the same performance (~4 FPS).

===EDIT===

Meanwhile I've modified this so that much smaller quads are used (sized: 34x20) and much less overlapping is done. There are ~400 quads->800 triangles spread on the whole screen. Texture size is 512x512 atlas and RGBA_8888 while the texture coordinates are in float. The code is very ugly in terms of API efficiency: there are two MatrixMode change along with two loads and two translation then a drawarrays for a triangle strip (quad). Now this produces ~45 FPS.

+12  A: 

Your texture is 512*512*4 bytes per pixel. That's a megabyte of data. If you render it 200 times per frame you generate a bandwidth load of 200 megabytes per frame.

With roughly 4 fps you consume 800mb/second just for texture reads alone. Frame- and Zbuffer writes need bandwidth as well. Then there is the CPU, and don't underestimate the bandwidth requirements of the display as well.

RAM on embedded systems (e.g. your iphone) is not as fast as on a Desktop-PC. What you see here is a bandwidth starvation effect. The RAM simply can't handle the data faster.

How to cure this problem:

  • pick a sane texture-size. On average you should have 1 texel per pixel. This gives crisp looking textures. I know - it's not always possible. Use common sense.

  • use mipmaps. This takes up 33% of extra space but allows the graphic chip to pick use a lower resultion mipmap if possible.

  • Try smaller texture formats. Maybe you can use the ARGB4444 format. This would double the rendering speed. Also take a look at the compressed texture formats. Decompression does not cause a performance drop as it's done in hardware. Infact the opposite is true: Due to the smaller size in memory the graphic chip can read the texture-data faster.

Nils Pipenbrinck
Why is the texture being loaded for every triangle? Isn't that already in the video memory?
f3r3nc
I don't know about the IPhone, but on handheld and embedded devices it's common that there is no physical difference between video and system memory. This is called unified memory.
Nils Pipenbrinck
Also it does not make a difference if the texture is in video mem or system mem from a bandwidth usage point of view. Just because a texture is in video memory does not mean that the video-memory has unlimited bandwidth. Also vid-mem is not much faster or more clever or so..
Nils Pipenbrinck
using smaller textures format didn't speed up it. although, using smaller vertexes did (obviously?). when squares have size 20x20 the rendering is fast enough (~50fsp). even with the large texture. I think the slow performance is because of the setup: 200 overlaying squares.
f3r3nc
I meant smaller squares not vertexes.Imagination (producer of PowerVR mbx lite) is also suggests using texture atlas. and because this chip is using a tile based rendering I should try such a setup where smaller triangles are used and spread around the whole screen with a reasonable overlapping.
f3r3nc
A: 

I'm not familiar with the iPhone, but if it doesn't have dedicated hardware for handling floating point numbers (I suspect it doesn't) then it'd be faster to use integers whenever possible.

I'm currently developing for Android (which uses OpenGL ES as well) and for instance my vertex array is int instead of float. I can't say how much of a difference it makes, but I guess it's worth a try.

maciej.gryka
A: 

Apple is very tight-lipped about the specific hardware specs of the iPhone, which seems very strange to those of us coming from a console background. But people have been able to determine that the CPU is a 32-bit RISC ARM1176JZF. The good news is that it have a full floating-point unit, so we can continue writing math and physics code the way we do in most platforms.

http://gamesfromwithin.com/?p=239

+1  A: 

I guess my first try was just a bad (or very good) test. iPhone has a PowerVR MBX Lite which has a tile based graphics processor. It subdivides the screen into smaller tiles and renders them parallel. Now in the first case above the subdivision might got a bit exhausted because of the very high overlapping. More over, they couldn't be clipped because of the same distance and so all texture coordinates had to calculated (This could be easily tested by changing the translation in the loop). Also because of the overlapping the parallelism couldn't be exploited and some tiles were sitting doing nothing and the rest (1/3) were working a lot.

So I think, while memory bandwidth could be a bottleneck, this wasn't the case in this example. The problem is more because of how the graphics HW works and the setup of the test.

f3r3nc
If alpha blending or alpha testing is enabled, no overdraw optimization is applicable, because every pixel can be transparent and therefore nothing can be rejected. tiles are not rendered in parallel, btw, they are rendered sequentially, but z-buffer is limited to tile size and is implemented using on-chip memory.
noop
+2  A: 

(I know this is very late, but I couldn't resist. I'll post anyway, in case other people come here looking for advice.)

This has nothing to do with the texture size. I don't know why people rated up Nils. He seems to have a fundamental misunderstanding of the OpenGL pipeline. He seems to think that for a given triangle, the entire texture is loaded and mapped onto that triangle. The opposite is true.

Once the triangle has been mapped into the viewport, it is rasterized. For every on-screen pixel the your triangle covers, the fragment shader is called. The default fragment shader (OpenGL ES 1.1, which you are using) will lookup the texel that most closely maps (GL_NEAREST) to the pixel you are drawing. It might look up 4 texels since you are using the higher quality GL_LINEAR method to average the best texel. Still, if the pixel count in your triangle is, say 100, then the most texture bytes you will have to read is 4(lookups) * 100(pixels) * 4(bytes per color. Far far less than what Nils was saying. It's amazing that he can make it sound like he actually knows what he's talking about.

WRT the tiled architecture, this is common in embedded OpenGL devices to preserve locality of reference. I believe that each tile gets exposed to each drawing operation, quickly culling most of them. Then the tile decides what to draw on itself. This is going to be much slower when you have blending turned on, as you do. Because you are using large triangles that might overlap and blend with other tiles, the GPU has to do a lot of extra work. If, instead of rendering the example square with alpha edges, you were to render an actual shape (instead of a square picture of the shape), then you could turn off blending for this part of the scene and I bet that would speed things up tremendously.

If you want to try it, just turn off blending and see how much things speed up, even if the don't look right. glDisable(GL_BLEND);

Bruce Miller
thanks for your answer.actually, the 200x2 ogl call should be also removed. instead the whole scene should be drown by one glDraw call. i'll give both a try and let you know the outcome.
f3r3nc
disabling blending increases fps 10x. tiler utilization was ~5% with blending. ~45% without. gldraw is still called 200 times. wow.
f3r3nc
As I understand it, each tile has to keep a queue of each triangle that intersects it. Without blending, a triangle that covers a tile can flush the tile queue of everything that's been drawn before it. With blending, the tiles have to keep a complete queue of everything drawn until the very end.I used to work on GPUs for a company that was using custom GPU ASIC's on their video chips. And I still don't understand tiling completely. Good luck.
Bruce Miller
if we assume that tiler utilization is actually constant, if we increase total speed by 10 times, relative percentage of tiler time should grow by (almost)10 times as well. Pixel fillrate got much faster without alphablending for two reasons. First reason is that alphablending is expensive operation by itself, and second reason is that tiling engine can only do its job when alphablending/alphatest is off.
noop