views:

237

answers:

3

This has probably been asked over and over but I couldn't find anything useful so here it goes again...

In my application I need to render a fairly large mesh (a couple of million triangles or more) and I'm having some problems getting decent frame rates out of it. The CPU is pretty much idling so I'm definitely GPU-bound. Changing the resolution doesn't affect performance, so it's not fragment- or raster-bound.

The mesh is dynamic (but locally static) so I cannot store the whole thing in the video card and render it with one call. For application specific reasons the data is stored as an octree with voxels in the leafs, with means I get frustum culling basically for free. The vertex data consist of coordinates, normals and colors - no textures or shaders are used.

My first approach was to just render everything from memory using one big STREAM_DRAW VBO, which turned out to be too slow. My initial thought was that I was perhaps overtaxing the bus (pushing ~150 MiB per frame), so I implemented a caching scheme that stores geometry recently used to render the object in static VBOs on the graphics card, with each VBO storing a couple of 100 KiB to a couple of MiB worth of data (storing more per VBO gives more cache thrashing, so there's a trade-off here). The picture below is an example of what the data looks like, where everything colored red is drawn from cached VBOs.

Example of the rendered data

As the numbers below show, I don't see a spectacular increase in performance when using the cache. For a fully static mesh of about 1 million triangles I get the following frame rates:

  • Without caching: 1.95 Hz
  • Caching using vertex arrays: 2.0 Hz (>75% of the mesh is cached)
  • Caching using STATIC_DRAW VBOs: 2.4 Hz

So my questions is how do I speed this up? I.e.:

  • What's the recommended vertex format to get decent performance? I use interleaved storage with positions and normals as GL_FLOAT and GL_UNSIGNED_BYTE for colors, with one padding byte to get 4-byte alignment (28 bytes/vertex total).
  • Whether using the same buffer for normals for all my boxes might help (all boxes are axis-aligned so I can allocate a normal buffer the size of the largest cache entry and use it for them all).
  • How do I know which part of the pipeline is the bottleneck? I don't have a spectacular video card (Intel GM965 with open source Linux drivers) so it's possible that I hit its limit. How much throughput can I expect from typical hardware (2-3 year old integrated graphics, modern integrated graphics, modern discrete graphics)?
  • Any other tips on how you would tackle this, pitfalls, etc.

I'm not interested in answers suggesting LOD (I already tested this), vendor-specific tips or using OpenGL features from anything later than 1.5.

A: 

I would use a performance profiler first (like gDEBugger), so you can figure out if you are vertex, fragment or bus limited, etc. It's hard to guess what optimizations to perform in such a particular case (Intel + open source drivers).

Did you also try VA mode? Are you using glDrawElements? glDrawArrays? Is the data vertex-cache friendly (pre and post transform)?

Stringer Bell
@String Bell: I would use an OpenGL profiler if there was an open source (or free) one for Linux (see [my other question](http://stackoverflow.com/questions/3235864/open-source-opengl-profiler-for-linux)). The open source drivers are the Intel-developed official drivers, but I guess you already know that. I'm using glDrawArrays since I cannot share data between vertices (all vertices has different normals or postion). What's VA mode? The data is cache friendly AFAICT, i.e. interleaved storage (not sure how transformations would affect that).
Staffan
@Stringer Bell: I tried gDEBugger despite being closed source, and it doesn't work for me (gives me an indirect context and then causes a SIGSEGV).
Staffan
VA mode is plain old 1.1 vertex arrays. Post transform caches are used only if you have indices (see http://www.opengl.org/wiki/Post_Transform_Cache). Do you use GL_QUAD or GL_TRIANGLES for rendering your boxes?
Stringer Bell
You'd better try intel stuff: http://software.intel.com/en-us/articles/intel-graphics-media-accelerator-profiler-v21/
Stringer Bell
@Stringer Bell: Ah, yeah I tested with plain ol' vertex arrays as well (I posted some benchmarks above). I use triangles. The Intel profiler is Windows-only unfortunately.
Staffan
+2  A: 

You're probably not going to like this response....

I've found your problem: Intel GM965 with open source Linux drivers

While my current job does not hit your volume of data, we've rendered several million vertexes in VBO and Intel graphics hardware/drivers have proven useless. Get yourself an NVidia card (and get over having to use the binary driver, it just works) and you'll be all set. Doesn't even have to be current generation though a top end Quadro (if work is paying) or top end GTX 400 series (if you're paying or just trying to save some bucks at work) should do just fine w/ the latest drivers. You could also try to find a machine w/ this hardware to test on if upgrading your machine is not an option.

basszero
@basszero: Looks like you're right. I tested on a machine with better graphics, not a Quadro but better nonetheless, and got 15 Hz with caching and half of that without. It's more an inconvenience than a problem since I'm only the developer and not (currently) the primary user of it.
Staffan
@Staffan: it doesn't mean that you have max out your GMA 965 to me. Maybe you're just doing something bad for performances. Frankly I would consider giving a try to Intel Media Accelerator Profiler (if your application is portable of course). Don't forget that GMA are tile-based renderers..
Stringer Bell
@Stringer Bell: It looks like it really is the video card/drivers setting the limit. After some tuning of my caching mechanism I get ~28 million triangles/s on a GeForce 3 Ti200, there's no official spec on how many triangles it can push but this seems like it might be reasonably close to the limit.
Staffan
@Staffan: Of course, you have no official spec.. What you can do is check benchmarks like 3DMark: http://techreport.com/articles.x/12195/9What is the throughput of your application in MVertices/second?
Stringer Bell
All the existing benchmarks are to broad for this case, I care only about triangles/s, not shaders and stuff. Well, 28 MTri/s = 28*3 = 84 MVertices/s
Staffan
A: 

I don't know about your "mesh" but it seems like they are all cubes. If it is possible for you, render a single union cube to a display list and render a scaled version of that display list. That often gives a 10x speedup, since the buss is not pumped with vertex data or the video memory exhausted.

Of course that depends on your ability to change the data. It might not be the case if it really is not like on the picture.

Sean Farrell
They're all cuboids, yes. As I described in the OP, I use a cache with VBOs to avoid overtaxing the bus -- no glVertex3() calls were involved in generation the above image.
Staffan
But VBO != Display List... The difference is that VBO still uses a LARGE ARRAY, even if it is in video memory. In most cases for setups like that I get most bang for a buck calling 10000 x one DL that 10000 cubes in a vertex array.
Sean Farrell