This has probably been asked over and over but I couldn't find anything useful so here it goes again...
In my application I need to render a fairly large mesh (a couple of million triangles or more) and I'm having some problems getting decent frame rates out of it. The CPU is pretty much idling so I'm definitely GPU-bound. Changing the resolution doesn't affect performance, so it's not fragment- or raster-bound.
The mesh is dynamic (but locally static) so I cannot store the whole thing in the video card and render it with one call. For application specific reasons the data is stored as an octree with voxels in the leafs, with means I get frustum culling basically for free. The vertex data consist of coordinates, normals and colors - no textures or shaders are used.
My first approach was to just render everything from memory using one big STREAM_DRAW
VBO, which turned out to be too slow. My initial thought was that I was perhaps overtaxing the bus (pushing ~150 MiB per frame), so I implemented a caching scheme that stores geometry recently used to render the object in static VBOs on the graphics card, with each VBO storing a couple of 100 KiB to a couple of MiB worth of data (storing more per VBO gives more cache thrashing, so there's a trade-off here). The picture below is an example of what the data looks like, where everything colored red is drawn from cached VBOs.
As the numbers below show, I don't see a spectacular increase in performance when using the cache. For a fully static mesh of about 1 million triangles I get the following frame rates:
- Without caching: 1.95 Hz
- Caching using vertex arrays: 2.0 Hz (>75% of the mesh is cached)
- Caching using
STATIC_DRAW
VBOs: 2.4 Hz
So my questions is how do I speed this up? I.e.:
- What's the recommended vertex format to get decent performance? I use interleaved storage with positions and normals as
GL_FLOAT
andGL_UNSIGNED_BYTE
for colors, with one padding byte to get 4-byte alignment (28 bytes/vertex total). - Whether using the same buffer for normals for all my boxes might help (all boxes are axis-aligned so I can allocate a normal buffer the size of the largest cache entry and use it for them all).
- How do I know which part of the pipeline is the bottleneck? I don't have a spectacular video card (Intel GM965 with open source Linux drivers) so it's possible that I hit its limit. How much throughput can I expect from typical hardware (2-3 year old integrated graphics, modern integrated graphics, modern discrete graphics)?
- Any other tips on how you would tackle this, pitfalls, etc.
I'm not interested in answers suggesting LOD (I already tested this), vendor-specific tips or using OpenGL features from anything later than 1.5.