views:

145

answers:

5

Hi guys,

I've been working on a point cloud player lately that should be ideally able to visualize terrain data points from a lidar capture and display them sequentially at around 30fps. I, however, seem to have come to a wall resulting from PCI-e IO.

What I need to do for every frame is load a large point cloud stored in memory, then calculate a color map due to height (I'm using something akin to matlab's jet map), then transfer the data to the GPU. This works fine on cloud captures with points < one million. However, at about 2 million points, this starts slowing down below 30 frames per second. I realize this is a lot of data (2 million frames per point * [3 floats per point + 3 floats per color point] * 4 bytes per float * 30 frames per second = around 1.34 gigabytes per second)

My rendering code looks something like this right now:

glPointSize(ptSize);
glEnableClientState(GL_VERTEX_ARRAY);
if(colorflag) {
    glEnableClientState(GL_COLOR_ARRAY);
} else {
    glDisableClientState(GL_COLOR_ARRAY);
    glColor3f(1,1,1);
}
glBindBuffer(GL_ARRAY_BUFFER, vbobj[VERT_OBJ]);
glBufferData(GL_ARRAY_BUFFER, cloudSize, vertData, GL_STREAM_DRAW);
glVertexPointer(3, GL_FLOAT, 0, 0);
glBindBuffer(GL_ARRAY_BUFFER, vbobj[COLOR_OBJ]);
glBufferData(GL_ARRAY_BUFFER, cloudSize, colorData, GL_STREAM_DRAW);
glColorPointer(3, GL_FLOAT, 0, 0);
glDrawArrays(GL_POINTS, 0, numPoints);
glDisableClientState(GL_VERTEX_ARRAY);
glEnableClientState(GL_COLOR_ARRAY);
glBindBuffer(GL_ARRAY_BUFFER, 0);

The pointer for vertData and colorData are changed every frame.

What I would like to be able to do is be able to play at a minimum of 30 frames per second even when later using large point clouds that might reach up to 7 million points per frame. Is this even possible? Or perhaps it would be easier to grid them and construct a heightmap and somehow display that? I'm still pretty new to 3-D programming, so any advice would be appreciated.

Thanks in advance

+4  A: 

I know nothing about opengl, but won't data compression be a natural workaround here? Isn't there support for integer types or 16-bit floats? Also other color representations than 3 floats per point?

Shelwien
Data compression sounds good, but I don't know if I can decompress on the GPU side with opengl. I'll try indexing the colors, thanks
Xzhsh
Well, even real compression might be possible with GPGPU, but I just meant that glVertexPointer apparently supports GL_SHORT and glColorPointer - GL_BYTE. Do you really need float precision there?Although of course indexing the colors is even better.
Shelwien
+5  A: 

If you can, implement color map using 1D texture. You'll only need 1 texture coordinate instead of 3 colors and it will make vertices 128-bit aligned too.

EDIT: You just need to create texture from your colormap and use glTexCoordPointer instead of glColorPointer (and changing vertex color values for texture coordinates in [0, 1] range of course). Here's linearly interpolated 6 texel colormap:

// Create texture
GLuint texture;
glGenTextures(1, &texture);
glBindTexture(GL_TEXTURE_1D, texture);
glTexParameteri(GL_TEXTURE_1D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_1D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);

// Load textureData
GLubyte colorData[] = {
    0xff, 0x00, 0x00,
    0xff, 0xff, 0x00,
    0x00, 0xff, 0x00,
    0x00, 0xff, 0xff,
    0x00, 0x00, 0xff,
    0xff, 0x00, 0xff
};
glTexImage1D(GL_TEXTURE_1D, 0, GL_RGB, 6, 0, GL_RGB, GL_UNSIGNED_BYTE, colorData);
glEnable(GL_TEXTURE_1D);
Ivan Baldin
Thanks for the answer. I'm not really sure how to use 1D textures as a colormap, is there something I can read to learn? Thanks in advance (sorry, I'm a bit of a opengl noob :D)
Xzhsh
Thanks again for the help
Xzhsh
+1  A: 

If you're willing to deal with the latency you can double-(or more!)buffer your VBOs, transferring geometry into one buffer while rendering from another:

while(true)
    {
    draw_vbo( cur_vbo_id );
    generate_new_geometry();
    load_vbo( nxt_vbo_id );
    swap( cur_vbo_id, nxt_vbo_id );
    }

EDIT: You also might try interleaving your vertexes instead of using one VBO per component.

genpfault
Latency isn't an issue, but I'm not sure I see how double buffering VBOs will help speed up transfer times. Is there some overhead per VBO? And thanks, I'll try interleaving tomorrow
Xzhsh
A: 

You say it's I/O bound. That implies you've profiled it and seen it spending 50% or more of its time waiting for I/O.

If so, that's what you've got to concentrate on, not the graphics.

If not, then some of the other answers sound good to me. Regardless, profile, don't guess. This is the method I use.

Mike Dunlavey
By I/O bound he means I/O on the PCI-express bus of the GPU, not HDD or network. But you're still right though.
Calvin1602
I'm fairly sure it's IO bound because I am just using the same frame played over and over for testing. It speeds up to 60fps if I just comment out the memcpy :\.
Xzhsh
@Calvin1602: Thanks for the correction, so edited.
Mike Dunlavey
@Xzhsh wtf ? I don't understand anymore. Which memcpy are you talking abour ?
Calvin1602
Err I'm sorry, I meant the bufferdata call. What I did to test this was i loaded the same frame data from memory in an initialization function, then I continually loaded the same one every draw frame. If I load the frame data only once, it can go up to 100 fps, but if I load the data every frame (which I'll need to do when I get more data), the fps goes down to something like 10.
Xzhsh
@Xzhsh: So frame time goes from 10ms to 100. That says in the slow case 90% of the time is going into that bufferdata call. What I would do is pause (by Ctrl-C) and, with 90% probability, you will catch it in the act of spending that time, and you will see exactly why. Once you know exactly why, you may very well get an idea how to make it faster.
Mike Dunlavey
@Xzhsh: If you haven't heard of that technique, it doesn't surprise me, you being in the home of gprof. Anyway, here's more on the subject: http://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343
Mike Dunlavey
A: 

Some pointers:

  • store as much data as possible on the graphic card and load only what is really needed (pretty obvious)
  • use lod levels in trees (kd- or octtrees) and calculate as much as possible up front
  • compression on disc is useful too in order to overcome io bottlenecks
Florian