views:

428

answers:

3

I'm writing my own graphics library (yep, its homework:) and use cuda to do all rendering and calculations fast.

I have problem with drawing filled triangles. I wrote it such a way that one process draw one triangle. It works pretty fine when there are a lot of small triangles on the scene, but it breaks performance totally when triangles are big.

My idea is to do two passes. In first calculate only tab with information about scanlines (draw from here to there). This would be triangle per process calculation like in current algorithm. And in second pass really draw the scanlines with more than one process per triangle.

But will it be fast enough? Maybe there is some better solution?

+1  A: 

I suspect that you have some misconceptions about CUDA and how to use it, especially since you refer to a "process" when, in CUDA terminology, there is no such thing.

For most CUDA applications, there are two important things to getting good performance: optimizing memory access and making sure each 'active' CUDA thread in a warp performs the same operation at the same time as otehr active threads in the warp. Both of these sound like they are important for your application.

To optimize your memory access, you want to make sure that your reads from global memory and your writes to global memory are coalesced. You can read more about this in the CUDA programming guide, but it essentially means, adjacent threads in a half warp must read from or write to adjacent memory locations. Also, each thread should read or write 4, 8 or 16 bytes at a time.

If your memory access pattern is random, then you might need to consider using texture memory. When you need to refer to memory that has been read by other threads in a block, then you should make use of shared memory.

In your case, I'm not sure what your input data is, but you should at least make sure that your writes are coalesced. You will probably have to invest some non-trivial amount of effort to get your reads to work efficiently.

For the second part, I would recommend that each CUDA thread process one pixel in your output image. With this strategy, you should watch out for loops in your kernels that will execute longer or shorter depending on the per-thread data. Each thread in your warps should perform the same number of steps in the same order. The only exception to this is that there is no real performance penalty for having some threads in a warp perform no operation while the remaining threads perform the same operation together.

Thus, I would recommend having each thread check if its pixel is inside a given triangle. If not, it should do nothing. If it is, it should compute the output color for that pixel.

Also, I'd strongly recommend reading more about CUDA as it seems like you are jumping into the deep end without having a good understanding of some of the basic fundamentals.

Eric
Sorry about my language, English is not my native. So what is proper terminology to processing on graphics cards?Well, i think i understand CUDA pretty good, but yes, i have lack of knowledge in parallel-algorithms. My input is set of vertexes in clipping space, and i had to draw triangles.I think algorithm where every pixel should check every triangle wouldn't be optimal.
qba
+2  A: 

You can check this blog: A Software Rendering Pipeline in CUDA. I don't think that's the optimal way to do it, but at least the author shares some useful sources.

Second, read this paper: A Programmable, Parallel Rendering Architecture. I think it's one of the most recent paper and it's also CUDA based.

If I had to do this, I would go with a Data-Parallel Rasterization Pipeline like in Larrabee (which is TBR) or even REYES and adapt it to CUDA:

http://www.ddj.com/architect/217200602 http://home.comcast.net/~tom_forsyth/larrabee/Standford%20Forsyth%20Larrabee%202010.zip (see the second part of the presentation)

http://graphics.stanford.edu/papers/mprast/

Stringer Bell
A: 

Not to be rude, but isn't this what graphics cards are designed to do anyway? Seems like using the standard OpenGL and Direct3D APIs would make more sense.

Why not use the APIs to do your basic rendering, rather than CUDA, which is much lower-level? Then, if you wish to do additional operations that are not supported, you can use CUDA to apply them on top. Or maybe implement them as shaders.

BobMcGee
Yes, yes indeed. But his goal here is to build a graphic rasterization pipeline WITHOUT the traditional APIs. Think of it as a proof of concept or educational purpose project.
Stringer Bell
Yes its project for my studies. We had to do all rasterization ourself . Most people use CPU, but i decided to use CUDA.
qba
Hmm, in that case it sounds like an interesting project. Kind of a back-asswards approach, but interesting nonetheless.
BobMcGee
Yep, until now i tried to avoid all matrices while programming in opengl, but now im not afraid of them:>
qba