i have written a CUDA code to solve an NP-Complete problem, but the performance was not as i suspected.
i know about "some" optimization techniques (using shared memroy,textures,zerocopy...)
What are the most important optimization techniques Cuda programmers should know about?