Summary
This chapter explained the launch of multiple blocks with each having multiple threads from kernel function. It also showed the method of choosing these two parameters for a large value of threads. It also explained the hierarchical memory architecture that can be used by CUDA programs. The memory nearer to thread execution is fast and as we move away from it memories get slower. When multiple threads want to communicate with each other then CUDA provides the flexibility of using shared memory by which threads from same blocks can communicate with each other. When multiple threads use same memory location then there should be synchronization between this memory access otherwise the final result will not be as expected. We have also seen use of atomic operation to accomplish this synchronization. If some parameters are remaining constant throughout the kernel execution then it can be stored in constant memory for speed up. When CUDA programs exhibit a certain communication pattern...