Summary
In this chapter, we have seen some advanced concepts in CUDA that help us in developing a complex application using CUDA. We have seen the method for measuring the performance of the device code and how to see a detailed profile of kernel function using Nvidia Visual profiler tool. It helps us in identifying the operation that slows down the performance of our program. We have seen the methods to handle errors in hardware operation from CUDA code itself and also seen methods to debug the code using tools. CPU provides efficient task parallelism where two completely different functions execute in parallel. We have seen that GPU also provides this functionality using CUDA streams and achieve 2x speed up on the same vector addition program using CUDA streams. Then we have seen an acceleration of sorting algorithm using CUDA which is an important concept to understand to build complex computing applications. Image processing is a computationally intensive task which needs to be performed...