





















































(For more resources related to this topic, see here.)
Let's start the journey by looking back into the history of computing and why OpenCL is important from the respect that it aims to unify the software programming model for heterogeneous devices. The goal of OpenCL is to develop a royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers, and handheld/embedded devices. This effort is taken by "The Khronos Group" along with the participation of companies such as Intel, ARM, AMD, NVIDIA, QUALCOMM, Apple, and many others. OpenCL allows the software to be written once and then executed on the devices that support it. In this way it is akin to Java, this has benefits because software development on these devices now has a uniform approach, and OpenCL does this by exposing the hardware via various data structures, and these structures interact with the hardware via Application Programmable Interfaces ( APIs ). Today, OpenCL supports CPUs that includes x86s, ARM and PowerPC and GPUs by AMD, Intel, and NVIDIA.
Developers can definitely appreciate the fact that we need to develop software that is cross-platform compatible, since it allows the developers to develop an application on whatever platform they are comfortable with, without mentioning that it provides a coherent model in which we can express our thoughts into a program that can be executed on any device that supports this standard. However, what cross-platform compatibility also means is the fact that heterogeneous environments exists, and for quite some time, developers have to learn and grapple with the issues that arise when writing software for those devices ranging from execution model to memory systems. Another task that commonly arose from developing software on those heterogeneous devices is that developers were expected to express and extract parallelism from them as well. Before OpenCL, we know that various programming languages and their philosophies were invented to handle the aspect of expressing parallelism (for example, Fortran, OpenMP, MPI, VHDL, Verilog, Cilk, Intel TBB, Unified parallel C, Java among others) on the device they executed on. But these tools were designed for the homogeneous environments, even though a developer may think that it's to his/her advantage, since it adds considerable expertise to their resume. Taking a step back and looking at it again reveals that is there is no unified approach to express parallelism in heterogeneous environments. We need not mention the amount of time developers need to be productive in these technologies, since parallel decomposition is normally an involved process as it's largely hardware dependent. To add salt to the wound, many developers only have to deal with homogeneous computing environments, but in the past few years the demand for heterogeneous computing environments grew.
The demand for heterogeneous devices grew partially due to the need for high performance and highly reactive systems, and with the "power wall" at play, one possible way to improve more performance was to add specialized processing units in the hope of extracting every ounce of parallelism from them, since that's the only way to reach power efficiency. The primary motivation for this shift to hybrid computing could be traced to the research headed entitled Optimizing power using Transformations by Anantha P. Chandrakasan. It brought out a conclusion that basically says that many-core chips (which run at a slightly lower frequency than a contemporary CPU) are actually more power-efficient. The problem with heterogeneous computing without a unified development methodology, for example, OpenCL, is that developers need to grasp several types of ISA and with that the various levels of parallelism and their memory systems are possible. CUDA, the GPGPU computing toolkit, developed by NVIDIA deserves a mention not only because of the remarkable similarity it has with OpenCL, but also because the toolkit has a wide adoption in academia as well as industry. Unfortunately CUDA can only drive NVIDIA's GPUs.
The ability to extract parallelism from an environment that's heterogeneous is an important one simply because the computation should be parallel, otherwise it would defeat the entire purpose of OpenCL. Fortunately, major processor companies are part of the consortium led by The Khronos Group and actively realizing the standard through those organizations. Unfortunately the story doesn't end there, but the good thing is that we, developers, realized that a need to understand parallelism and how it works in both homogeneous and heterogeneous environments. OpenCL was designed with the intention to express parallelism in a heterogeneous environment.
For a long time, developers have largely ignored the fact that their software needs to take advantage of the multi-core machines available to them and continued to develop their software in a single-threaded environment, but that is changing (as discussed previously). In the many-core world, developers need to grapple with the concept of concurrency, and the advantage of concurrency is that when used effectively, it maximizes the utilization of resources by providing progress to others while some are stalled.
When software is executed concurrently with multiple processing elements so that threads can run simultaneously, we have parallel computation. The challenge that the developer has is to discover that concurrency and realize it. And in OpenCL, we focus on two parallel programming models: task parallelism and data parallelism.
Task parallelism means that developers can create and manipulate concurrent tasks. When developers are developing a solution for OpenCL, they would need to decompose a problem into different tasks and some of those tasks can be run concurrently, and it is these tasks that get mapped to processing elements ( PEs ) of a parallel environment for execution. On the other side of the story, there are tasks that cannot be run concurrently and even possibly interdependent. An additional complexity is also the fact that data can be shared between tasks.
When attempting to realize data parallelism, the developer needs to readjust the way they think about data and how they can be read and updated concurrently. A common problem found in parallel computation would be to compute the sum of all the elements given in an arbitrary array of values, while storing the intermediary summed value and one possible way to do this is illustrated in the following diagram and the operator being applied there, that is, is any binary associative operator. Conceptually, the developer could use a task to perform the addition of two elements of that input to derive the summed value.
Whether the developer chooses to embody task/data parallelism is dependent on the problem, and an example where task parallelism would make sense will be by traversing a graph. And regardless of which model the developer is more inclined with, they come with their own sets of problems when you start to map the program to the hardware via OpenCL. And before the advent of OpenCL, the developer needs to develop a module that will execute on the desired device and communication, and I/O with the driver program. An example example of this would be a graphics rendering program where the CPU initializes the data and sets everything up, before offloading the rendering to the GPU. OpenCL was designed to take advantage of all devices detected so that resource utilization is maximized, and hence in this respect it differs from the "traditional" way of software development.
Now that we have established a good understanding of OpenCL, we should spend some time understanding how a developer can learn it. And not to fret, because every project you embark with, OpenCL will need you to understand the following:
Next, we need to solidify the preceding points by taking a deeper look into the various components of OpenCL. The following components collectively make up the OpenCL architecture:
When a kernel is submitted for execution, an index space is defined such that a work item is instantiated to execute each point in that space. A work item would be identified by its global ID and it executes the same code as expressed in the kernel. Work items are grouped into work groups and each work group is given an ID commonly known as its work group ID, and it is the work group's work items that get executed concurrently on the PEs of a single CU.
That index space we mentioned earlier is known as NDRange describing an N-dimensional space, where N can range from one to three. Each work item has a global ID and a local ID when grouped into work groups, that is distinct from the other and is derived from NDRange. The same can be said about work group IDs. Let's use a simple example to illustrate how they work.
Given two arrays, A and B, of 1024 elements each, we would like to perform the computation of vector multiplication also known as dot product, where each element of A would be multiplied by the corresponding element in B. The kernel code would look something as follows:
__kernel void vector_multiplication(__global int* a,
__global int* b,
__global int* c) {
int threadId = get_global_id(0); // OpenCL function
c[i] = a[i] * b[i];
}
In this scenario, let's assume we have 1024 processing elements and we would assign one work item to perform exactly one multiplication, and in this case our work group ID would be zero (since there's only one group) and work items IDs would range from {0 … 1023}. Recall what we discussed earlier, that it is the work group's work items that can executed on the PEs. Hence reflecting back, this would not be a good way of utilizing the device.
In this same scenario, let's ditch the former assumption and go with this: we still have 1024 elements but we group four work items into a group, hence we would have 256 work groups with each work group having an ID ranging from {0 … 255}, but it is noticed that the work item's global ID still would range from {0 … 1023} simply because we have not increased the number of elements to be processed. This manner of grouping work items into their work groups is to achieve scalability in these devices, since it increases execution efficiency by ensuring all PEs have something to work on.
The NDRange can be conceptually mapped into an N-dimensional grid and the following diagram illustrates how a 2DRange works, where WG-X denotes the length in rows for a particular work group and WG-Y denotes the length in columns for a work group, and how work items are grouped including their respective IDs in a work group.
Before the execution of the kernels on the device(s), the host program plays an important role and that is to establish context with the underlying devices and laying down the order of execution of the tasks. The host program does the context creation by establishing the existence (creating if necessary) of the following:
The application running on the host uses the OpenCL API to create memory objects in global memory and will enqueue memory commands to the command queue to operate on them. The host's responsibility is to ensure that data is available to the device when the kernel starts execution, and it does so by copying data or by mapping/unmapping regions of memory objects. During a typical data transfer from the host memory to the device memory, OpenCL commands are issued to queues which may be blocking or non-blocking. The primary difference between a blocking and non-blocking memory transfer is that in the former, the function calls return only once (after being queued) it is deemed safe, and in the latter the call returns as soon as the command is enqueued.
Memory mapping in OpenCL allows a region of memory space to be available for computation and this region can be blocking or non-blocking and the developer can treat this space as readable or writeable or both.
Hence forth, we are going to focus on getting the basics of OpenCL by letting our hands get dirty in developing small OpenCL programs to understand a bit more, programmatically, how to use the platform and execution model of OpenCL.
The OpenCL specification Version 1.2 is an open, royalty-free standard for general purpose programming across various devices ranging from mobile to conventional CPUs, and lately GPUs through an API and the standard at the time of writing supports:
Throughout this article, we are going to show you how you can become proficient in programming OpenCL.
As you go through the article, you'll discover not only how to use the API to perform all kinds of operations on your OpenCL devices, but you'll also learn how to model a problem and transform it from a serial program to a parallel program. More often than not, the techniques you'll learn can be transferred to other programming toolsets.
In the toolsets, I have worked with OpenCLTM, CUDATM, OpenMPTM, MPITM, Intel thread building blocksTM, CilkTM, CilkPlusTM, which allows the developer to express parallelism in a homogeneous environment and find the entire process of learning the tools to application of knowledge to be classified into four parts. These four phases are rather common and I find it extremely helpful to remember them as I go along. I hope you will be benefited from them as well.
Don't worry about these concepts, they'll be explained as we move through the article.
The next few recipes we are going to examine have to do with understanding the usage of OpenCL APIs, by focusing our efforts in understanding the platform model of the architecture.