Using OpenCL

(For more resources related to this topic, see here.)

Let's start the journey by looking back into the history of computing and why OpenCL is important from the respect that it aims to unify the software programming model for heterogeneous devices. The goal of OpenCL is to develop a royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers, and handheld/embedded devices. This effort is taken by "The Khronos Group" along with the participation of companies such as Intel, ARM, AMD, NVIDIA, QUALCOMM, Apple, and many others. OpenCL allows the software to be written once and then executed on the devices that support it. In this way it is akin to Java, this has benefits because software development on these devices now has a uniform approach, and OpenCL does this by exposing the hardware via various data structures, and these structures interact with the hardware via Application Programmable Interfaces ( APIs ). Today, OpenCL supports CPUs that includes x86s, ARM and PowerPC and GPUs by AMD, Intel, and NVIDIA.

Developers can definitely appreciate the fact that we need to develop software that is cross-platform compatible, since it allows the developers to develop an application on whatever platform they are comfortable with, without mentioning that it provides a coherent model in which we can express our thoughts into a program that can be executed on any device that supports this standard. However, what cross-platform compatibility also means is the fact that heterogeneous environments exists, and for quite some time, developers have to learn and grapple with the issues that arise when writing software for those devices ranging from execution model to memory systems. Another task that commonly arose from developing software on those heterogeneous devices is that developers were expected to express and extract parallelism from them as well. Before OpenCL, we know that various programming languages and their philosophies were invented to handle the aspect of expressing parallelism (for example, Fortran, OpenMP, MPI, VHDL, Verilog, Cilk, Intel TBB, Unified parallel C, Java among others) on the device they executed on. But these tools were designed for the homogeneous environments, even though a developer may think that it's to his/her advantage, since it adds considerable expertise to their resume. Taking a step back and looking at it again reveals that is there is no unified approach to express parallelism in heterogeneous environments. We need not mention the amount of time developers need to be productive in these technologies, since parallel decomposition is normally an involved process as it's largely hardware dependent. To add salt to the wound, many developers only have to deal with homogeneous computing environments, but in the past few years the demand for heterogeneous computing environments grew.

The demand for heterogeneous devices grew partially due to the need for high performance and highly reactive systems, and with the "power wall" at play, one possible way to improve more performance was to add specialized processing units in the hope of extracting every ounce of parallelism from them, since that's the only way to reach power efficiency. The primary motivation for this shift to hybrid computing could be traced to the research headed entitled Optimizing power using Transformations by Anantha P. Chandrakasan. It brought out a conclusion that basically says that many-core chips (which run at a slightly lower frequency than a contemporary CPU) are actually more power-efficient. The problem with heterogeneous computing without a unified development methodology, for example, OpenCL, is that developers need to grasp several types of ISA and with that the various levels of parallelism and their memory systems are possible. CUDA, the GPGPU computing toolkit, developed by NVIDIA deserves a mention not only because of the remarkable similarity it has with OpenCL, but also because the toolkit has a wide adoption in academia as well as industry. Unfortunately CUDA can only drive NVIDIA's GPUs.

The ability to extract parallelism from an environment that's heterogeneous is an important one simply because the computation should be parallel, otherwise it would defeat the entire purpose of OpenCL. Fortunately, major processor companies are part of the consortium led by The Khronos Group and actively realizing the standard through those organizations. Unfortunately the story doesn't end there, but the good thing is that we, developers, realized that a need to understand parallelism and how it works in both homogeneous and heterogeneous environments. OpenCL was designed with the intention to express parallelism in a heterogeneous environment.

For a long time, developers have largely ignored the fact that their software needs to take advantage of the multi-core machines available to them and continued to develop their software in a single-threaded environment, but that is changing (as discussed previously). In the many-core world, developers need to grapple with the concept of concurrency, and the advantage of concurrency is that when used effectively, it maximizes the utilization of resources by providing progress to others while some are stalled.

When software is executed concurrently with multiple processing elements so that threads can run simultaneously, we have parallel computation. The challenge that the developer has is to discover that concurrency and realize it. And in OpenCL, we focus on two parallel programming models: task parallelism and data parallelism.

Task parallelism means that developers can create and manipulate concurrent tasks. When developers are developing a solution for OpenCL, they would need to decompose a problem into different tasks and some of those tasks can be run concurrently, and it is these tasks that get mapped to processing elements ( PEs ) of a parallel environment for execution. On the other side of the story, there are tasks that cannot be run concurrently and even possibly interdependent. An additional complexity is also the fact that data can be shared between tasks.

When attempting to realize data parallelism, the developer needs to readjust the way they think about data and how they can be read and updated concurrently. A common problem found in parallel computation would be to compute the sum of all the elements given in an arbitrary array of values, while storing the intermediary summed value and one possible way to do this is illustrated in the following diagram and the operator being applied there, that is, using-opencl-img-0 is any binary associative operator. Conceptually, the developer could use a task to perform the addition of two elements of that input to derive the summed value.

using-opencl-img-1

Whether the developer chooses to embody task/data parallelism is dependent on the problem, and an example where task parallelism would make sense will be by traversing a graph. And regardless of which model the developer is more inclined with, they come with their own sets of problems when you start to map the program to the hardware via OpenCL. And before the advent of OpenCL, the developer needs to develop a module that will execute on the desired device and communication, and I/O with the driver program. An example example of this would be a graphics rendering program where the CPU initializes the data and sets everything up, before offloading the rendering to the GPU. OpenCL was designed to take advantage of all devices detected so that resource utilization is maximized, and hence in this respect it differs from the "traditional" way of software development.

Now that we have established a good understanding of OpenCL, we should spend some time understanding how a developer can learn it. And not to fret, because every project you embark with, OpenCL will need you to understand the following:

Discover the makeup of the heterogeneous system you are developing for
Understand the properties of those devices by probing it
Start the parallel program decomposition using either or all of task parallelism or data parallelism, by expressing them into instructions also known as kernels that will run on the platform
Set up data structures for the computation
Manipulate memory objects for the computation
Execute the kernels in the order that's desired on the proper device
Collate the results and verify for correctness

Next, we need to solidify the preceding points by taking a deeper look into the various components of OpenCL. The following components collectively make up the OpenCL architecture:

Platform Model: A platform is actually a host that is connected to one or more OpenCL devices. Each device comprises possibly multiple compute units ( CUs ) which can be decomposed into one or possibly multiple processing elements, and it is on the processing elements where computation will run.
Execution Model: Execution of an OpenCL program is such that the host program would execute on the host, and it is the host program which sends kernels to execute on one or more OpenCL devices on that platform.
When a kernel is submitted for execution, an index space is defined such that a work item is instantiated to execute each point in that space. A work item would be identified by its global ID and it executes the same code as expressed in the kernel. Work items are grouped into work groups and each work group is given an ID commonly known as its work group ID, and it is the work group's work items that get executed concurrently on the PEs of a single CU.

That index space we mentioned earlier is known as NDRange describing an N-dimensional space, where N can range from one to three. Each work item has a global ID and a local ID when grouped into work groups, that is distinct from the other and is derived from NDRange. The same can be said about work group IDs. Let's use a simple example to illustrate how they work.

Given two arrays, A and B, of 1024 elements each, we would like to perform the computation of vector multiplication also known as dot product, where each element of A would be multiplied by the corresponding element in B. The kernel code would look something as follows:
```
__kernel void vector_multiplication(__global int* a,
                                    __global int* b,
                                    __global int* c) {
int threadId = get_global_id(0); // OpenCL function
c[i] = a[i] * b[i];
}

    
        Unlock access to the largest independent learning library in Tech for FREE!
        
            
                Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
                Renews at ₹800/month. Cancel anytime
            
            
                                    
                        
                    
                            
        
    
```
In this scenario, let's assume we have 1024 processing elements and we would assign one work item to perform exactly one multiplication, and in this case our work group ID would be zero (since there's only one group) and work items IDs would range from {0 … 1023}. Recall what we discussed earlier, that it is the work group's work items that can executed on the PEs. Hence reflecting back, this would not be a good way of utilizing the device.

In this same scenario, let's ditch the former assumption and go with this: we still have 1024 elements but we group four work items into a group, hence we would have 256 work groups with each work group having an ID ranging from {0 … 255}, but it is noticed that the work item's global ID still would range from {0 … 1023} simply because we have not increased the number of elements to be processed. This manner of grouping work items into their work groups is to achieve scalability in these devices, since it increases execution efficiency by ensuring all PEs have something to work on.

The NDRange can be conceptually mapped into an N-dimensional grid and the following diagram illustrates how a 2DRange works, where WG-X denotes the length in rows for a particular work group and WG-Y denotes the length in columns for a work group, and how work items are grouped including their respective IDs in a work group.

Before the execution of the kernels on the device(s), the host program plays an important role and that is to establish context with the underlying devices and laying down the order of execution of the tasks. The host program does the context creation by establishing the existence (creating if necessary) of the following:
- All devices to be used by the host program
- The OpenCL kernels, that is, functions and their abstractions that will run on those devices
- The memory objects that encapsulated the data to be used / shared by the OpenCL kernels.
- Once that is achieved, the host needs to create a data structure called a command queue that will be used by the host to coordinate the execution of the kernels on the devices and commands are issued to this queue and scheduled onto the devices. A command queue can accept: kernel execution commands, memory transfer commands, and synchronization commands. Additionally, the command queues can execute the commands in-order, that is, in the order they've been given, or out-of-order. If the problem is decomposed into independent tasks, it is possible to create multiple command queues targeting different devices and scheduling those tasks onto them, and then OpenCL will run them concurrently.
Memory Model: So far, we have understood the execution model and it's time to introduce the memory model that OpenCL has stipulated. Recall that when the kernel executes, it is actually the work item that is executing its instance of the kernel code. Hence the work item needs to read and write the data from memory and each work item has access to four types of memories: global, constant, local, and private. These memories vary from size as well as accessibilities, where global memory has the largest size and is most accessible to work items, whereas private memory is possibly the most restrictive in the sense that it's private to the work item. The constant memory is a read-only memory where immutable objects are stored and can be shared with all work items. The local memory is only available to all work items executing in the work group and is held by each compute unit, that is, CU-specific.
The application running on the host uses the OpenCL API to create memory objects in global memory and will enqueue memory commands to the command queue to operate on them. The host's responsibility is to ensure that data is available to the device when the kernel starts execution, and it does so by copying data or by mapping/unmapping regions of memory objects. During a typical data transfer from the host memory to the device memory, OpenCL commands are issued to queues which may be blocking or non-blocking. The primary difference between a blocking and non-blocking memory transfer is that in the former, the function calls return only once (after being queued) it is deemed safe, and in the latter the call returns as soon as the command is enqueued.

Memory mapping in OpenCL allows a region of memory space to be available for computation and this region can be blocking or non-blocking and the developer can treat this space as readable or writeable or both.

Hence forth, we are going to focus on getting the basics of OpenCL by letting our hands get dirty in developing small OpenCL programs to understand a bit more, programmatically, how to use the platform and execution model of OpenCL.

The OpenCL specification Version 1.2 is an open, royalty-free standard for general purpose programming across various devices ranging from mobile to conventional CPUs, and lately GPUs through an API and the standard at the time of writing supports:

Data and task based parallel programming models
Implements a subset of ISO C99 with extensions for parallelism with some restrictions such as recursion, variadic functions, and macros which are not supported
Mathematical operations comply to the IEEE 754 specification
Porting to handheld and embedded devices can be accomplished by establishing configuration profiles
Interoperability with OpenGL, OpenGL ES, and other graphics APIs

Throughout this article, we are going to show you how you can become proficient in programming OpenCL.

As you go through the article, you'll discover not only how to use the API to perform all kinds of operations on your OpenCL devices, but you'll also learn how to model a problem and transform it from a serial program to a parallel program. More often than not, the techniques you'll learn can be transferred to other programming toolsets.

In the toolsets, I have worked with OpenCL^TM, CUDA^TM, OpenMP^TM, MPI^TM, Intel thread building blocks^TM, Cilk^TM, CilkPlus^TM, which allows the developer to express parallelism in a homogeneous environment and find the entire process of learning the tools to application of knowledge to be classified into four parts. These four phases are rather common and I find it extremely helpful to remember them as I go along. I hope you will be benefited from them as well.

Finding concurrency: The programmer works in the problem domain to identify the available concurrency and expose it to use in the algorithm design
Algorithm structure: The programmer works with high-level structures for organizing a parallel algorithm
Supporting Structures: This refers to how the parallel program will be organized and the techniques used to manage shared data
Implementation mechanisms: The final step is to look at specific software constructs for implementing a parallel program.

Don't worry about these concepts, they'll be explained as we move through the article.

The next few recipes we are going to examine have to do with understanding the usage of OpenCL APIs, by focusing our efforts in understanding the platform model of the architecture.