Programming on GPUs

6. 6. 2018

Killian Keller

Software

Processing a large amount of data costs time. A lot of time. What if I told you that you can accelerate operations on large datasets by using your GPU? GPUs are optimized to handle repetitive operations on large datasets and by running your data-processing on the GPU, you can save a lot of time. This text will introduce GPU programming with some easy examples to the reader and hopefully set your foundation for accelerated data processing.

There are a couple of different frameworks which people can use, namely CUDA from NVidia, and OpenCL, which is supported by AMD.

Programming on a GPU – Python

In Python, we can either work with pycuda or with pyopencl. This text will focus on pyopencl, as it works with both NVidia and AMD graphic cards.

First, you will need the pyopencl package itself. Instructions how to install python packages can be found here. As promised, we will present an easy example to demonstrate GPU programming. We will calculate a simple vector addition on the GPU by following the official OpenCL example.

We first begin by importing the package into our environment (import ...) and declaring the variables (A _local= …). They are random vectors of length 50000. The commands os.environ[…COMPILER...] will show the output of the GPU compiler. This is important to see if the compilation failed or was successful. The os.environ[...CTX] creates a context for opencl.

import pyopencl as gpu
 import numpy as np

import os

os.environ['PYOPENCL_COMPILER_OUTPUT'] = '1'
 os.environ['PYOPENCL_CTX'] = '1'

A_local = np.random.randn(50000).astype(np.float64)
 B_local = np.random.randn(50000).astype(np.float64)

Next, you have to set up a context for your program (create_some_context) and a command queue (CommandQueue). This is necessary, as the GPU can only work with a command queue, processing one command after another.

ctx = gpu.create_some_context()
 queue = cl.CommandQueue(ctx)

The next thing we need to do, is move this buffer into the GPU buffer. The variable mf copies the mem_flags array, which contains all the flags for the memory. Using the mf variable, we have now access to the flags in human readable form. By applying an ‘OR’ operator onto the flags, we give the GPU both flags at the same time. The flag READ_ONLY tells the GPU that the host-buffers are read-only and the flag COPY_HOST_PTR tells it to copy the pointer to the host buffers to the memory.

mf = gpu.mem_flags
 A_gpu = gpu.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=A_local)
 B_gpu = gpu.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=B_local)

Now we want to declare the actual program, usually a simple function. This part is written in C++. It is an array addition defined as a function. It takes three pointers to arrays as input, two summands and one result array. At the beginning of the function, it requests the global iteration index, which is given by get_global_id(0). If the dimension n of our array would be higher, we would have get_global_id(n-1) to process the array correctly. When executing the program, we need to give the length of the arrays to the program, since otherwise it will read from forbidden memory sectors (overiteration) and cause a crash of the program.

prg = gpu.Program(ctx, """

__kernel void sum(__global const float *a_g, __global const float *b_g, __global float *res_g){
   int gid = get_global_id(0);
   res_g[gid] = a_g[gid] + b_g[gid];

}

""").build()

Next thing we do is building the result array in the GPU and on the local resources, execute the calculation and retrieve it. The command gpu.Buffer creates a buffer of the same length as A_local as write-only in the context ctx. Finally, we execute the program with prg.sum with the first argument being the command queue we created earlier, the shape of the array, such that the global id does not overflow. We also give the three arrays as final arguments. Finally, we create an array with the desired shape (like A_local) on the local resources and copy the result from the GPU to the local variable (gpu.enqueue_copy).

res_gpu = gpu.Buffer(ctx, mf.WRITE_ONLY, A_local.nbytes)
 prg.sum(queue, a_loc.shape, None, a_gpu, b_gpu, res_gpu)

res_local = np.empty_like(A_local)
 qpu.enqueue_copy(queue, res_local, res_gpu)

For additional information and concepts like parallel computing, shared memories and multi-dimensional arrays, I invite you to visit PyOpenCL’s documentation, found here.

Programming on a GPU – MATLAB

Performing calculations on a GPU in MATLAB is much easier than in python. However, it does not offer as much freedom as the python version and it does not support OpenCL. In MATLAB, all you have to do, is to select your gpuDevice, create your arrays and if you only need to run built-in functions, you’re already done.

To select a graphics card, use the command gpuDevice. If you have multiple graphics cards in your setup, call the command with an index. Before running this command, you need to deselect all graphic cards in use by entering a zero-matrix into the call. Next, you need to create an array and move it into the GPU. You create the array like usual and move it into the GPU with the command gpuArray. Then you can process the data in the GPU using certain built-in functions. The complete list of built-in functions is given the MATLAB reference. To collect your processed data, move back your array with the command gather.

gpuDevice([])
 d = gpuDevice(1)

N = 300
 M = magic(N)
 G = gpuArray(M)

R = inv(G)

result = gather(R)

It is also very simple to run your own functions on the GPU. Basically, you can just run the built-in MATLAB function with your function handle as argument . If one of both subsequent arrays is a GPU array, MATLAB will automatically execute the function in the GPU.

function Aout = Amplifier(AIn, gain, Noise)
          Aout = (Ain.*gain) + Noise

meas = ...  % Read some real values
 gain = 100

amplified = arrayfun(@Amplifier, meas, gain, rand(1000,’gpuArray’))
 result = gather(amplified)

figure
 plot(x, result)

Programming on a GPU – Other

Most programming languages support GPUs, especially low-level languages like C, C++ and Fortran. Unfortunately, covering all of them would really exceed the limits of this text. For references to C++, you can use the document published by the Khronos OpenCL Working Group, for C you can use also a document from the Khronos Group and for Fortan you can start here.

Simulation programs using linear solvers can heavily benefit from GPU acceleration. However, you need to check with the provider to see how to use it.

Conclusion

If you made it to this point – congratulations. You set the foundation for your GPU-programming. As with every programming technique, all that matters is getting some exercise. So go ahead and accelerate your existing code and compare it with purely CPU-run code. This is possible with all devices which have a graphics card – e.g. a workstation with a dedicated graphics card like the Dell Precision 5520, HP EliteBook 850 G5 or Zbook 15 G4, the Thinkpad T470p and P51 or the MacBooks equipped with a GPU. If you don’t have a workstation but still want to benefit from the speedup from general purpose GPU programming, don’t worry. By using an HP Omen Accelerator and a Thunderbolt 3 interface, you can plug in virtually any graphics card externally into your computer.

Further information on GPU programming for MATLAB can be found here and for Python on the Python website, Nvidia website and on the PyOpenGL website.

Jobs available!

We are looking for a working student as Deputy Support Coordinator (50-60%). You can find more details about the position here. Apply now or share the job description with your friends!

Order Information

You can track the order status of your laptop in our distribution partner's customer account. You can find information on this in the FAQ.

Help Point : Please book an appointment

Our support staff at the Help Points ETHZ, Bern vonRoll, UNIBAS and UNILU are there for you. If you need technical support, contact us via email to make an appointment.