Visucore Blog

Python, Parallel Processing, Graphics, Open Source, Embedded Systems, Game Development

OpenCL on i.MX6

written by wladimir, on Mar 12, 2013 6:41:00 PM.

As a GPGPU aficionado, I am pretty excited to be able to run OpenCL on a small, low-power device. While playing around with it I wrote down some notes and performance tips.

I'll begin with mentioning the software versions used, and a workaround that had to be in place to be able to build against OpenCL at all. After that I'll list some performance tips, and the article ends with some limitations encountered along the way.

Version

Hardware: GC2000, revision 0x5108, 4 shader cores.

Software: I wanted to put the CL compiler version here, but alas the driver refused to cough up any specific version information using the OpenCL API. Anyhow, it is included with

  • Freescale rootfs "L3.0.35_1.1.0_oneiric.tgz"
  • libHAL.so reports Vivante version "4.6.9:1478"

Newer versions of the driver may have solved (some of) the later mentioned limitations.

=== 1 OpenCL platform(s) found: ===
  -- 0 --
  PROFILE = EMBEDDED_PROFILE
  VERSION = OpenCL 1.1 
  NAME = Vivante OpenCL Platform
  VENDOR = Vivante Corporation
  EXTENSIONS = 
=== 1 OpenCL device(s) found on platform:
  -- 0 --
  DEVICE_NAME = Vivante OpenCL Device
  DEVICE_VENDOR = Vivante Corporation
  DEVICE_VERSION = OpenCL 1.1 
  DEVICE_EXTENSIONS = cl_khr_byte_addressable_store 
  DRIVER_VERSION = OpenCL 1.1 
  DEVICE_MAX_COMPUTE_UNITS = 4
  DEVICE_MAX_CLOCK_FREQUENCY = 500
  DEVICE_GLOBAL_MEM_SIZE = 100663296
Workaround: link against libGAL
At least on my Freescale rootfs, libOpenCL needs functions from libGAL but is not linked against it. This means that -lGAL has to be added to the linker command line along with -lOpenCL.

Performance

Floating point performance can be up to 16 GFLOPS according to shoc OpenCL benchmark maxflops (but beware, the board this was executed on was likely not tuned for full performance). If you want to repeat the benchmark yourself you can use my shoc version adapted to run on this hardware.
Mind your compiler optimizations

A quick test revealed that adding -cl-mad-enable -cl-unsafe-math-optimizations -cl-fast-relaxed-math to clBuildProgram halved the number of generated instructions for a simple A*B kernel. The worst-case path was explicitly comparing to NaN and Inf, likely to enforce IEEE 754 edge-case compliance. In some cases this may be required, but keep in mind that full performance of the device can only be reached with those options enabled.

Use float4

Like on ATI hardware, float2/3/4 operations directly translate to vectorized instructions. Floating point instructions always perform operations on four values, so using smaller vectors under-utilizes the hardware. For integer operations it does not matter as the underlying instructions are scalar.

See for example the Madd test results from shoc:

Add1-SP         Size:2097152	GFLOPS	0.891504
Add4-SP         Size:2097152	GFLOPS	8.00293	
Add8-SP         Size:2097152	GFLOPS	1.59215
Add16-SP        (result missing)

MAdd1-SP        Size:2097152	GFLOPS	4.58433
MAdd4-SP        Size:2097152	GFLOPS	16.0061	
MAdd8-SP        Size:2097152	GFLOPS	3.1843	
MAdd16-SP      *Size:2097152	GFLOPS	14.8157	

float4 and float16 are clearly the sweet spots. I am not sure why float8 suffers, it could be due to certain instruction interleaving pattern or a compiler limitation.

LLVM

"strings" on libCLC.so shows that the CL compiler is based on clang+LLVM, which is an advanced optimizing compiler combo. This means there is likely some room for further performance tuning.

Limitations

Here I've listed some limitations compared to OpenCL on giant, power-hungry PC graphics cards.
Global work size

The GLOBAL_WORK_SIZE per dimension cannot exceed 1<<16 with the Vivante driver. This limitation is not documented anywhere and I am not sure if this is violation of spec, however the documentation for clEnqueueNDRangeKernel mentions:

The values specified in global_work_size + corresponding values specified in global_work_offset cannot exceed the range given by the sizeof(size_t) for the device on which the kernel execution will be enqueued. The sizeof(size_t) for a device can be determined using CL_DEVICE_ADDRESS_BITS in the table of OpenCL Device Queries for clGetDeviceInfo. If, for example, CL_DEVICE_ADDRESS_BITS = 32, i.e. the device uses a 32-bit address space, size_t is a 32-bit unsigned integer and global_work_size values must be in the range 1 .. 2^32 - 1. Values outside this range return a CL_OUT_OF_RESOURCES error.

CL_DEVICE_ADDRESS_BITS is 32, so if we were to believe the above snippet it should allow for very large global work sizes. Maybe it is different for the embedded profile.

In the shoc maxflops benchmark I worked around this by emulating a larger global work size by using a 2D global work size, ie.

    /* (in invocation) */
    size_t globalWorkSize[2] = {min(N, 65536), max(N / 65536, 1)};

    /* (and in kernel) */
    int gid = get_global_id(0) | (get_global_id(1)<<16), globalSize = get_global_size(0) * get_global_size(1);

Another possibility proposed by @pixelio is to invoke clEnqueueNDRangeKernel in a loop with increasing offset. But this turned out to be problematic, as the limit holds for the work offset as well.

Workgroup size

Allowed workgroup size very strongly depends on the kernel, and can be anywhere between 16 to 256 (256 is the number of threads per shader core). Do not assume that the reported maximum work group size (1024) can be used but query it for every kernel separately. shoc fails to do this, as well as many OpenCL examples found on the web.

    clGetKernelWorkGroupInfo (ker, devid, CL_KERNEL_WORK_GROUP_SIZE, ...)

Another potential pitfall: the returned maximum workgroup size is not always a power of two (and there is nothing in the standard that says it has to be).

Shader size

Keep your kernels small.

As could be expected on an embedded device, it is quite easy to run out of shader memory. For example, the bitcoin miner is too large, as well as a few of the kernels used in shoc. This will result in CL_OUT_OF_RESOURCES errors during clCreateKernel.

The "query chip identity" ioctl call on GC2000 reports an instructionCount of 512. Looking at command stream dumps from etna_viv the device appears to have 0x0E000 - 0x0C000 = 8192 bytes of instruction memory, with Vivante's 128 bit instructions this confirms that there is only room for 512 instructions.

Unfortunately, I am not currently aware of a way to query the shader size in native instructions from the OpenCL API, as CL_PROGRAM_BINARY_SIZES shows the size of the IR.

Atomics

The OpenCL 1.1 doc lists atomic functions such as atomic_add, but any attempt to use them (on global and local memory) resulted in "atomic function XXX not supported" compiler error.

I've tried various ways to work around this, such as specifying #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable (shouldn't be needed for 1.1 though), but that's greeted with "warning: pragma OPENCL EXTENSION is not supported, ignoring pragma".

That's all for now...

Comments

  • Too bad there are no atomics. That's a show-stopper for a lot of GPU work. Maybe it will be added in a future driver release.

    Comment by Allan MacKinnon — Mar 12, 2013 9:00:01 PM | #

  • @Allan I'm a bit surprised about it, as I know for fact that the low-level ISA does support atomics (but haven't confirmed whether they are implemented there). The even mention atomics in their press release: www.vivantecorp.com/vtopcl.html . Let's hope it's just a bug.

    Comment by wladimir — Mar 12, 2013 9:12:00 PM | #

  • I almost missed what you wrote about 512 (128-bit) instructions. That's really surprising!

    Comment by Allan MacKinnon — Mar 12, 2013 9:12:52 PM | #

  • The huge instruction format is a leftover from their DirectX days. But with the new OpenCL instructions being mostly scalar I wouldn't be surprised if they, like Adreno recently (and many before), switch to a more compact ISA at some stage.

    Comment by wladimir — Mar 13, 2013 8:05:00 AM | #

  • I think/hope no support atomic is temporarily, before next driver release.

    /opt/viv_samples/cl11/UnitTest$ sudo ./math 20
    Unit Test Math Starting...
    Initializing OpenCL...
    Running Math testcase 20...
    Extension "cl_khr_global_int32_extended_atomics" is not supported.

    /opt/viv_samples/cl11/UnitTest$ strings ./math | grep atom cl_khr_global_int32_extended_atomics __kernel void math( volatile __global int *pdst, int offset ) { int oldVal = atomic_add( pdst[1] = oldVal; }

    Comment by alexander — Mar 13, 2013 3:58:31 PM | #

  • See last page of paper: 2KB of registers and further confirmation that there are 512 instructions.

    hgpu.org/?p=9052

    Comment by Allan MacKinnon — Mar 20, 2013 4:17:42 PM | #

  • So aside from running benchmarks, what else of practical can we run on GPU using OpenCL if atomics are not supported?

    Comment by Andrew — Mar 28, 2013 5:50:45 PM | #

  • @Allan interesting paper, hadn't seen that one, thanks

    @Andrew You can still do all kinds of convolutions, matrix multiplications, image processing algorithms. The more "embarrassingly parallel" stuff.

    Comment by wladimir — Mar 29, 2013 4:34:50 AM | #

  • Vladimir I experienced some strange errors using the Vivante GPU, most notably a completely out of place CL_OUT_OF_HOST_MEMORY from a clCreateKernel call. It turned out it was supposed to be a CL_INVALID_KERNEL_NAME, which they failed to return correctly. I wonder if your errors were also incorrect errors...

    Comment by Constantin — Apr 2, 2013 7:22:31 PM | #

  • @Constantin which one? There may be some bugs related to errors, but I think all of the errors I got during my experiment were correct and explainable. When I get a new driver version I will re-try.

    Comment by wladimir — Apr 3, 2013 7:04:57 AM | #

  • It seems very interesting, @Andrew. I'd like to know which evaluation board did you use with i.MX6. Thanks.

    Comment by Alex — May 30, 2013 5:24:53 PM | #