Arm Guide to OpenCL Programming
Arm Guide to OpenCL Programming
Version 3.3
Developer Guide
Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved.
ARM 100614_0303_00_en
ARM® Mali™ GPU OpenCL
Document History
Your access to the information in this document is conditional upon your acceptance that you will not use or permit others to use
the information for the purposes of determining whether implementations infringe any third party patents.
THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES,
EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE
WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, ARM makes no representation with respect to, and has
undertaken no analysis to identify or understand the scope and content of, third party patents, copyrights, trade secrets, or other
rights.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR
CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING
OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
DAMAGES.
This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication or disclosure of
this document complies fully with any relevant export laws and regulations to assure that this document or any portion thereof is
not exported, directly or indirectly, in violation of such export laws. Use of the word “partner” in reference to ARM’s customers is
not intended to create or refer to any partnership relationship with any other company. ARM may make changes to this document at
any time and without notice.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 2
Non-Confidential
ARM® Mali™ GPU OpenCL
If any of the provisions contained in these terms conflict with any of the provisions of any signed written agreement covering this
document with ARM, then the signed written agreement prevails over and supersedes the conflicting provisions of these terms.
This document may be translated into other languages for convenience, and you agree that if there is any conflict between the
English version of this document and any translation, the terms of the English version of the Agreement shall prevail.
Words and logos marked with ® or ™ are registered trademarks or trademarks of ARM Limited or its affiliates in the EU and/or
elsewhere. All rights reserved. Other brands and names mentioned in this document may be the trademarks of their respective
owners. Please follow ARM’s trademark usage guidelines at http://www.arm.com/about/trademark-usage-guidelines.php
Copyright © 2012, 2013, 2015–2017, ARM Limited or its affiliates. All rights reserved.
LES-PRE-20349
Confidentiality Status
This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license restrictions in
accordance with the terms of the agreement entered into by ARM and the party that ARM delivered this document to.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3
Non-Confidential
Contents
ARM® Mali™ GPU OpenCL Developer Guide
Preface
About this book ...................................................... ...................................................... 8
Feedback .................................................................................................................... 10
Chapter 1 Introduction
1.1 About ARM® Mali™ GPUs ............................................ ............................................ 1-12
1.2 About OpenCL .................................................... .................................................... 1-13
1.3 About the Mali GPU OpenCL driver and support .......................... .......................... 1-14
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 4
Non-Confidential
3.7 The OpenCL memory model ......................................... ......................................... 3-31
3.8 The Mali™ GPU OpenCL memory model ................................ ................................ 3-33
3.9 OpenCL concepts summary .................................................................................... 3-34
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5
Non-Confidential
Appendix A OpenCL Data Types
A.1 About OpenCL data types ...................................... ...................................... Appx-A-92
A.2 OpenCL data type lists ......................................... ......................................... Appx-A-93
Appendix F Revisions
F.1 Revisions .................................................. .................................................. Appx-F-124
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 6
Non-Confidential
Preface
This preface introduces the ARM® Mali™ GPU OpenCL Developer Guide.
It contains the following:
• About this book on page 8.
• Feedback on page 10.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 7
Non-Confidential
Preface
About this book
Intended audience
This guide is written for software developers with experience in C or C-like languages who want to
develop OpenCL on Mali™ Midgard GPUs or Mali Bifrost GPUs.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 8
Non-Confidential
Preface
About this book
Glossary
The ARM Glossary is a list of terms used in ARM documentation, together with definitions for those
terms. The ARM Glossary does not contain terms that are industry standard unless the ARM meaning
differs from the generally accepted meaning.
See the ARM Glossary for more information.
Typographic conventions
italic
Introduces special terminology, denotes cross-references, and citations.
bold
Highlights interface elements, such as menu names. Denotes signal names. Also used for terms
in descriptive lists, where appropriate.
monospace
Denotes text that you can enter at the keyboard, such as commands, file and program names,
and source code.
monospace
Denotes a permitted abbreviation for a command or option. You can enter the underlined text
instead of the full command or option name.
monospace italic
Denotes arguments to monospace text where the argument is to be replaced by a specific value.
monospace bold
Denotes language keywords when used outside example code.
<and>
Encloses replaceable terms for assembler syntax where they appear in code or code fragments.
For example:
MRC p15, 0, <Rd>, <CRn>, <CRm>, <Opcode_2>
SMALL CAPITALS
Used in body text for a few terms that have specific technical meanings, that are defined in the
ARM glossary. For example, IMPLEMENTATION DEFINED, IMPLEMENTATION SPECIFIC, UNKNOWN, and
UNPREDICTABLE.
Additional reading
This book contains information that is specific to this product. See the following documents for other
relevant information.
ARM publications
See Infocenter, http://infocenter.arm.com, for access to ARM documentation.
Other publications
OpenCL 1.2 Specification, www.khronos.org
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9
Non-Confidential
Preface
Feedback
Feedback
Feedback on content
If you have comments on content then send an e-mail to errata@arm.com. Give:
• The title ARM Mali GPU OpenCL Developer Guide.
• The number ARM 100614_0303_00_en.
• If applicable, the page number(s) to which your comments refer.
• A concise explanation of your comments.
ARM also welcomes general suggestions for additions and improvements.
Note
ARM tests the PDF only in Adobe Acrobat and Acrobat Reader, and cannot guarantee the quality of the
represented document when used with any other PDF reader.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 10
Non-Confidential
Chapter 1
Introduction
This chapter introduces Mali GPUs, OpenCL, and the Mali GPU OpenCL driver.
It contains the following sections:
• 1.1 About ARM® Mali™ GPUs on page 1-12.
• 1.2 About OpenCL on page 1-13.
• 1.3 About the Mali GPU OpenCL driver and support on page 1-14.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 1-11
Non-Confidential
1 Introduction
1.1 About ARM® Mali™ GPUs
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 1-12
Non-Confidential
1 Introduction
1.2 About OpenCL
Related information
http://www.khronos.org.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 1-13
Non-Confidential
1 Introduction
1.3 About the Mali GPU OpenCL driver and support
Note
The Mali GPU OpenCL driver does not support Mali Utgard GPUs.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 1-14
Non-Confidential
Chapter 2
Parallel Processing Concepts
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 2-15
Non-Confidential
2 Parallel Processing Concepts
2.1 About parallel processing
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 2-16
Non-Confidential
2 Parallel Processing Concepts
2.2 Types of parallelism
Decode Decode
sound video
Parse data
Operating
Network stack
system
2.2.3 Pipelines
Pipelines process data in a series of stages. In a pipeline, the stages can operate simultaneously but they
do not process the same data. A pipeline typically has a relatively small number of stages.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 2-17
Non-Confidential
2 Parallel Processing Concepts
2.2 Types of parallelism
An example of a pipeline is a video recorder application that must execute these stages:
1. Capture image data from an image sensor and measure light levels.
2. Modify the image data to correct for lens effects.
3. Modify the contrast, color balance, and exposure of the image data.
4. Compress the image.
5. Add the data to the video file.
6. Write the video file to storage.
These stages must be executed in order, but they can all execute on data from different video frames at
the same time.
The figure shows parts of a video capture application that can operate simultaneously as a pipeline.
Capture Modify:
Correct Write video
data Contrast Compress Add data to
image for file to
from image Color balance image video file
lens effects storage
sensor Exposure
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 2-18
Non-Confidential
2 Parallel Processing Concepts
2.3 Mixing different types of parallelism
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 2-19
Non-Confidential
2 Parallel Processing Concepts
2.4 Embarrassingly parallel applications
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 2-20
Non-Confidential
2 Parallel Processing Concepts
2.5 Limitations of parallel processing and Amdahl's law
1
Speedup =
P
S +
N
8X
5% serial
6X
10% serial
4X
20% serial
2X
1X
0
1 2 4 6 8 10
Processors
Related concepts
2.4 Embarrassingly parallel applications on page 2-20.
9.5 Reducing the effect of serial computations on page 9-84.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 2-21
Non-Confidential
2 Parallel Processing Concepts
2.6 Concurrency
2.6 Concurrency
Concurrent applications have multiple operations in progress at the same time. These can operate in
parallel or in serial through the use of a time sharing system.
In a concurrent application, multiple tasks attempt to share the same data. Access to this data must be
managed to prevent complex problems such as race conditions, deadlocks, and livelocks.
Race conditions
A race condition occurs when two or more threads try to modify the value of one variable at the
same time. In general, the final value of the computation will always produce the same value,
but when a race condition occurs, the variable can get a different value that depends on the order
of the writes.
Deadlocks
A deadlock occurs when two threads become blocked by each other and neither thread can make
progress. This can happen when each thread obtains a lock that the other thread requires.
Livelocks
A livelock is similar to deadlock, but the threads keep running. Because of the lock, the threads
can never complete their tasks.
Concurrent applications require concurrent data structures. A concurrent data structure is a data structure
that enables multiple tasks to gain access to the data with no concurrency problems.
Data parallel applications use concurrent data structures. These are the sorts of data structures that you
typically use in OpenCL.
OpenCL includes atomic operations to help manage interactions between threads. Atomic operations
provide one thread exclusive access to a data item while it modifies it. The atomic operation enables one
thread to read, modify, and write the data item with the guarantee that no other thread can modify the
data item at the same time.
Note
OpenCL does not guarantee the order of operation of threads. Threads can start and finish in any order.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 2-22
Non-Confidential
Chapter 3
OpenCL Concepts
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3-23
Non-Confidential
3 OpenCL Concepts
3.1 Using OpenCL
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3-24
Non-Confidential
3 OpenCL Concepts
3.2 OpenCL applications
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3-25
Non-Confidential
3 OpenCL Concepts
3.3 OpenCL execution model
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3-26
Non-Confidential
3 OpenCL Concepts
3.4 OpenCL data processing
Work-items
One dimensional
NDRange
N=0
Two dimensional
NDRange
N=1
Work-items
Three dimensional
NDRange
N=2
Work-items
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3-27
Non-Confidential
3 OpenCL Concepts
3.4 OpenCL data processing
Work-groups
Work-items
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3-28
Non-Confidential
3 OpenCL Concepts
3.5 OpenCL work-groups
After the synchronization is complete, all writes to shared buffers are guaranteed to have
been completed. It is then safe for work-items to read data written by different work-items
within the same work-group.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3-29
Non-Confidential
3 OpenCL Concepts
3.6 OpenCL identifiers
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3-30
Non-Confidential
3 OpenCL Concepts
3.7 The OpenCL memory model
Work-group Work-group
Constant memory
• Constant memory is a memory region used for objects allocated and initialized by the host.
• It is accessible as read-only by all work-items.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3-31
Non-Confidential
3 OpenCL Concepts
3.7 The OpenCL memory model
Global memory
• Global memory is accessible to all work-items executing in a context.
• It is accessible to the host using read, write, and map commands.
• It is consistent across work-items in a single work-group.
Note
— Work-items execute in an undefined order. This means you cannot guarantee the order
that work-items write data in.
— If you want a work-item to read data that are written by another work-item, you must use
a barrier to ensure that they execute in the correct order.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3-32
Non-Confidential
3 OpenCL Concepts
3.8 The Mali™ GPU OpenCL memory model
Related concepts
8.3 Memory allocation on page 8-71.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3-33
Non-Confidential
3 OpenCL Concepts
3.9 OpenCL concepts summary
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 3-34
Non-Confidential
Chapter 4
Developing an OpenCL Application
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 4-35
Non-Confidential
4 Developing an OpenCL Application
4.1 Software and hardware requirements for Mali GPU OpenCL development
4.1 Software and hardware requirements for Mali GPU OpenCL development
Implementations of OpenCL are available for several operating systems. You can develop on other
hardware platforms with implementations of OpenCL.
To develop OpenCL applications for Mali GPUs, you require:
• A compatible OS.
• The Mali GPU OpenCL driver.
• A platform with a Mali GPU.
Note
The Mali GPU must be a Mali Midgard or Bifrost GPU.
Estimating Mali GPU performance with results from a different system will produce inaccurate data.
Related concepts
1.1 About ARM® Mali™ GPUs on page 1-12.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 4-36
Non-Confidential
4 Developing an OpenCL Application
4.2 Development stages for OpenCL
Write kernels
OpenCL applications consist of a set of kernel functions. You must write the kernels that
perform the computations.
If possible, partition your kernels so that the least amount of data is transferred between them.
Loading large amounts of data is often the most expensive part of an operation.
Write infrastructure for kernels
OpenCL applications require infrastructure code that sets up the data and prepares the kernels
for execution,
Execute the kernels
Enqueue the kernels for execution and read back the results.
Related concepts
6.2 Analyzing code for parallelization on page 6-53.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 4-37
Non-Confidential
Chapter 5
Execution Stages of an OpenCL Application
Note
This chapter is not intended as a comprehensive lesson in OpenCL.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-38
Non-Confidential
5 Execution Stages of an OpenCL Application
5.1 About the execution stages
Related concepts
5.2 Finding the available compute devices on page 5-41.
5.3 Initializing and creating OpenCL contexts on page 5-42.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-39
Non-Confidential
5 Execution Stages of an OpenCL Application
5.1 About the execution stages
Related concepts
5.4 Creating a command queue on page 5-43.
5.5 Creating OpenCL program objects on page 5-44.
5.7 Creating kernel and memory objects on page 5-46.
5.8 Executing the kernel on page 5-47.
5.9 Reading the results on page 5-49.
5.10 Cleaning up unused objects on page 5-50.
Related references
5.6 Building a program executable on page 5-45.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-40
Non-Confidential
5 Execution Stages of an OpenCL Application
5.2 Finding the available compute devices
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-41
Non-Confidential
5 Execution Stages of an OpenCL Application
5.3 Initializing and creating OpenCL contexts
You can optionally specify an error notification callback function when you create an OpenCL context.
When you leave this parameter as a NULL value the system does not register an error notification.
To receive runtime errors for the particular OpenCL context, provide the callback function. For example:
// Optionally user_data can contain contextual information
// Implementation specific data of size cb, can be returned in private_info
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-42
Non-Confidential
5 Execution Stages of an OpenCL Application
5.4 Creating a command queue
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-43
Non-Confidential
5 Execution Stages of an OpenCL Application
5.5 Creating OpenCL program objects
Creating a program object from a binary is a similar process to creating a program object from source
code, except that you must supply the binary for each device that you want to execute the kernel on. Use
the clCreateProgramWithBinary() function to do this.
Use the clGetProgramInfo() function to obtain the binary after you have generated it.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-44
Non-Confidential
5 Execution Stages of an OpenCL Application
5.6 Building a program executable
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-45
Non-Confidential
5 Execution Stages of an OpenCL Application
5.7 Creating kernel and memory objects
Procedure
1. Package the data in a memory object.
2. Associate the memory object with the kernel.
These are the types of memory objects:
Buffer objects
Simple blocks of memory.
Image objects
These are structures specifically for representing 2D or 3D images. These are opaque
structures. This means that you cannot see the implementation details of these structures.
To create buffer objects, use the clCreateBuffer() function.
To create image objects, use the clCreateImage() function.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-46
Non-Confidential
5 Execution Stages of an OpenCL Application
5.8 Executing the kernel
If your application is not required to share data among work-items, set the local_work_size parameter
to NULL when enqueuing your kernel. This enables the OpenCL driver to determine an efficient work-
group size for your kernel, but this might not be the optimal work-group size.
To get the maximum work-group size in each dimension, call clGetDeviceInfo() with
CL_DEVICE_MAX_WORK_ITEM_SIZES. This is for the simplest kernel and dimensions might be lower for
more complex kernels. The product of the dimensions of your work-group might limit the size of the
work-group.
Note
To get the total work-group size, call clGetKernelWorkGroupInfo() with
CL_KERNEL_WORK_GROUP_SIZE. If the maximum work-group size for a kernel is lower than 128,
performance is reduced. If this is the case, try simplifying the kernel.
The work-group size for each dimension must divide evenly into the total data-size for that dimension.
This means that the x size of the work-group must divide evenly into the x size of the total data. If this
requirement means padding the work-group with extra work-items, ensure the additional work-items
return immediately and do no work.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-47
Non-Confidential
5 Execution Stages of an OpenCL Application
5.8 Executing the kernel
For example:
size_t globalWorkSize[1] = { ARRAY_SIZE };
size_t localWorkSize[1] = { 4 };
if (errNum != CL_SUCCESS)
{
printf("Error queuing kernel for execution.\n");
Cleanup();
return 1;
}
Kernels that are enqueued to an in-order queue automatically wait for kernels that were previously
enqueued on the same queue. You are not required to write any code to synchronize them.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-48
Non-Confidential
5 Execution Stages of an OpenCL Application
5.9 Reading the results
ASSERT(CL_SUCCESS == err);
Note
• clFinish() must be called to make the buffer available.
• The third parameter of clEnqueueMapBuffer() is CL_NON_BLOCKING in the previous example. If you
change this parameter in clEnqueueMapBuffer() or clFinish() to CL_BLOCKING, the call becomes a
blocking call and the read must be completed before clEnqueueMapBuffer() returns.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-49
Non-Confidential
5 Execution Stages of an OpenCL Application
5.10 Cleaning up unused objects
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 5-50
Non-Confidential
Chapter 6
Converting Existing Code to OpenCL
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 6-51
Non-Confidential
6 Converting Existing Code to OpenCL
6.1 Profiling your application
Related information
http://malideveloper.arm.com.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 6-52
Non-Confidential
6 Converting Existing Code to OpenCL
6.2 Analyzing code for parallelization
Related concepts
6.3.1 Use the global ID instead of the loop counter on page 6-55.
6.3.2 Compute values in a loop with a formula instead of using counters on page 6-55.
6.3.3 Compute values per frame on page 6-56.
6.3.4 Perform computations with dependencies in multiple-passes on page 6-57.
6.3.5 Pre-compute values to remove dependencies on page 6-57.
6.4 Using parallel processing with non-parallelizable code on page 6-59.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 6-53
Non-Confidential
6 Converting Existing Code to OpenCL
6.2 Analyzing code for parallelization
Related concepts
6.3 Parallel processing techniques in OpenCL on page 6-55.
6.3.1 Use the global ID instead of the loop counter on page 6-55.
6.4 Using parallel processing with non-parallelizable code on page 6-59.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 6-54
Non-Confidential
6 Converting Existing Code to OpenCL
6.3 Parallel processing techniques in OpenCL
Note
You can include loops in OpenCL kernels, but they can only iterate over the data for that work-item, not
the entire NDRange.
This loop is parallelizable because the loop elements are all independent. There is no main loop counter
loop_count in the OpenCL kernel, so it is replaced by the global ID.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 6-55
Non-Confidential
6 Converting Existing Code to OpenCL
6.3 Parallel processing techniques in OpenCL
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 6-56
Non-Confidential
6 Converting Existing Code to OpenCL
6.3 Parallel processing techniques in OpenCL
In this case, splitting the computations into iterations also splits the dependencies. The data required for
one frame is computed in the previous frame.
Some types of simulation require many iterations for relatively small movements. If this is the case, try
computing multiple iterations before drawing frames.
Related concepts
6.3.3 Compute values per frame on page 6-56.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 6-57
Non-Confidential
6 Converting Existing Code to OpenCL
6.3 Parallel processing techniques in OpenCL
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 6-58
Non-Confidential
6 Converting Existing Code to OpenCL
6.4 Using parallel processing with non-parallelizable code
Related concepts
6.5.2 Use concurrent data structures on page 6-60.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 6-59
Non-Confidential
6 Converting Existing Code to OpenCL
6.5 Dividing data for OpenCL
Related references
Chapter 3 OpenCL Concepts on page 3-23.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 6-60
Non-Confidential
6 Converting Existing Code to OpenCL
6.5 Dividing data for OpenCL
Note
The examples map the problems into the NDRanges that have the same number of dimensions. OpenCL
does not require that you do this. You can for example, map a one-dimensional problem onto a two-, or
three-dimensional NDRange.
One-dimensional data
An example of one-dimensional data is audio. Audio is represented as a series of samples. Changing the
volume of the audio is a parallel task, because the operation is performed independently per sample.
In this case, the NDRange is the total number of samples in the audio. Each work-item can be one
sample and a work-group is a collection of samples.
Audio can also be processed with vectors. If your audio samples are 16-bit, you can make a work-item
represent 8 samples and process 8 of them at a time with vector instructions.
Two-dimensional data
An image is a natural fit for OpenCL, because you can process a 1 600 by 1 200 pixel image by mapping
it onto a two-dimensional NDRange of 1 600 by 1 200.The total number of work-items is the total
number of pixels in the image, that is, 1 920 000.
The NDRange is divided into work-groups where each work-group is also a two-dimensional array. The
number of work-groups must divide into the NDRange exactly.
If each work-item processes a single pixel, a work-group size of 8 by 16 has the size of 128. This work-
group size fits exactly into the NDRange on both the x and y axis. To process the image, you require
15 000 work-groups of 128 work-items each.
You can vectorize this example by processing all the color channels in a single vector. If the channels are
8-bit values, you can process multiple pixels in a single vector. If each vector processes four pixels, this
means each work-item processes four pixels and you require four times fewer work-items to process the
entire image. This means that your NDRange can be reduced to 400 by 1 200 and you only require 3 750
work-groups to process the image.
Three-dimensional data
You can use three-dimensional data to model the behavior of materials in the real world. For example,
you can model the behavior of concrete for building by simulating the stresses in a three-dimensional
data set.
You can use the data produced to determine the size and design of the structure you require to hold a
specific load.
You can use this technique in games to model the physics of objects. When an object is broken, the
physics simulation makes the process of breaking more realistic.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 6-61
Non-Confidential
Chapter 7
Retuning Existing OpenCL Code
This chapter describes how to retune existing OpenCL code so you can run it on Mali GPUs.
It contains the following sections:
• 7.1 About retuning existing OpenCL code for Mali GPUs on page 7-63.
• 7.2 Differences between desktop-based architectures and Mali GPUs on page 7-64.
• 7.3 Procedure for retuning existing OpenCL code for Mali GPUs on page 7-66.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 7-62
Non-Confidential
7 Retuning Existing OpenCL Code
7.1 About retuning existing OpenCL code for Mali GPUs
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 7-63
Non-Confidential
7 Retuning Existing OpenCL Code
7.2 Differences between desktop-based architectures and Mali GPUs
Note
OpenCL typically only uses the arithmetic or load-store execution pipelines. The texture pipeline is only
used for reading image data types.
The Mali GPUs use a VLIW (Very Long Instruction Word) architecture. Each instruction word contains
multiple operations. The Mali GPUs also use SIMD, so that most arithmetic instructions operate on
multiple data elements simultaneously.
Each thread uses only one of the arithmetic or load-store execution pipes at any point in time. Two
instructions from the same thread execute in sequence.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 7-64
Non-Confidential
7 Retuning Existing OpenCL Code
7.2 Differences between desktop-based architectures and Mali GPUs
Related references
Chapter 10 The kernel auto-vectorizer and unroller on page 10-86.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 7-65
Non-Confidential
7 Retuning Existing OpenCL Code
7.3 Procedure for retuning existing OpenCL code for Mali GPUs
7.3 Procedure for retuning existing OpenCL code for Mali GPUs
You can optimize existing OpenCL code for Mali GPUs if you analyze existing code and remove the
device-specific optimizations.
This section contains the following subsections:
• 7.3.1 Analyze code on page 7-66.
• 7.3.2 Locate and remove device optimizations on page 7-66.
• 7.3.3 Optimize your OpenCL code for Mali GPUs on page 7-67.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 7-66
Non-Confidential
7 Retuning Existing OpenCL Code
7.3 Procedure for retuning existing OpenCL code for Mali GPUs
Use of scalars
Mali Bifrost GPUs use scalars.
Mali Midgard GPUs use scalars and 128-bit vectors.
Modifications for memory bank conflicts
Some GPUs include per-warp memory banks. If the code includes optimizations to avoid
conflicts in these memory banks, remove them.
Optimizations for divergent threads, warps, or wavefronts
Some GPU architectures group work-items together into what are called warps or wavefronts.
All the work-items in a warp must proceed in lock-step together in these architectures and this
means branches can perform badly.
Threads on Mali Midgard GPUs are independent and can diverge without any performance
impact. If your code contains optimizations or workarounds for divergent threads in warps or
wavefronts, remove them.
Note
Mali Midgard GPUs do not use warps or wavefronts.
Note
This optimization only applies to Mali Midgard GPUs.
Related references
Chapter 8 Optimizing OpenCL for Mali GPUs on page 8-68.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 7-67
Non-Confidential
Chapter 8
Optimizing OpenCL for Mali GPUs
This chapter describes the procedure to optimize OpenCL applications for Mali GPUs.
It contains the following sections:
• 8.1 The optimization process for OpenCL applications on page 8-69.
• 8.2 Load balancing between control threads and OpenCL threads on page 8-70.
• 8.3 Memory allocation on page 8-71.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 8-68
Non-Confidential
8 Optimizing OpenCL for Mali GPUs
8.1 The optimization process for OpenCL applications
Do a dummy run of the kernel the first time to ensure that the memory is allocated. Ensure this
is outside of your timing loop.
The allocation of some buffers in certain cases is delayed until the first time they are used. This
can cause the first kernel run to be slower than subsequent runs.
Select the kernels that take the most time
Select the kernels that have the longest run-time and optimize these. Optimizing any other
kernels has little impact on overall performance.
Analyze the kernels
Analyze the kernels to see if they contain computationally expensive operations:
• Measure how many reads and writes there are in the kernel. For high performance, do as
many computations per memory access as possible.
• For Mali GPUs, you can use the Offline Shader Compiler to check the balancing between the
different pipelines.
Note
Compiler output using -v is not available for Bifrost GPUs.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 8-69
Non-Confidential
8 Optimizing OpenCL for Mali GPUs
8.2 Load balancing between control threads and OpenCL threads
8.2.2 Do not use any of the clEnqueueMap() operations with a blocking call
Use clWaitForEvents() or callbacks to ensure that the control thread and OpenCL can work in parallel.
Procedure
1. Split work into many parts.
2. For each part:
a. Prepare the work for part X on the application processor.
b. Submit part X OpenCL work-items to the OpenCL device.
3. For each part:
a. Wait for part X OpenCL work-items to complete on the OpenCL device using clWaitForEvents.
b. Process the results from the OpenCL device on the application processor.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 8-70
Non-Confidential
8 Optimizing OpenCL for Mali GPUs
8.3 Memory allocation
Parameter Description
CL_MEM_ALLOC_HOST_PTR This is a hint to the driver indicating that the buffer is accessed on the host side. To use the buffer on the
application processor side, you must map this buffer and write the data into it. This is the only method
that does not involve copying data. If you must fill in an image that is processed by the GPU, this is the
best way to avoid a copy.
CL_MEM_COPY_HOST_PTR Copies the contents of the host_ptr argument into memory allocated by the driver.
CL_MEM_USE_HOST_PTR Copies the content of the host memory pointer into the buffer when the first kernel using this buffer
starts running. This flag enforces memory restrictions that can reduce performance. Avoid using this if
possible.
When a map is executed, the memory must be copied back to the provided host pointer. This
significantly increases the cost of map operations.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 8-71
Non-Confidential
8 Optimizing OpenCL for Mali GPUs
8.3 Memory allocation
CL_MEM_ALLOC_HOST_PTR
Note
• You must make the initial memory allocation through the OpenCL API.
• Always use the latest pointer returned.
If a buffer is repeatedly mapped and unmapped, the address the buffer maps into is not guaranteed to
be the same.
Application
Mali GPU
processor
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 8-72
Non-Confidential
8 Optimizing OpenCL for Mali GPUs
8.3 Memory allocation
8.3.4 Do not allocate memory buffers created with malloc() for OpenCL applications
The Mali GPU cannot access the memory buffers created by malloc() because they are not mapped into
the address space of the Mali GPU.
The inaccessible memory buffer is shown in the following figure.
Application
Mali GPU
processor
Mali GPU
cannot access
Buffer created by Global memory buffer
malloc() memory
You must allocate the memory in OpenCL with CL_MEM_ALLOC_HOST_PTR because it ensures that the
memory pages are always mapped into physical memory.
If you allocate the memory on the application processor, the OS might not allocate physical memory to
the pages until they are used for the first time. Errors occur if an I/O device attempts to use unmapped
pages.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 8-73
Non-Confidential
Chapter 9
OpenCL Optimizations List
This chapter lists several optimizations to use when writing OpenCL code for Mali GPUs.
It contains the following sections:
• 9.1 General optimizations on page 9-75.
• 9.2 Kernel optimizations on page 9-77.
• 9.3 Code optimizations on page 9-80.
• 9.4 Execution optimizations on page 9-83.
• 9.5 Reducing the effect of serial computations on page 9-84.
• 9.6 Mali™ Bifrost GPU specific optimizations on page 9-85.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9-74
Non-Confidential
9 OpenCL Optimizations List
9.1 General optimizations
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9-75
Non-Confidential
9 OpenCL Optimizations List
9.1 General optimizations
Related references
Chapter 6 Converting Existing Code to OpenCL on page 6-51.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9-76
Non-Confidential
9 OpenCL Optimizations List
9.2 Kernel optimizations
For best performance, use a workgroup size that is between 4 and 64 inclusive, and a multiple of 4
If you are using a barrier, a smaller workgroup size is better.
When you are selecting a workgroup size, consider the memory access pattern of the data.
Finding the best workgroup size can be counter-intuitive, so test different options to see what
one is fastest.
If the global work size is not divisible by 4, use padding at the edges or use a non-uniform
workgroup size
To ensure the global work size is divisible by 4, add a few more dummy threads.
Alternatively you can let the application processor compute the edges.
You can use a non-uniform workgroup size, but this does not guarantee better performance than
the other options.
If you are not sure what workgroup size is best, define local_work_size as NULL
The driver picks the workgroup size it thinks as best. The driver usually selects the work group
size as 64.
Note
The performance might not be optimal
If you want to set the local work size, set the reqd_work_group_size qualifier to kernel functions
This provides the driver with information at compile time for register use and sizing jobs to fit
properly on shader cores.
Experiment with work-group size
If you can, experiment with different sizes to see if any give a performance advantage. Sizes that
are a multiple of two are more likely to perform better.
If your kernel has no preference for the work-group size, you can pass NULL to the local work
size argument of the clEnqueueNDRangeKernel().
Use a work-group size of 128 or 256 if possible
The maximum work-group size is typically 256, but this is not possible for all kernels and the
driver suggests another size. A work-group size of 64 is the smallest size guaranteed to be
available for all kernels.
If possible, use a work-group size of 128 or 256. These make optimal use of the Mali GPU
hardware. If the maximum work-group size is below 128, your kernel might be too complex.
Experiment with work-group shape
The shape of the work-group can affect the performance of your application. For example, a 32
by 4 work-group might be the optimal size and shape.
Experiment with different shapes and sizes to find the best combination for your application.
Check for synchronization requirements
Some kernels require work-groups for synchronization of the work-items within the work-group
with barriers. These typically require a specific work-group size.
In cases where synchronization between work-items is not required, the choice of the size of the
work-groups depends on the most efficient size for the device.
You can pass in NULL to enable OpenCL to pick an efficient size.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9-77
Non-Confidential
9 OpenCL Optimizations List
9.2 Kernel optimizations
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9-78
Non-Confidential
9 OpenCL Optimizations List
9.2 Kernel optimizations
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9-79
Non-Confidential
9 OpenCL Optimizations List
9.3 Code optimizations
Vectorize incrementally
Vectorize in incremental steps. For example, start processing one pixel at a time, then two, then
four.
Use vector loads and saves
To load as much data as possible in a single operation, use vector loads. These enable you to
load 128 bits at a time. Do the same for saving data.
For example, if you are loading char values, use the built-in function vload16() to load 16
bytes at a time.
Do not try to load more than 128 bits in a single load. This can reduce performance.
Avoid processing single values
Avoid writing kernels that operate on single bytes or other small values. Write kernels that work
on vectors.
Perform as many operations per load as possible
Operations that perform multiple computations per element of data loaded are typically good for
programming in OpenCL:
• Try to reuse data already loaded.
• Use as many arithmetic instructions as possible per load.
Avoid conversions to or from float and int
Conversions to or from float and int are relatively expensive so avoid them if possible.
Experiment to see how fast you can get your algorithm to execute
There are many variables that determine how well an application performs. Some of the
interactions between variables can be very complex and it is difficult to predict how they impact
performance.
Experiment with your OpenCL kernels to see how fast they can run:
Data types
Use the smallest data types for your calculation as possible.
For example, if your data does not exceed 16 bits do not use 32-bit types.
Load store types
Try changing the amount of data processed per work-item.
Data arrangement
Change the data arrangement to make maximum use of the processor caches.
Maximize data loaded
Always load as much data in a single operation as possible. Use 128-bit wide vector
loads to load as many data items as possible per load.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9-80
Non-Confidential
9 OpenCL Optimizations List
9.3 Code optimizations
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9-81
Non-Confidential
9 OpenCL Optimizations List
9.3 Code optimizations
Related references
Chapter 10 The kernel auto-vectorizer and unroller on page 10-86.
Appendix B OpenCL Built-in Functions on page Appx-B-96.
B.3 half_ and native_ math functions on page Appx-B-100.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9-82
Non-Confidential
9 OpenCL Optimizations List
9.4 Execution optimizations
Doing this ensures that when you use kernels in the future, they start faster because the existing
finalized binary is used.
• If you use callbacks to prompt the processor to continue processing data resulting from the execution
of a kernel, ensure that the callbacks are set before you flush the queue.
If you do not do this, the callbacks might occur at the end of a larger batch of work, later than they
might have based on actual completion of work.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9-83
Non-Confidential
9 OpenCL Optimizations List
9.5 Reducing the effect of serial computations
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9-84
Non-Confidential
9 OpenCL Optimizations List
9.6 Mali™ Bifrost GPU specific optimizations
Note
Only use these optimizations if you are specifically targeting a Mali Bifrost GPU.
Ensure that the threads in quads all take the same branch direction in if-statements and loops
In Mali Bifrost GPUs, groups of four adjacent threads are arranged together as quads. If your
shaders contain branches, such as if statements or loops, the branches in quads can go different
ways. This reduces performance because the arithmetic unit cannot execute both sides of the
branch at the same time.
Try to ensure that the threads in quads all branch the same way.
Avoid excessive register usage
Every thread has 64 32-bit working registers. A 64-bit variable uses two adjacent 32-bit
registers for its 64-bit data.
If a thread requires more than 64 registers, the compiler might start storing register data in
memory. This reduces performance and the available bandwidth. This is especially bad if your
shader is already load-store bound.
Vectorize 8-bit and 16-bit operations
For 16-bit operations use 2-component vectors to get full performance. For basic arithmetic
operations, fp16 version is twice as fast as fp32 version.
For 8-bit types, such as char, use four-component vectors for best performance.
Do not vectorize 32-bit operations
Mali Bifrost GPUs use scalars so you are not required to vectorize 32-bit operations. 32-bit
scalar and vector arithmetic operations have same performance.
Use 128-bit load or store operations
128-bit load or store operations make the more efficient use of the internal buses.
Load and store operations are faster if all threads in a quad load from the same cache-line
If all threads in a quad load from the same cache-line, the arithmetic pipeline only sends one
request to the load-store unit to load the 512-bit data.
For example, this example is fast because consecutive threads load consecutive 128-bit vectors
from memory:
global float4 * input_array;
float4 v = input_array[get_global_id(0)];
This second version is slower, because the four threads with adjacent global ids load data from
different cache lines.
global float4 * input_array;
float4 v = input_array[4*get_global_id(0)];
Note
One cache line is 512-bits.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 9-85
Non-Confidential
Chapter 10
The kernel auto-vectorizer and unroller
Note
• The kernel auto-vectorizer and unroller are enabled by default for Midgard GPUs.
• This feature is not available for Bifrost GPUs.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 10-86
Non-Confidential
10 The kernel auto-vectorizer and unroller
10.1 About the kernel auto-vectorizer and unroller
Option Description
no option Kernel unroller and vectorizer enabled, with conservative heuristics
-fno-kernel-vectorizer Disable the kernel vectorizer
-fno-kernel-unroller Disable the kernel unroller
-fkernel-vectorizer Enable aggressive heuristics for the kernel vectorizer
-fkernel-unroller Enable aggressive heuristics for the kernel unroller
Note
The kernel auto-vectorizer performs a code transformation. For the transformation to be possible, several
conditions must be met:
• The enqueued NDRange must be a multiple of the vectorization factor.
• Barriers are not permitted in the kernel.
• Thread-divergent code is not permitted in the kernel.
• Global offsets are not permitted in the enqueued NDRange.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 10-87
Non-Confidential
10 The kernel auto-vectorizer and unroller
10.2 Kernel auto-vectorizer options
The vectorizer works by merging consecutive work-items. The number of work-items enqueued is
reduced by the vectorization factor.
For example, in a one-dimensional NDRange, work-items have the local-IDs 0, 1, 2, 3, 4, 5...
Vectorizing by a factor of four merges work-items in groups of four. First work-items 0, 1, 2, and 3, then
work-items 4, 5, 6, and 7 going upwards in groups of four until the end of the NDRange.
In a two-dimensional NDRange, the work-items have local-IDs such as (0,0), (0,1), (0,2)..., (1,0), (1,1),
(1,2)... where (x,y) is showing (global_id(0), global_id(1)).
The vectorizer can vectorize along dimension 0 and merge work-items (0,0), (1,0)...
Alternatively it can vectorize along dimension 1 and merge work-items (0,0), (0,1)...
Example Description
-fkernel-vectorizer Enable the vectorizer, use heuristics for both dimension and factor
-fkernel-vectorizer=x4 Enable the vectorizer, use dimension 0, use factor 4
-fkernel-vectorizer=x Enable the vectorizer, use heuristics for the factor, use dimension 0
-fkernel-vectorizer=2 Enable the vectorizer, use heuristics for the dimension, use factor 2
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 10-88
Non-Confidential
10 The kernel auto-vectorizer and unroller
10.3 Kernel unroller options
Example Description
-fkernel-unroller Enable the unroller, use heuristics for both dimension and factor
-fkernel-unroller=x4 Enable the unroller, use dimension 0, use factor 4
-fkernel-unroller=x Enable the unroller, use heuristics for the factor, use dimension 0
-fkernel-unroller=2 Enable the unroller, use heuristics for the dimension, use factor 2
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 10-89
Non-Confidential
10 The kernel auto-vectorizer and unroller
10.4 The dimension interchange transformation
This interchanges dimensions dim0 and dim1, where <dim0> and <dim1> can be 0, 1 or 2.
You can disable dimension interchange with the following option:
-fno-dim-interchange
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. 10-90
Non-Confidential
Appendix A
OpenCL Data Types
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-A-91
Non-Confidential
A OpenCL Data Types
A.1 About OpenCL data types
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-A-92
Non-Confidential
A OpenCL Data Types
A.2 OpenCL data type lists
Note
You can query CL_DEVICE_ADDRESS_BITS with clGetDeviceInfo(). The value returned might be
different for 32-bit and 64-bit host applications, even on the same Mali GPU.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-A-93
Non-Confidential
A OpenCL Data Types
A.2 OpenCL data type lists
complex half, complex halfn Complex 16-bit float, scalar, and vector
imaginary half, imaginary halfn Imaginary 16-bit complex, scalar, and vector
complex float, complex floatn, Complex 32-bit float, scalar, and vector
imaginary float, imaginary floatn Imaginary 32-bit float, scalar, and vector
complex double, complex doublen Complex 64-bit float, scalar, and vector
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-A-94
Non-Confidential
A OpenCL Data Types
A.2 OpenCL data type lists
imaginary double, imaginary doublen Imaginary 64-bit float, scalar, and vector
complex quad, complex quadn Complex 128-bit float, scalar, and vector
imaginary quad, imaginary quadn Imaginary 128-bit float, scalar, and vector
long double, long doublen 64-bit - 128-bit float, scalar, and vector
long long, long longnb 128-bit signed int, scalar, and vector
unsigned long long, ulong long, ulonglongn 128-bit unsigned int, scalar, and vector
Note
• The half and half vector data types can be used with the cl_khr_fp16 extension.
• The double and double vector data types can be used with the cl_khr_fp64 extension on Mali
Midgard GPUs. This extension is not available on Bifrost GPUs.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-A-95
Non-Confidential
Appendix B
OpenCL Built-in Functions
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-96
Non-Confidential
B OpenCL Built-in Functions
B.1 Work-item functions
Function
get_work_dim()
get_global_size()
get_global_id()
get_local_size()
get_local_id()
get_num_groups()
get_group_id()
get_global_offset()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-97
Non-Confidential
B OpenCL Built-in Functions
B.2 Math functions
- cospi() ldexp()
- exp() lgamma()
- exp2() lgamma_r()
- exp10() log()
- expml() log10()
- floor() log1p()
- fma() logb()
- log2() modf()
- pow() nan()
- pown() nextafter()
- powr() remainder()
- rsqrt() remquo()
- sin() rootn()
- sincos() sinh()
- sinpi() tan()
- sqrt() tanh()
- - tanpi()
- - tgamma()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-98
Non-Confidential
B OpenCL Built-in Functions
B.2 Math functions
Note
The ulp error of lgamma() is 16ulp unless the correctly rounded result is less than one. If the correctly
rounded result is less than one, lgamma() is also less than one. The error of lgamma_r() is same as
lgamma().
lgamma() is logarithmic, so if the correctly rounded result is small, the precision of the result is not
important.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-99
Non-Confidential
B OpenCL Built-in Functions
B.3 half_ and native_ math functions
half_divide() native_divide()
half_exp() native_exp()
half_exp2() native_exp2()
half_exp10() native_exp10()
half_log() native_log()
half_log2() native_log2()
half_log10() native_log10()
half_powr() native_powr()
half_recip() native_recip()
half_rsqrt() native_rsqrt()
half_sin() native_sin()
half_sqrt() native_sqrt()
half_tan() native_tan()
Mali GPUs implement most of the full precision variants of the half_ and native_ math functions at
full speed so you are not required to use the half_ and native_ functions.
Note
On Mali GPUs, the following functions are faster than the full precision versions:
• native_sin().
• native_cos().
• native_tan().
• native_divide().
• native_exp().
• native_sqrt().
• half_sqrt().
Related references
B.2 Math functions on page Appx-B-98.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-100
Non-Confidential
B OpenCL Built-in Functions
B.4 Integer functions
Function Notes
abs()
abs_diff()
add_sat()
hadd()
rhadd()
clz()
max()
min()
sub_sat()
mad_hi()
mul_hi()
mad_sat()
rotate()
upsample()
popcount()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-101
Non-Confidential
B OpenCL Built-in Functions
B.5 Common functions
Function
max()
min()
step()
clamp()
degrees()
mix()
radians()
smoothstep()
sign()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-102
Non-Confidential
B OpenCL Built-in Functions
B.6 Geometric functions
Function
dot()
normalize()
fast_distance()
fast_length()
fast_normalize()
cross()
distance()
length()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-103
Non-Confidential
B OpenCL Built-in Functions
B.7 Relational functions
Function
any()
all()
bitselect()
select()
isequal()
isnotequal()
isgreater()
isgreaterequal()
isless()
islessequal()
islessgreater()
isfinite()
isinf()
isnan()
isnormal()
isordered()
isunordered()
signbit()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-104
Non-Confidential
B OpenCL Built-in Functions
B.8 Vector data load and store functions
Function
vload()
vstore()
vload_half()
vstore_half()
vloada_half()
vstorea_half()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-105
Non-Confidential
B OpenCL Built-in Functions
B.9 Synchronization
B.9 Synchronization
List of synchronization functions.
The barrier() function has no speed rating because it must wait for multiple work-items to complete.
The time this takes determines the length of time the function takes in your application. This also
depends on several factors such as:
• The number of work-items in the work-groups being synchronized.
• How much the work-items diverge.
Function
barrier()
mem_fence()
read_mem_fence()
write_mem_fence()
Note
ARM recommends that you avoid using barriers, especially in small kernels.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-106
Non-Confidential
B OpenCL Built-in Functions
B.10 Asynchronous copy functions
Function
async_work_group_copy()
async_work_group_strided_copy()
wait_group_events()
prefetch()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-107
Non-Confidential
B OpenCL Built-in Functions
B.11 Atomic functions
Function
atomic_add()
atomic_sub()
atomic_xchg()
atomic_inc()
atomic_dec()
atomic_cmpxchg()
atomic_min()
atomic_max()
atomic_and()
atomic_or()
atomic_xor()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-108
Non-Confidential
B OpenCL Built-in Functions
B.12 Miscellaneous vector functions
Function
vec_step()
shuffle()
shuffle2()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-109
Non-Confidential
B OpenCL Built-in Functions
B.13 Image read and write functions
Function
read_imagef()
read_imagei()
read_imageui()
write_imagef()
write_imagei()
write_imageui()
get_image_width()
get_image_height()
get_image_depth()
get_image_channel_data_type()
get_image_channel_order()
get_image_dim()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-B-110
Non-Confidential
Appendix C
OpenCL Extensions
This appendix describes the OpenCL extensions that the Mali GPU OpenCL driver supports.
It contains the following section:
• C.1 OpenCL extensions supported by the Mali™ GPU OpenCL driver on page Appx-C-112.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-C-111
Non-Confidential
C OpenCL Extensions
C.1 OpenCL extensions supported by the Mali™ GPU OpenCL driver
Note
The cl_khr_fp64 extension only works on Mali Midgard GPUs.
The Mali GPU OpenCL driver on Midgard GPUs also supports the following optional ARM extensions:
• cl_arm_core_id.
• cl_arm_printf.
• cl_arm_thread_limit_hint.
• cl_arm_import_memory.
• cl_arm_import_memory_dma_buf.
• cl_arm_non_uniform_work_group_size.
The Mali GPU OpenCL driver on Bifrost GPUs also supports the optional ARM extension,
cl_arm_shared_virtual_memory.
Related information
The Khronos Group.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-C-112
Non-Confidential
Appendix D
Using OpenCL Extensions
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-D-113
Non-Confidential
D Using OpenCL Extensions
D.1 Inter-operation with EGL
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-D-114
Non-Confidential
D Using OpenCL Extensions
D.2 The cl_arm_printf extension
Note
the printf() function is included in OpenCL 1.2.
Related information
http://www.khronos.org.
void callback(const char *buffer, size_t length, size_t final, void *user_data)
{
fwrite(buffer, 1, length, stdout);
}
int main()
{
cl_platform_id platform;
cl_device_id device;
cl_context context;
cl_context_properties context_properties[] =
{
CL_CONTEXT_PLATFORM, 0,
CL_PRINTF_CALLBACK_ARM, (cl_context_properties)callback,
CL_PRINTF_BUFFERSIZE_ARM, 0x1000,
0
};
cl_command_queue queue;
cl_program program;
cl_kernel kernel;
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-D-115
Non-Confidential
D Using OpenCL Extensions
D.2 The cl_arm_printf extension
clFinish(queue);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(queue);
clReleaseContext(context);
return 0;
}
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-D-116
Non-Confidential
Appendix E
OpenCL 1.2
This appendix describes some of the important changes to the Mali OpenCL driver in OpenCL 1.2.
It contains the following sections:
• E.1 OpenCL 1.2 compiler options on page Appx-E-118.
• E.2 OpenCL 1.2 compiler parameters on page Appx-E-119.
• E.3 OpenCL 1.2 functions on page Appx-E-120.
• E.4 Functions deprecated in OpenCL 1.2 on page Appx-E-121.
• E.5 OpenCL 1.2 extensions on page Appx-E-122.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-E-117
Non-Confidential
E OpenCL 1.2
E.1 OpenCL 1.2 compiler options
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-E-118
Non-Confidential
E OpenCL 1.2
E.2 OpenCL 1.2 compiler parameters
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-E-119
Non-Confidential
E OpenCL 1.2
E.3 OpenCL 1.2 functions
clLinkProgram()
Using this typically does not provide much performance benefit in the Mali OpenCL driver.
clCompileProgram()
Using this typically does not provide much performance benefit in the Mali OpenCL driver.
clEnqueueMarkerWithWaitList()
clEnqueueBarrierWithWaitList()
clEnqueueMigrateMemObjects()
The Mali OpenCL driver supports the memory object migration API
clEnqueueMigrateMemObjects(), but this does not provide any benefit because Mali GPUs
use a unified memory architecture.
OpenCL 1.2 includes the following built-in function:
printf()
Note
The flag CL_MAP_WRITE_INVALIDATE_REGION has no effect in the Mali OpenCL driver.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-E-120
Non-Confidential
E OpenCL 1.2
E.4 Functions deprecated in OpenCL 1.2
Function
clEnqueueMarker()
clEnqueueBarrier()
clEnqueueWaitForEvents()
clCreateImage2D()
clCreateImage3D()
clUnloadCompiler()
clGetExtensionFunctionAddress()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-E-121
Non-Confidential
E OpenCL 1.2
E.5 OpenCL 1.2 extensions
Related information
http://www.khronos.org.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-E-122
Non-Confidential
Appendix F
Revisions
This appendix contains a list of technical changes made between releases and where they are documented
in this guide.
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-F-123
Non-Confidential
F Revisions
F.1 Revisions
F.1 Revisions
This describes the technical changes between released issues of this guide.
Removed section on OpenCL inter-operation D.1 Inter-operation with EGL on page Appx-D-114 All Mali GPUs
with OpenGL ES
Removed E.4 Functions deprecated in OpenCL 1.2 All Mali GPUs
functions:clCreateFromGLTexture2D() on page Appx-E-121
clCreateFromGLTexture3D()
ARM 100614_0303_00_en Copyright © 2012, 2013, 2015–2017 ARM Limited or its affiliates. All rights reserved. Appx-F-124
Non-Confidential