Halide RL
Halide RL
Halide RL
Abstract. Writing efficient image processing code is a very demanding task and
much programming effort is put into porting existing code to new generations of
hardware. Besides, the definition of what is an efficient code varies according to
the desired optimization target, such as runtime, energy consumption or memory
usage. We present a semi-automatic schedule generation system for the Halide
DSL that uses a Reinforcement Learning agent to choose a set of scheduling
options that optimizes the runtime of the resulting application. We compare our
results to the state of the art implementations of three Halide pipelines and show
that our agent is able to surpass hand-tuned code and Halide’s auto-scheduler
on most scenarios for CPU and GPU architectures.
1. Introduction
Image processing libraries devote great amount of their development effort in optimizing
algorithms and code in order to achieve good performance. This optimization, for each
algorithm (and sometimes even for each image size), has to be custom tailored to specific
hardware architectures, if the code is to run close to the theoretical optimum. Besides,
“good performance” can be defined in different ways (compute time, energy consumption,
memory usage), and thus require specific optimization strategies.
Optimized code development is a very challenging task, with few available skilled
programmers. Once the code is optimized, it is usually very intricate and specific. Small
modifications in an image processing pipeline or data structure can render all optimization
inefficient or even useless, demanding a re-implementation.
Given the great profusion of imaging hardware architectures, specially consider-
ing mobile and embedded systems, it is almost impossible to develop and maintain opti-
mized versions of image processing tools that can be kept updated considering the above
mentioned variables. Therefore, Halide [Ragan-Kelley et al. 2012] was created as a Do-
main Specific Language (DSL) aiming to decouple algorithms from schedules, allowing
for an easier optimization of image processing pipelines.
We propose an approach based on the Proximal Policy Optimization
[Schulman et al. 2017] reinforcement learning agent to semi-automatically generate/op-
timize schedules for Halide algorithms. We developed an interface between the Halide
compiler and the agent so that it can “learn” to choose the scheduling directives guided
by a feedback based on the runtime of the generated code. Experiments with tree algo-
rithms were performed on CPU and GPU architectures, and were compared to versions
generated by Halide’s auto scheduler and hand optimized.
2. Background
2.1. Halide Language
Halide DSL is designed to facilitate the portability of high performance code for image
processing in modern heterogeneous architectures, generating code capable of exploiting
memory locality, vectorization (SIMD instructions) and parallelism in multicore CPUs
and GPGPUs [Ragan-Kelley and Adams 2012]. Halide is implemented as a C++ embed-
ded DSL and distributed as a library. It can be compiled for a broad set of platforms such
as: X86, ARM, CUDA, OpenCL and OpenGL on OS X, Linux, Android and Windows.
The innovation of Halide is the decoupling between the definitions of the algo-
rithm (logic) and the organization of how it should be computed (execution schedule), i.e.,
in Halide the heuristic of the image processing algorithm is separated from the nesting of
processing loops, parallelism statements, vectorization, etc. [Ragan-Kelley et al. 2012].
Changing the schedule does not affect the definition nor the results of the algorithms
logic, making it easier for programmers to experiment with different schedules in order
to achieve a better performance on specific hardware architectures.
Image processing algorithms are defined in Halide as a multistage processing
pipeline that performs operations on data consumed from previous stages and produces
the result for subsequent stages, until the end of the algorithm. A pipeline can be repre-
sented as a graph connecting different stages, coded using a functional syntax, in order to
facilitate the understanding and readability of the code.
The execution schedule consists of a series of scheduling directives that will be ap-
plied to the pipeline stages. These directives are implemented as methods of C++ classes
defined by the Halide DSL and drive the Halide compiler’s code generation for the corre-
sponding pipeline. Listing 1 presents an image smoothing algorithm (3 × 3 blur) written
in Halide, including the definition of a pipeline (Lines 5 and 6) and a possible execution
schedule with directives (Lines 8 and 9). Line 11 compiles the code and generates the
corresponding executable through the realize operation.
1 Image<float> in = load<float>("img.png");
2 Func blur_x, blur_y; Var x, y, xi, yi;
3
4 // Algorithm / Processing pipeline
5 blur_x(x, y)=(in(x,y) + in(x+1,y) + in(x+2,y)) / 3;
6 blur_y(x, y)=(blur_x(x,y) + blur_x(x,y+1) + blur_x(x,y+2))/3;
7 // Execution schedule
8 blur_y.tile(x, y, xi, yi, 8, 4).parallel(y).vectorize(xi, 8);
9 blur_x.compute_at(blur_y,x).vectorize(x,8);
10
11 Image<float> out = blur_y.realize(in.width()-2, in.height()-2);
Listing 1. Halide program for blurring an image using a 3×3 mean kernel.
State
3. Related Work
Since the conception of the Halide language, with the decoupling between the definition of
the pipeline and the scheduling, efforts have been made to try to automate the generation
of optimized schedules. Halide authors show that analyzing and generating individual
schedules for each pipeline stage is simple, but complexity raises exponentially when
composing these stages. Thus, finding an optimized schedule through exhaustive search
for all possible scheduling alternatives is impracticable due to the huge search space. For
example, the pipeline of an algorithm with approximately 100 stages has an estimated
lower limit of 10720 possible schedules [Ragan-Kelley 2014, Ragan-Kelley et al. 2013].
In order to support the creation of schedules, Halide authors proposed an auto-
tuner that uses a genetic algorithm to find an efficient pipeline. However, this ap-
proach needs to define specific rules for each schedule being optimized, presents
difficulties in overcoming local minimum, and has a very slow convergence time
[Mullapudi et al. 2015]. There is also a Halide schedule optimizer based on OpenTuner
that has more generalized rules [Ansel et al. 2014]. OpenTuner was able to find effi-
cient scheduling for simple pipelines (8 stages), but failed to converge in more complex
pipelines (32, 44 and 49 stages), with generated schedules five to ten times slower when
compared to manually optimized.
Another approach [Mullapudi et al. 2016] extended the function domain boundary
analysis mechanism of Halide to split the pipeline stages into groups and find an efficient
tile for each group by estimating arithmetic cost for each stage. The cost is used to define
a tile size that minimizes the number of data loads and the processing loops are ordered to
maximize data locality and the parallelism of the outermost loops. This approach is able
to generate optimized schedules in short time, but not currently for GPU architecture,
and by inspecting the generated code the authors noticed that it could be significantly
improved.
Image Processing
Pipeline Schedule
Reinforcement Generation and Compilation
Learning Estimation and Execution
Refinement
Scheduling
Options
Measured
Run Time
Figure 2. The image processing pipeline and the scheduling options are the in-
put for the reinforcement learning agent to generate execution schedules
by choosing and executing one scheduling option a time, refining esti-
mates for upcoming choices based on information gathered from previous
choices and reward based on runtime.
Ds Ds
...
vectorize tile ... parallel Directives
Pd Pd Pd
... ...
var factor Directive Parameters
A A
x xi 4 8 16 Parameter Arguments
Figure 3. Illustration of the space elicitation of scheduling options for the Blur
pipeline. The blue line indicates one possible scheduling option.
Listing 2 exemplifies scheduling options deemed important for the schedule gen-
eration of the Blur pipeline presented in Listing 1, with their respective set of possible
parameters and arguments. These options and a Halide algorithm must be provided by
the programmer as input to the PPO agent, as depicted in Figure 2.
(rtprev
− rtcurr ) × rs if rtcurr < rtprev
r= 0 if rtcurr >= rtprev (1)
−1 if error
where rs = 100/rtinit is the normalized scaling factor using the initial execution time
rtinit of the input pipeline. In this way, the reward will have a similar amplitude for dif-
ferent input pipelines, proportional to the obtained gain, even if they have quite different
size and execution times.
5. Experimental Results
The proposed PPO reinforcement learning agent1 was tested on three image processing
pipelines, using different image sizes and CPU and GPU computing architectures.
where pipeline ∈ {Blur , Harris, Interpolation} and runtime() is the execution time, pre-
sented in milliseconds, without considering the time spent compiling the Halide program.
ms)
50 Tr
ial1
d
ageRewar
me(
250 Tr
ial2
0
Tr
ial3
i
200
ageRunt
-
50 Tr
ial1 Tr
ial4
Tr
ial2 150
Aver
-
100 Tr
ial3
100
Aver
-
150 Tr
ial4
50
-
200
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
PPOI
ter
ati
ons PPOI
ter
ati
ons
(a) Reward (b) Runtime
Figure 4. Evolution of the Reward (a) and Runtime (b) of schedules generated
by the PPO agent for the Interpolation pipeline during four independent
executions (trials). Negative rewards indicate invalid schedules.
1 // Hand-tuned:
2 blur_y.split(y, y, yi, 8).parallel(y).vectorize(x, 8);
3 blur_x.store_at(blur_y, y).compute_at(blur_y, yi).vectorize(x, 8);
4 // PPO:
5 blur_y.unroll(x, 4).parallel(y).bound(x, 0, input.width());
1 // Hand-tuned:
2 blur_y.gpu_tile(x, y, xi, yi, 16, 16);
3 blur_x.compute_at(blur_y, x).gpu_threads(x, y);
4 // PPO:
5 blur_y.gpu_tile(x, y, xi, yi, 128, 8).unroll(yi, 4).unroll(xi, 2);
1.
0 1.
0 1.
0
0.
8 0.
8 0.
8
0.
6 0.
6 0.
6
0.
4 0.
4 0.
4
0.
2 0.
2 0.
2
0.
0 0.
0 0.
0
CPU GPU CPU GPU CPU GPU
Har
ris-I
mage1 Har
ris-I
mage2 Har
ris-I
mage3
e
1.
0 1.
0 1.
0
manc
0.
8 0.
8 0.
8
for
0.
6 0.
6 0.
6
vePer
0.
4 0.
4 0.
4
ati
0.
2 0.
2 0.
2
Rel
0.
0 0.
0 0.
0
CPU GPU CPU GPU CPU GPU
I
nter
pol
ati
on-I
mage1 I
nter
pol
ati
on-I
mage2 I
nter
pol
ati
on-I
mage3
1.
0 1.
0 1.
0
0.
8 0.
8 0.
8
0.
6 0.
6 0.
6
0.
4 0.
4 0.
4
0.
2 0.
2 0.
2
0.
0 0.
0 0.
0
CPU GPU CPU GPU CPU GPU
Sc
hedul
er: Hand-
tuned Aut
o-sc
hedul
e PPO
result, i.e., the shortest execution time, are highlighted in bold, while those with inferior
results have their slowdown relative to the best presented.
6. Conclusion
In this work we presented an approach for generation and optimization of Halide sched-
ules using a reinforcement learning technique. The approach utilizes an implementation
of a PPO agent that interacts with an environment built within this work. One of the im-
portant aspects within the definition of the environment is the calculation of the reward,
which represents the most important information that guides the learning of the agent.
Another relevant aspect in defining the environment is the composition of the space of ac-
tions that will be explored by the agent. This space is defined from the set of scheduling
options and entered as input during the initialization of the environment.
In the current implementation the scheduling options are elicited by the Halide
programmer for each pipeline to be optimized. An advantage at this point is that it allows
the programmer to use her/his professional experience and knowledge about the pipeline
to include only plausible options, thus reducing the search space to be explored by the
Table 2. Absolute Execution Time (ms) and Relative Slowdown (×) by Architec-
ture and Scheduling Method.
CPU GPU
Hand-tuned Auto-sched. PPO Hand-tuned PPO
ms × ms × ms × ms × ms ×
Img1 0.10 - 0.13 1.3 0.12 1.2 0.07 1.4 0.05 -
Blur Img2 0.49 - 0.51 1.0 0.52 1.1 0.21 1.9 0.11 -
Img3 2.94 1.1 3.15 1.2 2.66 - 0.76 1.9 0.39 -
Img1 3.21 6.4 0.68 1.4 0.50 - 0.22 1.8 0.12 -
Harris Img2 13.07 6.2 2.82 1.3 2.10 - 0.72 1.8 0.39 -
Img3 36.89 4.7 10.71 1.4 7.78 - 2.79 1.9 1.46 -
Img1 4.51 1.2 4.10 1.0 3.91 - 3.27 1.1 2.94 -
Interp. Img2 17.43 1.5 11.49 - 16.88 1.5 6.22 - 6.26 1.0
Img3 77.23 1.2 85.89 1.4 63.57 - 16.23 - 18.52 1.1
PPO agent. On the other hand, as it requires a programmer’s intervention, the developed
mechanism is not fully automated, positioning itself as an intermediate model between
the manual development of the schedule and the auto-schedule mechanism available in
the Halide language. However, the proposed approach is independent of the hardware ar-
chitecture used. In the present work the approach was evaluated in two architectures, CPU
and GPU, but can also be used in other architectures supported by the Halide language.
Results show that the reinforcement learning agent was able to converge to good
Halide execution schedules in the evaluated pipelines, on both architectures, although
it did not reach the best result in some of the considered scenarios, when compared to
the Hand-tuned and Auto-schedule methods. These results also suggest that the proposed
environment, as well as the method of calculating the reward, besides the representation of
the states and actions, were effective in representing the problem in a way compatible with
the reinforcement learning technique. More test scenarios and a broader set of pipelines
need to be evaluated in order to have a comprehensive and reliable indicator.
A point that requires attention is related to the considerable duration of the ex-
periments, which may eventually render the developed solution impracticable for large
pipelines. Since most time is spent on compiling each proposed schedule, the use of
an agent already trained with one pipeline to optimize the schedule of another pipeline
through the use of Transfer Learning techniques [Pan et al. 2010] should be considered.
Another enhancement would be to automate the elicitation of scheduling options. A pos-
sible implementation could adapt the Halide’s available auto-schedule mechanism to use
information extracted from the pipeline through the language compiler and generate the
set of scheduling options that would then be explored by the reinforcement learning agent.
References
Ansel, J., Kamil, S., Veeramachaneni, K., Ragan-Kelley, J., Bosboom, J., O’Reilly, U.-
M., and Amarasinghe, S. (2014). Opentuner: An extensible framework for program
autotuning. http://opentuner.org. Accessed 2018-01.
Dhariwal, P., Hesse, C., Klimov, O., et al. (2017). Openai baselines. https://
github.com/openai/baselines. Accessed 2018-09.
Huang, S. (2018). Introduction to various reinforcement learning algorithms, part
i: Q-learning, sarsa, dqn, ddpg. https://towardsdatascience.com/
72a5e0cb6287. Accessed 2018-09.
Júnior, E. P. F. D. (2012). Aprendizado por reforço sobre o problema de revisitação de
páginas web. Master’s thesis, Pós-Graduação em Informática - Pontifı́cia Universidade
Católica do Rio de Janeiro, Rio de Janeiro - RJ.
Mullapudi, R. T., Adams, A., Sharlet, D., Ragan-Kelley, J., and Fatahalian, K. (2016).
Automatically scheduling halide image processing pipelines. ACM Trans. Graph.,
35(4):83:1–83:11.
Mullapudi, R. T., Vasista, V., and Bondhugula, U. (2015). Polymage: Automatic opti-
mization for image processing pipelines. ACM SIGPLAN Notices, 50(4):429–443.
Ottoni, A. L. C., Oliveira, M. S., Nepomuceno, E. G., and Lamperti, R. D. (2015). Análise
do aprendizado por reforço via modelos de regressão logı́stica: Um estudo de caso
no futebol de robôs. Revista Junior de Iniciação Cientı́fica em Ciências Exatas e
Engenharia, 1(10):44–49.
Pan, S. J., Yang, Q., et al. (2010). A survey on transfer learning. IEEE Trans. on Knowl-
edge and Data Engineering, 22(10):1345–1359.
Ragan-Kelley, J. and Adams, A. (2012). Halide: A language for image processing.
http://halide-lang.org. Accessed 2018-08.
Ragan-Kelley, J., Adams, A., Paris, S., Levoy, M., Amarasinghe, S., and Durand, F.
(2012). Decoupling algorithms from schedules for easy optimization of image pro-
cessing pipelines. ACM Trans. Graph., 31(4):32:1–32:12.
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S.
(2013). Halide: A language and compiler for optimizing parallelism, locality, and
recomputation in image processing pipelines. ACM SIGPLAN Notices, 48(6):519–530.
Ragan-Kelley, J. M. (2014). Decoupling algorithms from the organization of computation
for high performance image processing. PhD thesis, MIT.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Silva, L. M. D. D. (2016). Proposta de arquitetura em hardware para fpga da técnica q-
learning de aprendizagem por reforço. Master’s thesis, Pós-Graduação em Engenharia
Elétrica e de Computação - Universidade Federal do Rio Grande do Norte, Natal - RN.
Silver, D., Lever, G., Heess, N., Degris, T., et al. (2014). Deterministic policy gradient
algorithms. In International Conference on Machine Learning.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction. MIT
press Cambridge.