CPP Amp Language and Programming Model
CPP Amp Language and Programming Model
ABSTRACT
C++ AMP (Accelerated Massive Parallelism) is a native programming model that contains elements that span the C++
programming language and its runtime library. It provides an easy way to write programs that compile and execute on dataparallel hardware, such as graphics cards (GPUs).
The syntactic changes introduced by C++ AMP are minimal, but additional restrictions are enforced to reflect the limitations
of data parallel hardware.
Data parallel algorithms are supported by the introduction of multi-dimensional array types, array operations on those types,
indexing, asynchronous memory transfer, shared memory, synchronization and tiling/partitioning techniques.
Overview .................................................................................................................................................................. 1
1.1
1.2
1.3
1.4
Conformance ............................................................................................................................................................ 1
Definitions ................................................................................................................................................................. 2
Error Model ............................................................................................................................................................... 4
Programming Model ................................................................................................................................................. 5
2.1.3
2.2
Meaning of Restriction Specifiers ............................................................................................................................. 8
2.2.1 Function Definitions ............................................................................................................................................... 8
2.2.2
2.2.3
2.3
Expressions Involving Restricted Functions ............................................................................................................ 10
2.3.1 Function pointer conversions .............................................................................................................................. 10
2.3.2
2.3.2.1
2.3.2.2
2.3.3
Casting.................................................................................................................................................................. 12
2.4
amp Restriction Modifier ........................................................................................................................................ 13
2.4.1 Restrictions on Types ........................................................................................................................................... 13
2.4.1.1
2.4.1.2
2.4.1.2.1
2.4.1.3
2.4.2
2.4.3
2.4.3.1
Literals ......................................................................................................................................................... 14
2.4.3.2
2.4.3.3
2.4.3.4
2.4.3.5
2.4.3.5.1
2.4.3.6
2.4.3.7
Synopsis ............................................................................................................................................................... 17
3.2.3
3.2.4
Constructors ......................................................................................................................................................... 18
3.2.5
Members .............................................................................................................................................................. 19
3.2.6
Properties............................................................................................................................................................. 20
3.3
accelerator_view..................................................................................................................................................... 21
3.3.1 Synopsis ............................................................................................................................................................... 21
3.3.2
3.3.3
Constructors ......................................................................................................................................................... 22
3.3.4
Members .............................................................................................................................................................. 23
3.4
Device enumeration and selection API ................................................................................................................... 24
3.4.1 Synopsis ............................................................................................................................................................... 24
4
Constructors ......................................................................................................................................................... 27
4.1.3
Members .............................................................................................................................................................. 27
4.1.4
Operators ............................................................................................................................................................. 28
4.2
extent<N> ............................................................................................................................................................... 29
4.2.1 Synopsis ............................................................................................................................................................... 29
4.2.2
Constructors ......................................................................................................................................................... 31
4.2.3
Members .............................................................................................................................................................. 31
4.2.4
Operators ............................................................................................................................................................. 32
4.3
tiled_extent<D0,D1,D2> ......................................................................................................................................... 34
4.3.1 Synopsis ............................................................................................................................................................... 34
4.3.2
Constructors ......................................................................................................................................................... 36
4.3.3
Members .............................................................................................................................................................. 36
4.3.4
Operators ............................................................................................................................................................. 36
4.4
tiled_index<D0,D1,D2> ........................................................................................................................................... 37
4.4.1 Synopsis ............................................................................................................................................................... 38
4.4.2
Constructors ......................................................................................................................................................... 40
4.4.3
Members .............................................................................................................................................................. 40
4.5
tile_barrier .............................................................................................................................................................. 41
4.5.1 Synopsis ............................................................................................................................................................... 41
4.5.2
Constructors ......................................................................................................................................................... 41
4.5.3
Members .............................................................................................................................................................. 41
4.5.4
4.6
completion_future .................................................................................................................................................. 42
4.6.1 Synopsis ............................................................................................................................................................... 43
4.6.2
Constructors ......................................................................................................................................................... 43
4.6.3
Members .............................................................................................................................................................. 44
Constructors ......................................................................................................................................................... 52
5.1.2.1
5.1.3
Members .............................................................................................................................................................. 57
5.1.4
Indexing................................................................................................................................................................ 58
5.1.5
5.2
array_view<T,N> ..................................................................................................................................................... 60
5.2.1 Synopsis ............................................................................................................................................................... 61
5.2.1.1
array_view<T,N> ......................................................................................................................................... 62
5.2.1.2
5.2.2
Constructors ......................................................................................................................................................... 68
5.2.3
Members .............................................................................................................................................................. 69
5.2.4
Indexing................................................................................................................................................................ 70
5.2.5
5.3
Copying Data ........................................................................................................................................................... 73
5.3.1 Synopsis ............................................................................................................................................................... 73
5.3.2
5.3.3
5.3.4
Synposis .................................................................................................................................................................. 77
Atomically Exchanging Values ................................................................................................................................. 78
Atomically Applying an Integer Numerical Operation ............................................................................................ 79
7.1
7.2
8
8.2
8.3
8.4
9
8.1.2.1
8.1.2.2
8.1.2.3
8.1.2.4
8.1.2.5
10
fast_math ................................................................................................................................................................ 92
precise_math .......................................................................................................................................................... 94
Miscellaneous Math Functions (Optional) ............................................................................................................ 101
10.1.3
10.1.4
10.1.5
10.1.6
10.1.7
10.1.8
10.1.9
10.1.10
10.1.11
10.1.12
10.1.13
10.1.14
10.1.14.1
10.1.15
10.2.3
10.2.4
10.2.5
10.2.6
10.2.7
10.2.7.1
10.2.7.2
10.2.8
10.2.8.1
10.2.9
10.3.3
Operators....................................................................................................................................................... 118
10.4.2.1
10.4.2.2
10.4.3
10.4.3.1
10.4.3.2
10.4.3.3
10.4.3.4
10.6.3
11
12
12.2.2
out_of_memory............................................................................................................................................. 134
12.2.3
12.2.4
12.2.5
12.3
13
13.1.3
13.2
Page 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
1.1
47
48
49
Overview
C++ AMP is a compiler and programming model extension to C++ that enables the acceleration of C++ code on data-parallel
hardware.
One example of data-parallel hardware today is the discrete graphics card (GPU), which is becoming increasingly relevant for
general purpose parallel computations, in addition to its main function as a graphics accelerator. While GPUs may be tightly
integrated with the CPU and can share memory space, C++ AMP programmers must remain aware that the GPU can also be
physically separate from the CPU, having discrete memory address space, and incurring high cost for transferring data
between CPU and GPU memory. The programmer must carefully balance the cost of this potential data transfer overhead
against the computational acceleration achievable by parallel execution on the device. The programmer must also follow
some basic conventions to avoid unnecessary copies on systems that have separate memory (see Error! Reference source
not found. Error! Reference source not found. and the discard_data() method in Error! Reference source not found.).
Another example of data-parallel hardware is the SIMD vector instruction set, and associated registers, found in all modern
processors.
For the remainder of this specification, we shall refer to the data-parallel hardware as the accelerator. In the few places
where the distinction matters, we shall refer to a GPU or a VectorCPU.
The C++ AMP programming model gives the developer explicit control over all of the above aspects of interaction with the
accelerator. The developer may explicitly manage all communication between the CPU and the accelerator, and this
communication can be either synchronous or asynchronous. The data parallel computations performed on the accelerator
are expressed using high-level abstractions, such as multi-dimensional arrays, high level array manipulation functions, and
multi-dimensional indexing operations, all based on a large subset of the C++ programming language.
The programming model contains multiple layers, allowing developers to trade off ease-of-use with maximum performance.
C++ AMP is composed of three broad categories of functionality:
1.
2.
3.
Conformance
All text in this specification falls into one of the following categories:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 2
50
51
52
53
54
55
56
57
58
59
60
Normative: all text, unless otherwise marked (see previous categories) is normative. Normative text falls into the
following two sub-categories:
o Optional: each section of the specification that falls into this sub-category includes the suffix (Optional)
in its title. A conforming implementation of C++ AMP may choose to support such features, or not.
(Microsoft-specific portions of the text are also Optional.)
o Required: unless otherwise stated, all Normative text falls into the sub-category of Required. A conforming
implementation of C++ AMP must support all Required features.
61
62
63
64
65
66
67
68
Conforming implementations shall provide all normative features and any number of optional features. Implementations may
provide additional features so long as these features are exposed in namespaces other than those listed in this specification.
Implementation may provide additional language support for amp-restricted functions (section 2.1) by following the rules set
forth in section 13.
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
1.2
The programming model utilizes Microsofts Visual C++ syntax for properties. Any such property shall be considered optional.
An implementation is free to use equivalent mechanisms for introducing such properties as long as they provide the same
functionality of indirection to a member function as Microsofts Visual C++ properties do.
Definitions
This section introduces terms used within the body of this specification.
Accelerator
A hardware device or capability that enables accelerated computation on data-parallel workloads. Examples
include:
o Graphics Processing Unit, or GPU, other coprocessor, accessible through the PCIe bus.
o Graphics Processing Unit, or GPU, or other coprocessor that is integrated with a CPU on the same die.
o SIMD units of the host node exposed through software emulation of a hardware accelerator.
Array
A dense N-dimensional data container.
Array View
A view into a contiguous piece of memory that adds array-like dimensionality.
Extent
A vector of integers that describes lengths of N-dimensional array-like objects.
Global memory
On a GPU, global memory is the main off-chip memory store,
Informative: Typcially, on current-generation GPUs, global memory is implemented in DRAM, with access times of
400-1000 cycles; the GPU clock speed is around 1 Ghz; and may or may not be cached. Global memory is accessed
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 3
98
99
100
101
102
in a coalesced pattern with a granularity of 128 bytes, so when accessing 4 bytes of global memory, 32 successive
threads need to read the 32 successive 4-byte addresses, to be fully coalesced.
Informative: The memory space of current GPUs is typically disjoint from its host system.
103
104
105
106
107
108
GPGPU: General Purpose computation on Graphics Processing Units, which is a GPU capable of running nongraphics computations.
GPU: A specialized (co)processor that offloads graphics computation and rendering from the host. As GPUs have
evolved, they have become increasingly able to offload non-graphics computations as well (see GPGPU).
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
Heterogenous programming
A workload that combines kernels executing on data-parallel compute nodes with algorithms running on CPUs.
Host
The operating system proecess and the CPU(s) that it is running on.
Host thread
The operating system thread and the CPU(s) that it is running on. A host thread may initiate a copy operation or
parallel loop operation that may run on an accelerator.
Index
A vector of integers that describes an N-dimentional point in iteration space or index space.
Pixel
A pixel, or picture element, represents a single element in a digital image. Typically pixels are composed of multiple
color components such as a red, green and blue values. Other color representation exist, including single channel
images that just represent intensity or black and white values.
Reference counting
Reference counting is a memory management technique to manage an objects lifetime. References to an object
are counted and the object is kept alive as long as there is at least one reference to it. A reference counted object
is destroyed when the last reference disappears.
SIMD unit
Single Instruction Multiple Data. A machine programming model where a single instruction operates over multiple
pieces of data. Translating a program to use SIMD is known as vectorization. GPUs have multiple SIMD units,
which are the streaming multiprocessors.
Informative: An SSE (Nehalem, Phenom) or AVX (Sandy Bridge) or LRBni (Larrabee) vector unit is a SIMD unit or
vector processor.
SMP
Symmetric Multi-Processor standard PC multiprocessor architecure.
146
147
148
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 4
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
Texel
A texel or texture element represents a single element of a texture space. Texel elements are mapped to 1D, 2D or
3D surfaces during sampling, rendering and/or rasterization and end up as pixel elements on a display.
Texture
A texture is a 1, 2 or 3 dimensional logical array of texels which is optimized in hardware for spacial access using
texture caches. Textures typically are used to represent image, volumetric or other visual information, although
they are efficient for many data arrays which need to be optimized for spacial access or need to interpolate
between adjacent elements. Textures provide virtualization of storage, whereby shader code can sample a texture
object as if it contained logical elements of one type (e.g., float4) whereas the concrete physical storage of the
texture is represented in terms of a second type (e.g., four 8-bit channels). This allows the application of the same
shader algorithms on different types of concrete data.
Texture Format
Texture formats define the type and arrangement of the underlying bytes representing a texel value.
Informative: Direct3D supports many types of formats, which are described under the DXGI_FORMAT enumeration.
Texture memory
Texture memory space resides in GPU memory and is cached in texture cache. A texture fetch costs one memory
read from GPU memory only on a cache miss, otherwise it just costs one read from texture cache. The texture
cache is optimized for 2D spatial locality, so threads of the same scheduling unit that read texture addresses that
are close together in 2D will achieve best performance. Also, it is designed for streaming fetches with a constant
latency; a cache hit reduces global memory bandwidth demand but not fetch latency.
1.3
Tile_static memory
User-managed programmable cache on streaming multiprocessors on GPUs. Shared memory is local to a
multiprocessor and shared across threads executing on the same multiprocessor. Shared memory allocations per
thread group will affect the total number of thread groups that are in-flight per multiprocessor
Tiling
Tiling is the partitioning of an N-dimensional dense index space (compute domain) into same sized tiles which are
N-dimensional rectangles with sides parallel to the coordinate axes. Tiling is essentially the process of recognizing
the current thread group as being a cooperative gang of threads, with the decomposition of a global index into a
local index plus a tile offset. In C++ AMP it is viewing a global index as a local index and a tile ID described by the
canonical correspondence:
compute grid ~ dispatch grid x thread group
In particular, tiling provides the local geometry with which to take advantage of shared memory and barriers
whose usage patterns enable reducing global memory accesses and coalescing of global memory access. The
former is the most common use of tile_static memory.
Restricted function
A function that is declared to obey the restrictions of a particular C++ AMP subset. A function can be CPUrestricted, in which case it can run on a host CPU. A function can be amp-restricted, in which case it can run on an
amp-capable accelerator, such as a GPU or VectorCPU. A function can carry more than one restriction.
Error Model
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 5
199
200
201
202
203
204
205
206
207
208
209
210
Host-side runtime library code for C++ AMP has a different error model than device-side code. For more details, examples
and exception categorization see Error Handling.
211
212
213
214
1.4
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
Host-Side Error Model: On a host, C++ exceptions and assertions will be used to present semantic errors and hence will be
categorized and listed as error states in API descriptions.
Device-Side Error Model: Microsoft-specific: The debug_printf instrinsic is additionally supported for logging messages
from within the accelerator code to the debugger output window.
Compile-time asserts: The C++ intrinsic static_assert is often used to handle error states that are detectable at compile time.
In this way static_assert is a technique for conveying static semantic errors and as such they will be categorized similar to
exception types.
Programming Model
The C++ AMP programming model is factored into the following header files:
<amp.h>
<amprt.h>
<amp_math.h>
<amp_graphics.h>
<amp_short_vectors.h>
Here are the types and patterns that comprise C++ AMP.
Indexing level (<amp.h>)
o index<N>
o extent<N>
o tiled_extent<D0,D1,D2>
o tiled_index<D0,D1,D2>
Data level (<amp.h>)
o array<T,N>
o array_view<T,N>, array_view<const T,N>
o copy
o copy_async
Runtime level (<amprt.h>)
o accelerator
o accelerator_view
o completion_future
Call-site level (<amp.h>)
o parallel_for_each
o copy various commands to move data between compute nodes
Kernel level (<amp.h>)
o tile_barrier
o restrict() clause
o tile_static
o Atomic functions
Math functions (<amp_math.h>)
o Precise math functions
o Fast math functions
Textures (optional, <amp_graphics.h>)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 6
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
2.1
o texture<T,N>
o writeonly_texture_view<T,N>
Short vector types (optional, <amp_short_vectors.h>)
o Short vector types
direct3d interop (optional and Microsoft-specific)
o Data interoperation on arrays and textures
o Scheduling interoperation accelerators and accelerator views
o Direct3d intrinsic functions for clamping, bit counting, and other special arithmetic operations.
C++ AMP adds a closed set1 of restriction specifiers to the C++ type system, with new syntax, as well as rules for how they
behave with respect to conversion rules and overloading.
Restriction specifiers apply to function declarators only. The restriction specifiers perform the following functions:
1. They become part of the signature of the function.
2. They enforce restrictions on the content and/or behaviour of that function.
3. They may designate a particular subset of the C++ language
.
For example, an amp restriction would imply that a function must conform to the defined subset of C++ such that it is
amenable for use on a typical GPU device.
Syntax
There is no mechanism proposed here to allow developers to extend the set of restrictions.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 7
295
296
297
298
299
300
301
The cpu restriction specifies that this function will be able to run on the host CPU.
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
If a declarator elides the restriction specifier, it behaves as if it were specified with restrict(cpu), except when a restriction
specifier is determined by the surrounding context as specified in section 2.2.1. If a declarator contains a restriction
specifier, then it specifies the entire set of restrictions (in other words: restrict(amp) means will be able to run on the amp
target, need not be able to run the CPU).
Restriction specifiers shall not be applied to other declarators (e.g.: arrays, pointers, references). They can be applied to all
kinds of functions including free functions, static and non-static member functions, special member functions, and overloaded
operators.
Examples:
auto grod() restrict(amp);
auto freedle() restrict(amp)-> double;
class Fred {
public:
Fred() restrict(amp)
: member-initializer
{ }
Fred& operator=(const Fred&) restrict(amp);
int kreeble(int x, int y) const restrict(amp);
static void zot() restrict(amp);
};
restriction-specifier-seqopt applies to to all expressions between the restriction-specifier-seq and the end of the functiondefinition, lambda-expression, member-declarator, lambda-declarator or declarator.
lambda-declarator:
( parameter-declaration-clause ) attribute-specifieropt mutableopt restriction-specifier-seqopt
exception-specificationopt trailing-return-typeopt
When a restriction modifier is applied to a lambda expression, the behavior is as if all member functions of the generated
functor are restriction-modified.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 8
350
351
352
353
354
355
356
357
358
359
360
or simply
float (*pf)(int) restrict(cpu);
361
362
363
364
365
366
367
2.2
368
369
370
371
Informative: not for this release: It is possible to imagine two restriction specifiers that are intrinsically incompatible with
each other (for example, pure and elemental). When this occurs, the compiler will produce an error.
372
373
374
375
376
377
The restriction specifiers on a function become part of its signature, and thus can be used to overload.
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
The restriction specifiers on the declaration of a given function F must agree with those specified on the definition of function
F.
Multiple restriction specifiers may be specified for a given function: the effect is that the function enforces the union of the
restrictions defined by each restriction modifier.
Every expression (or sub-expression) that is evaluated in code that has multiple restriction specifiers must have the same
type in the context of each restriction. It is a compile-time error if an expression can evaluate to different types under the
different restriction specifiers. Function overloads should be defined with care to avoid a situation where an expression can
evaluate to different types with different restrictions.
// int_void_amp is amp-restricted
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 9
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
Since destructors cannot be overloaded, the destructor must contain a restriction specifier that covers the union of
restrictions on all the constructors. (A destructor can achieve the same effect of overloading by calling auxiliary cleanup
functions that have different restriction specifiers.)
449
450
451
452
453
454
455
456
457
458
459
460
461
For example:
class Foo {
public:
Foo() { }
Foo() restrict(amp) { }
~Foo() restrict(cpu,amp);
};
void UnrestrictedFunction() {
Foo a; // calls Foo::Foo()
A virtual function declaration in a derived class will override a virtual function declaration in a base class only if the derived
class function has the same restriction specifiers as the base. E.g.:
class Base {
public:
virtual void foo() restrict(R1);
};
class Derived : public Base {
public:
virtual void foo() restrict(R2); // Does not override Base::foo
};
(Note that C++ AMP does not support virtual functions in the current restrict(amp) subset.)
Foo ambientVar;
auto functor = [ambientVar] (int y) restrict(amp) -> int { return y + ambientVar.z; };
is equivalent to:
Foo ambientVar;
class <lambdaName> {
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 10
462
463
464
465
466
467
468
469
470
471
472
473
474
475
public:
<lambdaName>(const Foo& foo)
: capturedFoo(foo)
{ }
~<lambdaName>() { }
int operator()(int y) restrict(amp) { return y + capturedFoo.z; }
const Foo& capturedFoo;
};
<lambdaName> functor;
476
2.3
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
(Note that C++ AMP does not support function pointers in the current restrict(amp) subset.)
The restriction specifiers of a function shall not overlap with any restriction specifiers in another function within the same
overload set.
int func(int x) restrict(cpu,amp);
int func(int x) restrict(cpu); // error, overlaps with previous declaration
The target of the function call operator must resolve to an overloaded set of functions that is at least as restricted as the body
of the calling function (see Overload Resolution). E.g.:
void grod();
void glorp() restrict(amp);
void foo() restrict(amp) {
glorp(); // okay: glorp has amp restriction
grod(); // error: grod lacks amp restriction
}
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 11
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
The compiler must behave this way since the local usage of Grod in this case should not affect other potential uses of it in
other restricted or unrestricted scopes.
More specifically, the compiler follows the standard C++ rules, ignoring restrictions, to determine which special member
functions to generate and how to generate them. Then the restrictions are set according to the following steps:
The compiler sets the restrictions of compiler-generated destructors to the intersection of the restrictions on all of the
destructors of the data members [able to destroy all data members] and all of the base classes destructors [able to call all
base classes destructors]. If there are no such destructors, then all possible restrictions are used [able to destroy in any
context]. However, any restriction that would result in an error is not set.
The compiler sets the restrictions of compiler-generated default constructors to the intersection of the restrictions on all of
the default constructors of the member fields [able to construct all member fields], all of the base classes default
constructors [able to call all base classes default constructors], and the destructor of the class [able to destroy in any
context constructed]. However, any restriction that would result in an error is not set.
562
563
564
565
566
The compiler sets the restrictions of compiler-generated copy constructors to the intersection of the restrictions on all of
the copy constructors of the member fields [able to construct all member fields], all of the base classes copy constructors
[able to call all base classes copy constructors], and the destructor of the class [able to destroy in any context constructed].
However, any restriction that would result in an error is not set.
567
568
569
570
571
572
The compiler sets the restrictions of compiler-generated assignment operators to the intersection of the restrictions on all of
the assignment operators of the member fields [able to assign all member fields] and all of the base classes assignment
operators [able to call all base classes assignment operators]. However, any restriction that would result in an error is not
set.
573
2.3.2.1
574
575
Overload resolution depends on the set of restrictions (function modifiers) in force at the call site.
Overload Resolution
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 12
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
A call to function F is valid if and only if the overload set of F covers all the restrictions in force in the calling function. This
rule can be satisfied by a single function F that contains all the require restrictions, or by a set of overloaded functions F that
each specify a subset of the restrictions in force at the call site. For example:
void Z() restrict(amp,sse2,cpu) { }
void Z_caller() restrict(amp,sse,cpu) {
Z(); // okay; all restrictions available in a single function
}
void X() restrict(amp) { }
void X() restrict(sse) { }
void X() restrict(cpu) { }
void X_caller() restrict(amp,sse,cpu) {
X(); // okay; all restrictions available in separate functions
}
void Y() restrict(amp) { }
void Y_caller() restrict(cpu,amp) {
Y(); // error; no available Y() that satisfies CPU restriction
}
When a call to a restricted function is satisfied by more than one function, then the compiler must generate an as-if-runtime3dispatch to the correctly restricted version.
611
2.3.2.2
Name Hiding
612
613
614
615
616
617
618
619
620
621
622
623
624
625
Overloading via restriction specifiers does not affect the name hiding rules. For example:
626
627
628
629
630
631
632
633
2.3.3 Casting
A restricted function type can be cast to a more restricted function type using a normal C-style cast or reinterpret_cast. (A
cast is not needed when losing restrictions, only when gaining.) For example:
The name hiding rules in C++11 Section 3.3.10 state that within namespace N1, the global name Foo is hidden by the local
name Foo, and is not overloaded by it.
void unrestricted_func(int,int);
void restricted_caller() restrict(R) {
((void (*)(int,int) restrict(R))unrestricted_func)(6, 7);
2
3
Note that sse is used here for illustration only, and does not imply further meaning to it in this specification.
Compilers are always free to optimize this if they can determine the target statically.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 13
634
635
636
637
A program which attempts to invoke a function expression after such unsafe casting can exhibit undefined behavior.
638
639
640
2.4
641
642
643
644
645
646
647
2.4.1.1
648
649
The volatile type qualifier is not supported within an amp-restricted function. A variable or member qualified with volatile
may not be declared or accessed in amp restricted code.
650
2.4.1.2
651
652
653
654
655
656
657
658
659
660
661
662
Of the set of C++ fundamental types only the following are supported within an amp-restricted function as amp-compatible
types.
663
2.4.1.2.1
664
665
666
667
668
669
670
671
672
673
674
675
676
Floating point types behave the same in amp restricted code as they do in CPU code. C++ AMP imposes the additional
behavioural restriction that an intermediate representation of a floating point expression shall not use higher precision
than the operands demand. For example,
677
678
Microsoft-specific: This is equivalent to the Visual C++ /fp:precise mode. C++ AMP does not use higher-precision for
intermediate representations of floating point expressions even when /fp:fast is specified.
The amp restriction modifier applies a relatively small set of restrictions that reflect the current limitations of GPU hardware
and the underlying programming model.
We refer to the set of supported types as being amp-compatible. Any type referenced within an amp restriction function
shall be amp-compatible. Some uses require further restrictions.
Type Qualifiers
Fundamental Types
bool
int, unsigned int
long, unsigned long
float, double
void
The representation of these types on a device running an amp function is identical to that of its host.
Floating Point Types
In the above example, the expression f1 + f2 shall not be performed using double (or higher) precision and then converted
back to float.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 14
679
2.4.1.3
680
681
682
683
684
685
686
687
688
689
690
691
Compound Types
References (lvalue and rvalue) shall refer only to amp-compatible types and/or concurrency::array and/or
concurrency::graphics::texture. Additionally, references to pointers are supported as long as the pointer type is itself
supported. Reference to std::nullptr_t is not allowed. No reference type is considered amp-compatible. References are only
supported as local variables and/or function parameters and/or function return types.
concurrency::array_view and concurrency::graphics::writeonly_texture_view are amp-compatible types.
A class type (class, struct, union) is amp-compatible if
692
693
694
695
696
697
it contains only data members whose types are amp-compatible, except for references to instances of classes
array and texture, and
the offset of its data members and base classes are at least four bytes aligned, and
its data members shall not be bitfields, and
it shall not have virtual base classes, and virtual member functions, and
all of its base classes are amp-compatible.
698
699
700
701
702
703
704
705
The element type of an array shall be amp-compatible and four byte aligned.
706
707
708
709
710
711
712
713
714
715
716
717
2.4.3.1
718
719
A C++ AMP program is ill-formed if the value of an integer constant or floating point constant exceeds the allowable range of
any of the above types.
720
2.4.3.2
721
722
723
724
An identifier or qualified identifier that refers to an object shall refer only to:
a parameter to the function, or
a local variable declared at a block scope within the function, or
a non-static member of the class of which this function is a member, or
Pointers to members (C++11 8.3.3) shall only refer to non-static data members.
Enumeration types shall have underlying types consisting of int, unsigned int, long, or unsigned long.
The representation of an amp-compatible compound type (with the exception of pointer & reference) on a device is identical
to that of its host.
Literals
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 15
725
726
727
728
a static const type that can be reduced to a integer literal and is only used as an rvalue, or
a global const type that can be reduced to a integer literal and is only used as an rvalue, or
a captured variable in a lambda expression.
729
2.4.3.3
730
731
732
733
734
735
736
If a lambda expression appears within the body of an amp-restricted function, the amp modifier may be elided and the lambda
is still considered an amp lambda.
737
2.4.3.4
738
739
740
741
742
743
744
745
746
2.4.3.5
747
748
Local declarations shall not specify any storage class other than register, or tile_static. Variables that are not tile_static shall
have types that are amp-compatible, pointers to amp-compatible types, or references to amp-compatible types.
749
2.4.3.5.1
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
A variable declared with the tile_static storage class can be accessed by all threads within a tile (group of threads). (The
tile_static storage class is valid only within a restrict(amp) context.) The storage lifetime of a tile_static variable begins when
the execution of a thread in a tile reaches the point of declaration, and ends when the kernel function is exited by the last
thread in the tile. Each thread tile accessing the variable shall perceive to access a separate, per-tile, instance of the variable.
766
Microsoft-specific: The Microsoft implementation of C++ AMP restricts the total size of tile_static memory to 32K.
767
2.4.3.6
768
769
770
A type-cast shall not be used to convert a pointer to an integral type, nor an integral type to a pointer. This restriction applies
to reinterpret_cast (C++11 5.2.10) as well as to C-style casts (C++11 5.4).
Lambda Expressions
A lambda expression shall not capture any context variable by reference, except for context variables of type
concurrency::array and concurrency::graphics::texture.
The effective closure type must be amp-compatible.
Function Calls (C++11 5.2.2)
tile_static Variables
A tile_static variable declaration does not constitute a barrier (see 8.1.1). tile_static variables are not initialized by the
compiler and assume no default initial values.
The tile_static storage class shall only be used to declare local (function or block scope) variables.
The type of a tile_static variable or array must be amp-compatible and shall not directly or recursively contain any
concurrency containers (e.g. concurrency::array_view) or reference to concurrency containers.
A tile_static variable shall not have an initializer and no constructors or destructors will be called for it; its initial contents are
undefined.
Type-Casting Restrictions
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 16
771
Casting away const-ness may result in a compiler warning and/or undefined behavior.
772
2.4.3.7
773
774
775
776
777
778
779
780
781
782
783
The pointer-to-member operators .* and ->* shall only be used to access pointer-to-data member objects.
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
3.1
806
807
808
809
3.2
810
811
812
813
814
815
816
Miscellaneous Restrictions
Device Modeling
The concept of a compute accelerator
A compute accelerator is a hardware capability that is optimized for data-parallel computing. An accelerator may be a device
attached to a PCIe bus (such as a GPU), a device integrated on the same die as the GPU, or it might be an extended instruction
set on the main CPU (such as SSE or AVX).
Informative: Some architectures might bridge these two extremes, such as AMDs Fusion or Intels Knights Ferry.
In the C++ AMP model, an accelerator may have private memory which is not generally accessible by the host. C++ AMP
allows data to be allocated in the accelerator memory and references to this data may be manipulated on the host. It is
assumed that all data accessed within a kernel must be stored in acclerator memory although some C++ AMP scenarios will
implicitly make copies of data logically stored on the host.
C++ AMP has functionality for copying data between host and accelerator memories. A copy from accelerator-to-host is
always a synchronization point, unless an explicit asynchronous copy is specified. In general, for optimal performance,
memory content should stay on an accelerator as long as possible.
In some cases, accelerator memory and CPU memory are one and the same. And depending upon the architecture, there
may never be any need to copy between the two physical locations of memory. C++ AMP provides for coding patterns that
allow the C++ AMP runtime to avoid or perform copies as required.
accelerator
An accelerator is an abstraction of a physical data-parallel-optimized compute node. An accelerator is often a GPU, but can
also be a virtual host-side entity such as the Microsoft DirectX REF device, or WARP (a CPU-side device accelerated using SSE
instructions), or can refer to the CPU itself.
A user may explicitly create a default accelerator object in one of two ways:
1.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 17
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
accelerator def;
2.
The user may also influence which accelerator is chosen as the default by calling accelerator::set_default prior to invoking
any operation which would otherwise choose the default. Such operations include invoking parallel_for_each without an
explicit accelerator_view argument, or creating an array not bound to an explicit accelerator_view, etc. Note that obtaining
the default accelerator does not fix the default; this allows users to determine what the runtimes choice would be before
attempting to override it.
If the user does not call accelerator::set_default, the default is chosen in an implementation specific manner.
832
833
834
835
836
837
838
839
840
841
842
843
844
Microsoft-specific:
The Microsoft implementation of C++ AMP uses the the following heuristic to select a default accelerator when one is not
specified by a call to accelerator::set_default:
1. If using the debug runtime, prefer an accelerator that supports debugging.
2. If the process environment variable CPPAMP_DEFAULT_ACCELERATOR is set, interpret its value as a device path
and prefer the device that corresponds to it.
3. Otherwise, the following criteria are used to determine the best accelerator:
a. Prefer non-emulated devices. Among multiple non-emulated devices:
i. Prefer the device with the most available memory.
ii. Prefer the device which is not attached to the display.
b. Among emulated devices, prefer accelerated devices such as WARP over the REF device.
845
846
847
848
849
850
851
3.2.2
Note that the cpu_accelerator is never considered among the candidates in the above heuristic.
Synopsis
class accelerator
{
public:
static const wchar_t default_accelerator[]; // = L"default"
852
853
854
// Microsoft-specific:
static const wchar_t direct3d_warp[];
// = L"direct3d\\warp"
static const wchar_t direct3d_ref[];
// = L"direct3d\\ref"
855
856
857
858
859
860
861
862
863
864
865
866
867
// = L"cpu"
accelerator();
explicit accelerator(const wstring& path);
accelerator(const accelerator& other);
static vector<accelerator> get_all();
static bool set_default(const wstring& path);
accelerator& operator=(const accelerator& other);
__declspec(property(get)) wstring device_path;
__declspec(property(get)) unsigned int version; // hiword=major, loword=minor
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 18
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
__declspec(property(get))
__declspec(property(get))
__declspec(property(get))
__declspec(property(get))
__declspec(property(get))
__declspec(property(get))
__declspec(property(get))
__declspec(property(get))
wstring description;
bool is_debug;
bool is_emulated;
bool has_display;
bool supports_double_precision;
bool supports_limited_double_precision;
size_t dedicated_memory;
accelerator_view default_view;
accelerator_view create_view();
accelerator_view create_view(queuing_mode qmode);
bool operator==(const accelerator& other) const;
bool operator!=(const accelerator& other) const;
};
class accelerator
Represents a physical accelerated computing device. An object of this type can be created by enumerating the available
devices, or getting the default device, the reference device, or the WARP device.
Microsoft-specific:
The WARP device may not be available on all platforms, not even all Microsoft platforms.
885
886
3.2.3
Static Members
887
888
static bool set_default(const wstring& path);
Sets the default accelerator to the device path identified by the path argument. See the constructor accelerator(const
wstring& path) for a description of the allowable path strings.
This establishes a process-wide default accelerator and influences all subsequent operations that might use a default
accelerator.
Parameters
path
The device path of the default accelerator.
Return Value:
A Boolean flag indicating whether the default was set. If the default has already been set for this process, this value will
be false, and the function will have no effect.
889
890
891
3.2.4
Constructors
accelerator()
Constructs a new accelerator object that represents the default accelerator. This is equivalent to calling the constructor
accelerator(accelerator::default_accelerator).
The actual accelerator chosen as the default can be affected by calling accelerator::set_default.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 19
Parameters:
None.
892
accelerator(const wstring& path)
Constructs a new accelerator object that represents the physical device named by the path argument. If the path
represents an unknown or unsupported device, an exception will be thrown.
The path can be one of the following:
1. accelerator::default_accelerator (or Ldefault), which represents the path of the fastest accelerator available,
as chosen by the runtime.
2. accelerator::cpu_accelerator (or Lcpu), which represents the CPU. Note that parallel_for_each shall not be
invoked over this accelerator.
3. A valid device path that uniquely identifies a hardware accelerator available on the host system.
Microsoft-specific:
4. accelerator::direct3d_warp (or Ldirect3d\\warp), which represents the WARP accelerator
5. accelerator::direct3d_ref (or Ldirect3d\\ref), which represents the REF accelerator.
Parameters:
path
893
accelerator(const accelerator& other);
Copy constructs an accelerator object. This function does a shallow copy with the newly created accelerator object
pointing to the same underlying device as the passed accelerator parameter.
Parameters:
other
894
895
896
3.2.5
Members
static
static
static
static
const
const
const
const
wchar_t
wchar_t
wchar_t
wchar_t
default_accelerator[]
direct3d_warp[]
direct3d_ref[]
cpu_accelerator[]
These are static constant string literals that represent device paths for known accelerators, or in the case of
default_accelerator, direct the runtime to choose an accelerator automatically.
default_accelerator: The string Ldefault represents the default accelerator, which directs the runtime to choose the
fastest accelerator available. The selection criteria are discussed in section 3.2.1 Default Accelerator.
cpu_accelerator: The string Lcpu represents the host system. This accelerator is used to provide a location for
system-allocated memory such as host arrays and staging arrays. It is not a valid target for accelerated computations.
Microsoft-specific:
direct3d_warp: The string Ldirect3d\\warp represents the device path of the CPU-accelerated Warp device. On other non-direct3d platforms, this
member may not exist.
direct3d_ref: The string Ldirect3d\\ref represents the software rasterizer, or Reference, device. This particular device is useful for debugging. On
other non-direct3d platforms, this member may not exist.
897
accelerator& operator=(const accelerator& other)
Assigns an accelerator object to this accelerator object and returns a reference to this object. This function does a
shallow assignment with the newly created accelerator object pointing to the same underlying device as the passed
accelerator parameter.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 20
Parameters:
other
Return Value:
A reference to this accelerator object.
898
__declspec(property(get)) accelerator_view
default_view
Returns the default accelerator view associated with the accelerator. The queuing_mode of the default accelerator_view
is queuing_mode_automatic.
Return Value:
The default accelerator_view object associated with the accelerator.
899
accelerator_view create_view(queuing_mode qmode)
Creates and returns a new accelerator view on the accelerator with the supplied queuing mode.
Return Value:
The new accelerator_view object created on the compute device.
Parameters:
qmode
900
accelerator_view create_view()
Creates and returns a new resource view on the accelerator. Equivalent to create_view(queuing_mode_automatic).
Return Value:
The new accelerator_view object created on the compute device.
901
902
bool operator==(const accelerator& other) const
Compares this accelerator with the passed accelerator object to determine if they represent the same underlying
device.
Parameters:
other
Return Value:
A boolean value indicating whether the passed accelerator object is same as this accelerator.
903
904
bool operator!=(const accelerator& other) const
Compares this accelerator with the passed accelerator object to determine if they represent different devices.
Parameters:
other
Return Value:
A boolean value indicating whether the passed accelerator object is different from this accelerator.
905
906
907
908
909
3.2.6
Properties
The following read-only properties are part of the public interface of the class accelerator, to enable querying the
accelerator characteristics:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 21
910
__declspec(property(get)) wstring description
Returns a short textual description of the accelerator device.
911
__declspec(property(get)) unsigned int version
Returns a 32-bit unsigned integer representing the version number of this accelerator. The format of the integer is
major.minor, where the major version number is in the high-order 16 bits, and the minor version number is in the loworder bits.
912
__declspec(property(get)) bool has_display
This property indicates that the accelerator may be shared by (and thus have interference from) the operating system or
other system software components for rendering purposes. A C++ AMP implementation may set this property to false
should such interference not be applicable for a particular accelerator.
913
__declspec(property(get)) size_t dedicated_memory
Returns the amount of dedicated memory (in KB) on an accelerator device. There is no guarantee that this amount of
memory is actually available to use.
914
__declspec(property(get)) bool supports_double_precision
Returns a Boolean value indicating whether this accelerator supports double-precision (double) computations. When this
returns true, supports_limited_double_precision also returns true.
915
__declspec(property(get)) bool supports_limited_double_precision
Returns a boolean value indicating whether the accelerator has limited double precision support (excludes double
division, precise_math functions, int to double, double to int conversions) for a parallel_for_each kernel.
916
__declspec(property(get)) bool is_debug
Returns a boolean value indicating whether the accelerator supports debugging.
917
__declspec(property(get)) bool is_emulated
Returns a boolean value indicating whether the accelerator is emulated. This is true, for example, with the reference,
WARP, and CPU accelerators.
918
919
920
921
922
923
924
925
926
927
928
929
930
3.3
931
932
933
934
935
3.3.1
accelerator_view
An accelerator_view represents a logical view of an accelerator. A single physical compute device may have many logical
(isolated) accelerator views. Each accelerator has a default accelerator view and additional accelerator views may be
optionally created by the user. Physical devices must potentially be shared amongst many client threads. Client threads may
choose to use the same accelerator_view of an accelerator or each client may communicate with a compute device via an
independent accelerator_view object for isolation from other client threads. Work submitted to an accelerator_view is
guaranteed to be executed in the order that it was submitted; there are no such ordering guarantees for work submitted on
different accelerator_views.
An accelerator_view can be created with a queuing mode of immediate or automatic. (See Queuing Mode).
Synopsis
class accelerator_view
{
public:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 22
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
accelerator_view() = delete;
accelerator_view(const accelerator_view& other);
accelerator_view& operator=(const accelerator_view& other);
__declspec(property(get))
__declspec(property(get))
__declspec(property(get))
__declspec(property(get))
Concurrency::accelerator accelerator;
bool is_debug;
unsigned int version;
queuing_mode queuing_mode;
void flush();
void wait();
completion_future create_marker();
bool operator==(const accelerator_view& other) const;
bool operator!=(const accelerator_view& other) const;
};
class accelerator_view
Represents a logical (isolated) accelerator view of a compute accelerator. An object of this type can be obtained by
calling the default_view property or create_view member functions on an accelerator object.
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
3.3.2
969
970
971
972
973
974
975
976
977
3.3.3
Queuing Mode
If the queuing mode is queuing_mode_immediate, then any commands (such as copy or parallel_for_each) are sent to the
corresponding accelerator before control is returned to the caller.
If the queuing mode is queuing_mode_automatic, then such commands are queued up on a command queue corresponding
to this accelerator_view. There are three events that can cause queued commands to be submitted:
Copying the contents of an array to the host or another accelerator_view results in all previous commands
referencing that array resource (including the copy command itself) to be submitted for execution on the hardware.
Calling the accelerator_view::flush or accelerator_view::wait methods.
The IHV device driver may internally uses a heuristic to determine when commands are submitted to the hardware
for execution, for example when resource limits would be exceeded without otherwise flushing the queue.
Constructors
An accelerator_view object may only be constructed using a copy or move constructor. There is no default constructor.
accelerator_view(const accelerator_view& other)
Copy-constructs an accelerator_view object. This function does a shallow copy with the newly created accelerator_view
object pointing to the same underlying view as the other parameter.
Parameters:
other
978
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 23
979
980
3.3.4
Members
Return Value:
A reference to this accelerator_view object.
981
__declspec(property(get)) queuing_mode queuing_mode
Returns the queuing mode that this accelerator_view was created with. See Queuing Mode.
Return Value:
The queuing mode.
982
__declspec(property(get)) unsigned int version
Returns a 32-bit unsigned integer representing the version number of this accelerator view. The format of the integer is
major.minor, where the major version number is in the high-order 16 bits, and the minor version number is in the loworder bits.
The version of the accelerator view is usually the same as that of the parent accelerator.
Microsoft-specific: The version may differ from the accelerator only when the accelerator_view is created from a direct3d
device using the interop API.
983
__declspec(property(get)) Concurrency::accelerator accelerator
Returns the accelerator that this accelerator_view has been created on.
984
__declspec(property(get)) bool is_debug
Returns a boolean value indicating whether the accelerator_view supports debugging through extensive error reporting.
The is_debug property of the accelerator view is usually same as that of the parent accelerator.
Microsoft-specific: The is_debug value may differ from the accelerator only when the accelerator_view is created from a
direct3d device using the interop API.
985
void wait()
Performs a blocking wait for completion of all commands submitted to the accelerator view prior to calling wait.
Return Value:
None
986
void flush()
Sends the queued up commands in the accelerator_view to the device for execution.
An accelerator_view internally maintains a buffer of commands such as data transfers between the host memory and
device buffers, and kernel invocations (parallel_for_each calls)). This member function sends the commands to the
device for processing. Normally, these commands are sent to the GPU automatically whenever the runtime determines
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 24
that they need to be, such as when the command buffer is full or when waiting for transfer of data from the device
buffers to host memory. The flush member function will send the commands manually to the device.
Calling this member function incurs an overhead and must be used with discretion. A typical use of this member function
would be when the CPU waits for an arbitrary amount of time and would like to force the execution of queued device
commands in the meantime. It can also be used to ensure that resources on the accelerator are reclaimed after all
references to them have been removed.
Because flush operates asynchronously, it can return either before or after the device finishes executing the buffered
commands. However, the commands will eventually always complete.
If the queuing_mode is queuing_mode_immediate, this function does nothing.
Return Value:
None
987
completion_future create_marker()
This command inserts a marker event into the accelerator_views command queue. This marker is returned as a
completion_future object. When all commands that were submitted prior to the marker event creation have
completed, the future is ready.
Return Value:
A future which can be waited on, and will block until the current batch of commands has completed.
988
989
bool operator==(const accelerator_view& other) const
Compares this accelerator_view with the passed accelerator_view object to determine if they represent the same
underlying object.
Parameters:
other
Return Value:
A boolean value indicating whether the passed accelerator_view object is same as this accelerator_view.
990
bool operator!=(const accelerator_view& other) const
Compares this accelerator_view with the passed accelerator_view object to determine if they represent different
underlying objects.
Parameters:
other
Return Value:
A boolean value indicating whether the passed accelerator_view object is different from this accelerator_view.
991
992
993
994
995
996
997
3.4
998
999
1000
1001
3.4.1
The physical compute devices can be enumerated or selected by calling the following static member function of the class
accelerator.
Synopsis
vector<accelerator> accelerator::get_all();
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 25
1002
1003
1004
1005
1006
1007
1008
1009
As an example, if one wants to find an accelerator that is not emulated and is not attached to a display, one could do the
following:
1010
1011
1012
1013
1014
1015
1016
1017
C++ AMP enables programmers to express solutions to data-parallel problems in terms of N-dimensional data aggregates and
operations over them.
Fundamental to C++ AMP is the concept of an array. An array associates values in an index space with an element type. For
example an array could be the set of pixels on a screen where each pixel is represented by four 32-bit values: Red, Green,
Blue and Alpha. The index space would then be the screen resolution, for example all points:
{ {y, x} | 0 <= y < 1200, 0 <= x < 1600, x and y are integers }.
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
4.1
index<N>
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
4.1.1
Defines an N-dimensional index point; which may also be viewed as a vector based at the origin in N-space.
The index<N> type represents an N-dimensional vector of int which specifies a unique position in an N-dimensional space.
The dimensions in the coordinate vector are ordered from most-significant to least-significant. Thus, in Cartesian 3dimensional space, where a common convention exists that the Z dimension (plane) is most significant, the Y dimension (row)
is second in significance and the X dimension (column) is the least significant, the index vector (2,0,4) represents the position
at (Z=2, Y=0, X=4).
The position is relative to the origin in the N-dimensional space, and can contain negative component values.
Informative: As a scoping decision, it was decided to limit specializations of index, extent, etc. to 1, 2, and 3 dimensions. This
also applies to arrays and array_views. General N-dimensional support is still provided with slightly reduced convenience.
Synopsis
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 26
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
<int N>
bool operator==(const index<N>& lhs, const index<N>& rhs) restrict(amp,cpu);
<int N>
bool operator!=(const index<N>& lhs, const index<N>& rhs) restrict(amp,cpu);
<int N>
index<N> operator+(const index<N>& lhs,
const index<N>& rhs) restrict(amp,cpu);
template <int N>
friend index<N> operator-(const index<N>& lhs,
const index<N>& rhs) restrict(amp,cpu);
index& operator+=(const index& rhs) restrict(amp,cpu);
index& operator-=(const index& rhs) restrict(amp,cpu);
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
index&
index&
index&
index&
index&
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
operator+=(int
operator-=(int
operator*=(int
operator/=(int
operator%=(int
rhs)
rhs)
rhs)
rhs)
rhs)
restrict(amp,cpu);
restrict(amp,cpu);
restrict(amp,cpu);
restrict(amp,cpu);
restrict(amp,cpu);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 27
1103
static const int rank = N
A static member of index<N> that contains the rank of this index.
1104
typedef int value_type;
The element type of index<N>.
1105
1106
1107
4.1.2 Constructors
index() restrict(amp,cpu)
Default constructor. The value at each dimension is initialized to zero. Thus, index<3> ix; initializes the variable to
the position (0,0,0).
1108
1109
index(const index& other) restrict(amp,cpu)
Copy constructor. Constructs a new index<N> from the supplied argument other.
Parameters:
other
1110
explicit index(int i0) restrict(amp,cpu) // N==1
index(int i0, int i1) restrict(amp,cpu) // N==2
index(int i0, int i1, int i2) restrict(amp,cpu) // N==3
Constructs an index<N> with the coordinate values provided by i02. These are specialized constructors that are only
valid when the rank of the index N {1,2,3}. Invoking a specialized constructor whose argument count N will result
in a compilation error.
Parameters:
i0 [, i1 [, i2 ] ]
The component values of the index vector.
1111
explicit index(const int components[]) restrict(amp,cpu)
Constructs an index<N> with the coordinate values provided the array of int component values. If the coordinate array
length N, the behavior is undefined. If the array value is NULL or not a valid pointer, the behavior is undefined.
Parameters:
components
1112
1113
4.1.3 Members
index& operator=(const index& other) restrict(amp,cpu)
Assigns the component values of other to this index<N> object.
Parameters:
An object of type index<N> from which to copy into this index.
other
Return Value:
Returns *this.
1114
int operator[](unsigned int c) const restrict(amp,cpu)
int& operator[](unsigned int c) restrict(amp,cpu)
Returns the index component value at position c.
Parameters:
c
The dimension axis whose coordinate is to be accessed.
Return Value:
A the component value at position c.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 28
1115
1116
1117
4.1.4
Operators
template
friend
template
friend
<int
bool
<int
bool
N>
operator==(const index<N>& lhs, const index<N>& rhs) restrict(amp,cpu)
N>
operator!=(const index<N>& lhs, const index<N>& rhs) restrict(amp,cpu)
1118
template
friend
template
friend
<int N>
index<N> operator+(const index<N>& lhs, const index<N>& rhs) restrict(amp,cpu)
<int N>
index<N> operator-(const index<N>& lhs, const index<N>& rhs) restrict(amp,cpu)
Binary arithmetic operations that produce a new index<N> that is the result of performing the corresponding pair-wise
binary arithmetic operation on the elements of the operands. The result index<N> is such that for a given operator ,
result[i] = leftIdx[i] rightIdx[i]
for every i from 0 to N-1.
Parameters:
The left-hand index<N> of the arithmetic operation.
lhs
The right-hand index<N> of the arithmetic operation.
rhs
1119
index& operator+=(const index& rhs) restrict(amp,cpu)
index& operator-=(const index& rhs) restrict(amp,cpu)
For a given operator , produces the same effect as
(*this) = (*this) rhs;
The return value is *this.
Parameters:
rhs
1120
1121
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
index<N>
<int N>
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 29
1122
index&
index&
index&
index&
index&
operator+=(int
operator-=(int
operator*=(int
operator/=(int
operator%=(int
value)
value)
value)
value)
value)
restrict(amp,cpu)
restrict(amp,cpu)
restrict(amp,cpu)
restrict(amp,cpu)
restrict(amp,cpu)
1123
1124
index&
index
index&
index
operator++() restrict(amp,cpu)
operator++(int) restrict(amp,cpu)
operator--() restrict(amp,cpu)
operator--(int) restrict(amp,cpu)
1125
1126
1127
1128
1129
1130
1131
1132
4.2
extent<N>
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
4.2.1
The extent<N> type represents an N-dimensional vector of int which specifies the bounds of an N-dimensional space with an
origin of 0. The values in the coordinate vector are ordered from most-significant to least-significant. Thus, in Cartesian 3dimensional space, where a common convention exists that the Z dimension (plane) is most significant, the Y dimension (row)
is second in significance and the X dimension (column) is the least significant, the extent vector (7,5,3) represents a space
where the Z coordinate ranges from 0 to 6, the Y coordinate ranges from 0 to 4, and the X coordinate ranges from 0 to 2.
Synopsis
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 30
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
extent&
extent&
extent&
extent&
operator+=(const
operator-=(const
operator+=(const
operator-=(const
<int
bool
<int
bool
N>
operator==(const extent<N>& lhs, const extent<N>& rhs) restrict(amp,cpu);
N>
operator!=(const extent<N>& lhs, const extent<N>& rhs) restrict(amp,cpu);
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 31
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
static const int rank = N
A static member of extent<N> that contains the rank of this extent.
1217
typedef int value_type;
The element type of extent<N>.
1218
1219
4.2.2 Constructors
extent() restrict(amp,cpu);
Default constructor. The value at each dimension is initialized to zero. Thus, extent<3> ix; initializes the variable to
the position (0,0,0).
Parameters:
None.
1220
1221
extent(const extent& other) restrict(amp,cpu)
Copy constructor. Constructs a new extent<N> from the supplied argument ix.
Parameters:
other
1222
explicit extent(int e0) restrict(amp,cpu) // N==1
extent(int e0, int e1) restrict(amp,cpu) // N==2
extent(int e0, int e1, int e2) restrict(amp,cpu) // N==3
Constructs an extent<N> with the coordinate values provided by e02. These are specialized constructors that are only
valid when the rank of the extent N {1,2,3}. Invoking a specialized constructor whose argument count N will result
in a compilation error.
Parameters:
e0 [, e1 [, e2 ] ]
The component values of the extent vector.
1223
explicit extent(const int components[]) restrict(amp,cpu);
Constructs an extent<N> with the coordinate values provided the array of int component values. If the coordinate array
length N, the behavior is undefined. If the array value is NULL or not a valid pointer, the behavior is undefined.
Parameters:
An array of N int values.
components
1224
1225
1226
4.2.3
Members
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 32
1227
int operator[](unsigned int c) const restrict(amp,cpu)
int& operator[](unsigned int c) restrict(amp,cpu)
Returns the extent component value at position c.
Parameters:
c
The dimension axis whose coordinate is to be accessed.
Return Value:
A the component value at position c.
1228
bool contains(const index<N>& idx) const restrict(amp,cpu)
Tests whether the index idx is properly contained within this extent (with an assumed origin of zero).
Parameters:
An object of type index<N>
idx
Return Value:
Returns true if the idx is contained within the space defined by this extent (with an assumed origin of zero).
1229
unsigned int size() const restrict(amp,cpu)
This member function returns the total linear size of this extent<N> (in units of elements), which is computed as:
extent[0] * extent[1] * extent[N-1]
1230
template <int D0>
tiled_extent<D0> tile() const restrict(amp,cpu)
template <int D0, int D1>
tiled_extent<D0,D1> tile() const restrict(amp,cpu)
template <int D0, int D1, int D2> tiled_extent<D0,D1,D2> tile() const restrict(amp,cpu)
Produces a tiled_extent object with the tile extents given by D0, D1, and D2.
tile<D0,D1,D2>() is only supported on extent<3>. It will produce a compile-time error if used on an extent where N
3.
tile<D0,D1>() is only supported on extent <2>. It will produce a compile-time error if used on an extent where N
2.
tile<D0>() is only supported on extent <1>. It will produce a compile-time error if used on an extent where N 1.
1231
1232
1233
4.2.4
Operators
template
friend
template
friend
<int
bool
<int
bool
N>
operator==(const extent<N>& lhs, const extent<N>& rhs) restrict(amp,cpu)
N>
operator!=(const extent<N>& lhs, const extent<N>& rhs) restrict(amp,cpu)
1234
extent<N> operator+(const index<N>& idx) restrict(amp,cpu)
extent<N> operator-(const index<N>& idx) restrict(amp,cpu)
Adds (or subtracts) an object of type index<N> from this extent to form a new extent. The result extent<N> is such that
for a given operator ,
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 33
1235
1236
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
template
friend
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
<int N>
extent<N>
Binary arithmetic operations that produce a new extent<N> that is the result of performing the corresponding binary
arithmetic operation on the elements of the extent operands. The result extent<N> is such that for a given operator ,
result[i] = ext[i] value
or
result[i] = value ext[i]
for every i from 0 to N-1.
Parameters:
ext
value
1237
extent&
extent&
extent&
extent&
extent&
operator+=(int
operator-=(int
operator*=(int
operator/=(int
operator%=(int
value)
value)
value)
value)
value)
restrict(amp,cpu)
restrict(amp,cpu)
restrict(amp,cpu)
restrict(amp,cpu)
restrict(amp,cpu)
1238
1239
extent&
extent
extent&
extent
operator++() restrict(amp,cpu)
operator++(int) restrict(amp,cpu)
operator--() restrict(amp,cpu)
operator--(int) restrict(amp,cpu)
1240
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 34
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
4.3
tiled_extent<D0,D1,D2>
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
4.3.1
A tiled_extent is an extent of 1 to 3 dimensions which also subdivides the index space into 1-, 2-, or 3-dimensional tiles. It
has three specialized forms: tiled_extent<D0>, tiled_extent<D0,D1>, and tiled_extent<D0,D1,D2>, where D0-2 specify the
positive length of the tile along each dimension, with D0 being the most-significant dimension and D2 being the leastsignificant. Partial template specializations are provided to represent 2-D and 1-D tiled extents.
A tiled_extent can be formed from an extent by calling extent<N>::tile<D0,D1,D2>() or one of the other two specializations of
extent<N>::tile().
A tiled_extent inherits from extent, thus all public members of extent are available on tiled_extent.
Synopsis
tiled_extent&
tiled_extent&
tiled_extent&
tiled_extent&
lhs,
rhs) restrict(amp,cpu);
lhs,
rhs) restrict(amp,cpu);
};
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 35
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
tiled_extent&
tiled_extent&
tiled_extent&
tiled_extent&
lhs,
rhs) restrict(amp,cpu);
lhs,
rhs) restrict(amp,cpu);
};
template <int D0>
class tiled_extent<D0,0,0> : public extent<1>
{
public:
static const int rank = 1;
tiled_extent() restrict(amp,cpu);
tiled_extent(const tiled_extent& other) restrict(amp,cpu);
tiled_extent(const extent<1>& extent) restrict(amp,cpu);
tiled_extent& operator=(const tiled_extent& other) restrict(amp,cpu);
tiled_extent pad() const restrict(amp,cpu);
tiled_extent truncate() const restrict(amp,cpu);
__declspec(property(get)) extent<1> tile_extent;
static const int tile_dim0 = D0;
friend bool operator==(const
const
friend bool operator!=(const
const
tiled_extent&
tiled_extent&
tiled_extent&
tiled_extent&
lhs,
rhs) restrict(amp,cpu);
lhs,
rhs) restrict(amp,cpu);
};
1338
static const int rank = N
A static member of tiled_extent that contains the rank of this tiled extent, and is either 1, 2, or 3 depending on the
specialization used.
1339
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 36
1340
1341
4.3.2
Constructors
tiled_extent() restrict(amp,cpu)
Default constructor.
Parameters:
None.
1342
tiled_extent(const tiled_extent& other) restrict(amp,cpu)
Copy constructor. Constructs a new tiled_extent from the supplied argument other.
Parameters:
An object of type tiled_extent from which to initialize this new
other
extent.
1343
tiled_extent(const extent<N>& extent) restrict(amp,cpu)
Constructs a tiled_extent<N> with the extent extent. The origin is default-constructed and thus zero.
Notice that this constructor allows implicit conversions from extent<N> to tiled_extent<N>.
Parameters:
extent
The extent of this tiled_extent
1344
1345
1346
4.3.3
Members
1347
tiled_extent pad() const restrict(amp,cpu)
Returns a new tiled_extent with the extents adjusted up to be evenly divisible by the tile dimensions. The origin of the
new tiled_extent is the same as the origin of this one.
1348
tiled_extent truncate() const restrict(amp,cpu)
Returns a new tiled_extent with the extents adjusted down to be evenly divisible by the tile dimensions. The origin of
the new tiled_extent is the same as the origin of this one.
1349
__declspec(property(get)) extent<N> tile_extent
Returns an instance of an extent<N> that captures the values of the tiled_extent template arguments D0, D1, and D2.
For example:
tiled_extent<64,16,4> tg;
extent<3> myTileExtent = tg.tile_extent;
assert(myTileExtent[0] == 64);
assert(myTileExtent[1] == 16);
assert(myTileExtent[2] == 4);
1350
static const int tile_dim0
static const int tile_dim1
static const int tile_dim2
These constants allow access to the template arguments of tiled_extent.
1351
1352
1353
4.3.4
Operators
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 37
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
4.4
tiled_index<D0,D1,D2>
A tiled_index is a set of indices of 1 to 3 dimensions which have been subdivided into 1-, 2-, or 3-dimensional tiles in a
tiled_extent. It has three specialized forms: tiled_index<D0>, tiled_index<D0,D1>, and tiled_index<D0,D1,D2>, where D0-2
specify the length of the tile along each dimension, with D0 being the most-significant dimension and D2 being the leastsignificant. Partial template specializations are provided to represent 2-D and 1-D tiled indices.
A tiled_index is implicitly convertible to an index<N>, where the implicit index represents the global index.
A tiled_index contains 4 member indices which are related to each other mathematically and help the user to pinpoint a
global index to an index within a tiled space.
A tiled_index contains a global index into an extent space. The other indices obey the following relations:
.local .global % (D0,D1,D2)
.tile .global / (D0,D1,D2)
.tile_origin .global - .local
This is shown visually in the following example:
parallel_for_each(extent<2>(20,24).tile<5,4>(),
[&](tiled_index<5,4> ti) { /* ... */ });
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
1
2
3
4
5
6
7
8
9
0
1
2
3
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 38
1
1
1
1
1
1
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
4
5
6
7
8
9
1.
2.
3.
4.
5.
4.4.1
Each cell in the diagram represents one thread which is scheduled by the parallel_for_each call. We see that, as with
the non-tiled parallel_for_each, the number of threads scheduled is given by the extent parameter to the
parallel_for_each call.
Using vector notation, we see that the total number of tiles scheduled is <20,24> / <5,4> = <4,6>, which we see in
the above diagram as 4 tiles along the vertical axis, and 6 tiles along the horizontal axis.
The tile in red is tile number <0,0>. The tile in yellow is tile number <1,2>.
The thread in blue:
a. has a global id of <5,8>
b. Has a local id <0,0> within its tile. i.e., it lies on the origin of the tile.
The thread in green:
a. has a global id of <6,9>
b. has a local id of <1,1> within its tile
c. The blue thread (number <5,8>) is the green threads tile origin.
Synopsis
index<3> global;
index<3> local;
index<3> tile;
index<3> tile_origin;
tile_barrier barrier;
tiled_index(const
const
const
const
const
tiled_index(const
index<3>& global,
index<3> local,
index<3> tile,
index<3> tile_origin,
tile_barrier& barrier) restrict(amp,cpu);
tiled_index& other) restrict(amp,cpu);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 39
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
public:
static const int rank = 2;
const
const
const
const
const
index<2> global;
index<2> local;
index<2> tile;
index<2> tile_origin;
tile_barrier barrier;
tiled_index(const
const
const
const
const
tiled_index(const
index<2>& global,
index<2> local,
index<2> tile,
index<2> tile_origin,
tile_barrier& barrier) restrict(amp,cpu);
tiled_index& other) restrict(amp,cpu);
index<1> global;
index<1> local;
index<1> tile;
index<1> tile_origin;
tile_barrier barrier;
tiled_index(const
const
const
const
const
tiled_index(const
index<1>& global,
index<1> local,
index<1> tile,
index<1> tile_origin,
tile_barrier& barrier) restrict(amp,cpu);
tiled_index& other) restrict(amp,cpu);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 40
Template Arguments
D0, D1, D2
1480
static const int rank = N
A static member of tiled_index that contains the rank of this tiled extent, and is either 1, 2, or 3 depending on the
specialization used.
1481
1482
1483
1484
1485
4.4.2
Constructors
index<N>& global,
index<N>& local,
index<N>& tile,
index<N>& tile_origin,
tile_barrier& barrier) restrict(amp,cpu)
1486
tiled_index(const tiled_index& other) restrict(amp,cpu)
Copy constructor. Constructs a new tiled_index from the supplied argument other.
Parameters:
An object of type tiled_index from which to initialize this.
other
1487
1488
1489
4.4.3
Members
1490
const index<N> local
An index of rank 1, 2, or 3 that represents the relative index within the current tile of a tiled extent.
1491
const index<N> tile
An index of rank 1, 2, or 3 that represents the coordinates of the current tile of a tiled extent.
1492
const index<N> tile_origin
An index of rank 1, 2, or 3 that represents the global coordinates of the origin of the current tile within a tiled extent.
1493
const tile_barrier barrier
An object which represents a barrier within the current tile of threads.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 41
1494
operator const index<N>() const restrict(amp,cpu)
Implicit conversion operator that converts a tiled_index<D0,D1,D2> into an index<N>. The implicit conversion converts
to the .global index member.
1495
__declspec(property(get)) extent<N> tile_extent
Returns an instance of an extent<N> that captures the values of the tiled_index template arguments D0, D1, and D2.
For example:
index<3> zero;
tiled_index<64,16,4> ti(index<3>(256,256,256), zero, zero, zero, mybarrier);
extent<3> myTileExtent = ti.tile_extent;
assert(myTileExtent.tile_dim0 == 64);
assert(myTileExtent.tile_dim1 == 16);
assert(myTileExtent.tile_dim2 == 4);
1496
static const int tile_dim0
static const int tile_dim1
static const int tile_dim2
These constants allow access to the template arguments of tiled_index.
1497
1498
1499
1500
1501
1502
1503
1504
4.5
tile_barrier
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
4.5.1
1518
1519
1520
1521
4.5.2
The tile_barrier class is a capability class that is only creatable by the system, and passed to a tiled parallel_for_each function
object as part of the tiled_index parameter. It provides member functions, such as wait, whose purpose is to synchronize
execution of threads running within the thread tile.
A call to wait shall not occur in non-uniform code within a thread tile. Section 8 defines uniformity and lack thereof formally.
Synopsis
class tile_barrier
{
public:
tile_barrier(const tile_barrier& other) restrict(amp,cpu);
void
void
void
void
};
Constructors
The tile_barrier class does not have a public default constructor, only a copy-constructor.
tile_barrier(const tile_barrier& other) restrict(amp,cpu)
Copy constructor. Constructs a new tile_barrier from the supplied argument other.
Parameters:
An object of type tile_barrier from which to initialize this.
other
1522
1523
1524
4.5.3
Members
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 42
1525
1526
1527
The tile_barrier class does not have an assignment operator. Section 8 provides a complete description of the C++ AMP
memory model, of which class tile_barrier is an important part.
void wait() const restrict(amp)
Blocks execution of all threads in the thread tile until all threads in the tile have reached this call. Establishes a memory
fence on all tile_static and global memory operations executed by the threads in the tile such that all memory operations
issued prior to hitting the barrier are visible to all other threads after the barrier has completed and none of the memory
operations occurring after the barrier are executed before hitting the barrier. This is identical to
wait_with_all_memory_fence.
1528
void wait_with_all_memory_fence() const restrict(amp)
Blocks execution of all threads in the thread tile until all threads in the tile have reached this call. Establishes a memory
fence on all tile_static and global memory operations executed by the threads in the tile such that all memory operations
issued prior to hitting the barrier are visible to all other threads after the barrier has completed and none of the memory
operations occurring after the barrier are executed before hitting the barrier. This is identical to wait.
1529
void wait_with_global_memory_fence() const restrict(amp)
Blocks execution of all threads in the thread tile until all threads in the tile have reached this call. Establishes a memory
fence on global memory operations (but not tile-static memory operations) executed by the threads in the tile such that
all global memory operations issued prior to hitting the barrier are visible to all other threads after the barrier has
completed and none of the global memory operations occurring after the barrier are executed before hitting the barrier.
1530
void wait_with_tile_static_memory_fence() const restrict(amp)
Blocks execution of all threads in the thread tile until all threads in the tile have reached this call. Establishes a memory
fence on tile-static memory operations (but not global memory operations) executed by the threads in the tile such that
all tile_static memory operations issued prior to hitting the barrier are visible to all other threads after the barrier has
completed and none of the tile-static memory operations occurring after the barrier are executed before hitting the
barrier.
1531
1532
1533
1534
1535
1536
1537
4.5.4
C++ AMP provides functions that serve as memory fences, which establish a happens-before relationship between memory
operations performed by threads within the same thread tile. These functions are available in the concurrency namespace.
Section 8 provides a complete description of the C++ AMP memory model.
void all_memory_fence(const tile_barrier&) restrict(amp)
Establishes a thread-tile scoped memory fence for both global and tile-static memory operations. This function does not
imply a barrier and is therefore permitted in divergent code.
1538
void global_memory_fence(const tile_barrier&) restrict(amp)
Establishes a thread-tile scoped memory fence for global (but not tile-static) memory operations. This function does not
imply a barrier and is therefore permitted in divergent code.
1539
void tile_static_memory_fence(const tile_barrier&) restrict(amp)
Establishes a thread-tile scoped memory fence for tile-static (but not global) memory operations. This function does not
imply a barrier and is therefore permitted in divergent code.
1540
1541
1542
4.6
completion_future
1543
1544
1545
1546
This class is the return type of all C++ AMP asynchronous APIs and has an interface analogous to std::shared_future<void>.
Similar to std:shared_future, this type provides member methods such as wait and get to wait for C++ AMP asynchronous
operations to finish, and the type additionally provides a member method then, to specify a completion callback functor to
be executed upon completion of a C++ AMP asynchronous operation. Further this type also contains a member method
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 43
1547
1548
1549
to_task (Microsoft specific extension) which returns a concurrency::task object which can be used to avail the capabilities of
PPL tasks with C++ AMP asynchronous operations; viz. chaining continuations, cancellation etc. This essentially enables waitfree composition of C++ AMP asynchronous tasks on accelerators with CPU tasks.
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
4.6.1
1581
1582
4.6.2
Synopsis
class completion_future
{
public:
completion_future();
completion_future(const completion_future& _Other);
completion_future(completion_future&& _Other);
~completion_future();
completion_future& operator=(const completion_future& _Other);
completion_future& operator=(completion_future&& _Other);
void get() const;
bool valid() const;
void wait() const;
template <class _Rep, class _Period>
std::future_status::future_status wait_for(const std::chrono::duration<_Rep, _Period>&
_Rel_time) const;
template <class _Clock, class _Duration>
std::future_status::future_status wait_until(const std::chrono::time_point<_Clock,
_Duration>& _Abs_time) const;
operator std::shared_future<void>() const;
void then(const _Functor &_Func) const;
concurrency::task<void> to_task() const;
};
Constructors
completion_future()
Default constructor. Constructs an empty uninitialized completion_fuure object which does not refer to any asynchronous
operation. Default constructed completion_future objects have valid() == false
1583
completion_future (const completion_future& other)
Copy constructor. Constructs a new completion_future object that referes to the same asynchronous operation as the
other completion_future object.
Parameters:
An object of type completion_future from which to initialize this.
other
1584
1585
1586
completion_future (completion_future&& other)
Move constructor. Move constructs a new completion_future object that referes to the same asynchronous operation
as originally refered by the other completion_future object. After this constructor returns, other.valid() == false
Parameters:
An object of type completion_future which the new
other
completion_future object is to be move constructed from.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 44
1587
completion_future& operator=(const completion_future& other)
Copy assignment. Copy assigns the contents of other to this.This method causes this to stop referring its current
asynchronous operation and start referring the same asynchronous operation as other.
Parameters:
An object of type completion_future which is copy assigned to this.
other
1588
completion_future& operator=(completion_future&& other)
Move assignment. Move assigns the contents of other to this.This method causes this to stop referring its current
asynchronous operation and start referring the same asynchronous operation as other. After this method returns,
other.valid() == false
Parameters:
An object of type completion_future which is move assigned to this.
other
1589
1590
1591
1592
4.6.3
Members
1593
bool valid() const
This method is functionally identical to std::shared_future<void>::valid. This returns true if this completion_future
is associated with an asynchronous operation.
1594
void wait() const
template <class Rep, class Period>
std::future_status::future_status wait_for(const std::chrono::duration<Rep, Period>&
rel_time) const
template <class Clock, class Duration>
std::future_status::future_status wait_until(const std::chrono::time_point<Clock,
Duration>& abs_time) const
These methods are functionally identical to the corresponding std::shared_future<void> methods.
The wait method waits for the associated asynchronous operation to finish and returns only upon completion of the
associated asynchronous operation or if an exception was encountered when executing the asynchronous operation.
The other variants are functionally identical to the std::shared_future<void> member methods with same names.
1595
operator shared_future<void>() const
Conversion operator to std::shared_future<void>. This method returns a shared_future<void> object
corresponding to this completion_future object and refers to the same asynchronous operation.
1596
1597
1598
template <typename Functor>
void then(const Functor &func) const
This method enables specification of a completion callback func which is executed upon completion of the asynchronous
operation associated with this completion_future object. The completion callback func should have an operator() that
is valid when invoked with non arguments, i.e., func().
Parameters:
func
A function object or lambda whose operator() is invoked upon
completion of thiss associated asynchronous operation.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 45
1599
concurrency::task<void> to_task() const
This method returns a concurrency::task<void> object corresponding to this completion_future object and refers to
the same asynchronous operation. This method is a Microsoft specific extension.
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
5.1
1621
1622
1623
Data Containers
array<T,N>
The type array<T,N> represents a dense and regular (not jagged) N-dimensional array which resides on a specific location
such as an accelerator or the CPU. The element type of the array is T, which is necessarily of a type compatible with the target
accelerator. While the rank of the array is determined statically and is part of the type, the extent of the array is runtimedetermined, and is expressed using class extent<N>. A specific element of an array is selected using an instance of index<N>.
If idx is a valid index for an array with extent e, then 0 <= idx[k] < e[k] for 0 <= k < N. Here each k is referred to as a
dimension and higher-numbered dimensions are referred to as less significant.
The array element type T shall be an amp-compatible whose size is a multiple of 4 bytes and shall not directly or recursively
contain any concurrency containers or reference to concurrency containers.
Array data is laid out contiguously in memory. Elements which differ by one in the least significant dimension are adjacent
in memory. This storage layout is typically referred to as row major and is motivated by achieving efficient memory access
given the standard mapping rules that GPUs use for assigning compute domain values to warps.
Arrays are logically considered to be value types in that when an array is copied to another array, a deep copy is performed.
Two arrays never point to the same data.
The array<T,N> type is used in several distinct scenarios:
1624
1625
1626
An array can have any number of dimensions, although some functionality is specialized for array<T,1>, array<T,2>, and
array<T,3>. The dimension defaults to 1 if the template argument is elided.
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
5.1.1
Synopsis
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 46
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 47
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 48
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 49
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
template<typename T>
class array<T,2>
{
public:
static const int rank = 2;
typedef T value_type;
array() = delete;
explicit array(const extent<2>& extent);
array(int e0, int e1);
array(const extent<2>& extent,
accelerator_view av, accelerator_view associated_av); // staging
array(int e0, int e1, accelerator_view av, accelerator_view associated_av); // staging
array(const extent<2>& extent, accelerator_view av);
array(int e0, int e1, accelerator_view av);
template <typename InputIterator>
array(const extent<2>& extent, InputIterator srcBegin);
template <typename InputIterator>
array(const extent<2>& extent, InputIterator srcBegin, InputIterator srcEnd);
template <typename InputIterator>
array(int e0, int e1, InputIterator srcBegin);
template <typename InputIterator>
array(int e0, int e1, InputIterator srcBegin, InputIterator srcEnd);
template <typename InputIterator>
array(const extent<2>& extent, InputIterator srcBegin,
accelerator_view av, accelerator_view associated_av); // staging
template <typename InputIterator>
array(const extent<2>& extent, InputIterator srcBegin, InputIterator srcEnd,
accelerator_view av, accelerator_view associated_av); // staging
template <typename InputIterator>
array(int e0, int e2, InputIterator srcBegin,
accelerator_view av, accelerator_view associated_av); // staging
template <typename InputIterator>
array(int e0, int e2, InputIterator srcBegin, InputIterator srcEnd,
accelerator_view av, accelerator_view associated_av); // staging
template <typename InputIterator>
array(const extent<2>& extent, InputIterator srcBegin, accelerator_view av);
template <typename InputIterator>
array(const extent<2>& extent, InputIterator srcBegin, InputIterator srcEnd,
accelerator_view av);
template <typename InputIterator>
array(int e0, int e1, InputIterator srcBegin, accelerator_view av);
template <typename InputIterator>
array(int e0, int e1, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av);
array(const array_view<const T,2>& src);
array(const array_view<const T,2>& src,
accelerator_view av, accelerator_view associated_av); // staging
array(const array_view<const T,2>& src, accelerator_view av);
array(const array& other);
array(array&& other);
array& operator=(const array& other);
array& operator=(array&& other);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 50
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
template<typename T>
class array<T,3>
{
public:
static const int rank = 3;
typedef T value_type;
array() = delete;
explicit array(const extent<3>& extent);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 51
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 52
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
static const int rank = N
The rank of this array.
2033
typedef T value_type;
The element type of this array.
2034
2035
2036
2037
5.1.2 Constructors
There is no default constructor for array<T,N>. All constructors are restricted to run on the CPU only (cant be executed on
an amp target).
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 53
2038
array(const array& other)
Copy constructor. Constructs a new array<T,N> from the supplied argument other. The new array is located on the
same accelerator_view as the source array. A deep copy is performed.
Parameters:
An object of type array<T,N> from which to initialize this new array.
Other
2039
array(array&& other)
Move constructor. Constructs a new array<T,N> by moving from the supplied argument other.
Parameters:
An object of type array<T,N> from which to initialize this new array.
Other
2040
explicit array(const extent<N>& extent)
Constructs a new array with the supplied extent, located on the default view of the default accelerator. If any
components of the extent are non-positive, an exception will be thrown.
Parameters:
Extent
The extent in each dimension of this array.
2041
explicit array<T,1>::array(int e0)
array<T,2>::array(int e0, int e1)
array<T,3>::array(int e0, int e1, int e2)
Equivalent to construction using array(extent<N>(e0 [, e1 [, e2 ]])).
Parameters:
e0 [, e1 [, e2 ] ]
The component values that will form the extent of this array.
2042
template <typename InputIterator>
array(const extent<N>& extent, InputIterator srcBegin [, InputIterator srcEnd])
Constructs a new array with the supplied extent, located on the default accelerator, initialized with the contents of a
source container specified by a beginning and optional ending iterator. The source data is copied by value into this array
as if by calling copy().
If the number of available container elements is less than this->extent.size(), undefined behavior results.
Parameters:
extent
The extent in each dimension of this array.
srcBegin
srcEnd
2043
template <typename InputIterator>
array<T,1>::array(int e0, InputIterator srcBegin [, InputIterator srcEnd])
template <typename InputIterator>
array<T,2>::array(int e0, int e1, InputIterator srcBegin [, InputIterator srcEnd])
template <typename InputIterator>
array<T,3>::array(int e0, int e1, int e2, InputIterator srcBegin [, InputIterator srcEnd])
Equivalent to construction using array(extent<N>(e0 [, e1 [, e2 ]]), src).
Parameters:
e0 [, e1 [, e2 ] ]
The component values that will form the extent of this array.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 54
srcBegin
srcEnd
2044
explicit array(const array_view<const T,N>& src)
Constructs a new array, located on the default view of the default accelerator, initialized with the contents of the
array_view src. The extent of this array is taken from the extent of the source array_view. The src is copied by
value into this array as if by calling copy(src, *this) (see 5.3.2).
Parameters:
An array_view object from which to copy the data into this array (and
src
also to determine the extent of this array).
2045
explicit array(const extent<N>& extent, accelerator_view av)
Constructs a new array with the supplied extent, located on the accelerator bound to the accelerator_view av.
Parameters:
extent
The extent in each dimension of this array.
av
2046
array<T,1>::array(int e0, accelerator_view av)
array<T,2>::array(int e0, int e1, accelerator_view av)
array<T,3>::array(int e0, int e1, int e2, accelerator_view av)
Equivalent to construction using array(extent<N>(e0 [, e1 [, e2 ]]), av).
Parameters:
e0 [, e1 [, e2 ] ]
The component values that will form the extent of this array.
av
2047
template <typename InputIterator>
array(const extent<N>& extent, InputIterator srcBegin [, InputIterator srcEnd],
accelerator_view av)
Constructs a new array with the supplied extent, located on the accelerator bound to the accelerator_view av,
initialized with the contents of the source container specified by a beginning and optional ending iterator. The data is
copied by value into this array as if by calling copy().
Parameters:
extent
The extent in each dimension of this array.
srcBegin
srcEnd
av
2048
array(const array_view<const T,N>& src, accelerator_view av)
Constructs a new array initialized with the contents of the array_view src. The extent of this array is taken from the
extent of the source array_view. The src is copied by value into this array as if by calling copy(src, *this) (see
5.3.2). The new array is located on the accelerator bound to the accelerator_view av.
Parameters:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 55
src
An array_view object from which to copy the data into this array (and
also to determine the extent of this array).
av
2049
template <typename InputIterator>
array<T,1>::array(int e0, InputIterator srcBegin [, InputIterator srcEnd],
accelerator_view av)
template <typename InputIterator>
array<T,2>::array(int e0, int e1, InputIterator srcBegin [, InputIterator srcEnd],
accelerator_view av)
template <typename InputIterator>
array<T,3>::array(int e0, int e1, int e2, InputIterator srcBegin [, InputIterator srcEnd],
accelerator_view av)
Equivalent to construction using array(extent<N>(e0 [, e1 [, e2 ]]), srcBegin [, srcEnd], av).
Parameters:
e0 [, e1 [, e2 ] ]
The component values that will form the extent of this array.
srcBegin
srcEnd
av
2050
2051
5.1.2.1
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
Staging arrays are used as a hint to optimize repeated copies between two accelerators (in V1 practically this is between the
CPU and an accelerator). Staging arrays are optimized for data transfers, and do not have stable user-space memory.
Microsoft-specific: On Windows, staging arrays are backed by DirectX staging buffers which have the correct hardware
alignment to ensure efficient DMA transfer between the CPU and a device.
Staging arrays are differentiated from normal arrays by their construction with a second accelerator. Note that the
accelerator_view property of a staging array returns the value of the first accelerator argument it was constructed with (av,
below).
It is illegal to change or examine the contents of a staging array while it is involved in a transfer operation (i.e., between lines
17 and 22 in the following example):
1. class SimulationServer
2. {
3.
array<float,2> acceleratorArray;
4.
array<float,2> stagingArray;
5. public:
6.
SimulationServer(const accelerator_view& av)
7.
:acceleratorArray(extent<2>(1000,1000), av),
8.
stagingArray(extent<2>(1000,1000), accelerator(cpu).default_view,
9.
accelerator(gpu).default_view)
10.
{
11.
}
12.
13.
void OnCompute()
14.
{
15.
array<float,2> &a = acceleratorArray;
16.
ApplyNetworkChanges(stagingArray.data());
17.
a = stagingArray;
18.
parallel_for_each(a.extents, [&](index<2> idx)
19.
{
20.
// Update a[idx] according to simulation
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 56
2083
2084
2085
2086
2087
2088
2089
21.
22.
23.
24.
25. };
}
stagingArray = a;
SendToClient(stagingArray.data());
}
associated_av
2090
array<T,1>::array(int e0, accelerator_view av, accelerator_view associated_av)
array<T,2>::array(int e0, int e1, accelerator_view av, accelerator_view associated_av)
array<T,3>::array(int e0, int e1, int e2, accelerator_view av, accelerator_view associated_av)
Equivalent to construction using array(extent<N>(e0 [, e1 [, e2 ]]), av, associated_av).
Parameters:
e0 [, e1 [, e2 ] ]
The component values that will form the extent of this array.
av
associated_av
2091
template <typename InputIterator>
array(const extent<N>& extent, InputIterator srcBegin [, InputIterator srcEnd],
accelerator_view av, accelerator_view associated_av)
Constructs a staging array with the given extent, which acts as a staging area between accelerators av (which must be
the CPU accelerator) and associated_av. The staging array will be initialized with the data specified by src as if by
calling copy(src, *this) (see 5.3.2).
Parameters:
extent
The extent in each dimension of this array.
srcBegin
srcEnd
av
associated_av
2092
2093
array(const array_view<const T,N>& src, accelerator_view av, accelerator_view associated_av)
Constructs a staging array initialized with the array_view given by src, which acts as a staging area between
accelerators av (which must be the CPU accelerator) and associated_av. The extent of this array is taken from the
extent of the source array_view. The staging array will be initialized from src as if by calling copy(src, *this) (see
5.3.2).
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 57
Parameters:
src
An array_view object from which to copy the data into this array (and
also to determine the extent of this array).
av
associated_av
2094
template <typename InputIterator>
array<T,1>::array(int e0, InputIterator srcBegin [, InputIterator srcEnd], accelerator_view
av, accelerator_view associated_av)
template <typename InputIterator>
array<T,2>::array(int e0, int e1, InputIterator srcBegin [, InputIterator srcEnd],
accelerator_view av, accelerator_view associated_av)
template <typename InputIterator>
array<T,3>::array(int e0, int e1, int e2, InputIterator srcBegin [, InputIterator srcEnd],
accelerator_view av, accelerator_view associated_av)
Equivalent to construction using array(extent<N>(e0 [, e1 [, e2 ]]), src, av, associated_av).
Parameters:
e0 [, e1 [, e2 ] ]
The component values that will form the extent of this array.
srcBegin
srcEnd
av
associated_av
2095
2096
2097
2098
5.1.3
Members
2099
__declspec(property(get)) accelerator_view accelerator_view
This property returns the accelerator_view representing the location where this array has been allocated. This property
is only accessible on the CPU.
2100
__declspec(property(get)) accelerator_view associated_accelerator_view
This property returns the accelerator_view representing the preferred target where this array can be copied.
2101
array& operator=(const array& other)
Assigns the contents of the array other to this array, using a deep copy. This function can only be called on the CPU.
Parameters:
An object of type array<T,N> from which to copy into this array.
other
Return Value:
Returns *this.
2102
array& operator=(array&& other)
Moves the contents of the array other to this array. This function can only be called on the CPU.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 58
Parameters:
other
Return Value:
Returns *this.
2103
array& operator=(const array_view<const T,N>& src)
Assigns the contents of the array_view src, as if by calling copy(src, *this) (see 5.3.2).
Parameters:
An object of type array_view<T,N> from which to copy into this array.
src
Return Value:
Returns *this.
2104
void copy_to(array<T,N>& dest)
Copies the contents of this array to the array given by dest, as if by calling copy(*this, dest) (see 5.3.2).
Parameters:
An object of type array <T,N> to which to copy data from this array.
dest
2105
void copy_to(const array_view<T,N>& dest)
Copies the contents of this array to the array_view given by dest, as if by calling copy(*this, dest) (see 5.3.2).
Parameters:
An object of type array_view<T,N> to which to copy data from this
dest
array.
2106
T* data() restrict(amp,cpu)
const T* data() const restrict(amp,cpu)
Returns a pointer to the raw data underlying this array.
Return Value:
A (const) pointer to the first element in the linearized array.
2107
operator std::vector<T>() const
Implicitly converts an array to a std::vector, as if by copy(*this, vector) (see 5.3.2).
Return Value:
An object of type vector<T> which contains a copy of the data contained on the array.
2108
2109
2110
5.1.4
Indexing
2111
const T& operator[](const index<N>& idx) const restrict(amp,cpu)
const T& operator()(const index<N>& idx) const restrict(amp,cpu)
Returns a const reference to the element of this array that is at the location in N-dimensional space specified by idx.
Accessing array data on from a location where it is not resident (e.g. from the CPU when it is resident on a GPU) results
in an exception or undefined behavior.
Parameters:
An object of type index<N> from that specifies the location of the
idx
element.
2112
T& array<T,1>::operator()(int i0) restrict(amp,cpu)
T& array<T,1>::operator[](int i0) restrict(amp,cpu)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 59
2113
const
const
const
const
T&
T&
T&
T&
array<T,1>::operator()(int
array<T,1>::operator[](int
array<T,2>::operator()(int
array<T,3>::operator()(int
i0)
i0)
i0,
i0,
const restrict(amp,cpu)
const restrict(amp,cpu)
int i1) const restrict(amp,cpu)
int i1, int i2) const restrict(amp,cpu)
2114
array_view<T,N-1> operator[](int i0) restrict(amp,cpu)
array_view<const T,N-1> operator[](int i0) const restrict(amp,cpu)
This overload is defined for array<T,N> where N 2.
This mode of indexing is equivalent to projecting on the most-significant dimension. It allows C-style indexing. For
example:
array<float,4> myArray(myExtents, );
myArray[index<4>(5,4,3,2)] = 7;
assert(myArray[5][4][3][2] == 7);
Parameters:
i0
Return Value:
Returns an array_view whose dimension is one lower than that of this array.
2115
2116
2117
5.1.5
View Operations
2118
array_view<T,N> section(const index<N>& idx) restrict(amp,cpu)
array_view<const T,N> section(const index<N>& idx) const restrict(amp,cpu)
Equivalent to section(idx, this->extent idx).
2119
array_view<T,N> section(const extent<N>& ext) restrict(amp,cpu)
array_view<const T,N> section(const extent<N>& ext) const restrict(amp,cpu)
Equivalent to section(index<N>(), ext).
2120
array_view<T,1> array<T,1>::section(int i0, int e0) restrict(amp,cpu)
array_view<const T,1> array<T,1>::section(int i0, int e0) const restrict(amp,cpu)
array_view<T,2> array<T,2>::section(int i0, int i1, int e0, int e1) restrict(amp,cpu)
array_view<const T,2> array<T,2>::section(int i0, int i1,
int e0, int e1) const restrict(amp,cpu)
array_view<T,3> array<T,3>::section(int i0, int i1, int i2,
int e0, int e1, int e2) restrict(amp,cpu)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 60
The component values that will form the extent of the section
2121
template<typename ElementType>
array_view<ElementType,1> reinterpret_as() restrict(amp,cpu)
template<typename ElementType>
array_view<const ElementType,1> reinterpret_as() const restrict(amp,cpu)
Sometimes it is desirable to view the data of an N-dimensional array as a linear array, possibly with a (unsafe)
reinterpretation of the element type. This can be achieved through the reinterpret_as member function. Example:
struct RGB { float r; float g; float b; };
array<RGB,3> a = ...;
array_view<float,1> v = a.reinterpret_as<float>();
assert(v.extent == 3*a.extent);
The size of the reinterpreted ElementType must evenly divide into the total size of this array.
Return Value:
Returns an array_view from this array<T,N> with the element type reinterpreted from T to ElementType, and the rank
reduced from N to 1.
2122
template <int K>
array_view<T,K> view_as(extent<K> viewExtent) restrict(amp,cpu)
template <int K>
array_view<const T,K> view_as(extent<K> viewExtent) const restrict(amp,cpu)
An array of higher rank can be reshaped into an array of lower rank, or vice versa, using the view_as member function.
Example:
array<float,1> a(100);
array_view<float,2> av = a.view_as(extent<2>(2,50));
Return Value:
Returns an array_view from this array<T,N> with the rank changed to K from N.
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
5.2
array_view<T,N>
The array_view<T,N> type represents a possibly cached view into the data held in an array<T,N>, or a section thereof. It also
provides such views over native CPU data. It exposes an indexing interface congruent to that of array<T,N>.
Like an array, an array_view is an N-dimensional object, where N defaults to 1 if it is omitted.
The array element type T shall be an amp-compatible whose size is a multiple of 4 bytes and shall not directly or recursively
contain any concurrency containers or reference to concurrency containers.
.
array_views may be accessed locally, where their source data lives, or remotely on a different accelerator_view or coherence
domain. When they are accessed remotely, views are copied and cached as necessary. Except for the effects of automatic
caching, array_views have a performance profile similar to that of arrays (small to negligible access penalty when accessing
the data through views).
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 61
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
A view to a system memory pointer is passed through a parallel_for_each call to an accelerator and accessed on
the accelerator.
A view to an accelerator-residing array is passed using a parallel_for_each to another accelerator_view and is
accessed there.
A view to an accelerator-residing array is accessed on the CPU.
When any of these scenarios occur, the referenced views are implicitly copied by the system to the remote location and, if
modified through the array_view, copied back to the home location. The Implementation is free to optimize copying changes
back; may only copy changed elements, or may copy unchanged portions as well. Overlapping array_views to the same data
source are not guaranteed to maintain aliasing between arrays/array_views on a remote location.
Multi-threaded access to the same data source, either directly or through views, must be synchronized by the user.
The runtime makes the following guarantees regarding caching of data inside array views.
1.
2.
Let A be an array and V a view to the array. Then, all well-synchronized accesses to A and V in program order obey
a serial happens-before relationship.
Let A be an array and V1 and V2 be overlapping views to the array.
When executing on the accelerator where A has been allocated, all well-synchronized accesses through A,
V1 and V2 are aliased through A and induce a total happens-before relationship which obeys program
order. (No caching.)
Otherwise, if they are executing on different accelerators, then the behaviour of writes to V1 and V2 is
undefined (a race).
When an array_view is created over a pointer in system memory, the user commits to:
1.
2.
only changing the data accessible through the view directly through the view class, or
adhering to the following rules when accessing the data directly (not through the view):
a. Calling synchronize() before the data is accessed directly, and
b. If the underlying data is modified, calling refresh() prior to further accessing it through the view.
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
(Note: The underlying data of an array_view is updated when the last copy of an array_view having pending writes goes out
of scope or is otherwise destructed.)
2181
2182
2183
2184
5.2.1 Synopsis
The array_view<T,N> has the following specializations:
array_view<T,1>
array_view<T,2>
Either action will notify the array_view that the underlying native memory has changed and that any accelerator-residing
copies are now stale. If the user abides by these rules then the guarantees provided by the system for pointer-based views
are identical to those provided to views of data-parallel arrays.
The memory allocation underlying a concurrency::array is reference counted for automatic lifetime management. The array
and all array_views created from it hold references to the allocation and the allocation lives till there exists at least one array
or array_view object that references the allocation. Thus it is legal to access the array_view(s) even after the source
concurrency::array object has been destructed.
When an array_view is created over native CPU data (such as raw CPU memory, std::vector, etc), it is the users responsibility
to ensure that the source data outlives all array_views created over that source. Any attempt to access the array_view
contents after native CPU data has been deallocated has undefined behavior.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 62
2185
2186
2187
2188
2189
array_view<T,3>
array_view<const T,N>
array_view<const T,1>
array_view<const T,2>
array_view<const T,3>
2190
5.2.1.1
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
The generic array_view<T,N> represents a view over elements of type T with rank N. The elements are both readable and
writeable.
array_view<T,N>
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 63
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
typedef T value_type;
array_view() = delete;
array_view(array<T,1>& src) restrict(amp,cpu);
template <typename Container>
array_view(const extent<1>& extent, Container& src);
template <typename Container>
array_view(int e0, Container& src);
array_view(const extent<1>& extent, value_type* src) restrict(amp,cpu);
array_view(int e0, value_type* src) restrict(amp,cpu);
array_view(const array_view& other) restrict(amp,cpu);
array_view& operator=(const array_view& other) restrict(amp,cpu);
void copy_to(array<T,1>& dest) const;
void copy_to(const array_view& dest) const;
__declspec(property(get)) extent<1> extent;
T& operator[](const index<1>& idx) const restrict(amp,cpu);
T& operator[](int i) const restrict(amp,cpu);
T& operator()(const index<1>& idx) const restrict(amp,cpu);
T& operator()(int i) const restrict(amp,cpu);
array_view<T,1>
array_view<T,1>
array_view<T,1>
array_view<T,1>
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 64
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 65
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
5.2.1.2
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
The partial specialization array_view<const T,N> represents a view over elements of type const T with rank N. The elements
are readonly. At the boundary of a call site (such as parallel_for_each), this form of array_view need only be copied to the
target accelerator if it isnt already there. It will not be copied out.
array_view<const T,N>
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 66
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 67
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 68
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
5.2.2
Constructors
The array_view type cannot be default-constructed. It must be bound at construction time to a contiguous data source.
No bounds-checking is performed when constructing array_views.
2556
template <typename Container>
array_view<T,N>::array_view(const extent<N>& extent, Container& src)
template <typename Container>
array_view<const T,N>::array_view(const extent<N>& extent, const Container& src)
Constructs an array_view which is bound to the data contained in the src container. The extent of the array_view is
that given by the extent argument, and the origin of the array view is at zero.
Parameters:
Src
A template argument that must resolve to a linear container that
supports .data() and .size() members (such as std::vector or
std::array)
Extent
The extent of this array_view.
2557
array_view<T,N>::array_view(const extent<N>& extent, value_type* src) restrict(amp,cpu)
array_view<const T,N>::array_view(const extent<N>& extent,
const value_type* src) restrict(amp,cpu)
Constructs an array_view which is bound to the data contained in the src container. The extent of the array_view is
that given by the extent argument, and the origin of the array view is at zero.
Parameters:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 69
Src
A pointer to the source data that will be copied into this array.
Extent
2558
template <typename Container>
array_view<T,1>::array_view(int e0, Container& src)
template <typename Container>
array_view<T,2>::array_view(int e0, int e1, Container& src)
template <typename Container>
array_view<T,3>::array_view(int e0, int e1, int e2, Container& src)
template <typename
array_view<const
template <typename
array_view<const
template <typename
array_view<const
Container>
T,1>::array_view(int e0, const Container& src)
Container>
T,2>::array_view(int e0, int e1, const Container& src)
Container>
T,3>::array_view(int e0, int e1, int e2, const Container& src)
2559
array_view<T,1>::array_view(int e0, value_type* src) restrict(amp,cpu)
array_view<T,2>::array_view(int e0, int e1, value_type* src) restrict(amp,cpu)
array_view<T,3>::array_view(int e0, int e1, int e2, value_type* src) restrict(amp,cpu)
array_view<const T,1>::array_view(int e0, const value_type* src) restrict(amp,cpu)
array_view<const T,2>::array_view(int e0, int e1, const value_type* src) restrict(amp,cpu)
array_view<const T,3>::array_view(int e0, int e1, int e2,
const value_type* src) restrict(amp,cpu)
Equivalent to construction using array_view(extent<N>(e0 [, e1 [, e2 ]]), src).
Parameters:
e0 [, e1 [, e2 ] ]
The component values that will form the extent of this array_view.
Src
A pointer to the source data that will be copied into this array.
2560
array_view(const array_view<T,N>& other) restrict(amp,cpu)
array_view(const array_view<const T,N>& other) restrict(amp,cpu);
Copy constructor. Constructs a new array_view<T,N> from the supplied argument other. A shallow copy is performed.
Parameters:
An object of type array_view<T,N> or array_view<const T,N> from
Other
which to initialize this new array_view.
2561
2562
2563
5.2.3
Members
2564
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 70
2565
void copy_to(array<T,N>& dest)
Copies the data referred to by this array_view to the array given by dest, as if by calling copy(*this, dest) (see
5.3.2).
Parameters:
An object of type array <T,N> to which to copy data from this array.
dest
2566
void copy_to(const array_view& dest)
Copies the contents of this array_view to the array_view given by dest, as if by calling copy(*this, dest) (see 5.3.2).
Parameters:
An object of type array_view<T,N> to which to copy data from this
dest
array.
2567
T* array_view<T,1>::data() const restrict(amp,cpu)
const T* array_view<const T,1>::data() const restrict(amp,cpu)
Returns a pointer to the first data element underlying this array_view. This is only available on array_views of rank 1.
When the data source of the array_view is native CPU memory, the pointer returned by data() is valid for the lifetime of
the data source.
When the data source underlying the array_view is an array, the pointer returned by data() in CPU context is ephemeral
and is invalidated when the original data source or any of its views are accessed on an accelerator_view through a
parallel_for_each or a copy operation.
Return Value:
A (const) pointer to the first element in the linearized array.
2568
void array_view<T, N>::refresh() const
void array_view<const T, N>::refresh() const
Calling this member function informs the array_view that its bound memory has been modified outside the array_view
interface. This will render all cached information stale.
2569
void array_view<T, N>::synchronize() const
Calling this member function synchronizes any modifications made to this array_view to its underlying data container.
For example, for an array_view on system memory, if the contents of the view are modified on a remote
accelerator_view through a parallel_for_each invocation, calling synchronize ensures that the modifications are
synchronized to the source data and will be visible through the system memory pointer which the array_view was
created over.
2570
completion_future array_view<T, N>::synchronize_async() const
An asynchronous version of synchronize, which returns a completion future object. When the future is ready, the
synchronization operation is complete.
2571
void array_view<T, N>::discard_data() const
Indicates to the runtime that it may discard the current logical contents of this array_view. This is an optimization hint to
the runtime used to avoid copying the current contents of the view to a target accelerator_view, and its use is
recommended if the existing content is not needed.
2572
2573
2574
2575
5.2.4
Indexing
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 71
2576
T& array_view<T,N>::operator[](const index<N>& idx) const restrict(amp,cpu)
T& array_view<T,N>::operator()(const index<N>& idx) const restrict(amp,cpu)
Returns a reference to the element of this array_view that is at the location in N-dimensional space specified by idx.
Parameters:
An object of type index<N> from that specifies the location of the
Idx
element.
2577
const T& array_view<const T,N>::operator[](const index<N>& idx) const restrict(amp,cpu)
const T& array_view<const T,N>::operator()(const index<N>& idx) const restrict(amp,cpu)
Returns a const reference to the element of this array_view that is at the location in N-dimensional space specified by
idx.
Parameters:
An object of type index<N> from that specifies the location of the
Idx
element.
2578
T&
T&
T&
T&
array_view<T,1>::operator()(int
array_view<T,1>::operator[](int
array_view<T,2>::operator()(int
array_view<T,3>::operator()(int
i0)
i0)
i0,
i0,
const restrict(amp,cpu)
const restrict(amp,cpu)
int i1) const restrict(amp,cpu)
int i1, int i2) const restrict(amp,cpu)
2579
const T& array_view<const T,1>::operator()(int i0) const restrict(amp,cpu)
const T& array_view<const T,2>::operator()(int i0, int i1) const restrict(amp,cpu)
const T& array_view<const T,3>::operator()(int i0, int i1, int i2) const restrict(amp,cpu)
Equivalent to array_view<T,N>::operator()(index<N>(i0 [, i1 [, i2 ]])) const.
Parameters:
i0 [, i1 [, i2 ] ]
The component values that will form the index into this array.
2580
array_view<T,N-1> array_view<T,N>::operator[](int i0) const restrict(amp,cpu)
array_view<const T,N-1> array_view<const T,N>::operator[](int i0) const restrict(amp,cpu)
This overload is defined for array_view<T,N> where N 2.
This mode of indexing is equivalent to projecting on the most-significant dimension. It allows C-style indexing. For
example:
array<float,4> myArray(myExtents, );
myArray[index<4>(5,4,3,2)] = 7;
assert(myArray[5][4][3][2] == 7);
Parameters:
i0
Return Value:
Returns an array_view whose dimension is one lower than that of this array_view.
2581
2582
2583
5.2.5
View Operations
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 72
Example:
array<float,2> a(extent<2>(200,100));
array_view<float,2> v1(a); // v1.extent = <200,100>
array_view<float,2> v2 = v1.section(index<2>(15,25), extent<2>(40,50));
assert(v2(0,0) == v1(15,25));
Parameters:
idx
ext
Return Value:
Returns a subsection of the source array at specified origin, and with the specified extent.
2584
array_view<T,N> array_view<T,N>::section(const index<N>& idx) const restrict(amp,cpu)
array_view<const T,N> array_view<const T,N>::section(const index<N>& idx) const
restrict(amp,cpu)
Equivalent to section(idx, this->extent idx).
2585
2586
array_view<T,N> array_view<T,N>::section(const extent<N>& ext) const restrict(amp,cpu)
array_view<const T,N> array_view<const T,N>::section(const extent<N>& ext) const
restrict(amp,cpu)
Equivalent to section(index<N>(), ext).
2587
2588
array_view<T,1> array_view<T,1>::section(int i0, int e0) const restrict(amp,cpu)
array_view<const T,1> array_view<const T,1>::section(int i0, int e0) const restrict(amp,cpu)
array_view<T,2> array_view<T,2>::section(int i0, int i1, int e0, int e1) const
restrict(amp,cpu)
array_view<const T,2> array_view<const T,2>::section(int i0, int i1,
int e0, int e1) const restrict(amp,cpu)
array_view<T,3> array_view<T,3>::section(int i0, int i1, int i2,
int e0, int e1, int e2) const restrict(amp,cpu)
array_view<const T,3> array_view<const T,3>::section(int i0, int i1, int i2,
int e0, int e1, int e2) const restrict(amp,cpu)
Equivalent to section(index<N>(i0 [, i1 [, i2 ]]), extent<N>(e0 [, e1 [, e2 ]])).
Parameters:
i0 [, i1 [, i2 ] ]
The component values that will form the origin of the section
e0 [, e1 [, e2 ] ]
The component values that will form the extent of the section
2589
template<typename ElementType>
array_view<ElementType,1> array_view<T,1>::reinterpret_as() const restrict(amp,cpu)
template<typename ElementType>
array_view<const ElementType,1> array_view<const T,1>::reinterpret_as() const
restrict(amp,cpu)
This member function is similar to array<T,N>::reinterpret_as (see 5.1.5), although it only supports array_views of
rank 1 (only those guarantee that all elements are laid out contiguously).
The size of the reinterpreted ElementType must evenly divide into the total size of this array_view.
Return Value:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 73
Returns an array_view from this array_view<T,1> with the element type reinterpreted from T to ElementType.
2590
template <int K>
array_view<T,K> array_view<T,1>::view_as(extent<K> viewExtent) const restrict(amp,cpu)
template <int K>
array_view<const T,K> array_view<const T,1>::view_as(extent<K> viewExtent) const
restrict(amp,cpu)
This member function is similar to array<T,N>::view_as (see 5.1.5), although it only supports array_views of rank 1
(only those guarantee that all elements are laid out contiguously).
Return Value:
Returns an array_view from this array_view<T,1> with the rank changed to K from 1.
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
5.3
Copying Data
C++ AMP offers a universal copy function which covers all synchronous data transfer requirements. In all cases, copying data
is not supported while executing on an accelerator (in other words, the copy functions do not have a restrict(amp) clause).
The general form of copy is:
copy(src, dest);
Informative: Note that this more closely follows the STL convention (destination is the last argument, as in std::copy) and is
opposite of the C-style convention (destination is the first argument, as in memcpy).
Copying to array and array_view types is supported from the following sources:
An array or array_view with the same rank and element type as the destination array or array_view.
A standard container whose element type is the same as the destination array or array_view.
2606
2607
2608
2609
2610
2611
2612
2613
Informative: Containers that expose .size() and .data() members (e.g., std::vector, and std::array) can be handled more
efficiently.
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
5.3.1
Synopsis
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 74
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
5.3.2
int N>
srcEnd, array<T,N>& dest);
int N>
srcEnd, const array_view<T,N>& dest);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 75
The contents of src are copied into dest. The source and destination may reside on different accelerators. If the
extents of src and dest dont match, a runtime exception is thrown.
Parameters:
An object of type array<T,N> to be copied from.
Src
Dest
2681
template <typename T, int N>
void copy(const array<T,N>& src, const array_view<T,N>& dest)
template <typename T, int N>
completion_future copy_async(const array<T,N>& src, const array_view<T,N>& dest)
The contents of src are copied into dest. If the extents of src and dest dont match, a runtime exception is
thrown.
Parameters:
An object of type array<T,N> to be copied from.
src
dest
2682
template <typename T, int N>
void copy(const array_view<const T,N>& src, array<T,N>& dest)
template <typename T, int N>
void copy(const array_view<T,N>& src, array<T,N>& dest)
template <typename T, int N>
completion_future copy_async(const array_view<const T,N>& src, array<T,N>& dest)
template <typename T, int N>
completion_future copy_async(const array_view<T,N>& src, array<T,N>& dest)
The contents of src are copied into dest. If the extents of src and dest dont match, a runtime exception is
thrown.
Parameters:
An object of type array_view<T,N> (or array_view<const T,N>) to be
src
copied from.
dest
2683
template <typename T, int N>
void copy(const array_view<const T,N>& src, const array_view<T,N>& dest)
template <typename T, int N>
completion_future copy_async(const array_view<const T,N>& src, const array_view<T,N>& dest)
The contents of src are copied into dest. If the extents of src and dest dont match, a runtime exception is
thrown.
Parameters:
An object of type array_view<T,N> (or array_view<const T,N>) to be
src
copied from.
dest
2684
2685
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 76
2686
2687
2688
2689
2690
2691
5.3.3
A standard container can be copied into an array or array_view by specifying an iterator range.
Informative: Standard containers that present a .size() and a .data() (such as std::vector and std::array) operation can be
handled very efficiently.
template <typename InputIter, typename T, int N>
void copy(InputIter srcBegin, InputIter srcEnd, array<T,N>& dest)
template <typename InputIter, typename T, int N>
void copy(InputIter srcBegin, array<T,N>& dest)
template <typename InputIter, typename T, int N>
completion_future copy_async(InputIter srcBegin, InputIter srcEnd, array<T,N>& dest)
template <typename InputIter, typename T, int N>
completion_future copy_async(InputIter srcBegin, array<T,N>& dest)
The contents of a source container from the iterator range [srcBegin,srcEnd) are copied into dest. If the number of
elements in the iterator range is not equal to dest.extent.size(), an exception is thrown.
In the overloads which dont take an end-iterator it is assumed that the source iterator is able to provide at least
dest.extent.size() elements, but no checking is performed (nor possible).
Parameters:
srcBegin
An iterator to the first element of a source container.
srcEnd
dest
2692
template <typename InputIter, typename T, int N>
void copy(InputIter srcBegin, InputIter srcEnd, const array_view<T,N>& dest)
template <typename InputIter, typename T, int N>
void copy(InputIter srcBegin, const array_view<T,N>& dest)
Dest
2693
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 77
2694
2695
2696
2697
2698
2699
5.3.4
An array or array_view can be copied into a standard container by specifying the begin iterator. Standard containers that
present a .size() and a .data() (such as std::vector and std::array) operation can be handled very
efficiently.
template <typename OutputIter, typename T, int N>
void copy(const array<T,N>& src, OutputIter destBegin)
template <typename OutputIter, typename T, int N>
completion_future copy_async(const array<T,N>& src, OutputIter destBegin)
The contents of a source array are copied into dest starting with iterator destBegin. If the number of elements in the
range starting destBegin in the destination container is smaller than src.extent.size(), an exception is thrown.
Parameters:
An object of type array<T,N> to be copied from.
src
destBegin
2700
template <typename OutputIter, typename T, int N>
void copy(const array_view<T,N>& src, OutputIter destBegin)
template <typename OutputIter, typename T, int N>
completion_future copy_async(const array_view<T,N>& src, OutputIter destBegin)
The contents of a source array are copied into dest starting with iterator destBegin. If the number of elements in the
range starting destBegin in the destination container is smaller than src.extent.size(), an exception is thrown.
Parameters:
An object of type array_view<T,N> to be copied from.
src
destBegin
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
6.1
Atomic Operations
C++ AMP provides a set of atomic operations in the concurrency namespace. These operations are applicable in
restrict(amp) contexts and may be applied to memory locations within concurrency::array instances and to memory
locations within tile_static variables. Section 8 provides a full description of the C++ AMP memory model and how atomic
operations fit into it.
Synposis
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 78
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
6.2
2746
bool atomic_compare_exchange(int * dest, int * expected_val, int val) restrict(amp)
bool atomic_compare_exchange(unsigned int * dest, unsigned int * expected_val, unsigned int
val) restrict(amp)
These functions attempt to atomically perform these three steps atomically:
1. Read the value stored in the location pointed to by dest
2. Compare the value read in the previous step with the value contained in the location pointed by expected_val
3. Carry the following operations depending on the result of the comparison of the previous step:
a. If the values are identical, then the function tries to atomically change the value pointed by dest to the
value in val. The function indicates by its return value whether this transformation has been successful
or not.
b. If the values are not identical, then the function stores the value read in step (1) into the location
pointed to by expected_val, and returns false.
In terms of sequential semantics, the function is equivalent to the following pseudo-code:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 79
auto t = *dest;
bool eq = t == *expected_val;
if (eq)
*dst = val;
*expected_val = t;
return eq;
The function may fail spuriously. It is guaranteed that the system as a whole will make progress when threads are
contending to atomically modify a variable, but there is no upper bound on the number of failed attempts that any
particular thread may experience.
Parameters:
dst
An pointer to the location which needs to be atomically modified. The
location may reside within a concurrency::array or within a tile_static
variable.
expected_val
val
Return value:
The return value indicates whether the function has been successful in atomically reading, comparing and modifying the
contents of the memory location.
2747
2748
6.3
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 80
Parameters:
Dst
val
Return value:
These functions return the old value which was previously stored at dst, and that was atomically replaced. These
functions always succeed.
2749
int atomic_fetch_inc(int * dest) restrict(amp)
unsigned int atomic_fetch_inc(unsigned int * dest) restrict(amp)
int atomic_fetch_dec(int * dest) restrict(amp)
unsigned int atomic_fetch_dec(unsigned int * dest) restrict(amp)
Atomically increment or decrement the value stored at the location point to by dest.
Parameters:
Dst
An pointer to the location which needs to be atomically modified. The
location may reside within a concurrency::array or within a tile_static
variable.
Return value:
These functions return the old value which was previously stored at dst, and that was atomically replaced. These
functions always succeed.
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
Developers using C++ AMP will use a form of parallel_for_each() to launch data-parallel computations on accelerators. The
behavior of parallel_for_each is similar to that of std::for_each: execute a function for each element in a range. The C++
AMP specialization over ranges of type extent and tiled_extent allow execution of functions on accelerators.
The parallel_for_each function takes the following general forms:
1.
Non-tiled:
template <int N, typename Kernel>
void parallel_for_each(extent<N> compute_domain, const Kernel& f);
2.
Tiled:
template <int D0, int D1, int D2, typename Kernel>
void parallel_for_each(tiled_extent<D0,D1,D2> compute_domain, const Kernel& f);
template <int D0, int D1, typename Kernel>
void parallel_for_each(tiled_extent<D0,D1> compute_domain, const Kernel& f);
template <int D0, typename Kernel>
void parallel_for_each(tiled_extent<D0> compute_domain, const Kernel& f);
Non-tiled:
template <int N, typename Kernel>
void parallel_for_each(const accelerator_view& accl_view,
extent<N> compute_domain, const Kernel& f);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 81
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2.
Tiled:
template <int D0, int D1, int D2, typename Kernel>
void parallel_for_each(const accelerator_view& accl_view,
tiled_extent<D0,D1,D2> compute_domain, const Kernel& f);
template <int D0, int D1, typename Kernel>
void parallel_for_each(const accelerator_view& accl_view,
tiled_extent<D0,D1> compute_domain, const Kernel& f);
template <int D0, typename Kernel>
void parallel_for_each(const accelerator_view& accl_view,
tiled_extent<D0> compute_domain, const Kernel& f);
A parallel_for_each over an extent represents a dense loop nest of independent serial loops.
When parallel_for_each executes, a parallel activity is spawned for each index in the compute domain. Each parallel activity
is associated with an index value. (This index is an index<N> in the case of a non-tiled parallel_for_each, or a
tiled_index<D0,D1,D2> in the case of a tiled parallel_for_each.) A parallel activity typically uses its index to access the
appropriate locations in the input/output arrays.
A call to parallel_for_each behaves as if it were synchronous. In practice, the call may be asynchronous because it executes
on a separate device, but since data copy-out is a synchronizing event, the developer cannot tell the difference.
There are no guarantees on the order and concurrency of the parallel activities spawned by the non-tiled parallel_for_each.
Thus it is not valid to assume that one activity can wait for another sibling activity to complete for itself to make progress.
This is discussed in further detail in section 8.
The tiled version of parallel_for_each organizes the parallel activities into fixed-size tiles of 1, 2, or 3 dimensions, as given by
the tiled_extent<> argument. The tiled_extent provided as the first parameter to parallel_for_each must be divisible, along
each of its dimensions, by the respective tile extent. Tiling beyond 3 dimensions is not supported. Threads (parallel
activities) in the same tile have access to shared tile_static memory, and can use tiled_index::barrier.wait (4.5.3) to
synchronize access to it.
When launching an amp-restricted kernel, the implementation of tiled parallel_for_each will provide the following
minimum capabilities:
The maximum number of tiles per dimension will be no less than 65535.
The maximum number of threads in a tile will be no less than 1024.
o In 3D tiling, the maximal value of D0 will be no less than 64.
2817
2818
2819
2820
2821
2822
Microsoft-specific:
When launching an amp-restricted kernel, the tiled parallel_for_each provides the above portable guarantees and no more.
i.e.,
The maximum number of tiles per dimension is 65535.
The maximum nuimber of threads in a tile is 1024
o In 3D tiling, the maximum value supported for D0 is 64.
2823
2824
2825
2826
2827
2828
2829
The execution behind the parallel_for_each occurs on a certain accelerator, in the context of a certain accelerator view. This
accelerator view may be passed explicitly to parallel_for_each (as an optional first argument). Otherwise, the target
accelerator and the view using which work is submitted to the accelerator, is chosen from the objects of type array<T,N> and
texture<T> that were captured in the kernel lambda. An implementation may require that all arrays and textures captured
in the lambda must be on the same accelerator view; if not, an implemention is free to throw an exception. An
implementation may also arrange for the specified data to be accessible on the selected accelerator view, rather than reject
the call.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 82
2830
2831
2832
2833
2834
2835
2836
Microsoft-specific: the Microsoft implementation of C++ AMP requires that all array and texture objects are colocated on the same accelerator view which is used, implicitly or explicitly in a parallel_for_each call.
If the parallel_for_each kernel functor does not capture an array/texture object and neither is the target accelerator_view
for the kernels execution is explicitly specified, the runtime is allowed to execute the kernel on any accelerator_view on the
default accelerator.
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
Microsoft-specific: In such a scenario, the Microsoft implementation of C++ AMP selects the target
accelerator_view for executing the parallel_for_each kernel as follows:
a.
b.
c.
Determine the set of accelerator_views where ALL array_views referenced in the p_f_e kernel
have cached copies
From the above set, filter out any accelerator_views that are not on the default accelerator.
Additionally filter out accelerator_views that do not have the capabilities required by the p_f_e
kernel (debug intrinsics, number of UAVs)
The default accelerator_view of the default accelerator is selected as the target, if the resultant
set from b. is empty, or contains, that accelerator_view
Otherwise, any accelerator_view from the resultant set from b., is arbitrarily selected as the target
The tiled_index<> argument passed to the kernel contains a collection of indices including those that are relative to the
current tile.
The argument f of template-argument type Kernel to the parallel_for_each function must be a lambda or functor offering an
appropriate function call operator which the implementation of parallel_for_each invokes with the instantiated index type.
To execute on an accelerator, the function call operator must be marked restrict(amp) (but may have additional restrictions),
and it must be callable from a caller passing in the instantiated index type. Overload resolution is handled as if the caller
contained this code:
template <typename IndexType, typename Kernel>
void parallel_for_each_stub(IndexType i, const Kernel& f) restrict(amp)
{
f(i);
}
Where the Kernel f argument is the same one passed into parallel_for_each by the caller, and the index instance i is the thread
identifier, where IndexType is the following type:
Non-Tiled parallel_for_each: index<N>, where N must be the same rank as the extent<N> used in the
parallel_for_each.
Tiled parallel_for_each: tiled_index<D0 [, D1 [, D2]]>, where the tile extents must match those of the tiled_extent
used in the parallel_for_each.
An implementation can employ whole-program compilation (such as link-time code-gen) to achieve this.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 83
2875
2876
2877
2878
7.1
2879
2880
2881
7.2
1.
2.
3.
4.
Correctly synchronized C++ AMP programs are correctly synchronized C++ programs which also adhere to a few additional
C++ AMP rules, as follows:
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
Exception Behaviour
If an error occurs trying to launch the parallel_for_each, an exception will be thrown. Exceptions can be thrown the
following reasons:
2882
2883
2884
2885
2886
2887
2888
Since the kernel function object does not take any other arguments, all other data operated on by the kernel, other than
the thread index, must be captured in the lambda or function object passed to parallel_for_each. The function object shall
be any amp-compatible class, struct or union type, including those introduced by lambda expressions.
1.
2.
8.1
Accelerator-side execution
a. Concurrency rules for arbitrary sibling theads launched by a parallel_for_each call.
b. Semantics and correctness of tile barriers.
c. Semantics of atomic and memory fence operations.
Host-side execution
a. Concurrency of accesses to C++ AMP containers between host-side operations: copy, synchronize,
parallel_for_each and the application of the various subscript operators of arrays and array views on the
host.
b. Accessing arrays or array_view data on the host.
In this section we will consider the relationship between sibling threads in a single parallel_for_each call. Interaction between
separate parallel_for_each calls, copy operations and other host-side operations will be considered in the following subsections.
A parallel_for_each call logically initiates the operation of multiple sibling threads, one for each coordinate in the extent or
tiled_extent passed to it.
All the threads launched by a parallel_for_each are potentially concurrent. Unless barriers are used, an implementation is
free to schedule these threads in any order. In addition, the memory model for normal memory accesses is weak, that is
operations could be arbitrarily reordered as long as each thread perceives to execute in its original program order. Thus any
two memory operations from any two threads in a parallel_for_each are by default concurrent, unless the application has
explicitly enforced an order between these two operations using atomic operations, fences or barriers.
Conversely, an implementation may also schedule only a single logical thread at a time, in a non-cooperative manner, i.e.,
without letting any other threads make any progress, with the exception of hitting a tile barrier or terminating. When a
thread encounters a tile barrier, an implementation must wrest control from that thread and provide progress to some other
thread in the tile until they all have reached the barrier. Similarly, when a thread finishes execution, the system is obligated
to execute steps from some other thread. Thus an implementation is obligated to switch context between threads only when
a thread has hit a barrier (barriers pertain just to the tiled parallel_for_each), or is finished. An implementation doesnt have
to admit any concurrency at a finer level than that which is dictated by barriers and thread termination. All implementations,
however, are obligated to ensure progress is continually made, until all threads launched by a parallel_for_each are
completed.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 84
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
An immediate corollary is that C++ AMP doesnt provide a mechanism using which a thread could, without using tile barriers,
poll for a change which needs to be effected by another thread. In particular, C++ AMP doesnt support locks which are
implemented using atomic operations and fences, since a thread could end up polling forever, waiting for a lock to become
available. The usage of tile barriers allows for creating a limited form of locking scoped to a thread tile. For example:
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
void tile_lock_example()
{
parallel_for_each(
extent<1>(TILE_SIZE).tile<TILE_SIZE>(),
[] (tiled_index<TILE_SIZE> tidx) restrict(amp)
{
tile_static int lock;
// Initialize lock:
if (tidx.local[0] == 0) lock = 0;
tidx.barrier.wait();
bool performed_my_exclusive_work = false;
for (;;) {
// try to acquire the lock
if (!performed_my_ exclusive _work && atomic_compare_exchange(&lock, 0, 1)) {
// The lock has been acquired - mutual exclusion from the rest of the threads in the tile
// is provided here....
some_synchronized_op();
// Release the lock
atomic_exchange(&lock, 0);
performed_my_exclusive_work = true;
}
else {
// The lock wasn't acquired, or we are already finished. Perhaps we can do something
// else in the meanwhile.
some_non_exclusive_op();
}
// The tile barrier ensures progress, so threads can spin in the for loop until they
// are successful in acquiring the lock.
tidx.barrier.wait();
}
});
}
Informative: More often than not, such non-deterministic locking within a tile is not really necessary, since a static schedule
of the threads based on integer thread IDs is possible and results in more efficient and more maintainable code, but we
bring this example here for completeness and to illustrate a valid form of polling.
Informative: This requirement, however, is typically not sufficient in order to allow for efficient implementations. For example,
it allows for the call stack of threads to differ, when they hit a barrier. In order to be able to generate good quality code for
vector targets, much stronger constraints should be placed on the usage of barriers, as explained below.
C++ AMP requires all active control flow expressions leading to a tile barrier to be tile-uniform. Active control flow expressions
are those guarding the scopes of all control flow constructs and logical expressions, which are actively being executed at a
time a barrier is called. For example, the condition of an if statement is an active control flow expression as long as either
the true or false hands of the if statement are still executing. If either of those hands contains a tile barrier, or leads to one
through an arbitrary nesting of scopes and function calls, then the control flow expression controlling the if statement must
be tile-uniform. What follows is an exhaustive list of control flow constructs which may lead to a barrier and their
corresponding control expressions:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 85
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
All active control flow constructs are strictly nested in accordance to the programs text, starting from the scope of the lambda
at the parallel_for_each all the way to the scope containing the barrier.
C++ AMP requires that, when a barrier is encountered by one thread:
1.
2.
3.
4.
That the same barrier will be encountered by all other threads in the tile.
That the sequence of active control flow statements and/or expressions be identical for all threads when they reach
the barrier.
That each of the corresponding control expressions be tile-uniform (which is defined below).
That any active control flow statement or expression hasnt been departed (necessarily in a non-uniform fashion) by
a break, continue or return statement. That is, any breaking statement which instructs the program to leave an
active scope must in itself behave as if it was a barrier, i.e., adhere to these preceding rules.
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
Informally, a tile-uniform expression is an expression only involving variables, literals and function calls which have a uniform
value throughout the tile. Formally, C++ AMP specifies that:
3025
3026
3027
3028
3029
3030
3031
3032
3033
5.
6.
1.
2.
The types of memory which are potentially accessed concurrently by different threads. The memory type can be:
a. Global memory
b. Tile-static memory
The relationship between the threads which could potentially access the same piece of memory. They could be:
a. Within the same thread tile
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 86
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3.
b. Within separate threads tiles or sibiling threads in the basic (non-tiled) parallel_for_each model.
Memory operations which the program contains:
a. Normal memory reads and writes.
b. Atomic read-modify-write operations.
c. Memory fences and barriers
Informally, the C++ AMP memory model is a weak memory model consistent with the C++ memory model, with the following
exceptions:
1.
2.
3.
4.
5.
6.
7.
Atomic operations do not necessarily create a sequentially consistent subset of execution. Atomic operations are
only coherent, not sequentially consistent. That is, there doesnt necessarily exist a global linear order containing all
atomic operations affecting all memory locations which were subjects of such operations. Rather, a separate global
order exists for each memory location, and these per-location memory orders are not necessarily combinable into a
single global order. (Note: this means an atomic operation does not constitute a memory fence.)
Memory fence operations are limited in their effects to the thread tile they are performed within. When a thread
from tile A executes a fence, the fence operation doesnt necessarily affect any other thread from any tile other than
A.
As a result of (1) and (2), the only mechanism available for cross-tile communication is atomic operations, and even
when atomic operations are concerned, a linear order is only guaranteed to exist on a per-location basis, but not
necessarily globally.
Fences are bi-directional, meaning they have both acquire and release semantics.
Fences can also be further scoped to a particular memory type (global vs. tile-static).
Applying normal stores and atomic operations concurrently to the same memory location results in undefined
behavior.
Applying a normal load and an atomic operation concurrently to the same memory location is allowed (i.e., results
in defined bavior).
We will now provide a more formal characterization of the different categories of programs based on their adherence to
synchronization rules. The three classes of adherence are
1.
2.
3.
barrier-incorrect programs,
racy programs, and,
correctly-synchronized programs.
3063
8.1.2.1
3064
3065
3066
A barrier-incorrect program is a program which doesnt adhere to the correct barrier usage rules specified in the previous
section. Such programs always have undefined behavior. The remainder of this section discusses barrier-correct programs
only.
3067
8.1.2.2
3068
3069
3070
3071
3072
3073
3074
3075
Barrier-incorrect programs
Two memory operations applied to the same (or overlapping) memory location are compatible if they are both aligned and
have the same data width, and either both operations are reads, or both operation are atomic, or one operation is a read
and the other is atomic.
This is summarized by the following table in which T1 is a thread executing Op1 and T2 is a thread executing operation Op2.
Op1
Op2
Compatible?
Atomic
Atomic
Yes
Read
Read
Yes
Read
Atomic
Yes
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 87
Write
Any
No
3076
3077
8.1.2.3
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
1.
2.
3.
Adherence to program order: For each Ti, S respects the fences performed5 by Ti. That is any operation performed
by Ti before Ti performed fence Fj appears strictly before Fj in S, and similarly any operations performed by Ti after Fj
appears strictly after Fj in S.
Self-consistency: For i<j, let Mi be a subset containing at least one store (atomic or non-atomic) into location L and
let Mj be a subset containing at least a single load of L, and no stores into L. Further assume that no subset inbetween Mi and Mj stores into L. Then S provides that all loads in Mj shall:
a. Return values stored into L by operations in Mi, and
b. For each thread Ti, the subset of Ti operations in Mj reading L shall all return the same value (which is
necessarily one stored by an operation in Mi, as specified by condition (a) above).
Respecting initial values. Let Mj be a subset containing a load of L, and no stores into L. Further assume that there
is no Mi where i<j such that Mi contains a store into L. Then all loads of L in Mj will return the initial value of L.
In such a conforming sequence S, two operations are concurrent if they have been executed by different threads and they
belong to some common subset Mi. Two operations are concurrent in an execution history of a tile, if there exists a conforming
interleaving S as described herein in which the operations are concurrent. Two operations of a program are concurrent if
there possibly exists an execution of the program in which they are concurrent.
A barrier behaves like a fence to establish order between operations, except it provides additional guarantees on the order
of execution. Based on the above definition, a barrier is like a fence that only permits a certain kind of interleaving. Specifically,
one in which the sequence of fences (F in the above formalization) has the fences , corresponding to the barrier execution by
individual threads, appearing uninterrupted in S, without any memory operations interleaved between them. For example,
consider the following program:
C1
Barrier
C2
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 88
3123
3124
3125
3126
3127
3128
3129
3130
Assume that C1 and C2 are arbitrary sequences of code. Assume this program is executed by two threads T1 and T2, then the
only possible conforming interleavings are given by the following pattern:
3131
8.1.2.4
3132
3133
3134
3135
Racy programs are programs which have possible executions where at least two operations performed by two separate
threads are both (a) incompatible AND (b) concurrent.
3136
8.1.2.5
3137
3138
Race-free programs are, simply, programs that are not racy. Race-free programs have the following semantics assigned to
them:
T1(C1) || T2(C1)
T1(Barrier) || T2(Barrier)
T1(C2) || T2(C2)
Where the || operator implies arbitrary interleaving of the two operand sequences.
Racy programs do not have semantics assigned to them. They have undefined behavior.
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
Racy programs
1.
2.
Race-free programs
If two memory operations are ordered (i.e., not concurrent) by fences and/or barriers, then the values
loaded/stored will respect such an ordering.
If two memory operations are concurrent then they must be atomic and/or reads performed by threads within the
same tile. For each memory location X there exists an eventual total order including all such operations concurrent
opertions applied to X and obeying the semantics of loads and atomic read-modify-write transactions.
8.2
An invocation of parallel_for_each receives a function object, the contents of which are made available on the device. The
function object may contain: concurrency::array reference data members, concurrency::array_view value data members,
concurrency::graphics::texture reference data members, and concurrency::graphics::writeonly_texture_view value data
members. (In addition, the function object may also contain additional, user defined data members.) Each of these members
of the types array, array_view, texture and write_only_texture_view, could be constrained in the type of access it provides to
kernel code. For example an array<int,2>& member provides both read and write access to the array, while a const
array<int,2>& member provides just read access to the array. Similarly, an array_view<int,2> member provides read and
write access, while an array_view<const int,2> member provides read access only.
The C++ AMP specification permits implementations in which the memory backing an array, array_view or texture could be
shared between different accelerators, and possibly also the host, while also permitting implementations where data has to
be copied, by the implementation, between different memory regions in order to support access by some hardware.
Simulating coherence at a very granular level is too expensive in the case disjoint memory regions are required by the
hardware. Therefore, in order to support both styles of implementation, this specification stipulates that parallel_for_each
has the freedom to implement coherence over array, array_view, and texture using coarse copying. Specifically, while a
parallel_for_each call is being evaluated, implementations may:
1.
2.
Load and/or store any location, in any order, any number of times, of each container which is passed into
parallel_for_each in read/write mode.
Load from any location, in any order, any number of times, of each container which is passed into parallel_for_each
in read-only mode.
A parallel_for_each always behaves synchronously. That is, any observable side effects caused by any thread executing within
a parallel_for_each call, or any side effects further affected by the implementation, due to the freedom it has in moving
memory around, as stipulated above, shall be visible by the time parallel_for_each return.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 89
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
However, since the effects of parallel_for_each are constrained to changing values within arrays, array_views and textures
and each of these objects can synchronize its contents lazily upon access, an asynchronous implementation of
parallel_for_each is possible, and encouraged. Nonetheless, implementations should still honor calls to
accelerator_view::wait by blocking until all lazily queued side-effects have been fully performed. Similarly, an implementation
should ensure that all lazily queued side-effects preceding an accelerator_view::create_marker call have been fully performed
before the completion_future object which is retuned by create_marker is made ready.
Informative: Future versions of parallel_for_each may be less constrained in the changes they may affect to shared memory,
and at that point an asynchronous implementation will no longer be valid. At that point, an explicitly asynchronous
parallel_for_each_async will be added to the specification.
Even though an implementation could be coarse in the way it implements coherence, it still must provide true aliasing for
array_views which refer to the same home location. For example, assuming that a1 and a2 are both array_views constructed
on top of a 100-wide one dimensional array, with a1 referring to elements [010] of the array and a2 referring to elements
[10...20] of the same array. If both a1 and a2 are accessible on a parallel_for_each call, then accessing a1 at position 10 is
identical to accessing the view a2 at position 0, since they both refer to the same location of the array they are providing a
view over, namely position 10 in the original array. This rules holds whenever and wherever a1 and a2 are accessible
simultaneously, i.e., on the host and in parallel_for_each calls.
Thus, for example, an implementation could clone an array_view passed into a parallel_for_each in read-only mode, and pass
the cloned data to the device. It can create the clone using any order of reads from the original. The implementation may
read the original a multiple number of times, perhaps in order to implement load-balancing or reliability features.
Similarly, an implementation could copy back results from an internally cloned array, array_view or texture, onto the original
data. It may overwrite any data in the original container, and it can do so multiple times in the realization of a single
parallel_for_each call.
When two or more overlapping array views are passed to a parallel_for_each, an implementation could create a temporary
array corresponding to a section of the original container which contains at a minimum the union of the views necessary for
the call. This temporary array will hold the clones of the overlapping array_views while maintaining their aliasing
requirements.
The guarantee regarding aliasing of array_views is provided for views which share the same home location. The home
location of an array_view is defined thus:
1.
2.
In the case of an array_view that is ultimately derived from an array, the home location is the array.
In the case of an array_view that is ultimately derived from a host pointer, the home location is the original array
view created using the pointer.
This means that two different array_views which have both been created, independently, on top of the same memory
region are not guaranteed to appear coherent. In fact, creating and using top-level array_views on the same host storage is
not supported. In order for such array_view to appear coherent, they must have a common top-level array_view ancestor
which they both ultimately were derived from, and that top-level array_view must be the only one which is constructed on
top of the memory it refers to.
This is illustrated in the next example:
#include <assert.h>
#include <amp.h>
using namespace concurrency;
void coherence_buggy()
{
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 90
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
int storage[10];
array_view<int> av1(10, &storage[0]);
array_view<int> av2(10, &storage[0]); // error: av2 is top-level and aliases av1
array_view<int> av3(5, &storage[5]); // error: av3 is top-level and aliases av1, av2
parallel_for_each( extent<1>(1), [=] (index<1>) restrict(amp) { av3[2] = 15; });
parallel_for_each( extent<1>(1), [=] (index<1>) restrict(amp) { av2[7] = 16; });
parallel_for_each( extent<1>(1), [=] (index<1>) restrict(amp) { av1[7] = 17; });
assert(av1[7] == av2[7]); // undefined results
assert(av1[7] == av3[2]); // undefined results
}
void coherence_ok()
{
int storage[10];
array_view<int> av1(10, &storage[0]);
array_view<int> av2(av1);
array_view<int> av3(av1.section(5,5));
// OK
// OK
8.3
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 91
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
[&] { copy(a,b); },
[&] { copy(b,c); });
8.4
An array_view may be constructed to wrap over a host side pointer. For such array_views, it is generally forbidden to access
the underlying array_view storage directly, as long as the array_view exists. Access to the storage area is generally
accomplished indirectly through the array_view. However, array_view offers mechanisms to synchronize and refresh its
contents, which do allow accessing the underlying memory directly. These mechanisms are described below.
Reading of the underlying storage is possible under the condition that the view has been first synchronized back to its home
storage. This is performed using the synchronize or synchronize_async member functions of array_view.
When a top-level view is initially created on top of a raw buffer, it is synchronized with it. After it has been constructed, a
top-level view, as well as derived views, may lose coherence with the underlying host-side raw memory buffer if the
array_view is passed to parallel_for_each as a mutable view, or if the view is a target of a copy operation. In order to restore
coherence with host-side underlying memory synchronize or synchronize_async must be called. Synchronization is restored
when synchronize returns, or when the completion_future returned by synchronize_async is ready.
For the sake of composition with parallel_for_each, copy, and all other host-side operations involving a view, synchronize
should be considered a read of the entire data section referred to by the view, as if it was the source of a copy operation, and
thus it must not be executed concurrently with any other operation involving writing the view. Note that even though
synchronize does potentially modify the underlying host memory, it is logically a no-op as it doesnt affect the logical contents
of the array. As such, it is allowed to execute concurrently with other operations which read the array view. As with copy,
synchronize works at the granularity of the view it is applied to, e.g., synchronizing a view representing a sub-section of a
parent view doesnt necessarily synchronize the entire parent view. It is just guaranteed to synchronize the overlapping
portions of such related views.
array_views are also required to synchronize their home storage:
1.
2.
Before they are destructed if and only if it is the last view of the underlying data container.
When they are accessed using the subscript operator or the .data() method (on said home location)
As a result of (1), any errors in synchronization which may be encountered during destruction of arrays views will not be
propagated through the destructor. Users are therefore encouraged to ensure that array_views which may contain
unsynchronized data are explicitly synchronized before they are destructed.
As a result of (2), the implementation of the subscript operator may need to contain a coherence enforcing check, especially
on platforms where the accelerator hardware and host memory are not shared, and therefore coherence is managed
explicitly by the C++ AMP runtime. Such a check may be detrimental for code desiring to achieve high performance through
vectorization of the array view accesses. Therefore it is recommended for such performance-sensitive code to obtain a
pointer to the beginning of a run and perform the low-level accesses needed based off of the raw pointer into the
array_view. array_views are guaranteed to be contiguous in the unit-stride dimension, which enables this style of coding.
Furthermore, the code may explicitly synchronize the array_view and at that point read the home storage directly, without
the mediation of the view.
Sometimes it is desirable to also allow refreshing of a view by directly from its underlying memory. The refresh member
function is provided for this task. This function revokes any caches associated with the view and resynchronizes the views
contents with the underlying memory. As such it may not be invoked concurrently with any other operation that accesses
the views data. However, it is safe to assume that refresh doesnt modify the views underlying data and therefore
concurrent read access to the underlying data is allowed during refreshs operation and after refresh has returned, till the
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 92
3327
3328
point when coherence may have been lost again, as has been described above in the discussion on the synchronize member
function.
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
9.1
Math Functions
C++ AMP contains a rich library of floating point math functions that can be used in an accelerated computation. The C++
AMP library comes in two flavors, each contained in a separate namespace. The functions contained in the
concurrency::fast_math namespace support only single-precision (float) operands and are optimized for performance at the
expense of accuracy. The functions contained in the concurrency::precise_math namespace support both single and double
precision (double) operands and are optimized for accuracy at the expense of performance. The two namespaces cannot be
used together without introducing ambiguities. The accuracy of the functions in the concurrency::precise_math namespace
shall be at least as high as those in the concurrency::fast_math namespace.
All functions are available in the <amp_math.h> header file, and all are decorated restrict(amp).
fast_math
Functions in the fast_math namespace are designed for computations where accuracy is not a prime requirement, and
therefore the minimum precision is implementation-defined.
Not all functions available in precise_math are available in fast_math.
C++ API function
Description
float acosf(float x)
float acos(float x)
float asinf(float x)
float asin(float x)
float atanf(float x)
float atan(float x)
float ceilf(float x)
float ceil(float x)
float cosf(float x)
float cos(float x)
float coshf(float x)
float cosh(float x)
float expf(float x)
float exp(float x)
float exp2f(float x)
float exp2(float x)
float fabsf(float x)
float fabs(float x)
float floorf(float x)
float floor(float x)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 93
3348
3349
3350
3351
3352
3353
3354
3355
3356
Computes the remainder of dividing x by y. The return value is x n * y, where n is the quotient of x / y, rounded towards zero to
an integer.
int isfinite(float x)
Determines if x is finite.
int isinf(float x)
Determines if x is infinite.
int isnan(float x)
Determines if x is NAN.
float logf(float x)
float log(float x)
float log10f(float x)
float log10(float x)
float log2f(float x)
float log2(float x)
float roundf(float x)
float round(float x)
float rsqrtf(float x)
float rsqrt(float x)
int signbit(float x)
int signbit(double x)
Returns a non-zero value if the value of X has its sign bit set.
float sinf(float x)
float sin(float x)
float sinhf(float x)
float sinh(float x)
float sqrtf(float x)
float sqrt(float x)
float tanf(float x)
float tan(float x)
float tanhf(float x)
float tanh(float x)
float truncf(float x)
float trunc(float x)
The following list of standard math functions from the std:: namespace shall be imported into the concurrency::fast_math
namespace:
using
using
using
using
using
std::acosf;
std::asinf;
std::atanf;
std::atan2f;
std::ceilf;
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 94
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
std::cosf;
std::coshf;
std::expf;
std::fabsf;
std::floorf;
std::fmodf;
std::frexpf;
std::ldexpf;
std::logf;
std::log10f;
std::modff;
std::powf;
std::sinf;
std::sinhf;
std::sqrtf;
std::tanf;
std::tanhf;
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
std::acos;
std::asin;
std::atan;
std::atan2;
std::ceil;
std::cos;
std::cosh;
std::exp;
std::fabs;
std::floor;
std::fmod;
std::frexp;
std::ldexp;
std::log;
std::log10;
std::modf;
std::pow;
std::sin;
std::sinh;
std::sqrt;
std::tan;
std::tanh;
Importing these names into the fast_math namespace enables each of them to be called in unqualified syntax from a
function that has both restrict(cpu,amp) restrictions. E.g.,
void compute() restrict(cpu,amp) {
float x = cos(y); // resolves to std::cos in cpu context; else fast_math::cos in amp context
9.2
precise_math
Functions in the precise_math namespace are designed for computations where accuracy is required. In the table below, the
precision of each function is stated in units of ulps (error in last position).
Functions in the precise_math namespace also support both single and double precision, and are therefore dependent upon
double-precision support in the underlying hardware, even for single-precision variants.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 95
3412
C++ API function
Description
Precision
(float)
Precision
(double)
float acosf(float x)
N/A
N/A
float acos(float x)
double acos(double x)
float acoshf(float x)
float acosh(float x)
double acosh(double x)
float asinf(float x)
float asin(float x)
double asin(double x)
float asinhf(float x)
float asinh(float x)
double asinh(double x)
float atanf(float x)
float atan(float x)
double atan(double x)
float atanhf(float x)
float atanh(float x)
double atanh(double x)
float atan2f(float y, float x)
float atan2(float y, float x)
double atan2(double y, double x)
float cbrtf(float x)
float cbrt(float x)
double cbrt(double x)
float ceilf(float x)
float ceil(float x)
double ceil(double x)
float copysignf(float x, float y)
float copysign(float x, float y)
double copysign(double x, double y)
float cosf(float x)
float cos(float x)
double cos(double x)
float coshf(float x)
float cosh(float x)
double cosh(double x)
float cospif(float x)
float cospi(float x)
double cospi(double x)
float erff(float x)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 96
float erf(float x)
double erf(double x)
float erfcf(float x)
N/A
N/A
06
N/A
N/A
N/A
N/A
float erfc(float x)
double erfc(double x)
float erfinvf(float x)
float erfinv(float x)
double erfinv(double x)
float erfcinvf(float x)
float erfcinv(float x)
double erfcinv(double x)
float expf(float x)
float exp(float x)
double exp(double x)
float exp2f(float x)
float exp2(float x)
double exp2(double x)
float exp10f(float x)
float exp10(float x)
double exp10(double x)
float expm1f(float x)
float expm1(float x)
double expm1(double x)
float fabsf(float x)
float fabs(float x)
double fabs(double x)
float fdimf(float x, float y)
float fdim(float x, float y)
double fdim(double x, double y)
float floorf(float x)
float floor(float x)
double floor(double x)
float fmaf(float x, float y, float z)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 97
Computes the remainder of dividing x by y. The return value is x n * y, where n is the quotient of x / y, rounded towards zero to
an integer.
N/A
N/A
int ilogb(float x)
int ilogb(double x)
int isfinite(float x)
Determines if x is finite.
N/A
N/A
Determines if x is infinite.
N/A
N/A
Determines if x is NAN.
N/A
N/A
Determines if x is normal.
N/A
N/A
67
48
int isfinite(double x)
int isinf(float x)
int isinf(double x)
int isnan(float x)
int isnan(double x)
int isnormal(float x)
int isnormal(double x)
float ldexpf(float x, int exp)
float ldexp(float x, int exp)
double ldexpf(double x, int exp)
float lgammaf(float x)
float lgamma(float x)
double lgamma(double x)
float logf(float x)
float log(float x)
double log(double x)
float log10f(float x)
7
8
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 98
float log10(float x)
double log10(double x)
float log2f(float x)
N/A
N/A
float log2(float x)
double log2(double x)
float log1pf (float x)
float log1p(float x)
double log1p(double x)
float logbf(float x)
float logb(float x)
double logb(double x)
N/A
N/A
Computes the remainder of dividing x by y. The return value is x n * y, where n is the value x / y, rounded to the nearest integer. If
this quotient is 1/2 (mod 1), it is rounded to the nearest even
number (independent of the current rounding mode). If the
return value is 0, it has the sign of x.
float nearbyint(float x)
double nearbyint(double x)
float nextafterf(float x, float y)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 99
float rsqrt(float x)
double rsqrt(double x)
float sinpif(float x)
N/A
N/A
float sinpi(float x)
double sinpi(double x)
float scalbf(float x, float exp)
float scalb(float x, float exp)
double scalb(double x, double exp)
float scalbnf(float x, int exp)
float scalbn(float x, int exp)
double scalbn(double x, int exp)
int signbit(float x)
int signbit(double x)
Returns a non-zero value if the value of X has its sign bit set.
float sinf(float x)
09
This function returns the value of the Gamma function for the
argument x.
11
float sin(float x)
double sin(double x)
void sincosf(float x, float * s, float * c)
void sincos(float x, float * s, float * c)
void sincos(double x, double * s, double * c)
float sinhf(float x)
float sinh(float x)
double sinh(double x)
float sqrtf(float x)
float sqrt(float x)
double sqrt(double x)
float tgammaf(float x)
float tgamma(float x)
double tgamma(double x)
float tanf(float x)
float tan(float x)
double tan(double x)
float tanhf(float x)
float tanh(float x)
double tanh(double x)
float tanpif(float x)
float tanpi(float x)
double tanpi(double x)
float truncf(float x)
float trunc(float x)
double trunc(double x)
3413
3414
3415
The following list of standard math functions from the std:: namespace shall be imported into the concurrency::precise
_math namespace:
9
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 100
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
std::acosf;
std::asinf;
std::atanf;
std::atan2f;
std::ceilf;
std::cosf;
std::coshf;
std::expf;
std::fabsf;
std::floorf;
std::fmodf;
std::frexpf;
std::ldexpf;
std::logf;
std::log10f;
std::modff;
std::powf;
std::sinf;
std::sinhf;
std::sqrtf;
std::tanf;
std::tanhf;
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
using
std::acos;
std::asin;
std::atan;
std::atan2;
std::ceil;
std::cos;
std::cosh;
std::exp;
std::fabs;
std::floor;
std::fmod;
std::frexp;
std::ldexp;
std::log;
std::log10;
std::modf;
std::pow;
std::sin;
std::sinh;
std::sqrt;
std::tan;
std::tanh;
Importing these names into the precise_math namespace enables each of them to be called in unqualified syntax from a
function that has both restrict(cpu,amp) restrictions. E.g.,
void compute() restrict(cpu,amp) {
float x = cos(y); // resolves to std::cos in cpu context; else fast_math::cos in amp context
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 101
3472
3473
3474
3475
9.3
The following functions allow access to Direct3D intrinsic functions. These are included in <amp.h> in the
concurrency::direct3d namespace, and are only callable from a restrict(amp) function.
int abs(int val) restrict(amp)
Returns the absolute value of the integer argument.
Parameters:
val
3476
int clamp(int x, int min, int max) restrict(amp)
float clamp(float x, float min, float max) restrict(amp)
Clamps the input argument x so it is always within the range [min,max]. If x < min, then this function returns the
value of min. If x > max, then this function returns the value of max. Otherwise, x is returned.
Parameters:
val
The input value.
min
max
3477
unsigned int countbits(unsigned int val) restrict(amp)
Counts the number of bits in the input argument that are set (1).
Parameters:
val
3478
int firstbithigh(int val) restrict(amp)
Returns the bit position of the first set (1) bit in the input val, starting from highest-order and working down.
Parameters:
val
3479
int firstbitlow(int val) restrict(amp)
Returns the bit position of the first set (1) bit in the input val, starting from lowest-order and working up.
Parameters:
val
3480
int imax(int x, int y) restrict(amp)
Returns the maximum of x and y.
Parameters:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 102
3481
int imin(int x, int y) restrict(amp)
Returns the minimum of x and y.
Parameters:
x
3482
float mad(float x, float y, float z) restrict(amp)
double mad(double x, double y, double z) restrict(amp)
int mad(int x, int y, int z) restrict(amp)
unsigned int mad(unsigned int x, unsigned int y, unsigned int z) restrict(amp)
Performs a multiply-add on the three arguments: x*y + z.
Parameters:
x
Returns x*y + z.
3483
float noise(float x) restrict(amp)
Generates a random value using the Perlin noise algorithm. The returned value will be within the range [-1,+1].
Parameters:
x
The first input value.
Returns the random noise value.
3484
float radians(float x) restrict(amp)
Converts from x degrees into radians.
Parameters:
x
3485
float rcp(float x) restrict(amp)
Calculates a fast approximate reciprocal of x.
Parameters:
x
3486
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 103
3487
float saturate(float x) restrict(amp)
Clamps the input value into the range [-1,+1].
Parameters:
x
3488
int sign(int x) restrict(amp)
Returns the sign of x; that is, it returns -1 if x is negative, 0 if x is 0, or +1 if x is positive.
Parameters:
x
3489
float smoothstep(float min, float max, float x) restrict(amp)
Returns a smooth Hermite interpolation between 0 and 1, if x is in the range [min, max].
Parameters:
min
The minimum value of the range.
max
3490
float step(float x, float y) restrict(amp)
Compares two values, returning 0 or 1 based on which value is greater.
Parameters:
x
The first input value.
y
3491
3492
3493
3494
3495
3496
10 Graphics (Optional)
Programming model elements defined in <amp_graphics.h> and <amp_short_vectors.h> are designed for graphics
programming in conjunction with accelerated compute on an accelerator device, and are therefore appropriate only for
proper GPU accelerators. Accelerator devices that do not support native graphics functionality need not implement these
features.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 104
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
10.1 texture<T,N>
The texture class provides the means to create textures from raw memory or from file. textures are similar to arrays in that
they are containers of data and they behave like STL containers with respect to assignment and copy construction.
textures are templated on T, the element type, and on N, the rank of the texture. N can be one of 1, 2 or 3.
The element type of the texture, also referred to as the textures logical element type, is one of a closed set of short vector
types defined in the concurrency::graphics namespace and covered elsewhere in this specification. The below table briefly
enumerates all supported element types.
Rank of element
type, (also
referred to as
number of scalar
elements)
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
Signed Integer
Unsigned Integer
Single precision
floating point
number
Single
precision
singed
normalized
number
Single
precision
unsigned
normalized
number
Double
precision
floating point
number
int
unsigned int
float
norm
unorm
double
int_2
uint_2
float_2
norm_2
unorm_2
double_2
int_3
uint_3
float_3
norm_3
unorm_3
double_3
int_4
uint_4
float_4
norm_4
unorm_4
double_4
Remarks:
1. norm and unorm vector types are vector of floats which are normalized to the range [-1..1] and [0...1], respectively.
2. Grayed-out cells represent vector types which are defined by C++ AMP but which are not necessarily supported as
texture value types. Implementations can optionally support the types in the grayed-out cells in the above table.
Microsoft-specific: grayed-out cells in the above table are not supported.
10.1.1 Synopsis
template <typename T, int N>
class texture
{
public:
static const int rank = _Rank;
typedef typename T value_type;
typedef short_vectors_traits<T>::scalar_type scalar_type;
texture(const extent<N>& _Ext);
texture(int _E0);
texture(int _E0, int _E1);
texture(int _E0, int _E1, int _E2);
texture(const extent<N>& _Ext, const accelerator_view& _Acc_view);
texture(int _E0, const accelerator_view& _Acc_view);
texture(int _E0, int _E1, const accelerator_view& _Acc_view);
texture(int _E0, int _E1, int _E2, const accelerator_view& _Acc_view);
texture(const extent<N>& _Ext, unsigned int _Bits_per_scalar_element);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 105
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 106
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
texture(texture&& _Other);
texture& operator=(texture&& _Other);
void copy_to(texture& _Dest) const;
void copy_to(const writeonly_texture_view<T,N>& _Dest) const;
unsigned int get_bits_per_scalar_element() const;
__declspec(property(get= get_bits_per_scalar_element)) int bits_per_scalar_element;
unsigned int get_data_length() const;
__declspec(property(get=get_data_length)) unsigned int data_length;
extent<N> get_extent() const restrict(cpu,amp);
__declspec(property(get=get_extent)) extent<N> extent;
accelerator_view get_accelerator_view() const;
__declspec(property(get=get_accelerator_view)) accelerator_view accelerator_view;
const
const
const
const
const
const
const
value_type
value_type
value_type
value_type
value_type
value_type
value_type
3634
typedef ... scalar_type;
The scalar type that serves as the component of the textures value type. For example, for texture<int2, 3>, the scalar type would be int.
3635
3636
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 107
3637
3638
3639
3640
3641
3642
_Ext
_E0
Extent of dimension 0
_E1
Extent of dimension 1
_E2
Extent of dimension 2
_Bits_per_scalar_element
Number of bits per each scalar element in the underlying scalar type of the texture.
_Acc_view
Error condition
Exception thrown
Out of memory
concurrency::runtime_exception
concurrency::runtime_exception
Invalid combination of
value_type and bits per
scalar element
concurrency::unsupported_feature
accelerator_view doesnt
support textures
concurrency::unsupported_feature
The table below summarizes all valid combinations of underlying scalar types (columns), ranks(rows), supported values for
bits-per-scalar-element (inside the table cells), and default value of bits-per-scalar-element for each given combination
(highlighted in green). Note that unorm and norm have no default value for bits-per-scalar-element. Implementations can
optionally support textures of double4, with implementation-specific values of bits-per-scalar-element.
3643
3644
Rank
int
uint
float
norm
unorm
double
8, 16, 32
8, 16, 32
16, 32
8, 16
8, 16
64
8, 16, 32
8, 16, 32
16, 32
8, 16
8, 16
64
8, 16, 32
8, 16, 32
16, 32
8, 16
8, 16
3645
3646
3647
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 108
_E0
Extent of dimension 0
_E1
Extent of dimension 1
_E2
Extent of dimension 2
_Src_first
_Src_last
Iterator pointing immediately past the last element to be copied into the texture
_Acc_view
Error condition
Exception thrown
Out of memory
concurrency::runtime_exception
Inadequate amount of
data supplied through
the iterators
concurrency::runtime_exception
Accelerator_view doesnt
support textures
concurrency::unsupported_feature
3648
3649
3650
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 109
Creates a texture from a host-side provided buffer. The format of the data source must be compatible with the textures vector type, and the amount
of data in the data source must be exactly the amount necessary to initialize a texture in the specified format, with the given number of bits per scalar
element.
For example, a 2D texture of uint2 initialized with the extent of 100x200 and with _Bits_per_scalar_element equal to 8 will require a total of 100 * 200
* 2 * 8 = 320,000 bits available to copy from _Source, which is equal to 40,000 bytes. (or in other words, one byte, per one scalar element, for each
scalar element, and each pixel, in the texture).
Parameters:
_Ext
_E0
Extent of dimension 0
_E1
Extent of dimension 1
_E2
Extent of dimension 2
_Source
_Src_byte_size
_Bits_per_scalar_element
Number of bits per each scalar element in the underlying scalar type of the texture.
_Acc_view
Error condition
Exception thrown
Out of memory
concurrency::runtime_exception
Inadequate amount of
data supplied through the
host buffer
(_Src_byte_size <
texture.data_length)
concurrency::runtime_exception
concurrency::runtime_exception
Invalid combination of
value_type and bits per
scalar element
concurrency::unsupported_feature
Accelerator_view doesnt
support textures
concurrency::unsupported_feature
3651
3652
3653
Error condition
Exception thrown
Out of memory
concurrency::runtime_exception
3654
texture(const texture& _Src, const accelerator_view& _Acc_view);
Initializes one texture from another.
Parameters:
_Src
_Acc_view
Error condition
Exception thrown
Out of memory
concurrency::runtime_exception
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 110
Accelerator_view doesnt
support textures
concurrency::unsupported_feature
3655
3656
3657
10.1.7
Assignment operator
Error condition
Exception thrown
Out of memory
concurrency::runtime_exception
3658
3659
Error condition
Exception thrown
Out of memory
concurrency::runtime_exception
Incompatible texture
formats
concurrency::runtime_exception
concurrency::runtime_exception
3660
3661
3662
Error condition
Exception thrown
None
3663
3664
3665
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 111
3666
unsigned int get_data_length() const;
__declspec(property(get=get_data_length)) unsigned int data_length;
Gets the physical data length (in bytes) that is required in order to represent the texture on the host side with its native format.
Error conditions: none
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
Microsoft-specific: the Microsoft implementation always raises a runtime exception in such a situation.
3685
3686
3687
Trying to call set on a texture& of a different element type (i.e., on other than int, uint, and float) results in a static assert.
In order to write into textures of other value types, the developer must go through a writeonly_texture_view<T,N>.
This is the core function of class texture on the accelerator. Unlike arrays, the entire value type has to be get/set, and is
returned or accepted wholly. textures do not support returning a reference to their data internal representation.
Due to platform restrictions, only a limited number of texture types support simultaneous reading and writing. Reading is
supported on all texture types, but writing through a texture& is only supported for textures of int, uint, and float, and even
in those cases, the number of bits used in the physical format must be 32. In case a lower number of bits is used (8 or 16)
and a kernel is invoked which contains code that could possibly both write into and read from one of these rank-1 texture
types, then an implementation is permitted to raise a runtime exception.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 112
_Index
_Value
Error conditions: if set is called on texture types which are not supported, a static_assert ensues.
3688
3689
_Dst
_Dst_byte_size
Error condition
Exception thrown
3690
3691
3692
(*) Out of memory errors may occur due to the need to allocate temporary buffers in some memory transfer scenarios.
template <typename T, int N>
void copy(const void * _Src, unsigned int _Src_byte_size, texture<T,N>& _Texture);
Copies raw texture data to a device-side texture. The buffer must be laid out in accordance with the texture format and dimensions.
Parameters
_Texture
Destination texture
_Src
_Src_byte_size
Error condition
Exception thrown
Out of memory
Buffer too small
3693
3694
3695
For each copy function specified above, a copy_async function will also be provided, returning a completion_future.
3696
3697
3698
pTexture
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 113
Return value
Created texture
Error condition
Exception thrown
Out of memory
Invalid D3D texture argument
3699
template <typename T, int N>
IUnknown * get_texture<const texture<T, N>& _Texture);
Retrieves a DX interface pointer from a C++ AMP texture object. Class texture allows retrieving a texture interface pointer (the exact interface depends
on the rank of the class). On success, it increments the reference count of the D3D texture interface by calling AddRef on the interface. Users must
call Release on the returned interface after they are finished using it, for proper reclamation of the resources associated with the object.
Parameters
_Texture
Source texture
Return value
Error condition: no
3700
3701
3702
3703
3704
10.2 writeonly_texture_view<T,N>
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
10.2.1 Synopsis
3736
C++ AMP write-only texture views, coded as writeonly_texture_view<T, N>, which provides write-only access into any texture.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 114
The logical value type of the writeonly_texture_view. e.g., for writeonly_texture_view<float2,3>, value_type would be float2.
3737
typdef ... scalar_type;
The scalar type that serves as the component of the textures value type. For example, for writeonly _texture_view<int2,3>, the scalar type would be
int.
3738
Source texture
3739
3740
Error condition
Exception thrown
3741
3742
10.2.5 Destructor
~writeonly_texture_view() restrict(cpu,amp);
texture_view can be destructed on the accelerator.
Error conditions: none
3743
3744
3745
3746
3747
unsigned int get_data_length() const;
__declspec(property(get=get_data_length)) unsigned int data_length;
Gets the physical data length (in bytes) that is required in order to represent the texture on the host side with its native format.
Error conditions: none
3748
3749
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 115
3750
3751
3752
extent<N> get_extent() const restrict(cpu,amp);
__declspec(property(get=get_extent)) extent<N> extent;
These members have the same meaning as the equivalent ones on the array class
Error conditions: none
3753
3754
3755
This is the main purpose of this type. All texture types can be written through a write-only view.
void set(const index<N>& _Index, const value_type& _Val) const restrict(amp);
Stores one texel in the texture.
If the texture is indexed, at runtime, outside of its logical bounds, behavior is undefined.
Parameters
_Index
Index components
_Val
3756
3757
3758
_Src
_Src_byte_size
Error condition
Exception thrown
Out of memory
Buffer too small
3759
3760
For each copy function specified above, a copy_async function will also be provided, returning a completion_future.
3761
3762
3763
3764
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 116
Retrieves a DX interface pointer from a C++ AMP writeonly_texture_view object. On success, it increments the reference count of the D3D texture
interface by calling AddRef on the interface. Users must call Release on the returned interface after they are finished using it, for proper
reclamation of the resources associated with the object.
Parameters
_TextureView
Return value
Error condition: no
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3789
3790
3791
3792
3793
3794
3795
3796
3797
3798
3799
3800
3801
3802
3803
3804
3805
3806
3807
3808
3809
3810
10.3.1 Synopsis
The norm type is a single-precision floating point value that is normalized to the range [-1.0f, 1.0f]. The unorm type is a singleprecision floating point value that is normalized to the range [0.0f, 1.0f].
class norm
{
public:
norm() restrict(cpu, amp);
explicit norm(float _V) restrict(cpu, amp);
explicit norm(unsigned int _V) restrict(cpu, amp);
explicit norm(int _V) restrict(cpu, amp);
explicit norm(double _V) restrict(cpu, amp);
norm(const norm& _Other) restrict(cpu, amp);
norm(const unorm& _Other) restrict(cpu, amp);
norm& operator=(const norm& _Other) restrict(cpu, amp);
operator float(void) const restrict(cpu, amp);
norm& operator+=(const norm& _Other) restrict(cpu,
norm& operator-=(const norm& _Other) restrict(cpu,
norm& operator*=(const norm& _Other) restrict(cpu,
norm& operator/=(const norm& _Other) restrict(cpu,
norm& operator++() restrict(cpu, amp);
norm operator++(int) restrict(cpu, amp);
norm& operator--() restrict(cpu, amp);
norm operator--(int) restrict(cpu, amp);
norm operator-() restrict(cpu, amp);
amp);
amp);
amp);
amp);
};
class unorm
{
public:
unorm() restrict(cpu, amp);
explicit unorm(float _V) restrict(cpu, amp);
explicit unorm(unsigned int _V) restrict(cpu, amp);
explicit unorm(int _V) restrict(cpu, amp);
explicit unorm(double _V) restrict(cpu, amp);
unorm(const unorm& _Other) restrict(cpu, amp);
explicit unorm(const norm& _Other) restrict(cpu, amp);
unorm& operator=(const unorm& _Other) restrict(cpu, amp);
operator float() const restrict(cpu,amp);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 117
3811
3812
3813
3814
3815
3816
3817
3818
3819
3820
3821
3822
3823
3824
3825
3826
3827
3828
3829
3830
3831
3832
3833
3834
3835
3836
3837
3838
3839
3840
3841
3842
3843
3844
3845
3846
3847
3848
3849
3850
3851
3852
3853
3854
3855
3856
3857
3858
3859
3860
3861
3862
3863
3864
3865
amp);
amp);
amp);
amp);
};
unorm operator+(const unorm& lhs, const unorm& rhs) restrict(cpu, amp);
norm operator+(const norm& lhs, const norm& rhs) restrict(cpu, amp);
unorm operator-(const unorm& lhs, const unorm& rhs) restrict(cpu, amp);
norm operator-(const norm& lhs, const norm& rhs) restrict(cpu, amp);
unorm operator*(const unorm& lhs, const unorm& rhs) restrict(cpu, amp);
norm operator*(const norm& lhs, const norm& rhs) restrict(cpu, amp);
unorm operator/(const unorm& lhs, const unorm& rhs) restrict(cpu, amp);
norm operator/(const norm& lhs, const norm& rhs) restrict(cpu, amp);
bool operator==(const unorm& lhs, const unorm& rhs) restrict(cpu, amp);
bool operator==(const norm& lhs, const norm& rhs) restrict(cpu, amp);
bool operator!=(const unorm& lhs, const unorm& rhs) restrict(cpu, amp);
bool operator!=(const norm& lhs, const norm& rhs) restrict(cpu, amp);
bool operator>(const unorm& lhs, const unorm& rhs) restrict(cpu, amp);
bool operator>(const norm& lhs, const norm& rhs) restrict(cpu, amp);
bool operator<(const unorm& lhs, const unorm& rhs) restrict(cpu, amp);
bool operator<(const norm& lhs, const norm& rhs) restrict(cpu, amp);
bool operator>=(const unorm& lhs, const unorm& rhs) restrict(cpu, amp);
bool operator>=(const norm& lhs, const norm& rhs) restrict(cpu, amp);
bool operator<=(const unorm& lhs, const unorm& rhs) restrict(cpu, amp);
bool operator<=(const norm& lhs, const norm& rhs) restrict(cpu, amp);
#define
#define
#define
#define
#define
#define
UNORM_MIN ((unorm)0.0f)
UNORM_MAX ((unorm)1.0f)
UNORM_ZERO ((norm)0.0f)
NORM_ZERO ((norm)0.0f)
NORM_MIN ((norm)-1.0f)
NORM_MAX ((norm)1.0f)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 118
3866
3867
3868
3869
3870
3871
unorm
In all these constructors, the object is initialized by first converting the argument to the float data type, and then clamping
the value into the range defined by the type.
3872
3873
3874
3875
3876
3877
10.3.3 Operators
All arithmetic operators that are defined for the float type are defined for norm and unorm as well. For each supported
operator , the result is computed in single-precision floating point arithmetic, and if required is then clamped back to the
appropriate range.
3878
3879
3880
3881
Assignment from norm to norm is defined, as is assignment from unorm to unorm. Assignment from other types requires an
explicit conversion.
C++ AMP defines a set of short vector types (of length 2, 3, and 4) which are based on one of the following scalar types: {int,
unsigned int, float, double, norm, unorm}, and are named as summarized in the following table:
Scalar Type
Length
2
int
3882
3883
3884
3885
3886
3887
3888
3889
3890
3891
3892
3893
3894
int_2, int2
int_3, int3
int_4, int4
unsigned int
uint_2, uint2
uint_3, uint3
uint_4, uint4
float
float_2, float2
float_3, float3
float_4, float4
double
double_2, double2
double_3, double3
double_4, double4
norm
norm_2, norm2
norm_3, norm3
norm_4, norm4
unorm
unorm_2, unorm2
unorm_3, unorm3
unorm_4, unorm4
There is no functional difference between the type scalar_N and scalarN. scalarN type is available in the graphics::direct3d
namespace.
Unlike index<N> and extent<N>, short vector types have no notion of significance or endian-ness, as they are not assumed to
be describing the shape of data or compute (even though a user might choose to use them this way). Also unlike extents and
indices, short vector types cannot be indexed using the subscript operator.
Components of short vector types can be accessed by name. By convention, short vector type components can use either
Cartesian coordinate names (x, y, z, and w), or color scalar element names (r, g, b, and a).
3895
3896
Note that the names derived from the color channel space (rgba) are available only as properties, not as getter and setter
functions.
3897
3898
3899
3900
3901
10.4.1 Synopsis
Because the full synopsis of all the short vector types is quite large, this section will summarize the basic structure of all the
short vector types.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 119
3902
3903
3904
3905
3906
3907
3908
3909
3910
3911
3912
3913
3914
3915
3916
3917
3918
3919
3920
3921
3922
3923
3924
3925
3926
3927
3928
3929
3930
3931
3932
3933
3934
3935
3936
3937
3938
3939
3940
3941
3942
3943
3944
3945
3946
3947
3948
3949
3950
3951
3952
3953
3954
3955
3956
3957
3958
In the summary class definition below the word "scalartype" is one of { int, uint, float, double, norm, unorm }. The value N is
2, 3 or 4.
class scalartype_N
{
public:
typedef scalartype value_type;
static const int size = N;
scalartype_N() restrict(cpu, amp);
scalartype_N(scalartype value) restrict(cpu, amp);
scalartype_N(const scalartype_N& other) restrict(cpu, amp);
// Component-wise constructor see 10.4.2.1 Constructors from components
// Constructors that explicitly convert from other short vector types
// See 10.4.2.2 Explicit conversion constructors.
scalartype_N& operator=(const scalartype_N& other) restrict(cpu, amp);
// Operators
scalartype_N& operator++() restrict(cpu, amp);
scalartype_N operator++(int) restrict(cpu, amp);
scalartype_N& operator--() restrict(cpu, amp);
scalartype_N operator--(int) restrict(cpu, amp);
scalartype_N& operator+=(const scalartype_N& rhs)
scalartype_N& operator-=(const scalartype_N& rhs)
scalartype_N& operator*=(const scalartype_N& rhs)
scalartype_N& operator/=(const scalartype_N& rhs)
restrict(cpu,
restrict(cpu,
restrict(cpu,
restrict(cpu,
amp);
amp);
amp);
amp);
amp);
amp);
amp);
amp);
or uint)
scalartype_N& rhs) restrict(cpu, amp);
scalartype_N& rhs) restrict(cpu, amp);
scalartype_N& rhs) restrict(cpu, amp);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 120
3959
3960
3961
3962
3963
10.4.2 Constructors
scalartype_N()restrict(cpu,amp)
Default constructor. Initializes all components to zero.
3964
scalartype_N(scalartype value) restrict(cpu,amp)
Initializes all components of the short vector to value.
Parameters:
value
The value with which to initialize each component of this vector.
3965
scalartype_N(const scalartype_N& other) restrict(cpu,amp)
Copy constructor. Copies the contents of other to this.
Parameters:
other
The source vector to copy from.
3966
3967
3968
3969
A short vector type can also be constructed with values for each of its components.
scalartype_2(scalartype
scalartype_3(scalartype
scalartype_4(scalartype
scalartype
v1,
v1,
v1,
v3,
scalartype
scalartype
scalartype
scalartype
Creates a short vector with the provided initialize values for each component.
Parameters:
v1
The value with which to initialize the x (or r) component.
v2
v3
v4
3970
3971
3972
3973
3974
A short vector of type scalartype1_N can be constructed from an object of type scalartype2_N, as long as N is the same in
both types. For example, a uint_4 can be constructed from a float_4.
explicit scalartype_N(const
explicit scalartype_N(const
explicit scalartype_N(const
explicit scalartype_N(const
explicit scalartype_N(const
explicit scalartype_N(const
Construct a short vector from a differently-typed short vector, performing an explicit conversion. Note that in the above
list of 6 constructors, each short vector type will have 5 of these.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 121
Parameters:
other
3975
3976
3977
3978
3979
3980
3981
3982
3983
3984
3985
3986
3987
3988
3989
Because the permutations of such component accessors are so large, they are described here using symmetric group notation.
In such notation, Sxy represents all permutations of the letters x and y, namely xy and yx. Similarly, Sxyz represents all 3! = 6
permutations of the letters x, y, and z, namely xy, xz, yx, yz, zx, and zy.
Recall that the z (or b) component of a short vector is only available for vector lengths 3 and 4. The w (or a) component of a
short vector is only available for vector length 4.
10.4.3.1 Single-component access
scalartype get_x() const restrict(cpu,amp)
scalartype get_y() const restrict(cpu,amp)
scalartype get_z() const restrict(cpu,amp)
scalartype get_w() const restrict(cpu,amp)
void set_x(scalartype v) restrict(cpu,amp)
void set_y(scalartype v) restrict(cpu,amp)
void set_z(scalartype v) restrict(cpu,amp)
void set_w(scalartype v) restrict(cpu,amp)
__declspec(property(get=get_x, put=set_x)) scalartype x
__declspec(property(get=get_y, put=set_y)) scalartype y
__declspec(property(get=get_z, put=set_z)) scalartype z
__declspec(property(get=get_w, put=set_w)) scalartype w
__declspec(property(get=get_x, put=set_x)) scalartype r
__declspec(property(get=get_y, put=set_y)) scalartype g
__declspec(property(get=get_z, put=set_z)) scalartype b
__declspec(property(get=get_w, put=set_w)) scalartype a
These functions (and properties) allow access to individual components of a short vector type. Note that the properties
in the rgba space map to functions in the xyzw space.
3990
3991
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 122
f3(1,2,3);
yz = f3.yz;
// yz = (2,3)
3992
3993
f3(1,2,3,4);
wzy = f3.wzy;
// wzy = (4,3,2)
3994
3995
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 123
f3(1,2,3,4);
wzyx = f3.wzyw;
// wzyx = (4,3,2,1)
3996
3997
3998
3999
4000
4001
4002
4003
4004
4005
4006
4007
4008
4009
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4039
4040
4041
4042
4043
10.5.1 Synopsis
The template class short_vector provides metaprogramming definitions of the above short vector types. These are useful
for programming short vectors generically. In general, the type scalartype_N is equivalent to
short_vector<scalartype,N>::type.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 124
4044
4045
4046
4047
4048
4049
4050
4051
4052
4053
4054
4055
4056
4057
4058
4059
4060
4061
4062
4063
4064
4065
4066
4067
4068
4069
4070
4071
4072
4073
4074
4075
4076
4077
4078
4079
4080
4081
4082
4083
4084
4085
4086
4087
4088
4089
4090
4091
4092
4093
4094
4095
4096
4097
4098
4099
4100
4101
{
typedef int_2 type;
};
template<>
struct short_vector<int, 3>
{
typedef int_3 type;
};
template<>
struct short_vector<int, 4>
{
typedef int_4 type;
};
template<>
struct short_vector<float, 1>
{
typedef float type;
};
template<>
struct short_vector<float, 2>
{
typedef float_2 type;
};
template<>
struct short_vector<float, 3>
{
typedef float_3 type;
};
template<>
struct short_vector<float, 4>
{
typedef float_4 type;
};
template<>
struct short_vector<unorm, 1>
{
typedef unorm type;
};
template<>
struct short_vector<unorm, 2>
{
typedef unorm_2 type;
};
template<>
struct short_vector<unorm, 3>
{
typedef unorm_3 type;
};
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 125
4102
4103
4104
4105
4106
4107
4108
4109
4110
4111
4112
4113
4114
4115
4116
4117
4118
4119
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
4143
4144
4145
4146
4147
4148
4149
4150
4151
4152
4153
4154
4155
template<>
struct short_vector<unorm, 4>
{
typedef unorm_4 type;
};
4156
4157
template<>
struct short_vector<norm, 1>
{
typedef norm type;
};
template<>
struct short_vector<norm, 2>
{
typedef norm_2 type;
};
template<>
struct short_vector<norm, 3>
{
typedef norm_3 type;
};
template<>
struct short_vector<norm, 4>
{
typedef norm_4 type;
};
template<>
struct short_vector<double, 1>
{
typedef double type;
};
template<>
struct short_vector<double, 2>
{
typedef double_2 type;
};
template<>
struct short_vector<double, 3>
{
typedef double_3 type;
};
template<>
struct short_vector<double, 4>
{
typedef double_4 type;
};
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 126
4158
short_vector template
Equivalent type
unsigned int
uint_2
uint_3
uint_4
short_vector<int, 1>::type
int
short_vector<int, 2>::type
int_2
short_vector<int, 3>::type
int_3
short_vector<int, 4>::type
int_4
short_vector<float, 1>::type
float
short_vector<float, 2>::type
float_2
short_vector<float, 3>::type
float_3
short_vector<float, 4>::type
float_4
short_vector<unorm, 1>::type
unorm
short_vector<unorm, 2>::type
unorm_2
short_vector<unorm, 3>::type
unorm_3
short_vector<unorm, 4>::type
unorm_4
short_vector<norm, 1>::type
norm
short_vector<norm, 2>::type
norm_2
short_vector<norm, 3>::type
norm_3
short_vector<norm, 4>::type
norm_4
short_vector<double, 1>::type
double
short_vector<double, 2>::type
double_2
short_vector<double, 3>::type
double_3
short_vector<double, 4>::type
double_4
4159
4160
4161
4162
4163
4164
4165
4166
4167
4168
4169
4170
4171
4172
4173
4174
4175
4176
10.6.1 Synopsis
The template class short_vector_traits provides the ability to reflect on the supported short vector types and obtain the
length of the vector and the underlying scalar type.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 127
4177
4178
4179
4180
4181
4182
4183
4184
4185
4186
4187
4188
4189
4190
4191
4192
4193
4194
4195
4196
4197
4198
4199
4200
4201
4202
4203
4204
4205
4206
4207
4208
4209
4210
4211
4212
4213
4214
4215
4216
4217
4218
4219
4220
4221
4222
4223
4224
4225
4226
4227
4228
4229
4230
4231
4232
4233
4234
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 128
4235
4236
4237
4238
4239
4240
4241
4242
4243
4244
4245
4246
4247
4248
4249
4250
4251
4252
4253
4254
4255
4256
4257
4258
4259
4260
4261
4262
4263
4264
4265
4266
4267
4268
4269
4270
4271
4272
4273
4274
4275
4276
4277
4278
4279
4280
4281
4282
4283
4284
4285
4286
4287
4288
4289
4290
4291
4292
template<>
struct short_vector_traits<float_2>
{
typedef float value_type;
static int const size = 2;
};
template<>
struct short_vector_traits<float_3>
{
typedef float value_type;
static int const size = 3;
};
template<>
struct short_vector_traits<float_4>
{
typedef float value_type;
static int const size = 4;
};
template<>
struct short_vector_traits<unorm>
{
typedef unorm value_type;
static int const size = 1;
};
template<>
struct short_vector_traits<unorm_2>
{
typedef unorm value_type;
static int const size = 2;
};
template<>
struct short_vector_traits<unorm_3>
{
typedef unorm value_type;
static int const size = 3;
};
template<>
struct short_vector_traits<unorm_4>
{
typedef unorm value_type;
static int const size = 4;
};
template<>
struct short_vector_traits<norm>
{
typedef norm value_type;
static int const size = 1;
};
template<>
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 129
4293
4294
4295
4296
4297
4298
4299
4300
4301
4302
4303
4304
4305
4306
4307
4308
4309
4310
4311
4312
4313
4314
4315
4316
4317
4318
4319
4320
4321
4322
4323
4324
4325
4326
4327
4328
4329
4330
4331
4332
4333
4334
4335
4336
4337
4338
4339
struct short_vector_traits<norm_2>
{
typedef norm value_type;
static int const size = 2;
};
4340
4341
10.6.2 Typedefs
template<>
struct short_vector_traits<norm_3>
{
typedef norm value_type;
static int const size = 3;
};
template<>
struct short_vector_traits<norm_4>
{
typedef norm value_type;
static int const size = 4;
};
template<>
struct short_vector_traits<double>
{
typedef double value_type;
static int const size = 1;
};
template<>
struct short_vector_traits<double_2>
{
typedef double value_type;
static int const size = 2;
};
template<>
struct short_vector_traits<double_3>
{
typedef double value_type;
static int const size = 3;
};
template<>
struct short_vector_traits<double_4>
{
typedef double value_type;
static int const size = 4;
};
Scalar Type
short_vector_traits<unsigned int>
unsigned int
short_vector_traits<uint_2>
unsigned int
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 130
4342
4343
short_vector_traits<uint_3>
unsigned int
short_vector_traits<uint_4>
unsigned int
short_vector_traits<int>
int
short_vector_traits<int_2>
int
short_vector_traits<int_3>
int
short_vector_traits<int_4>
int
short_vector_traits<float>
float
short_vector_traits<float_2>
float
short_vector_traits<float_3>
float
short_vector_traits<float_4>
float
short_vector_traits<unorm>
unorm
short_vector_traits<unorm_2>
unorm
short_vector_traits<unorm_3>
unorm
short_vector_traits<unorm_4>
unorm
short_vector_traits<norm>
norm
short_vector_traits<norm_2>
norm
short_vector_traits<norm_3>
norm
short_vector_traits<norm_4>
norm
short_vector_traits<double>
double
short_vector_traits<double_2>
double
short_vector_traits<double_3>
double
short_vector_traits<double_4>
double
10.6.3 Members
static int const size;
Introduces a static constant integer specifying the number of elements in the short vector type, based on the table
below:
Instantiated Type
Size
short_vector_traits<unsigned int>
short_vector_traits<uint_2>
short_vector_traits<uint_3>
short_vector_traits<uint_4>
short_vector_traits<int>
short_vector_traits<int_2>
short_vector_traits<int_3>
short_vector_traits<int_4>
short_vector_traits<float>
short_vector_traits<float_2>
short_vector_traits<float_3>
short_vector_traits<float_4>
short_vector_traits<unorm>
short_vector_traits<unorm_2>
short_vector_traits<unorm_3>
short_vector_traits<unorm_4>
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 131
short_vector_traits<norm>
short_vector_traits<norm_2>
short_vector_traits<norm_3>
short_vector_traits<norm_4>
short_vector_traits<double>
short_vector_traits<double_2>
short_vector_traits<double_3>
short_vector_traits<double_4>
4344
4345
4346
4347
4348
4349
4350
4351
4352
4353
2)
The device must have an AMP supported feature level. For this
release this means a D3D_FEATURE_LEVEL_11_0. or
D3D_FEATURE_LEVEL_11_1
3)
The D3D Device should not have been created with the
D3D11_CREATE_DEVICE_SINGLETHREADED flag.
Return Value:
The newly created accelerator_view object.
Exceptions:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 132
runtime_exception
1)
2)
4354
4355
IUnknown * get_device(const accelerator_view &_Rv)
Returns a D3D device interface pointer underlying the passed accelerator_view. Fails with a runtime_exception exception if the passed
accelerator_view is not a D3D device accelerator view. On success, it increments the reference count of the D3D device interface by calling AddRef on
the interface. Users must call Release on the returned interface after they are finished using it, for proper reclamation of the resources associated with
the object.
Concurrent use of the accelerator_view and the raw ID3D11Device interface from multiple host threads must be properly synchronized by users to
ensure mutual exclusion. Unsynchronized concurrent usage of the accelerator_view and the raw ID3D11Device interface will result in undefined
behavior.
Parameters:
_Rv
Return Value:
A IUnknown interface pointer corresponding to the D3D device underlying the passed accelerator_view. Users must use
the QueryInterface member function on the returned interface to obtain the correct D3D device interface pointer.
Exceptions:
runtime_exception
4356
4357
template <typename T, int N>
array<T,N> make_array(const extent<N> &_Extent,
const accelerator_view &_Rv,
IUnknown *_D3d_buffer_interface)
Creates an array with the specified extents on the specified accelerator_view from an existing Direct3D buffer interface
pointer. On failure the member function throws a runtime_exception exception. On success, the reference count of the
Direct3D buffer object is incremented by making an AddRef call on the interface to record the C++ AMP reference to the
interface, and users can safely Release the object when no longer required in their DirectX code.
Parameters:
_Extent
_Rv
_D3d_buffer_interface
2)
The D3D device on which the buffer was created must be the
same as that underlying the accelerator_view parameter rv.
3)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 133
c.
d.
Return Value:
The newly created array object.
Exceptions:
runtime_exception
1)
2)
3)
4)
4358
4359
template <size_t RANK, typename _Elem_type>
IUnknown * get_buffer(const array<_Elem_type, RANK> &_F)
Returns a D3D buffer interface pointer underlying the passed array. Fails with a runtime_exception exception of the passed array is not on a D3D device
resource view. On success, it increments the reference count of the D3D buffer interface by calling AddRef on the interface. Users must call Release
on the returned interface after they are finished using it, for proper reclamation of the resources associated with the object.
Parameters:
_F
The array for which the underlying D3D buffer interface is needed.
Return Value:
A IUnknown interface pointer corresponding to the D3D buffer underlying the passed array. Users must use the
QueryInterface member function on the returned interface to obtain the correct D3D buffer interface pointer.
Exceptions:
runtime_exception
4360
4361
4362
4363
12 Error Handling
4364
4365
4366
4367
4368
4369
12.1 static_assert
4370
4371
4372
4373
4374
4375
The C++ intrinsic static_assert is often used to handle error states that are detectable at compile time. In this way
static_assert is a technique for conveying static semantic errors and as such they will be categorized similar to exception
types.
On encountering an irrecoverable error, C++ AMP runtime throws a C++ exception to communicate/propagate the error to
client code. (Note: exceptions are not thrown from restrict(amp) code.) The actual exceptions thrown by each API are listed
in the API descriptions. Following are the exception types thrown by C++ AMP runtime:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 134
4376
4377
4378
4379
4380
4381
12.2.1 runtime_exception
A runtime_exception instance comprises a textual description of the error and a HRESULT error code to indicate the cause of
the error.
class runtime_exception
The exception type that all AMP runtime exceptions derive from. A runtime_exception instance comprises of a textual description of the error and a
HRESULT error code to indicate the cause of the error.
4382
4383
runtime_exception(const char * _Message, HRESULT _Hresult) throw()
Construct a runtime_exception exception with the specified message and HRESULT error code.
Parameters:
_Message
_Hresult
4384
4385
runtime_exception (HRESULT _Hresult) throw()
Construct a runtime_exception exception with the specified HRESULT error code.
Parameters:
_Hresult
4386
4387
HRESULT get_error_code() const throw()
Returns the error code that caused this exception.
Return Value:
Returns the HRESULT error code that caused this exception.
4388
4389
Source
Explanation
Array constructor
4390
4391
4392
4393
4394
4395
4396
4397
12.2.2 out_of_memory
An instance of this exception type is thrown when an underlying OS/DirectX API call fails due to failure to allocate system or
device memory (E_OUTOFMEMORY HRESULT error code). Note that if the runtime fails to allocate memory from the heap
using the C++ new operator, a std::bad_alloc exception is thrown and not the C++ AMP out_of_memory exception.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 135
4398
explicit out_of_memory(const char * _Message) throw()
Construct a out_of_memory exception with the specified message.
Parameters:
_Message
4399
4400
out_of_memory() throw()
Construct a out_of_memory exception.
Parameters:
None.
4401
4402
4403
4404
4405
4406
12.2.3 invalid_compute_domain
An instance of this exception type is thrown when the runtime fails to devise a dispatch for the compute domain specified at
a parallel_for_each call site.
4407
explicit invalid_compute_domain(const char * _Message) throw()
Construct an invalid_compute_domain exception with the specified message.
Parameters:
_Message
4408
4409
invalid_compute_domain() throw()
Construct an invalid_compute_domain exception.
Parameters:
None.
4410
4411
4412
4413
4414
4415
4416
4417
4418
4419
4420
4421
12.2.4 unsupported_feature
An instance of this exception type is thrown on executing a restrict(amp) function on the host which uses an intrinsic
unsupported on the host (such as tiled_index<>::barrier.wait()) or when invoking a parallel_for_each or allocating an object
on an accelerator which doesnt support certain features which are required for the execution to proceed, such as, but not
limited to:
1.
2.
3.
4.
The accelerator is not capable of executing code, but serves as a memory allocation arena only
The accelerator doesnt support the allocation of textures
A texture object is created with an invalid combination of bits_per_scalar_element and short-vector type
Read and write operations are both requested on a texture object with bits_per_scalar != 32
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 136
4422
class unsupported_feature : public runtime_exception
Exception thrown when an unsupported feature is used.
4423
explicit unsupported_feature (const char * _Message) throw()
Construct an unsupported_feature exception with the specified message.
Parameters:
_Message
4424
4425
unsupported_feature () throw()
Construct an unsupported_feature exception.
Parameters:
None.
4426
4427
4428
4429
4430
4431
4432
4433
12.2.5 accelerator_view_removed
An instance of this exception type is thrown when the C++ AMP runtime detects that a connection with a particular
accelerator, represented by an instance of class accelerator_view, has been lost. When such an incident happens, all data
allocated through the accelerator view and all in-progress computations on the accelerator view may be lost. This exception
may be thrown by parallel_for_each, as well as any other copying and/or synchronization method.
class accelerator_view_removed : public runtime_exception
HRESULT error code indicating the cause of removal of the accelerator_view
4434
explicit accelerator_view_removed(const char * _Message, HRESULT _View_removed_reason) throw();
explicit accelerator_view_removed(HRESULT _View_removed_reason) throw();
Construct an accelerator_view_removed exception with the specified message and HRESULT
Parameters:
_Message
_HRESULT
4435
4436
HRESULT get_view_removed_reason() const throw();
Provides the HRESULT error code indicating the cause of removal of the accelerator_view
Return Value:
The HRESULT error code indicating the cause of removal of the accelerator_view
4437
4438
4439
4440
4441
4442
4443
4444
4445
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 137
4446
4447
4448
4449
4450
4451
4452
4453
4454
4455
4456
Microsoft-specific: the Microsoft implementation of C++ AMP provides the methods specified in this section, provided all of
the following conditions are met.
1. The debug version of the runtime is being used (i.e. the code is compiled with the _DEBUG preprocessor definition).
2. The debug layer is available on the system. This, in turn requires DirectX SDK to be installed on the system on
Windows 7. On Windows 8 no SDK intallation is necessary..
3. The accelerator_view on which the kernel is invoked must be on a device which supports the printf and abort
intrinsics. As of the date of writing this document, only the REF device supports these intrinsics.
When the debug version of the runtime is not used or the debug layer is unavailable, executing a kernel that using these
intrinsics through a parallel_for_each call will result in a runtime exception. On devices that do not support these intrinsics,
these intrinsics will behave as no-ops.
4457
void direct3d_printf(const char *_Format_string, ) restrict(amp)
Prints formatted output from a kernel to the debug output. The formatting semantics are same as the C Library printf
function. Also, this function is executed as any other device-side function: per-thread, and in the context of the calling
thread. Due to the asynchronous nature of kernel execution, the output from this call may appear anytime between the
launch of the kernel containing the printf call and completion of the kernels execution.
Parameters:
_Format_string
Return Value:
None.
4458
void direct3d_errorf(char *_Format_string, ) restrict(amp)
This intrinsic prints formatted error messages from a kernel to the debug output. This function is executed as any other
device-side function: per-thread, and in the context of the calling thread. Note that due to the asynchronous nature of
kernel execution, the actual error messages may appear in the debug output asynchronously, any time between the
dispatch of the kernel and the completion of the kernels execution. When these error messages are detected by the
runtime, it raises a runtime_exception exception on the host with the formatted error message output as the exception
message.
Parameters:
_Format_string
4459
void direct3d_abort() restrict(amp)
This intrinsic aborts the execution of threads in the compute domain of a kernel invocation, that execute this instruction.
This function is executed as any other device-side function: per-thread, and in the context of the calling thread. Also the
thread is terminated without executing any destructors for local variables. When the abort is detected by the runtime, it
raises a runtime_exception exception on the host with the abort output as the exception message. Note that due to the
asynchronous nature of kernel execution, the actual abort may be detected any time between the dispatch of the kernel
and the completion of the kernels execution.
4460
4461
4462
4463
4464
4465
Due to the asynchronous nature of kernel execution, the direct3d_printf, direct3d_errorf and direct3d_abort messages from
kernels executing on a device appear asynchronously during the execution of the shader or after its completion and not
immediately after the async launch of the kernel. Thus these messages from a kernel may be interleaved with messages from
other kernels executing concurrently or error messages from other runtime calls in the debug output. It is the programmers
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 138
4466
4467
responsibility to include appropriate information in the messages originating from kernels to indicate the origin of the
messages.
4468
4469
4470
4471
4472
4473
4474
4475
4476
4477
4478
4479
4480
4481
4482
4483
4484
4485
4486
4487
4488
4489
4490
4491
4492
4493
4494
4495
4496
4497
4498
4499
4500
4501
4502
4503
4504
4505
4506
4507
4508
4509
4510
4511
4512
4513
It is likely that C++ AMP will evolve over time. The set of features allowed inside amp-restricted functions will grow. However,
compilers will have to continue to support older hardware targets which only support the previous, smaller feature set. This
section outlines possible such evolution of the language syntax and associated feature set.
This section contains an informative description of additional language syntax and rules to allow the versioning of C++ AMP
code. If an implementation desires to extend C++ AMP in a manner not covered by this version of the specification, it is
recommended that it follows the syntax and rules specified here.
restriction:
amp-restriction
cpu
auto
A function or lambda which is annotated with restrict(auto) directs the compiler to check all known restrictions and
automatically deduce the set of restrictions that a function complies with. restrict(auto) is only allowed for functions where
the function declaration is also a function definition, and no other declaration of the same function occurs.
A function may be simultaneously explicitly and auto restricted, e.g., restrict(cpu,auto). In such case, it will be explicitly
checked for compulsory conformance with the set of explicitly specified (non-auto) restrictions, and implicitly checked for
possible conformance with all other restrictions that the compiler supports.
Consider the following example:
int f1() restrict(amp);
int f2() restrict(cpu,auto)
{
f1();
}
In this example, f2 is verified for compulsory adherence to the restrict(cpu) restriction. This results in an error, since f2 calls
f1, which is not cpu-restricted. Had we changed f1s restriction to restrict(cpu), then f2 will pass the adherence test to the
explicitly specified restrict(cpu). Now with respect to the auto restriction, the compiler has to check whether f2 conforms to
restrict(amp), which is the only other restriction not explicitly specified. In the context of verifying the plausibility of
inferring an amp-restriction for f2, the compiler notices that f2 calls f1, which is, in our modified example, not amprestricted, and therefore f2 is also inferred to be not amp-restricted. Thus the total inferred restriction for f2 is restrict(cpu).
If we now change the restriction for f1 into restrict(cpu,amp), then the inference for f2 would reach the conclusion that f2
is restrict(cpu,amp) too.
When two overloads are available to call from a given restriction context, and they differ only by the fact that one is
explicitly restricted while the other is implicitly inferred to be restricted, the explicitly restricted overload shall be chosen.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 139
4514
4515
4516
4517
4518
4519
4520
4521
4522
4523
4524
4525
4526
4527
4528
4529
4530
4531
4532
4533
4534
4535
4536
4537
4538
4539
4540
4541
4542
4543
4544
4545
4546
4547
4548
4549
4550
4551
4552
4553
4554
4555
4556
4557
4558
4559
4560
4561
4562
4563
4564
In such a mode, when the compiler encounters a function declaration which is also a definition, and a previous declaration
for the function hasnt been encountered before, then the compiler analyses the function as if it was restricted with
restrict(cpu,auto). This allows easy reuse of existing code in amp-restricted code, at the cost of prolonged compilation
times.
amp-restriction:
amp amp-versionopt
amp-version:
: integer-constant
: integer-constant . integer-constant
An amp version specifies the lowest version of amp that this function supports. In other words, if a function is decorated
with restrict(amp:1), then that function also supports any version greater or equal to 1. When the amp version is elided,
the implied version is implementation-defined. Implementations are encouraged to support a compiler flag controlling the
default version assumed. When versioning is used in conjunction with restrict(auto) and/or automatic restriction deduction,
the compiler shall infer the maximal version of the amp restriction that the function adheres to.
Section 2.3.2 specifies that restriction specifiers of a function shall not overlap with any restriction specifiers in another
function within the same overload set.
int func(int x) restrict(cpu,amp);
int func(int x) restrict(cpu); // error, overlaps with previous declaration
This rule is relaxed in the case of versioning: functions overloaded with amp versions are not considered to overlap:
int func(int x) restrict(cpu);
int func(int x) restrict(amp:1);
int func(int x) restrict(amp:2);
When an overload set contains multiple versions of the amp specifier, the function with the highest version number that is
not higher than the callee is chosen:
void glorp() restrict(amp:1) { }
void glorp() restrict(amp:2) { }
void glorp_caller() restrict(amp:2) {
glorp(); // okay; resolves to call glorp() restrict(amp:2)
}
Based on the nascent availability of features in advanced GPUs and corresponding hardware-vendor-specific programming
models, it is apparent that the limitations associated with restrict(amp) will be gradually lifted. The table below captures
one possible path for future amp versions to follow. If implementers need to (non-normatively) extend the amp-restricted
language subset, it is recommended that they consult the table below and try to conform to its style.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 140
4565
4566
4567
4568
Implementations may not define an amp version greater or equal to 2.0. All non-normative extensions shall be restricted to
the patterns 1.x (where x > 0). Version number 1.0 is reserved to implementations strictly adhering to this version of the
specification, while version number 2.0 is reserved for the next major version of this specification.
Area
Feature
amp:1
amp:1.1
amp:1.2
amp:2
cpu
Local/Param/Function Return
char (8 - signed/unsigned/plain)
No
Yes
Yes
Yes
Yes
Local/Param/Function Return
No
Yes
Yes
Yes
Yes
Local/Param/Function Return
Yes
Yes
Yes
Yes
Yes
Local/Param/Function Return
Yes
Yes
Yes
Yes
Yes
Local/Param/Function Return
No
No
Yes
Yes
Yes
Local/Param/Function Return
No
No
No
No
No
Local/Param/Function Return
float (32)
Yes
Yes
Yes
Yes
Yes
Yes10
Yes
Yes
Yes
Yes
Local/Param/Function Return
double (64)
Local/Param/Function Return
No
No
No
No
Yes
Local/Param/Function Return
bool (8)
Yes
Yes
Yes
Yes
Yes
Local/Param/Function Return
wchar_t (16)
No
Yes
Yes
Yes
Yes
Local/Param/Function Return
Pointer (single-indirection)
Yes
Yes
Yes
Yes
Yes
Local/Param/Function Return
Pointer (multiple-indirection)
No
No
Yes
Yes
Yes
Local/Param/Function Return
Reference
Yes
Yes
Yes
Yes
Yes
Local/Param/Function Return
Reference to pointer
Yes
Yes
Yes
Yes
Yes
Local/Param/Function Return
Reference/pointer to function
No
No
Yes
Yes
Yes
Local/Param/Function Return
static local
No
No
Yes
Yes
Yes
Struct/class/union members
char (8 - signed/unsigned/plain)
No
Yes
Yes
Yes
Yes
Struct/class/union members
No
Yes
Yes
Yes
Yes
Struct/class/union members
Yes
Yes
Yes
Yes
Yes
Struct/class/union members
Yes
Yes
Yes
Yes
Yes
Struct/class/union members
No
No
Yes
Yes
Yes
Struct/class/union members
No
No
No
No
No
Struct/class/union members
float (32)
Yes
Yes
Yes
Yes
Yes
Struct/class/union members
double (64)
Yes
Yes
Yes
Yes
Yes
Struct/class/union members
No
No
No
No
Yes
Struct/class/union members
bool (8)
No
Yes
Yes
Yes
Yes
Struct/class/union members
wchar_t (16)
No
Yes
Yes
Yes
Yes
Struct/class/union members
Pointer
No
No
Yes
Yes
Yes
Struct/class/union members
Reference
No
No
Yes
Yes
Yes
Struct/class/union members
Reference/pointer to function
No
No
No
Yes
Yes
Struct/class/union members
bitfields
No
No
No
Yes
Yes
Struct/class/union members
unaligned members
No
No
No
No
Yes
Struct/class/union members
pointer-to-member (data)
No
No
Yes
Yes
Yes
Struct/class/union members
pointer-to-member (function)
No
No
Yes
Yes
Yes
Struct/class/union members
No
No
No
Yes
Yes
10
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 141
Struct/class/union members
Yes
Yes
Yes
Yes
Yes
Struct/class/union members
Yes
Yes
Yes
Yes
Yes
Struct/class/union members
No
No
Yes
Yes
Yes
Struct/class/union members
Constructors
Yes
Yes
Yes
Yes
Yes
Struct/class/union members
Destructors
Yes
Yes
Yes
Yes
Yes
Enums
char (8 - signed/unsigned/plain)
No
Yes
Yes
Yes
Yes
Enums
No
Yes
Yes
Yes
Yes
Enums
Yes
Yes
Yes
Yes
Yes
Enums
Yes
Yes
Yes
Yes
Yes
Enums
No
No
No
No
Yes
Structs/Classes
Yes
Yes
Yes
Yes
Yes
Structs/Classes
No
Yes
Yes
Yes
Yes
Arrays
of pointers
No
No
Yes
Yes
Yes
Arrays
of arrays
Yes
Yes
Yes
Yes
Yes
Declarations
tile_static
Yes
Yes
Yes
Yes
No
Function Declarators
Varargs ()
No
No
No
No
Yes
Function Declarators
throw() specification
No
No
No
No
Yes
Statements
global variables
No
No
No
Yes
Yes
Statements
No
No
No
Yes
Yes
Statements
No
No
Yes
Yes
Yes
Statements
No
No
No
Yes
Yes
Statements
No
No
Yes
Yes
Yes
Statements
No
Yes
Yes
Yes
Yes
Statements
new
No
No
Yes
Yes
Yes
Statements
delete
No
No
Yes
Yes
Yes
Statements
dynamic_cast
No
No
No
No
Yes
Statements
typeid
No
No
No
No
Yes
Statements
goto
No
No
No
No
Yes
Statements
labels
No
No
No
No
Yes
Statements
asm
No
No
No
No
Yes
Statements
throw
No
No
No
No
Yes
Statements
try/catch
No
No
No
No
Yes
Statements
__try/__except
No
No
No
No
Yes
Statements
__leave
No
No
No
No
Yes
4569
4570
C++ AMP : Language and Programming Model : Version 0.9 : January 2012