02 Principles of Parallel Execution and Partitioning
02 Principles of Parallel Execution and Partitioning
02 Principles of Parallel Execution and Partitioning
Having completed this module you will be able:
Page 2-1
Parallel Execution
Parallel execution is best understood by contemplating what happens
when parallel execution is not occurring.
In the worst case, one row is processed right through the ETL process, and
disposed of in accordance with the design. Then the next row is processed
right through the ETL process, and so on. In short, processing occurs one
row at a time.
There are two ways that parallelism that is, processing more than one
row at the same time can occur. There are names for these ways; they
are called pipeline parallelism and partition parallelism. DataStage
parallel jobs employ both, largely automatically. It is possible to design
both into DataStage server jobs too, in a non-automatic implementation,
but that discussion is outside the scope of these notes.
Pipeline Parallelism
If a DataStage job consists of stages connected by links, then the data flow
through that job is indicated by the arrowheads on the links.
By associating some kind of memory buffer with each of the links, then it
becomes possible to transmit more than one row at a time, and for
upstream stages to begin processing rows while other, earlier rows, are
still being processed by downstream stages.
Page 2-2
producing rows into one and the downstream stage consuming rows from
the other. These roles switch when certain tuneable thresholds1 occur.
This mechanism tends to smooth out speed variations in stage execution
and allows processing to continue as fast as possible.
Another way to think of pipeline parallelism is as a conveyor belt moving
buffers (boxes?) of rows from one stage to the next.
Conceptually the operating system idea of a pipeline may be used to aid
understanding of what is happening. Operating systems allow pipelines of
commands to be created, with operators such as the pipe (|), re-direction
operators such as > and conditional pipe operators such as && joining
them and controlling what happens.
If you imagine for a moment that each stage in a DataStage parallel job
design generates an operator when the job is compiled then you might
contemplate the execution path of a parallel job as being able to be
described as something like:
op0 | op1 | op2 | op3
To provide for the buffering the virtual Data Sets this model needs to
be slightly refined.
op0 > ds0 | op1 < ds0 > ds1 | op2 < ds1 > ds2 | op3 < ds3
In this model, op0 through op3 refer generically to operators and ds0
through ds2 refer generically to Data Sets2.
Indeed, when you examine the Orchestrate shell script that is generated by
compiling a job, or when you examine the score that is actually executed,
you see precisely this relationship between operators and Data Sets.
In the Orchestrate parallel execution engine, the only structure with which
an operator can deal is an Orchestrate Data Set. This has implications for
dealing with real world data, which will be discussed in a later module.
Details on how these mechanisms are tuned are beyond the scope of these notes. They
properly belong in an advanced syllabus.
Page 2-3
Partition Parallelism
Partition parallelism involves dividing the data into subsets of rows and
processing each subset identically and in parallel. Provided that there are
sufficient resources to handle this parallel processing, overall execution
time should be reduced linearly.
That is, if you divide your data into N equal subsets of rows, then it should
take very close to 1/N the amount of time to execute as it would to process
the entire set of rows in a single stream.
In practice there will be factors, such as the need to keep common key
values together and the overheads of starting up and shutting down the
additional parallel processes, that will mitigate against achieving this
optimal result, but as a general rule partition parallelism will yield reduced
throughput time.
DataStage parallel jobs achieve partition parallelism automatically, using
the Orchestrate parallel execution mechanism described below.
This means that you only have to design as if there will be a single stream
of processing; the DataStage parallel engine looks after the rest (though
you do need to remain alert to implications for partitioning data and
generating sequences of numbers in a parallel execution environment).
Consider the following job design (dont worry too much yet about what it
actually does).
Page 2-4
Page 2-5
Page 2-6
Execution Environment
It would be well here to take a short sidestep to discuss some terminology.
The term symmetric multi-processing (SMP) is often used to describe a
single machine with more than one CPU.
Multi-core CPU chips count as multiple
processors for this discussion. Although
technically it is not quite the same thing,
non-uniform memory architecture (NUMA)
is treated as SMP for DataStage
In an SMP environment, every processor has
access to all of the memory and all of the disk
resources available to that machine. For that
reason, it is sometimes usefully referred to as
a shared everything environment.
When processing occurs
on multiple machines,
which may be
connected together in a
cluster or grid
configuration, then each
machines CPUs have
access only to that
machines memory and
disk resources. For that
reason, such a
configuration is
sometimes referred to as
a shared nothing environment. The term massively parallel
processing (MPP) is also encountered when describing this kind of
Execution on a grid configuration, where the number of available
machines may vary over time, requires a more dynamic configuration file.
DataStage has the capability to manage or, more accurately, to be
managed by a grid environment. This requires installation of the Grid
Enablement Toolkit.
Page 2-7
Configuration File
A DataStage parallel execution configuration file contains a list of one or
more node names, the location of each node and the resources available on
each node. A node, in this context, is a purely logical construct called a
processing node there is no fixed relationship between the number of
nodes and, for example, the number of CPUs available.
On a four CPU machine (or cluster or grid), for example, it is likely that
one-node, two-node, four-node and even eight-node configurations might
be defined. And where the nodes are on separate machines, the same
considerations apply (consider the total number of CPUs available), but
there are some small overheads in communication between the machines
that needs to be taken into consideration.
In an SMP environment, everything is on the same machine in the
network. Therefore the machine location of every processing node in the
parallel execution file will be the same machine name. In an MPP
environment, however, there will be different machine names associated
with different processing nodes. It still may be the case that there is more
than one node on each of these machines, but if there is ever more than
one machine named in a configuration file, then it describes an MPP
configuration of some kind.
Lets examine the default configuration file that ships with DataStage
server to be installed on a Windows operating system.
node "node1"
fastname "PCXX"
pools ""
resource disk "C:/IBM/InformationServer/Server/Datasets"
{pools ""}
resource scratchdisk "C:/IBM/InformationServer/Server/Scratch"
{pools ""}
The outer set of curly braces contains the entire configuration definition.
The only thing that can appear outside of these curly braces are C-style
comments (that is, text introduced by /* and terminated by */).
In the default configuration file shown there is only one node, and it is
named node1. Node names are simply names; they are used only in
error, warning and informational messages to be read by humans.
DataStage internally uses the node number; the ordinal number, beginning
from zero, of the node within the current configuration file. Therefore, in
the default configuration file shown here, you would expect to see node #0
referred to in areas like monitoring.
Page 2-8
After each node name there is another set of curly braces, which contains
the definition of that particular node. A node definition contains:
the fastname of that node that is, the network name of the
machine on which this nodes processing is to occur it is called
fastname because you always specify the name by which that
machine is known on the fastest possible network connection
a list of the node pools in which this node will participate this is
a space separated list introduced by the word pools with one or
more node pool names each surrounded by double quote characters
A node pool or disk pool name that consists of a zero length string, as
shown in the default configuration file displayed above, is the default
pool. At least one node must belong to the default node pool unless every
stage in the job design has an explicit node pool specified.
Page 2-9
From DataStage Designer client (under the Tools menu) you can open a
tool called the Configurations editor, which is illustrated in Figure 2-1. As
well as being able to view the configuration files available on the system,
this editor allows new configuration files to be created from old, existing
configuration files to be edited3, and the validity of a configuration file to
be checked.
Lets examine a configuration file from a UNIX system. (The only
difference on a Windows system would be Windows pathnames.)4
/* Four-node configuration file
/* Created 2008-02-22, Ray Wurlod
node "node0"
fastname "dsdevsvr"
"" "node0" "low2"
resource disk "/usr/data1/DataStage/disk00"
resource disk "/usr/data2/DataStage/disk00"
resource scratchdisk "/usr/tmp/DataStage/disk00"
resource scratchdisk "/var/tmp/DataStage/disk00"
node "node1"
fastname "dsdevsvr"
"" "node1" "low2"
resource disk "/usr/data1/DataStage/disk01"
resource disk "/usr/data2/DataStage/disk01"
resource scratchdisk "/usr/tmp/DataStage/disk01"
resource scratchdisk "/var/tmp/DataStage/disk01"
node "node2"
fastname "dsdevsvr"
"" "node2" "high2"
resource disk "/usr/data1/DataStage/disk02"
resource disk "/usr/data2/DataStage/disk02"
resource scratchdisk "/usr/tmp/DataStage/disk02"
resource scratchdisk "/var/tmp/DataStage/disk02"
node "node3"
fastname "dsdevsvr"
"" "node3" "high2"
resource disk "/usr/data1/DataStage/disk03"
resource disk "/usr/data2/DataStage/disk03"
resource scratchdisk "/usr/tmp/DataStage/disk03"
resource scratchdisk "/var/tmp/DataStage/disk03"
{pools "buffer"}
{pools "buffer"}
{pools "buffer"}
{pools "buffer"}
Most sites protect the Configurations directory from being able to allow edits other than
by administrative staff. If this is the case at your site, you will receive an access denied
message when you attempt to save changes.
Configuration files for grid environments have other requirements due to the dynamic
nature of nodes resulting from grid management software. These are beyond the scope of these
Page 2-10
This is clearly an SMP system (the fastname is the same for all nodes).
There are four single-node node pools, and two two-node node pools. The
two-node node pools are named low2 and high2. The four one-node
node pools are eponymously named for the node to which each relates.
Node pool names can be dedicated. There are several reserved node pool
names in DataStage, among them DB2, INFORMIX, ORACLE, SAS and
sort. The first four tie in with resources available on that node. The sort
node pool can be used to specify a separate scratch disk resource that will
be used by sort rather than using the default scratch disk disk pool.
Directories listed under resource disk are used to store data in Data Sets,
File Sets and Lookup File Sets. We shall see later that files stored in these
directories are recorded in a control file for each of these particular
Directories listed under resource scratchdisk are, as the name suggests,
used for scratch space by various kinds of operator. For example, the
Sort stage allows a certain maximum amount of physical memory per
node to be allocated to the sorting process. Once all this memory is in use,
the sorting operation overflows into files in the scratch space. These files
should automatically be deleted once the job completes successfully.
Scratch space may also be used by other facilities. For example, if the
Oracle bulk loader (sqlldr) experiences a problem and generates a bad
file and a log file, and the designer has not specified locations where
these should be written, then DataStage will direct them to the scratch
The parallel engine uses the default scratch disk for temporary storage
other than buffering. If you define a buffer scratch disk pool for a node in
the configuration file, the parallel engine uses that scratch disk pool rather
than the default scratch disk for buffering, and all other scratch disk pools
defined are used for temporary storage other than buffering. Each
processing node in the above example has a single scratch disk resource in
the buffer pool, so buffering will use /var/tmp/DataStage/disknn but will
not use /usr/tmp/DataStage/disknn (other operators will use this scratch
However, if /var/tmp/DataStage/disknn were not in the buffer pool,
both /var/tmp/DataStage/disknn and /usr/tmp/DataStage/disknn would be
used because both would then be in the default pool.
Other resource types may (indeed may need to) be defined, for example
resource DB2, resource INFORMIX, resource ORACLE and resource
For more information on configuration files and the configuration file
editor tool refer to the chapter on the parallel engine configuration file in
the DataStage Parallel Job Developer Guide.
Page 2-11
The dynamic configuration file is dumped into the job log. Its "nodes" are
named according to the convention node_m_n where m is the machine
number and n is the partition number on that machine. For example, on a
2:2 configuration (two compute nodes and two partitions per compute
node), the configuration file would refer to node_1_1, node_1_2,
node_2_1 and node_2_2. (This numbering is an exception to the zerobased numbering encountered in most aspects of parallel job execution.)
If APT_GRID_ENABLE is not set to YES then DataStage will use the
static configuration file whose pathname is given in APT_CONFIG_FILE.
Sequences use a different mechanism. Sequences always execute on a
single machine. In a sequence you begin with an Execute Command
activity that executes sequencer.sh that takes the name of the new
configuration file and the number of compute nodes and partitions as
arguments, and returns a space-separated list of two sequential file host
names (discussed in a later module) and APT_CONFIG_FILE pathname.
Grid management software uses the term "node" to mean a machine in the grid.
DataStage uses the term "node" to mean a logical subset of available processing resources, as
described in the earlier section on configuration files.
Page 2-12
A partitioning algorithm describes to DataStage how to distribute rows of
data among the processing nodes. The partitioning algorithm to be used is
specified on the input link to a stage.
If you do not explicitly specify a partitioning algorithm, then (Auto) is
filled in, which instructs DataStage to choose the appropriate partitioning
algorithm when the job design is converted for parallel execution6.
Table 2-1 DataStage Partitioning Algorithms
For reasons that will become clear in a later module, this process is called composing
the score.
Page 2-13
Page 2-14
Configuring Partitioning
Every stage type that has an input link and which executes in parallel
mode can have the Partitioning properties of that input link configured.
On the input link properties page there is a tab captioned Partitioning.
Page 2-15
Configuring Collecting
Every stage type that has an input link and which executes in sequential
mode can have the Collecting properties of that input link configured. On
the input link properties page there is a tab captioned Partitioning.
Page 2-16
For a virtual Data Set, this file has a name created from the stage
and link to which it relates, and a suffix of .v. Virtual Data Set
control files are deleted automatically when the job finishes. The
name of a virtual Data Set control file can be seen in the osh script
generated by compiling a DataStage parallel job.
For a persistent Data Set, the control file has a name given by the
developer, which conventionally has a suffix of .ds. Persistent
Data Set control files contain not only the record schema but also
contain the location of the physical files on each processing node
in which the data are stored. These files always reside in
directories mentioned in the resource disk portion of the
configuration file.
Page 2-17
Let us contemplate, therefore, what a persistent (or virtual) Data Set looks
There are rows of data on each of the nodes, and a control file that outlines
the structure of each record.
Within that structure the final four fields are used to control such things as
preservation of sorted order in the Data Set. These are used only within
DataStage they are no accessible to developers. However, the record
schema can be viewed.
Note that there is no mention in the control file, when we are considering a
virtual Data Set, of where the data files actually are. This is because they
are only in memory. For a persistent Data Set the control file will also
contain the pathnames of the data files for each segment of the Data Set.
These may be viewed in the Data Set Management tool.
Data are moved to and from persistent Data Sets using a Data Set stage,
which in turn compiles to a copy operator. That is, any sorting and
partitioning in the Data Set is preserved. This makes the Data Set the ideal
structure for intermediate storage of data between jobs. That is, if you
need one job to store data on disk for another job to pick up later, then the
persistent Data Set should be the structure in which to store them.
In DataStage, persistent Data Sets are read from or written to using the
Data Set stage. They can be manipulated (inspected, copied or deleted)
Page 2-18
using the Data Set Management tool (from the Tools menu in Designer) or
using the orchadmin command.
Page 2-19
File Set
A File Set is a similar structure to a Data Set.
A File Set has a control file (this time with a .fs suffix) and one or more
data files on each processing node. Data in a File Set preserve both sorted
order and partitioning.
The difference is that the data stored in a File Set are in human-readable,
or external format, and therefore must go through conversion from or to
the internal format used within the DataStage job. On the other hand, data
in a File Set are readily accessible to other applications.
The Lfile entries in the control file (on the right) are the physical locations
where data are stored on each node. The directory part of each of these
pathnames is the directory recorded as resource disk in the configuration
file for parallel execution (the one referred to by the environment variable
Data in each of these files is human readable; in this cases the format is
comma-separated values with strings double-quoted, which can be seen in
the format portion of the record schema. These fields are not visible.
As with Data Sets, the final four fields are used to control such things as
preservation of sorted order in the File Set. The control file for a File Set
is clear text, so it is straightforward for other applications to read it (the
control file) to determine the locations of data.
In DataStage, File Sets are read from and written to using the File Set
Page 2-20
Page 2-21
Pipeline Parallelism is like a conveyor belt in which buffers are used
between operators so that more than one row at a time may be passed.
Partition Parallelism involves splitting the data into row-based subsets and
treating each subset identically to the original design.
There are seven Partitioning Algorithms available in DataStage parallel
jobs; round robin, random, entire, hash, modulus, range and DB2. Hash
and modulus are key-based algorithms. Auto and same cause an
algorithm to be selected when the job score is composed.
Situations in which key-based partitioning is necessary to achieve correct
results include any situation in which keys must be compared (such as in
the Lookup, Join, Merge and Remove Duplicates stage types) or combined
(such as in the Sort and Aggregator stage types).
DataStage structures in which partitioned data can be stored include Data
Sets (in which data are stored in internal format), File Sets (in which data
are stored in external, or human-readable, format) and Lookup File Sets
(in which data are stored in an indexed internal format). We examine
these structures in more detail in a later module.
Further Reading
Types of Parallelism Parallel Job Developer Guide Chapter 2
Configuration Files
Data Sets
File Sets
Page 2-22
Page 2-23