DMExpress Help 9
DMExpress Help 9
Overview
DMExpress is a high performance data transformation product. With DMExpress you can
design, schedule, and control all your data transformation applications from a simple graphical
interface on your Windows desktop.
Data records can be input from many types of sources such as database tables, SAP systems,
Salesforce.com objects, flat files, XML files, pipes, etc. The records can be aggregated, joined,
sorted, merged, or just copied to the appropriate target(s). Before output, records can be filtered,
reformatted or otherwise transformed.
Metadata, including record layouts, business rules, transformation definitions, run history and
data statistics can be maintained either within a specific task or in a central repository. The
effects of making a change to your application can be analyzed through impact and lineage
analysis.
You can run your data transformations directly from your desktop, on any UNIX or Windows
server, or schedule them for later execution, embed them in batch scripts, or invoke them from
your own programs. With DMX-h, you can run DMExpress ETL jobs on a Hadoop/Spark
cluster, and you can accelerate the sort phases of existing MapReduce jobs in Hadoop.
The primary components of DMExpress installed on your desktop are: the Task Editor, ,
used to define, specify and test individual tasks; the Job Editor, , used to group tasks
together for immediate or later execution, and to perform impact analysis; and the DMExpress
Server window, , used to access information about jobs and environment variables on a
DMExpress server. Global Find, , is used to search through all of your jobs and tasks for
specific values.
Task Editor
Show this help topic on Task Editor startup
The DMExpress Task Editor, , is used to define, specify and test a single task, the smallest
independent unit of any data transformation. Once you have defined one or more tasks, use the
DMExpress Job Editor, , to group the tasks together into Jobs and submit them for
immediate or later execution. Jobs and tasks that are running, completed, or scheduled to run in
Your task definition is displayed as a tree structure within the main task editor window. The
initial state of the tree shows empty sources and targets and no operations - as you define your
task with specialized dialogs, the tree will change to reflect the state of your definition.
Branches of the tree can be expanded and collapsed in the conventional way, by clicking the plus
and minus icons within the tree.
where they are. For example, if they're in a flat file, , you can browse to it via the usual file
browser dialog; if they're in a database table, , specify the connection information for the
database, , (username and password, or ODBC source), select from the list of tables
presented, and then identify the columns that you want to extract. Other types of sources can be
specified in a similar manner.
Copy The default type of task is a Copy, , which copies the records, in the same order, from
source to target.
Sort or Merge If you want to change the order of the records in the target, change the task type
to Sort, , and then double-click the Sort Fields branch in the tree to specify what ordering
you want. If you have files that are already pre-sorted and you want to merge them into a single
record per set with totals, counts, averages, etc.), then change the task type to Aggregate, ,
and double-click the Aggregate branch in the tree to specify your Group By and Summary fields.
Join If you want to create new target records by joining together records from different sources
based on a common field (for example, joining transaction records to customer records to
produce customer-transaction records), then change the task type to Join, , and double-click
the Join Fields branch in the tree to specify the common field(s).
Transforming records
To combine your task with other tasks and run them together as a Job, select the Create Job item
from the Run menu - this will switch you over to the Job Editor.
Job Editor
Show this help topic on Job Editor startup
The DMExpress Job Editor, , is used to group tasks (and any sub-jobs) together into a job,
define the run sequence between the included tasks/sub-jobs, and submit the job for immediate
or scheduled execution.
o press the Add DMExpress Task button, . Select a DMExpress task via the
browse dialog, then click on the position in the job graph where you want the task
to appear.
Extended Task - to perform non-DMXExpress processing using pre-defined command-
based tasks
o select a task from the Add Extended Task... menu, if any. Specify the command
arguments in the Extended Task dialog, then click on the position in the job graph
where you want the task to appear.
Custom Task - to perform non-DMExpress processing
opress the Add Custom button, . You are then prompted for the path of the
custom task executable and any arguments that you may want to pass to it when it
is run.
DMExpress Job - to execute another DMExpress job as part of this job
o use the Add DMExpress Job command under the Edit menu.
To view the details within a task, you can expand it (right click, Expand) in place to see a more
detailed view as shown below, or you can double-click the task to open it in the Task Editor.
Sequence links ( ) that indicate a purely sequential flow of execution can be created
between tasks and/or sub-jobs without a data-flow between them. Such links are useful,
for example, to determine the run sequence for custom tasks.
A MapReduce (MR) link ( ) is a data-flow link that designates the job as a full user-
defined MapReduce job and splits the job graph between the map side and the reduce
side. An MR link can also be specified as a direct data flow link to enable direct data
flow when the job is run outside of Hadoop.
To create a data-flow or sequence link, press the Add Run Sequence button, , click
your mouse on the earlier task and drag the link to the later task.
To convert data-flow links to direct data flow links, use one of the following commands:
o Use Direct Data Flow
o Use Direct Data Flows Everywhere
To create a MapReduce link, press the Add MapReduce Sequence button, , click on
what will be the last task on the map side and drag the link to what will be the first task
on the reduce side.
To convert a data-flow link to a MapReduce link, right click on it and select Use
MapReduce Sequence.
To delete an existing link, select the link and press the Delete button, .
You add comments directly to the job graph, where they display as moveable and resizeable
comment nodes. This comment text is included in the job reports and difference reports that you
generate.
Tooltip
The tooltip for the comment node displays all of the comment text.
For existing DMExpress task nodes and subjob nodes, the tooltip displays the task path and all
task properties and the subjob path and all job properties, respectively.
You can edit comments directly on the job graph and delete them from the job graph.
To modify an existing comment, click inside the comment node. An edit box surrounds the
comment node and all of the commented text is selected.
To delete the comment, right-click on the comment node and select Delete.
You will be prompted to identify the server on which you want the job to run, which can be
either this Windows system or a network-accessible UNIX or Windows DMExpress server.
You can also run your job from the command line using the dmxjob command.
DMExpress Server
The DMExpress Server program is used to access information about a DMExpress server:
the status of jobs run or scheduled on the server via the DMExpress Task Editor or Job
Editor,
environment variables defined for a user on the server,
SMTP server configuration for a user on the server,
Secure FTP key sets defined for a user on the server, and
calendars defined on the server to skip runs of scheduled jobs.
The DMExpress Server window can be accessed directly from the DMExpress program group on
your desktop, or from within the Task Editor or the Job Editor by selecting:
The DMExpress Server program, along with the help, can be installed and run separately on a
Windows system, without a full DMExpress installation. No license key is required. To install,
run the executable DMExpressServerConsoleSetup.exe on the system where you want the
DMExpress Server program to run. This executable is located in the directory where DMExpress
is installed, C:\Program Files\DMExpress by default.
DMX-h
DMX-h is the Hadoop-enabled edition of DMExpress, providing the following Hadoop
functionality:
Cloudera CDH 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 - YARN (MRv2)
Hortonworks Data Platform (HDP) 2.1, 2.2, 2.3, 2.4 - YARN
Apache Hadoop 2.x (2.2 and later) - YARN
MapR, M5 and M7 editions only (now respectively termed Community Edition and
Enterprise Edition)
o 5.0, 5.1 - YARN
o 4.0.1, 4.0.2, 4.1 - YARN
o 3.1.1 - MRv1
Pivotal HD 3.0 - YARN
BigInsights 4 - YARN
Spark
Both DMX-h sort acceleration and ETL processing use a proprietary Java implementation of the
sort plugin interface made available through Syncsort's contribution to the Hadoop open source
community. The plugin enables the Hadoop framework to invoke the DMExpress engine during
the circled steps in the following diagram of the MapReduce processing phases:
For DMX-h ETL Map-only jobs, the DMExpress engine is invoked during the mapper phase:
DMX-h sort acceleration improves the performance of existing Hadoop jobs by speeding up the
sort phases of the processing on both the map and reduce sides. Based on the MapReduce key
type, two DMExpress sort tasks are auto-generated, one each for map-sort and reduce-merge,
and they replace the native framework sort. As an additional performance boost, the map-sort is
typically replaced with a "null sort" except in specific cases, thereby eliminating an unnecessary
sort step.
DMX-h ETL processing in MapReduce allows DMExpress ETL jobs to run as the map-sort and
reduce-merge phases of a MapReduce job. These jobs can do more than just a simple sort,
including copy, merge, join, aggregation, reformat, filtering, and partitioning, effectively
combining several Hadoop processing phases into one job. DMX-h ETL can also be used for
map-only applications. An instance of DMX-h is run on each mapper (and reducer for full
MapReduce jobs) to execute the respective portion of the ETL job on the respective subset of
data.
When running on Spark, the main DMX-h job and any subjobs are broken up into Spark
applications, each of which is further broken up into Spark jobs, where one or more DMX-h
execution units corresponds to a Spark job. Each Spark job is broken up into a directed acyclic
graph (DAG) of Spark stages, each of which uses Resilient Distributed Datasets (RDDs) to
process the blocks of data. Some of those RDDs are executed by Spark, and some by
DMExpress.
Within a Spark job, data can get shuffled more than once. The processing before a shuffle is
similar to the map-side in MapReduce, where the number of partitions in Spark is like the
number of mappers in MapReduce, and the processing after a shuffle is similar to the reduce-side
in MapReduce, where the number of partitions in Spark is like the number of reducers in
MapReduce.
DMX-h ETL
DMX-h allows you to develop an ETL application entirely in the DMExpress GUI and run it
seamlessly in a cluster framework, either Hadoop MapReduce or Spark.
If you are using Cloudera's Hadoop distribution, data governance of DMX-h jobs run in the CDH
cluster is available via Cloudera Navigator, Cloudera’s proprietary governance solution for
Apache Hadoop. Part of Cloudera Enterprise, Cloudera Navigator provides metadata search,
policies, and auditing to secure, govern, and explore the data that lands in Hadoop.
DMX-h jobs run in a Hadoop cluster are tagged with "DMX-h" and "Syncsort" for filtering in
Cloudera Navigator. Simply enter the DMX-h and/or Syncsort tags in the Filters pane (or check
one or both of those checkboxes in the Tags list if previously added) to filter on DMX-h jobs.
User Properties
The following user properties are set by DMX-h to identify the jobs:
DMXhMainJob - this is the name of the top level DMX-h job, which may be divided into
more than one Hadoop job depending on the DMX-h subjob(s)/task(s) included in the
main DMX-h job
DMXhJobName - this is the name of the Hadoop job, which includes the DMX-h top
level job name followed by the names of the DMX-h subjob(s) and task(s) included in
this Hadoop job
For example, a DMX-h job named inventory.dxj that includes aggregate tasks
summarizeStores.dxt and summarizeStates.dxt will be broken into two MapReduce jobs, with the
above user properties set as follows for each Hadoop job:
Job 1
o DMXhMainJob=inventory.dxj
o DMXhJobName=DMXh-inventory.summarizeStores
Job 2
o DMXhMainJob=inventory.dxj
o DMXhJobName=DMXh-inventory.summarizeStates
Requirements
As part of DMX-h, the Spark Mainframe Connector requires one of the supported Hadoop
distributions. Additionally, the following configuration/connectivity settings are required on the
ETL node:
where:
The DMX-h Spark Mainframe Connector can be launched in both client and cluster deploy
modes in a Spark cluster of any type, standalone or Yarn. See the Spark User Guide for details
on the common spark-submit script options.
For client deploy mode, the spark-submit option --driver-class-path may be required to specify
the location of dmxspark.jar. For example:
--driver-class-path <dmx_install_dir>/lib/dmxspark.jar
The DMX-h Spark Mainframe Connector can transfer mainframe data in one of the following
modes:
Specify the machine name of the server from which the mainframe datasets are to be transferred.
--username <user name>
Specify the user name with which to log into the server machine.
--P
Prompt the user to enter the password securely through the console. This only works in the client
deploy mode, in which the connector is run locally as an external client.
--password <user password>
Specify the full path of the DMX-h installation directory in the Spark cluster. If not specified, the
same path as on the ETL node from which the spark-submit command was invoked will be used.
--dataset <dataset name>
Specify the dataset(s) with a fully-qualified name. The hierarchy of directories and the
representation of dataset names depends on the mainframe's FTP configuration options.
Typically, Partitioned Data Set (PDS) and Partitioned Data Set Extended (PDSE) are presented
as folders, and datasets outside or inside of PDS/PDSE are presented as files. High-level
qualifiers, which are the first (leftmost) qualifier in a dataset name, are represented as folders too.
Following are variations on fully qualified names, all of which are supported by the DMX-h
Spark Mainframe Connector:
To specify all files in a folder on the mainframe system, <dataset name> can be a
partitioned dataset or high-level qualifier. For example:
o Specify all files in the partitioned dataset SOMEUSER.SOMEPDS as:
--dataset SOMEUSER.SOMEPDS
o Specify all files with the high-level qualifier SOMEUSER as:
--dataset SOMEUSER
To specify all files in a folder on a Unix/Linux system, <dataset name> needs to be a full
path to a folder. For example, specify all files in the folder /somedir as:
--dataset /somedir
To specify some file(s) in a folder on the mainframe system, <dataset name> should
consist of the folder name (a partitioned dataset or high-level qualifier) followed by a file
name enclosed in parentheses, all enclosed in quotes. Depending on the mainframe's FTP
configuration, a wildcard may be used in the file name to specify a set of files in the
folder. For example:
o Specify the files with the high-level qualifier SOMEUSER that start with ORDER
as:
--dataset 'SOMEUSER(ORDER*)'
o Specify the files in the partitioned dataset SOMEUSER.SOMEPDS that end with
.DAT as:
--dataset 'SOMEUSER.SOMEPDS(*.DAT)'
--dataset /somedir/*.dat
--binary-transfer
The datasets are imported to HDFS in their exact same form as on the mainframe system,
creating a golden (master) copy of your mainframe data on HDFS.
The --cobol-copybook, --jobcard, and --source-compressed arguments are not applicable.
VSAM files are not supported in binary mode.
For example, the following invocation transfers all files from the partitioned dataset
SOMEUSER.SOMEDATA in binary mode and stores them in HDFS:
Specify the COBOL copybook file which contains the record layout of the mainframe data,
where <copybook file> can be either a mainframe file specified with a fully-qualified name, or a
local file with a full UNIX-like pathname.
The COBOL copybook defines the layout of mainframe data records, and all fields in the record
will be converted according to the data type:
All number fields will be converted to DMX-h edited numeric data type.
All character fields will be converted from EBCDIC to UTF-8 encoding.
The fields in the converted record will be delimited by a null byte (\0), the default field delimiter,
unless specified otherwise using the --field-delimiter argument. Presence of the delimiter
character within the fields will result in incorrect parsing.
For example, the following invocation transfers all files from the partitioned dataset
SOMEUSER.SOMEDATA on somehost to HDFS as text files using the COBOL copybook file
SOMEUSER.SOMEDATA.CPY as the metadata file:
Specify the record layout of the mainframe data, which is defined in the COBOL copybook
specified by the --cobol-copybook argument, where <recordlayout name> is the name of the
specified record layout. If not specified, the data-name in the first level 01 data description entry
of the COBOL copybook will be used.
--jobcard <jobcard>
Specify the JCL job card when transferring VSAM files, where <jobcard> is a double-quoted
string containing the information required to run jobs on the mainframe system.
For example, the following invocation transfers a list of VSAM files of the form
SOME*.VSAM.DATA in the folder SOMEUSER to HDFS as text files, using the COBOL
copybook SOMEUSER.SOMEVSAM.CPY as the metadata file, and running the indicated JCL
job card:
Specify the target path in HDFS, where <target path> can be either the full path or a relative path
to the user’s home directory. If not specified, the target directory is the directory with the name
of the source dataset (which must be a valid directory name) in the user’s home directory.
--source-compressed
When transferred in text mode, the data will be uncompressed and converted by DMX-h before
writing to the target. DMX-h supports bzip2 and gzip compression formats. Since the record
format of compressed data is most likely unknown, you can specify the record format with the --
source-recordformat and --source-recordlength arguments.
This argument is not applicable in binary mode; the data will remain compressed in HDFS.
--source-recordformat <recordformat>
Specify the record format of the mainframe data, where <recordformat> is one of the following
mainframe record formats supported by DMX-h:
This option is required when importing files from Unix/Linux systems, where, unlike the
mainframe system, the record format cannot be automatically retrieved from the system.
--source-recordlength <recordlength>
Specify the record length of the mainframe data when the record format is specified with the --
source-recordformat argument, where <recordlength> is:
the record length for fixed record format, in which case the argument is required
the maximum record length for variable record format, in which case the argument is
optional, with a default value of 32,756 bytes
For example, the following invocation transfers all files from the folder /somedir on a Linux host
somehost to HDFS as text files using the COBOL copybook file
/SOMEUSER/SOMEDATA.CPY as the metadata file, with a record format of
MAINFRAMEFIXED and a length of 80:
Specify the character with which fields are separated in the transferred records, where
<delimiter> is a single byte ASCII character supported by DMX-h.
This option is only applicable in text mode transfer. If not specified, the fields in the records are
separated with a null character, '\0'.
Requirements
As part of DMX-h, the Sqoop Mainframe Import Connector requires one of the supported
Hadoop distributions, as well as the following software:
Apache Sqoop version 1.4.6 or later (1.4.5 on Cloudera CDH 5.2 or later), which includes
the Sqoop mainframe import tool
Hive version 0.13.0 or later may be needed when importing data into HCatalog
And the following configuration/connectivity settings are required on the ETL node:
FTP or FTPS connectivity to the mainframe (FTP/FTPS are included with Hadoop, but
must be running on the mainframe)
the DMX-h bin directory, <dmx_install_dir>/bin, must be included in the path
the appropriate dmxhadoop jar file, <dmx_install_dir>/lib/dmxhadoop_<type>.jar, must
be specified in the Java classpath in one of the following ways, where <type> is either
mrv1 or mrv2 (for MapR, the jar file is dmxhadoop_<type>_mapr.jar):
o Edit the hadoop-env.sh file in the Hadoop conf directory (such as
/etc/hadoop/conf) and add the dmxhadoop jar file to the HADOOP_CLASSPATH
as follows:
export
HADOOP_CLASSPATH=<dmx_install_dir>/lib/dmxhadoop_<type>.jar:$HAD
OOP_CLASSPATH
where:
<tool-arguments> are the required mainframe import tool arguments
<connector-arguments> are optional arguments accepted by the DMX-h Sqoop
Mainframe Import Connector, specified after an extra -- at the end of the command line
Specify the password for the given username in one of these ways:
The --password-file argument is the most secure and recommended way, where
<filename> is the full path to a file with 400 permissions in your home directory in which
the password is saved.
The -P argument specifies to read the password from a console prompt.
The -–password argument specifies the password on the command line, which is not
secure.
--dataset <datasetname>
Specify the dataset(s) with a fully-qualified name. The hierarchy of directories and the
representation of dataset names depends on the mainframe's FTP configuration options.
Typically, Partitioned Data Set (PDS) and Partitioned Data Set Extended (PDSE) are presented
as folders, and datasets outside or inside of PDS/PDSE are presented as files. High-level
qualifiers, which are the first (leftmost) qualifier in a dataset name, are represented as folders too.
Following are variations on fully qualified names, all of which are supported with the DMX-h
Sqoop Mainframe Import Connector, but only the first of which – all files in a folder on the
mainframe – is supported in the open-source mainframe import tool:
--dataset SOMEUSER
--dataset /somedir
--dataset 'SOMEUSER(ORDER*)'
o Specify the files in the partitioned dataset SOMEUSER.SOMEPDS that end with
.DAT as:
--dataset 'SOMEUSER.SOMEPDS(*.DAT)'
--dataset /somedir/*.dat
--fields-terminated-by <char>
Specify the field separator character for input (unless overridden by --input-fields-terminated-by)
and output fields; default is comma (,). This argument only applies for text mode transfer.
--input-fields-terminated-by <char>
Specify the field separator character for input fields if it needs to be different from the one
specified by --fields-terminated-by. This argument only applies for text mode transfer.
See the Sqoop User Guide (version 1.4.6 or later) on the Sqoop site for details on the Sqoop
mainframe import tool arguments.
Sqoop Mainframe Import Connector Arguments
The DMX-h Sqoop Mainframe Import Connector can transfer mainframe data in one of the
following modes:
Following is a list of all arguments supported by the DMX-h Sqoop Mainframe Import
Connector:
--binary-transfer
The datasets are imported to HDFS in their exact same form as on the mainframe system,
creating a golden (master) copy of your mainframe data on HDFS.
The --cobol-copybook, --jobcard, --source-compressed, and --target-fileformat arguments
are not applicable.
VSAM files are not supported in binary mode.
For example, the following invocation transfers all files from the partitioned dataset
SOMEUSER.SOMEPDS in binary mode and stores them in HDFS:
Specify the COBOL copybook file which contains the record layout of the mainframe data,
where <copybook file> can be either a mainframe file specified with a fully-qualified name, or a
local file with a full UNIX-like pathname.
The COBOL copybook defines the layout of mainframe data records, and all fields in the record
will be converted according to the data type:
All number fields will be converted to DMX-h edited numeric data type.
All character fields will be converted from EBCDIC to UTF-8 encoding.
The fields in the converted record will be delimited by a comma (,), the default Sqoop field
delimiter, unless specified otherwise by the Sqoop tool arguments --fields-terminated-by or --
input-fields-terminated-by. Presence of the delimiter character within the fields will result in
incorrect parsing.
For example, the following invocation transfers all files from the partitioned dataset
SOMEUSER.SOMEPDS on somehost to HDFS as text files using the COBOL copybook file
SOMEUSER.SOMEPDS.CPY as the metadata file:
Specify which mainframe data record layout in the specified COBOL copybook to use, where
<recordlayout name> is the name of the record layout to use. If not specified, the data-name in
the first level 01 data description entry of the COBOL copybook will be used.
--dmx-home <dmx_install_dir>
Specify the full path of the DMX-h installation directory in the Hadoop cluster. If not specified,
the same path as on the ETL node from which the sqoop command was invoked will be used.
--jobcard <jobcard>
Specify the JCL job card when transferring VSAM files, where <jobcard> is a double-quoted
string containing the information required to run jobs on the mainframe system.
See Accessing files on mainframe servers for details on how DMX-h accesses VSAM files and
how to specify a required job card.
For example, the following invocation transfers a list of VSAM files of the form
SOME*.VSAM.DATA in the folder SOMEUSER to HDFS as text files, using the COBOL
copybook SOMEUSER.SOMEVSAM.CPY as the metadata file, and running the indicated JCL
job card:
When transferred in text mode, the data will be uncompressed and converted by DMX-h before
writing to the target. DMX-h supports bzip2 and gzip compression formats. Since the record
format of compressed data is most likely unknown, you can specify the record format with the --
source-recordformat and --source-recordlength arguments.
This argument is not applicable in binary mode; the data will remain compressed in HDFS.
--source-recordformat <recordformat>
Specify the record format of the mainframe data, where <recordformat> is one of the following
mainframe record formats supported by DMX-h:
If not specified, the record format is retrieved from the mainframe system. If the record format is
unknown, mainframevariable is assumed.
--source-recordlength <recordlength>
Specify the record length of the mainframe data when the record format is specified with the --
source-recordformat argument, where <recordlength> is:
the record length for fixed record format, in which case the argument is required
the maximum record length for variable record format, in which case the argument is
optional, with a default value of 32,756 bytes
--target-fileformat <fileformat>
Specify the file format for storing data in HDFS, where <fileformat> can be one of the following
values:
avrofile - to store data as Apache Avro files, a binary format with serialized Avro objects
and a header containing metadata
parquetfile - to store data as Apache Parquet files, a columnar format for very efficient
compression and encoding schemes
sequencefile - to store data as Hadoop sequence files, a binary key/value pair format
If not specified, data transferred in text mode will be stored as delimited text files.
This argument is not applicable in binary mode, as the data will remain in the mainframe format.
All mainframe numeric data types, which are converted to displayable decimal numbers,
are mapped to Java BigDecimal, Hive double, or HCatalog decimal types.
o The HCatalog decimal type supports specification of precision and scale as of
Hive version 0.13.0; the DMX-h Sqoop Mainframe Import Connector specifies
the maximum precision as 38 and the maximum scale as 30.
Character data type is mapped to string types in Java and Hive, and varchar in HCatalog.
o For HCatalog varchar type, the DMX-h Sqoop Mainframe Import Connector
specifies the maximum length as 65535.
The default mapping can be overridden using the following Sqoop arguments:
o --map-column-java (for changing Java mapping)
o --map-column-hive (for changing Hive or HCatalog mapping)
All non-alphanumeric characters in the field names of a COBOL record layout are
converted to an underscore ('_').
The structure of composite fields, including array fields, is not kept.
o Each subfield of a composite field is treated as a simple field itself.
o The subfield name is constructed in the form: <composite field name>_<subfield
name>
o The elements of an array field are referenced individually.
o The array element name is constructed in the form: <array field name>_<
subscript1>_..._<subscriptN>
o The subscripts for multi-dimensional arrays are appended in sequence.
o If an array field is a composite field, the subfield name is added at the end.
01 RECORD.
The record consists of a simple field FIELD1 and an array field FIELD2. The array field
FIELD2 has two subfields, SUBFIELD1 and SUBFIELD2. The subfield SUBFIELD1 of
the first element of FIELD2 is represented as FIELD2_1_SUBFIELD1.
The plain text format of DTL can be advantageous over the DMExpress interactive development
environment (IDE) in certain scenarios. Since syntax can be created manually via a text editor or
can be created programmatically, DTL offers great flexibility. Some situations in which DTL can
be leveraged include when:
The DMExpress Task and Job Editors, however, are ideal for situations not requiring dynamic
applications and for beginner DMExpress users as they offer an easy-to-use and intuitive
environment. The IDE provides helper dialogs, data and metadata visualization, context-sensitive
help, and a host of other features for easy application development.
Additionally, you can leverage both DTL and the DMExpress Task and Job Editors to meet your
requirements. You can create tasks and jobs through the IDE, for example, and execute them as
tasks and subjobs from within DTL jobs. This approach provides you with the flexibility to
develop the flows between tasks and subjobs programmatically. Conversely, you can create tasks
and jobs using DTL and either import them as DMExpress tasks and jobs or execute them as
tasks and subjobs from within a custom task defined in the IDE.
For a complete reference on DTL syntax and command options, see the DMExpress Data
Transformation Reference Guide:
Related concepts
From the Job Editor and Task Editor, Import DTL allows you to import a specified DTL
command options file into a new job or task.
From a Windows command line shell, import DTL syntax allows you to import an
unlimited number of DTL command options files into new jobs and tasks.
Differences exist in the way jobs and tasks run when using DTL and the IDE. At runtime and
design time, the dynamism with which data flows are created through DTL or the IDE provides a
point of comparison:
At runtime, a DTL job exhibits dynamism on the task and subjob level.
Through DTL, you establish data flows between tasks and subjobs; no requirement exists
for identifying data flows at the source and target level. At runtime, DTL dynamically
determines the data flow from the target of one task to the source of another and
automatically creates the data flows between the tasks.
At design time, the IDE exhibits dynamism at the source and target level.
At the Job Editor, you draw and dynamically establish data flows from the target of one
task to the source of another task. The target and source names for the data flows are
saved to the DMExpress job file. At runtime, DMExpress accesses the DMExpress job
file to reference the static target and source names for the data flows.
Upon DTL import, the Job Editor dynamically creates data flows between tasks based on source
and target names and saves the source and target names to the DMExpress job file. The runtime
dynamism of creating data flows is not available through the Job Editor.
IDE import
Job Editor
Upon selecting Import DTL from the File menu of the Job Editor, the Browse dialog displays.
After selecting a DTL job command options file through the Browse dialog, the imported DTL
syntax displays on the Job Editor as a graph in a new unsaved job.
Each DTL subtask/subjob with a valid path is converted into a DMExpress task/job file with a
file extension of .dxt/.dxj: <task_file1...n>.dxt, <job_file1...n>.dxj. These DMExpress task/job
files are stored in the same location as the original DTL task/job command options file.
If you attempt to close the newly-created, top-level job without saving, you are prompted to save
the job.
Task Editor
Upon selecting Import DTL from the File menu of the Task Editor, the Browse dialog displays.
After selecting a DTL task command options file through the Browse dialog, the imported DTL
syntax displays as expanded branches on the Task tree in a new unsaved task.
If you attempt to close the newly-created task without saving, you are prompted to save the task.
From a Windows command line shell, you can import a single DTL job or task command options
file or multiple DTL job or task command options files in batch. Expansion of wildcard
characters in the specified DTL command options file names is enabled by default. To prevent
wildcard expansion, specify the optional noWildcardExpansion parameter.
Job
Task
Depending on the success of the DTL import, the following types of messages display on the
command line: success, warning, or error.
Related concepts
Data Transformation Language
DTL pre-import considerations
DTL import command options support
Import DTL command
Job Editor
Task Editor
Browse dialog
Environment variables
Depending on whether the environment variables in your DTL command options files can be set
in both the interactive development environment (IDE) and through DTL or can be set through
DTL only, you may want to update your environment prior to DTL import.
Certain task and job elements can be specified as environment variables in dialogs within the
Task and Job Editors. Once defined, these environment variables are available to the DMExpress
environment. These same environment variables can also be specified in DTL command options
files.
Upon DTL import, if an environment variable that is specified in a DTL command options file is
also set in the environment, the environment variable value from the environment is inlined and
replaces the corresponding environment variable in the DTL command options file.
To prevent this replacement and retain the environment variables in imported DTL command
options files, clear the environment of these environment variables from either the Windows user
interface (UI) or from the command line prior to DTL import:
Windows UI: From the Windows Control Panel, navigate to the environment variables:
1. Select System > Advanced System Settings.
2. At the Advanced tab of the System Properties window, select Environment
Variables.
a. In the System variables section of the Environment Variables window,
select an environment variable, which is also included in the DTL
command options file, and select Delete.
b. Repeat this selection and deletion process for each of the applicable
environment variables.
Command line: Export the environment variables with blank values to the DMExpress
environment. Launch DMExpress from the same command line session.
After successful DTL import, the resulting DMExpress task or job file contains the environment
variables from the DTL command options file. These environment variables allow for dynamic
substitution of values at design time and runtime.
DTL extends significant flexibility in allowing for the definition of environment variables, which
can include, but is not limited to, full or partial syntax of DTL command options and keywords.
These environment variables cannot be set in dialogs within the Task and Job Editors and are not
supported in the IDE.
To retain the values of these environment variables in the imported DTL command options files,
export the environment variables and their values from the command line to the DMExpress
environment prior to DTL import.
Upon DTL import, the exported DTL environment variable values from the environment are
inlined and replace the corresponding environment variables in the DTL command options file.
The resulting DMExpress task or job file contains the static values of the DTL environment
variables, which can be used within the IDE; it does not contain the original DTL environment
variables, which could not be parsed for use within the IDE.
Relative paths
When referencing a relative path within a DTL command options file, the relative path is based
on the operating system’s current working directory. If a relative path to a file within a task or
job is used with the /DATADICTIONARY, /INFILE with XML or header layouts, or /TASK
options, for example, and the referenced file must be processed as part of the import, the current
working directory must be set to the base path from which all relative paths can be resolved.
Prior to DTL import, set the current working directory to the base path either through the IDE or
at the command line:
IDE: From either the Open or Save dialog on the Task or Job Editor, navigate to the
directory that will serve as the base path and either open or save any task or job file.
Upon file open or save, the current working directory is set to the specified base path.
Command line: Navigate to the directory that will serve as the base path. Launch
DMExpress. The current working directory is set to the base path from which you
launched DMExpress.
Related concepts
Note: The Job Editor does not support the concept of a global flow
setting and, therefore, does not support the global /DEFAULTFLOW
option.
Upon DTL import, the Job Editor imports each data flow as a non-direct
/FLOW
data flow by default. Only when the DIRECT keyword is specified in the
/FLOW option does the Job Editor import the specified, individual data
flow as a direct data flow.
After the DTL import, any additional data flow must be specified as
DIRECT to avoid writing an intermediate file to disk when connecting
the single file target of one task to the single file source of another task.
Upon DTL import, the Task Editor converts each of the subjobs and tasks
/SUBJOB, /TASK outlined in the DMExpress command options file and adds them to the
new DMExpress job.
Supported DTL task command options
Unless otherwise specified, all of the arguments for each of the following DTL task command
options are supported:
For a comprehensive list of DTL command options, see the DMExpress Data Transformation
Reference Guide:
Related concepts
While the DMExpress Job and Task Editors enable the effective execution of a single data
transfer, DMXDFNL enables data transfers en masse. The transfer of data from a database
schema to HDFS, for example, would traditionally require the creation of individual tasks
through the Task Editor for each and every table in the database schema. Today, you can
circumvent the manual creation of hundreds, if not thousands of individual data transfer tasks, by
running DMXDFNL at the command line. In facilitating the execution of automatically-
generated DTL tasks through the DMExpress engine, DMXDFNL seamlessly optimizes the mass
transfer of data from your data enterprise environment to HDFS. DMXDFNL provides you with
the capability to populate your Hadoop data lake using a single command.
Things to know
Requirements
Packaging considerations
Credentials for all source database connections and for all target database or target file
connections.
Order of priority in which generated DTL tasks are run
Source databases and the data that is eligible for extraction
Target databases or target files, to which the extracted volume of data is transferred
While you can chose to extract and transfer all of the data within a database schema or schemas,
you can fine tune the data transfer definitions specified within the DMXDFNL configuration file
to meet your needs. You can specify the data types to exclude from an extraction and the tables
to either exclude from an extraction or the tables to include in an extraction. In addition, you can
specify whether a job runs on a Hadoop-determined node in the Hadoop cluster.
DMXDFNL also allows for the execution of existing DMExpress task files, or DMExpress job
files, or both. Through strings defined through JSON objects and lists, you specify the following
requirements for each execute existing task job definition:
To ensure that an accurately populated DMXDFNL configuration file with the correct syntax is
available for DMXDFNL, familiarization with the members and elements that comprise the
JSON objects and lists is required. For an outline of the file components and details relevant to
each member and element, see DMXDFNL configuration file.
Process overview
Upon accessing the DMXDFNL configuration file, DMXDFNL does the following:
In parsing the DMXDFNL configuration file, DMXDFNL identifies the connections to the
source databases and to the target databases or target files for each specified DBMS. DMXDFNL
generates DTL tasks and identifies the priority in which the tasks are run and the number of tasks
that can be run in parallel. DMXDFNL identifies whether generated DTL task files are saved,
and if saved, the directory to which they are written. Unless otherwise specified, DMXDFNL
executes not only the generated DTL tasks through the DMExpress engine, but also executes all
existing DMExpress task and job files that are specified for execution. The progress of task and
job execution, warnings or errors, and run statistics are logged to standard output and display on
the system console.
Enhancement requests
Syncsort is committed to continued DMXDFNL development. If you have a requirement that
could be met in extending DMXDFNL capabilities, contact Syncsort technical support,
dmxsupport@syncsort.com.
Related concepts
connections
Through the required connections object, you can specify the following two types of
connections:
o database
o hdfs
job
The required job object specifies job level settings and subunits, within which the transfer
of data is specified. The job object members include the following:
o dtl_export_dir
o no_execute
o default_parallel
o subunit_order
o subunits
DMXDFNL configuration file example
Related concepts
alias
The required alias member is a string that specifies the name of the database connection
and how it is referenced in the data transfer. This alias is specified in either the source
database or target database connection strings.
"alias" : "<name_of_database_connection>"
dbms
The required dbms member is a string that specifies the database management system to
which DMXDFNL connects through the database connection.
"dbms" : "<db2|db2zos|hcatalog|hive|oracle|sqlserver|teradata>"
database
The required database member is a string that specifies the name of the data source to
which DMXDFNL connects. The value component of the name/value pair must be a
valid identifier, which is based on the DBMS type:
authtype
The optional authtype member is a string that specifies the method used to authenticate
the database. Specify an authentication method supported by the specified database. The
default value is autodetect.
"authtype" : "<autodetect|dbms|os|kerberos>"
The optional username and password members are strings, which can be used for all
authentication methods except OS. If you specify username and password values, these
values display in the generated DTL tasks. If you choose to specify the password as an
environment variable, the environment variable, for example, $DB_PASSWD, displays
in the generated DTL tasks.
"username" : "<user_name>",
"password" : "<password>"
hive_site
For HCatalog connections, the required hive_site member is a string that specifies the
directory path to a Hive XML configuration file. Typically, the directory path to hive-
site.xml is specified.
"hive_site" : "<directory_path_to_Hive_XML_configuration_file>"
Related concepts
alias
The required alias member is a string that specifies the name of the HDFS connection and
how it is referenced in the data transfer. This alias is specified in the target file
connection string.
"alias" : "<name_of_HDFS_connection>"
namenode
The required namenode member is a string that specifies either the hostname or, in high
availability clusters, the name service of the namenode. DMXDFNL connects to the
namenode through the HDFS connection.
"namenode" : "<hostname_or_name_service_of_namenode>"
Related concepts
"dtl_export_directory" : "<full_directory_path>"
When a directory value is specified for dtl_export_dir, the generated DTL task files, which are
written to the specified directory, are named according to the following convention:
tsk_<schema_name>.<table_name>.tdtl
Note: If your schema name or table name has special characters, for example, slash, ‘/’,
DMXDFNL may not be able to write the task to file or to execute the task.
If a directory is not specified, the generated DTL tasks are not written to file.
If the directory is set to the same directory that was set for a prior run, same-named DTL
task files are overwritten.
If the directory is set to a different directory from that set for a prior run, the DTL task
files from the prior run remain in the directory to which they were written.
Related concepts
"no_execute" : <true|false>
Related concepts
"default_parallel" : [1..]
Related concepts
alias
The required alias member is a string that must match a subunit_alias name, which you
define in the subunits object to identify a subunit list.
"alias" : "<name_of_subunit_list_in_subunit_object>"
parallel
The optional parallel member is an integral data type that specifies the number of DTL
tasks that can run simultaneously for the specified subunit. When set, this subunit_order
parallel value overrides the job-level default_parallel value. The default value is 1. If
specified, the value must be greater than 1.
"parallel" : [1..]
Note: All subunits that are scheduled in the subunit_order list must be defined in the subunits
object also.
subunit_order example
Related concepts
Each data transfer definition subunit list element is an object that specifies the source
from which data is extracted and the target into which the extracted data is loaded.
Each execute existing task job definition subunit list element is an object that specifies
the directory paths to existing DMExpress task files, or to existing DMExpress job files,
or to both.
Each subunit_alias defines at least one of the following, which occurs as a single schedulable
unit:
You assign a unique value to each subunit_alias that is defined within the DMXDFNL
configuration file.
Note: More subunits can be defined in the subunits object than are scheduled in the
subunit_order list; however, all subunits scheduled in the subunit_order list must be defined in
the subunits object.
subunits example
Related concepts
"node_agnostic" : <true|false>
Related concepts
connection
The required connection member is a string that specifies the alias of the connection to
the database source. The alias is defined in the database connection object.
"connection" : "<alias_of_connection_to_database_source>"
schema
The required schema member is a string that specifies the name of the source database
schema.
"schema" : "<source_database_schema>"
exclude_types
The optional exclude_types member is a list that specifies the RDBMS data type or data
types, for example, large object data types, to exclude from the extraction.
"<RDBMS_data_type>"
exclude
The optional exclude member is a list that specifies the database table or tables to exclude
from the extraction.
Note: An exclude list can be specified only when the optional include list is not specified.
"<table_name>"
include
The optional include member is a list that specifies the database table or tables to include
in the extraction. No tables in the database schema other than those specified in the
include list are included in the extraction. For each element in the include list, you can
specify either a table name or a table name and filter, which allows you to filter records in
the table based on a condition.
Note: An include list can be specified only when the optional exclude list is not specified.
An include list element can include a string, in which case the string is the table name, or
can include the following include object members:
o table
The required table member is a string that specifies the name of the database table
to be extracted.
"table" : "<table_name>"
o filter
The optional filter member is a string that specifies the condition upon which
records are extracted from the database table. The filter string is equivalent to the
text that follows a WHERE clause in a SQL query.
"filter" : "<filter_condition>"
extraction_formats
The optional extraction_formats is an object that enables you to override the default
extraction of date/time source columns. As DMExpress does not extract fractional
seconds in date/time data by default, you could, for example, use extraction_formats to
extract the fractional seconds.
"<DBMS_data_type>" : "<DMExpress_date/time_mask>"
source example
Related concepts
Database targets - The name of the database table to which the data is transferred is the
same as the name of the database source table from which the data is extracted.
File targets - The name of the target file is derived from the source database schema and
table names.
Character encoding
UTF-8 is the character encoding for all of the extracted data written to either target database
tables or to target files.
Database targets
The target object database members include the following:
connection
The required connection member is a string that specifies the alias of the connection to
the database target. The alias is defined in the database connection object.
"connection" : "<alias_of_connection_to_database_target>"
schema
The required schema member is a string that specifies the name of the target database
schema.
"schema" : "<target_database_schema>"
disposition
The optional disposition member is a string that specifies whether to overwrite, that is, to
truncate all existing rows from the target table before inserting the extracted records, or to
append the extracted records to the end of the target table. The default is overwrite.
"disposition" : "<overwrite|append>"
File targets
The target object file members include the following:
connection
The optional connection member is a string that must be specified to define a remote file
connection to HDFS. The alias is defined in the hdfs connection object. When a
connection string is not specified, the target defaults to the local file system.
"connection" : "<alias_of_remote_file_connection_to_HDFS>"
dir
The required dir member is a string that specifies the local file directory or HDFS to
which the target file is written.
"dir" : "<target_file_directory>"
The target file written to the local file directory or to HDFS is named according to the
following convention:
<schema_name>.<table_name>.dat
delimiter
The optional delimiter member is a string that specifies the single-byte delimiter that is
used to separate records in the target file. The default is a comma, ",".
"delimiter" : "<single_byte_delimiter>"
enclosing_char
The optional enclosing_char member is a string that specifies the single-byte character
that is used to enclose fields in the target file records. The default is no enclosing
characters.
"enclosing_char" : "<single_byte_field_enclosing_character>"
disposition
The optional disposition member is a string that specifies whether to overwrite the same-
named target file or to append the extracted records to the end of the target file. The
default is overwrite.
"disposition" : "<overwrite|append>"
"<directory_path_to_and_name_of_existing_DMExpress_task_files>"
tasks example
Related concepts
"<directory_path_to_and_name_of_existing_DMExpress_job_files>"
jobs example
Related concepts