0% found this document useful (0 votes)
1K views52 pages

DMExpress Help 9

This document provides an overview and instructions for using DMExpress, a data transformation product. DMExpress allows users to design, schedule, and control data transformation applications from a graphical interface. Data can be input from various sources and output to different targets after being aggregated, joined, sorted, filtered, or transformed. Metadata and transformations can be maintained centrally. DMExpress tasks can be run directly or scheduled on Windows or UNIX servers. The main DMExpress components are the Task Editor for defining tasks, the Job Editor for grouping tasks into jobs, and the Server window for monitoring jobs.

Uploaded by

kedarinath k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views52 pages

DMExpress Help 9

This document provides an overview and instructions for using DMExpress, a data transformation product. DMExpress allows users to design, schedule, and control data transformation applications from a graphical interface. Data can be input from various sources and output to different targets after being aggregated, joined, sorted, filtered, or transformed. Metadata and transformations can be maintained centrally. DMExpress tasks can be run directly or scheduled on Windows or UNIX servers. The main DMExpress components are the Task Editor for defining tasks, the Job Editor for grouping tasks into jobs, and the Server window for monitoring jobs.

Uploaded by

kedarinath k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

DMExpress Help 9.

Overview
DMExpress is a high performance data transformation product. With DMExpress you can
design, schedule, and control all your data transformation applications from a simple graphical
interface on your Windows desktop.

Data records can be input from many types of sources such as database tables, SAP systems,
Salesforce.com objects, flat files, XML files, pipes, etc. The records can be aggregated, joined,
sorted, merged, or just copied to the appropriate target(s). Before output, records can be filtered,
reformatted or otherwise transformed.

Metadata, including record layouts, business rules, transformation definitions, run history and
data statistics can be maintained either within a specific task or in a central repository. The
effects of making a change to your application can be analyzed through impact and lineage
analysis.

You can run your data transformations directly from your desktop, on any UNIX or Windows
server, or schedule them for later execution, embed them in batch scripts, or invoke them from
your own programs. With DMX-h, you can run DMExpress ETL jobs on a Hadoop/Spark
cluster, and you can accelerate the sort phases of existing MapReduce jobs in Hadoop.

The primary components of DMExpress installed on your desktop are: the Task Editor, ,

used to define, specify and test individual tasks; the Job Editor, , used to group tasks
together for immediate or later execution, and to perform impact analysis; and the DMExpress

Server window, , used to access information about jobs and environment variables on a

DMExpress server. Global Find, , is used to search through all of your jobs and tasks for
specific values.

Task Editor
Show this help topic on Task Editor startup

The DMExpress Task Editor, , is used to define, specify and test a single task, the smallest
independent unit of any data transformation. Once you have defined one or more tasks, use the

DMExpress Job Editor, , to group the tasks together into Jobs and submit them for
immediate or later execution. Jobs and tasks that are running, completed, or scheduled to run in

the future, can be viewed via the DMExpress Status window, .

Designing your task


You will need to specify the data source (input), the transformations to be performed on the
records from that source, and the target (output) where you want the resulting records to go.

Your task definition is displayed as a tree structure within the main task editor window. The
initial state of the tree shows empty sources and targets and no operations - as you define your
task with specialized dialogs, the tree will change to reflect the state of your definition.

Branches of the tree can be expanded and collapsed in the conventional way, by clicking the plus
and minus icons within the tree.

Specifying your data source(s)


Your data records can come from any source - right-click the Source branch in the tree to specify

where they are. For example, if they're in a flat file, , you can browse to it via the usual file

browser dialog; if they're in a database table, , specify the connection information for the

database, , (username and password, or ODBC source), select from the list of tables
presented, and then identify the columns that you want to extract. Other types of sources can be
specified in a similar manner.

Specifying your task type

Copy The default type of task is a Copy, , which copies the records, in the same order, from
source to target.

Sort or Merge If you want to change the order of the records in the target, change the task type

to Sort, , and then double-click the Sort Fields branch in the tree to specify what ordering
you want. If you have files that are already pre-sorted and you want to merge them into a single

file, use the Merge task type, , in the equivalent way.


Aggregate If you'd prefer to group the records into sets and summarize them (produce one

record per set with totals, counts, averages, etc.), then change the task type to Aggregate, ,
and double-click the Aggregate branch in the tree to specify your Group By and Summary fields.

Join If you want to create new target records by joining together records from different sources
based on a common field (for example, joining transaction records to customer records to

produce customer-transaction records), then change the task type to Join, , and double-click
the Join Fields branch in the tree to specify the common field(s).

Transforming records

To change the number of records output, use a Filter, .

To change the fields in the records, use a Reformat, .

Specifying your data target(s)


Your data records can be output to any target - right-click the Target branch on the tree to specify
where they should go, in the same way you used to specify the source.

Running your task

To test your task immediately, press the Run button, .

To combine your task with other tasks and run them together as a Job, select the Create Job item
from the Run menu - this will switch you over to the Job Editor.

Job Editor
Show this help topic on Job Editor startup

The DMExpress Job Editor, , is used to group tasks (and any sub-jobs) together into a job,
define the run sequence between the included tasks/sub-jobs, and submit the job for immediate
or scheduled execution.

To define your task(s), use the DMExpress Task Editor, .


You can view the progress of your submitted job via the DMExpress Server dialog, .

Collecting your job components


You can add the following component types to your job:

 DMExpress Task - to perform standard DMXExpress processing

o press the Add DMExpress Task button, . Select a DMExpress task via the
browse dialog, then click on the position in the job graph where you want the task
to appear.
 Extended Task - to perform non-DMXExpress processing using pre-defined command-
based tasks
o select a task from the Add Extended Task... menu, if any. Specify the command
arguments in the Extended Task dialog, then click on the position in the job graph
where you want the task to appear.
 Custom Task - to perform non-DMExpress processing

opress the Add Custom button, . You are then prompted for the path of the
custom task executable and any arguments that you may want to pass to it when it
is run.
 DMExpress Job - to execute another DMExpress job as part of this job
o use the Add DMExpress Job command under the Edit menu.

Viewing your job


Your job is displayed as a graph structure in the main Job Editor window, with each task/sub-job
as a node. A job with three DMExpress tasks (a join, a copy and an aggregate) and a custom task
might look like the following:

To view the details within a task, you can expand it (right click, Expand) in place to see a more
detailed view as shown below, or you can double-click the task to open it in the Task Editor.

Deciding the run sequence


Tasks in a job are linked to indicate data flow and run sequence; any tasks that are not linked will
be run simultaneously. The following types of links are available:
 Data-flow links ( ) appear between pairs of DMExpress tasks which have common
data detectable by the Job Editor. When a DMExpress task is added to a job, the Job
Editor determines the appropriate data-flow for the new task by matching the names of
source and target files or database tables, and adding a data-flow link from the target of
one task to the matching source of another task; for SAP sources, the name of the import
parameter file is used for matching.

 Sequence links ( ) that indicate a purely sequential flow of execution can be created
between tasks and/or sub-jobs without a data-flow between them. Such links are useful,
for example, to determine the run sequence for custom tasks.

 A MapReduce (MR) link ( ) is a data-flow link that designates the job as a full user-
defined MapReduce job and splits the job graph between the map side and the reduce
side. An MR link can also be specified as a direct data flow link to enable direct data
flow when the job is run outside of Hadoop.

Adding, Modifying, and Deleting Links

Links can be added, modified, and deleted as follows:

 To create a data-flow or sequence link, press the Add Run Sequence button, , click
your mouse on the earlier task and drag the link to the later task.
 To convert data-flow links to direct data flow links, use one of the following commands:
o Use Direct Data Flow
o Use Direct Data Flows Everywhere

 To create a MapReduce link, press the Add MapReduce Sequence button, , click on
what will be the last task on the map side and drag the link to what will be the first task
on the reduce side.
 To convert a data-flow link to a MapReduce link, right click on it and select Use
MapReduce Sequence.

 To delete an existing link, select the link and press the Delete button, .

Incorporating comments into your job


Through comments, you can outline processes, annotate jobs, and share information among
developers.

You add comments directly to the job graph, where they display as moveable and resizeable
comment nodes. This comment text is included in the job reports and difference reports that you
generate.

Adding comments to your job


To add comments to the job graph, do any one of the following:

 select Comment from the Edit toolbar.


 select Comment from the Edit menu.
 right-click the job graph and select Add Comment.

To save the comment, click outside of the comment node.

Tooltip

The tooltip for the comment node displays all of the comment text.

For existing DMExpress task nodes and subjob nodes, the tooltip displays the task path and all
task properties and the subjob path and all job properties, respectively.

Modifying and deleting comments

You can edit comments directly on the job graph and delete them from the job graph.

To modify an existing comment, click inside the comment node. An edit box surrounds the
comment node and all of the commented text is selected.

To save the comment, click outside of the comment node.

To cancel any pending edits, press ESC.

To delete the comment, right-click on the comment node and select Delete.

Running your job


You can run your job from the Job Editor in one of the following ways:

 To run your job immediately, press the Run button,

 To run your job at a scheduled time, press the Schedule button,

You will be prompted to identify the server on which you want the job to run, which can be
either this Windows system or a network-accessible UNIX or Windows DMExpress server.

You can also run your job from the command line using the dmxjob command.

DMExpress Server
The DMExpress Server program is used to access information about a DMExpress server:
 the status of jobs run or scheduled on the server via the DMExpress Task Editor or Job
Editor,
 environment variables defined for a user on the server,
 SMTP server configuration for a user on the server,
 Secure FTP key sets defined for a user on the server, and
 calendars defined on the server to skip runs of scheduled jobs.

The DMExpress Server window can be accessed directly from the DMExpress program group on
your desktop, or from within the Task Editor or the Job Editor by selecting:

 the Run->Status menu item or the toolbar button,


 the Run->Define Environment Variable Values... menu item,
 the Configure button for the SMTP server control of the Run Job dialog.
 the Define button for the Key set control of the Remote File Connection dialog.

The DMExpress Server program, along with the help, can be installed and run separately on a
Windows system, without a full DMExpress installation. No license key is required. To install,
run the executable DMExpressServerConsoleSetup.exe on the system where you want the
DMExpress Server program to run. This executable is located in the directory where DMExpress
is installed, C:\Program Files\DMExpress by default.

DMX-h
DMX-h is the Hadoop-enabled edition of DMExpress, providing the following Hadoop
functionality:

 ETL Processing in Hadoop - Develop a DMX-h ETL application entirely in the


DMExpress GUI to run seamlessly in the Hadoop framework, with no Pig, Hive, or Java
programming required. Currently, jobs can be run in either MapReduce or Spark.
 Hadoop Sort Acceleration - Seamlessly replace the native sort within Hadoop
MapReduce processing with the high-speed DMExpress engine sort, providing
performance benefits without programming changes to existing MapReduce jobs. See the
DMX-h Sort User Guide included in the Documentation folder under your DMExpress

software installation directory:


 Apache Spark Integration - Use the Spark Mainframe Connector to transfer mainframe
data to HDFS.
 Apache Sqoop Integration - Use the Sqoop Mainframe Import Connector to import
mainframe data into HDFS.

Supported Hadoop Distributions


MapReduce

DMX-h supports MapReduce on the following Hadoop distributions:

 Cloudera CDH 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 - YARN (MRv2)
 Hortonworks Data Platform (HDP) 2.1, 2.2, 2.3, 2.4 - YARN
 Apache Hadoop 2.x (2.2 and later) - YARN
 MapR, M5 and M7 editions only (now respectively termed Community Edition and
Enterprise Edition)
o 5.0, 5.1 - YARN
o 4.0.1, 4.0.2, 4.1 - YARN
o 3.1.1 - MRv1
 Pivotal HD 3.0 - YARN
 BigInsights 4 - YARN

Spark

DMX-h supports Spark as follows:

 Spark on YARN on the following Hadoop distributions:


o Cloudera CDH 5.5, 5.6, 5.7
o Hortonworks Data Platform (HDP) 2.3.4, 2.4
o MapR 5.1, M5 and M7 editions only (now respectively termed Community
Edition and Enterprise Edition)
 Spark on Mesos 0.21.0
 Spark Standalone 1.5.2 and higher

DMX-h MapReduce Architecture


A good introduction to Hadoop MapReduce and the major processing phases is available at
http://developer.yahoo.com/hadoop/tutorial/index.html.

Both DMX-h sort acceleration and ETL processing use a proprietary Java implementation of the
sort plugin interface made available through Syncsort's contribution to the Hadoop open source
community. The plugin enables the Hadoop framework to invoke the DMExpress engine during
the circled steps in the following diagram of the MapReduce processing phases:
For DMX-h ETL Map-only jobs, the DMExpress engine is invoked during the mapper phase:
DMX-h sort acceleration improves the performance of existing Hadoop jobs by speeding up the
sort phases of the processing on both the map and reduce sides. Based on the MapReduce key
type, two DMExpress sort tasks are auto-generated, one each for map-sort and reduce-merge,
and they replace the native framework sort. As an additional performance boost, the map-sort is
typically replaced with a "null sort" except in specific cases, thereby eliminating an unnecessary
sort step.

DMX-h ETL processing in MapReduce allows DMExpress ETL jobs to run as the map-sort and
reduce-merge phases of a MapReduce job. These jobs can do more than just a simple sort,
including copy, merge, join, aggregation, reformat, filtering, and partitioning, effectively
combining several Hadoop processing phases into one job. DMX-h ETL can also be used for
map-only applications. An instance of DMX-h is run on each mapper (and reducer for full
MapReduce jobs) to execute the respective portion of the ETL job on the respective subset of
data.

DMX-h Spark Architecture


DMX-h jobs can be run on Apache Spark - a fast, general-purpose parallel processing framework
- against a variety of Spark cluster managers: either YARN, Mesos, or Spark standalone.

When running on Spark, the main DMX-h job and any subjobs are broken up into Spark
applications, each of which is further broken up into Spark jobs, where one or more DMX-h
execution units corresponds to a Spark job. Each Spark job is broken up into a directed acyclic
graph (DAG) of Spark stages, each of which uses Resilient Distributed Datasets (RDDs) to
process the blocks of data. Some of those RDDs are executed by Spark, and some by
DMExpress.

Within a Spark job, data can get shuffled more than once. The processing before a shuffle is
similar to the map-side in MapReduce, where the number of partitions in Spark is like the
number of mappers in MapReduce, and the processing after a shuffle is similar to the reduce-side
in MapReduce, where the number of partitions in Spark is like the number of reducers in
MapReduce.

DMX-h ETL
DMX-h allows you to develop an ETL application entirely in the DMExpress GUI and run it
seamlessly in a cluster framework, either Hadoop MapReduce or Spark.

DMX-h Installation Components


See DMX-h Installation Components for details on installing and setting up the necessary
components of DMX-h to run jobs in a cluster.

Developing and Running DMX-h ETL Jobs


By following the rules and recommendations for developing DMX-h ETL jobs, you can run
DMX-h ETL jobs locally or in the cluster by simply changing environment variable settings and
the method of invocation.
DMX-h MapReduce Execution Architecture

DMX-h Spark Execution Architecture

Use Case Accelerators


A set of use case accelerators, or examples, to help understand how to implement DMX-h ETL
and Hadoop file system Load/Extract solutions is provided in the DMExpress Knowledge Base.
See A Guide to DMExpress Hadoop Use Case Accelerators for details.

Integration with Cloudera Navigator


Data governance refers to the overall management of the availability, usability, integrity, and
security of the data employed in an enterprise, and is a necessary component for meeting
compliance and regulatory requirements.

If you are using Cloudera's Hadoop distribution, data governance of DMX-h jobs run in the CDH
cluster is available via Cloudera Navigator, Cloudera’s proprietary governance solution for
Apache Hadoop. Part of Cloudera Enterprise, Cloudera Navigator provides metadata search,
policies, and auditing to secure, govern, and explore the data that lands in Hadoop.

Finding DMX-h Jobs in Cloudera Navigator


Tags

DMX-h jobs run in a Hadoop cluster are tagged with "DMX-h" and "Syncsort" for filtering in
Cloudera Navigator. Simply enter the DMX-h and/or Syncsort tags in the Filters pane (or check
one or both of those checkboxes in the Tags list if previously added) to filter on DMX-h jobs.

User Properties

The following user properties are set by DMX-h to identify the jobs:
 DMXhMainJob - this is the name of the top level DMX-h job, which may be divided into
more than one Hadoop job depending on the DMX-h subjob(s)/task(s) included in the
main DMX-h job
 DMXhJobName - this is the name of the Hadoop job, which includes the DMX-h top
level job name followed by the names of the DMX-h subjob(s) and task(s) included in
this Hadoop job

For example, a DMX-h job named inventory.dxj that includes aggregate tasks
summarizeStores.dxt and summarizeStates.dxt will be broken into two MapReduce jobs, with the
above user properties set as follows for each Hadoop job:

 Job 1
o DMXhMainJob=inventory.dxj
o DMXhJobName=DMXh-inventory.summarizeStores
 Job 2
o DMXhMainJob=inventory.dxj
o DMXhJobName=DMXh-inventory.summarizeStates

Viewing DMX-h Job Lineage


Upon clicking on a job in the search results, Cloudera Navigator will display the job's lineage as
follows:

 HDFS file sources/targets will appear as inputs/outputs to the given job.


 HCatalog table sources/targets and their underlying storage directories will appear as
inputs/outputs to the given job
 For jobs with multiple targets (either HDFS file or HCatalog), Cloudera Navigator shows
Empty_<...> because DMX-h writes those targets directly and Cloudera Navigator cannot
currently detect them.
 For jobs with sub-jobs that fall back to running on an edge or single cluster node, there
will be a gap in the lineage.

Spark Mainframe Connector


The DMX-h Spark Mainframe Connector is a standalone Spark application for transferring
mainframe datasets to the Hadoop Distributed File System (HDFS). The connector supports the
following functionality:

 transfer sequential datasets as well as VSAM files


 use COBOL copybooks as metadata
 transfer data in both text and binary modes

Requirements
As part of DMX-h, the Spark Mainframe Connector requires one of the supported Hadoop
distributions. Additionally, the following configuration/connectivity settings are required on the
ETL node:

 The DMX-h bin directory, <dmx_install_dir>/bin, must be in the path.


 You must be able to establish FTP or FTPS connectivity to the host system (FTP/FTPS
are included with Hadoop, but must be running on the host system).

Running the Spark Mainframe Connector


The DMX-h Spark Mainframe Connector can be launched using the spark-submit script in
Spark’s bin directory as follows:

spark-submit <spark-options> --class com.syncsort.dmexpress.hadoop.DMXSparkConnector


<dmx_install_dir>/lib/dmxspark.jar <connector-arguments>

where:

 <spark-options> are spark-submit options


 DMXSparkConnector is the main class of the Spark application to launch
 <dmx_install_dir> is the directory where DMX-h is installed
 <connector-arguments> are arguments passed to the DMX-h Spark Mainframe Connector

The DMX-h Spark Mainframe Connector can be launched in both client and cluster deploy
modes in a Spark cluster of any type, standalone or Yarn. See the Spark User Guide for details
on the common spark-submit script options.

For client deploy mode, the spark-submit option --driver-class-path may be required to specify
the location of dmxspark.jar. For example:

--driver-class-path <dmx_install_dir>/lib/dmxspark.jar

DMX-h Spark Mainframe Connector Arguments

The DMX-h Spark Mainframe Connector can transfer mainframe data in one of the following
modes:

 Text Mode (default)


o If --cobol-copybook is specified, all mainframe number fields will be converted to
DMX-h edited numeric data type and all character fields will be converted to
UTF-8 encoding.
o Otherwise, the whole data record will be treated as a single character field and the
data will be converted from EBCDIC to UTF-8 encoding.
 Binary Mode (specified via --binary-transfer)
o Record data is transferred unchanged from the mainframe to HDFS as a "golden
copy".

Following is a list of arguments supported by the DMX-h Spark Mainframe Connector:


--hostname <server name>

Specify the machine name of the server from which the mainframe datasets are to be transferred.
--username <user name>

Specify the user name with which to log into the server machine.
--P

Prompt the user to enter the password securely through the console. This only works in the client
deploy mode, in which the connector is run locally as an external client.
--password <user password>

Specify the password of the login user.


--secure-connection

Specify to transfer mainframe data over FTPS secure connection.


--dmx-home <dmx_install_dir>

Specify the full path of the DMX-h installation directory in the Spark cluster. If not specified, the
same path as on the ETL node from which the spark-submit command was invoked will be used.
--dataset <dataset name>

Specify the dataset(s) with a fully-qualified name. The hierarchy of directories and the
representation of dataset names depends on the mainframe's FTP configuration options.
Typically, Partitioned Data Set (PDS) and Partitioned Data Set Extended (PDSE) are presented
as folders, and datasets outside or inside of PDS/PDSE are presented as files. High-level
qualifiers, which are the first (leftmost) qualifier in a dataset name, are represented as folders too.

Following are variations on fully qualified names, all of which are supported by the DMX-h
Spark Mainframe Connector:

 To specify all files in a folder on the mainframe system, <dataset name> can be a
partitioned dataset or high-level qualifier. For example:
o Specify all files in the partitioned dataset SOMEUSER.SOMEPDS as:

--dataset SOMEUSER.SOMEPDS
o Specify all files with the high-level qualifier SOMEUSER as:

--dataset SOMEUSER

 To specify all files in a folder on a Unix/Linux system, <dataset name> needs to be a full
path to a folder. For example, specify all files in the folder /somedir as:

--dataset /somedir

 To specify some file(s) in a folder on the mainframe system, <dataset name> should
consist of the folder name (a partitioned dataset or high-level qualifier) followed by a file
name enclosed in parentheses, all enclosed in quotes. Depending on the mainframe's FTP
configuration, a wildcard may be used in the file name to specify a set of files in the
folder. For example:
o Specify the files with the high-level qualifier SOMEUSER that start with ORDER
as:

--dataset 'SOMEUSER(ORDER*)'

o Specify the files in the partitioned dataset SOMEUSER.SOMEPDS that end with
.DAT as:

--dataset 'SOMEUSER.SOMEPDS(*.DAT)'

 To specify some file(s) in a folder on a Unix/Linux system, <dataset name> needs to be a


full path to a file name. A wildcard can be used in the file name to specify a set of files in
the folder. For example, specify the files in the folder /somedir that end with .dat as:

--dataset /somedir/*.dat
--binary-transfer

Specify to transfer mainframe data in binary mode.

 The datasets are imported to HDFS in their exact same form as on the mainframe system,
creating a golden (master) copy of your mainframe data on HDFS.
 The --cobol-copybook, --jobcard, and --source-compressed arguments are not applicable.
 VSAM files are not supported in binary mode.

For example, the following invocation transfers all files from the partitioned dataset
SOMEUSER.SOMEDATA in binary mode and stores them in HDFS:

spark-submit --master yarn-client --driver-class-path <dmx_install_dir>/lib/dmxspark.jar --class


com.syncsort.dmexpress.hadoop.DMXSparkConnector <dmx_install_dir>/lib/dmxspark.jar --
hostname somehost --username SOMEUSER --P --dataset SOMEUSER.SOMEDATA --binary-
transfer
--cobol-copybook <copybook file>

Specify the COBOL copybook file which contains the record layout of the mainframe data,
where <copybook file> can be either a mainframe file specified with a fully-qualified name, or a
local file with a full UNIX-like pathname.

The COBOL copybook defines the layout of mainframe data records, and all fields in the record
will be converted according to the data type:

 All number fields will be converted to DMX-h edited numeric data type.
 All character fields will be converted from EBCDIC to UTF-8 encoding.

The fields in the converted record will be delimited by a null byte (\0), the default field delimiter,
unless specified otherwise using the --field-delimiter argument. Presence of the delimiter
character within the fields will result in incorrect parsing.

This argument is not applicable in binary transfer mode.

See External metadata in a COBOL copybook for supported COBOL copybooks.

For example, the following invocation transfers all files from the partitioned dataset
SOMEUSER.SOMEDATA on somehost to HDFS as text files using the COBOL copybook file
SOMEUSER.SOMEDATA.CPY as the metadata file:

spark-submit --master yarn-client --driver-class-path /lib/dmxspark.jar --class


com.syncsort.dmexpress.hadoop.DMXSparkConnector /lib/dmxspark.jar --hostname somehost --
username SOMEUSER --P --dataset SOMEUSER.SOMEDATA --cobol-copybook
SOMEUSER.SOMEDATA.CPY
--cobol-recordlayout <recordlayout name>

Specify the record layout of the mainframe data, which is defined in the COBOL copybook
specified by the --cobol-copybook argument, where <recordlayout name> is the name of the
specified record layout. If not specified, the data-name in the first level 01 data description entry
of the COBOL copybook will be used.
--jobcard <jobcard>

Specify the JCL job card when transferring VSAM files, where <jobcard> is a double-quoted
string containing the information required to run jobs on the mainframe system.

This argument is not applicable in binary transfer mode.


See Accessing files on mainframe servers for details on how DMX-h accesses VSAM files and
how to specify a required job card.

For example, the following invocation transfers a list of VSAM files of the form
SOME*.VSAM.DATA in the folder SOMEUSER to HDFS as text files, using the COBOL
copybook SOMEUSER.SOMEVSAM.CPY as the metadata file, and running the indicated JCL
job card:

spark-submit --master yarn-client --driver-class-path /lib/dmxspark.jar --class


com.syncsort.dmexpress.hadoop.DMXSparkConnector /lib/dmxspark.jar --hostname somehost --
username SOMEUSER --P --dataset ‘SOMEUSER(SOME*.VSAM.DATA)’ --cobol-copybook
SOMEUSER.SOMEVSAM.CPY --
jobcard="(SOMEUSER,WWCDMX900),'DMXGRP',MSGCLASS=Q"
--target-dir <target path>

Specify the target path in HDFS, where <target path> can be either the full path or a relative path
to the user’s home directory. If not specified, the target directory is the directory with the name
of the source dataset (which must be a valid directory name) in the user’s home directory.
--source-compressed

Specify that mainframe data is compressed.

When transferred in text mode, the data will be uncompressed and converted by DMX-h before
writing to the target. DMX-h supports bzip2 and gzip compression formats. Since the record
format of compressed data is most likely unknown, you can specify the record format with the --
source-recordformat and --source-recordlength arguments.

This argument is not applicable in binary mode; the data will remain compressed in HDFS.
--source-recordformat <recordformat>

Specify the record format of the mainframe data, where <recordformat> is one of the following
mainframe record formats supported by DMX-h:

 MAINFRAMEFIXED - mainframe fixed record format


 MAINFRAMEUNBLOCKEDVARIABLE - mainframe variable record format
 MAINFRAMEVARIABLE - mainframe variable with block descriptor record format
 MAINFRAMEVSAM - mainframe VSAM record format

This option is required when importing files from Unix/Linux systems, where, unlike the
mainframe system, the record format cannot be automatically retrieved from the system.
--source-recordlength <recordlength>

Specify the record length of the mainframe data when the record format is specified with the --
source-recordformat argument, where <recordlength> is:
 the record length for fixed record format, in which case the argument is required
 the maximum record length for variable record format, in which case the argument is
optional, with a default value of 32,756 bytes

For example, the following invocation transfers all files from the folder /somedir on a Linux host
somehost to HDFS as text files using the COBOL copybook file
/SOMEUSER/SOMEDATA.CPY as the metadata file, with a record format of
MAINFRAMEFIXED and a length of 80:

spark-submit --master yarn-client --driver-class-path /lib/dmxspark.jar --class


com.syncsort.dmexpress.hadoop.DMXSparkConnector /lib/dmxspark.jar --source-recordformat
MAINFRAMEFIXED --source-recordlength 80 --hostname somehost --username SOMEUSER -
-P --dataset /somedir --cobol-copybook /SOMEUSER/SOMEDATA.CPY
--field-delimiter <delimiter>

Specify the character with which fields are separated in the transferred records, where
<delimiter> is a single byte ASCII character supported by DMX-h.

This option is only applicable in text mode transfer. If not specified, the fields in the records are
separated with a null character, '\0'.

Sqoop Mainframe Import Connector


Apache Sqoop provides tools for transferring data between the Hadoop Distributed File System
(HDFS) and relational databases in a parallel fashion using Hadoop's MapReduce (MR)
framework.

Syncsort's open-source contribution to the Apache Sqoop project, SQOOP-1272, provides a


mainframe import tool that enables basic mainframe to HDFS import functionality. When used
in conjunction with the DMX-h Sqoop Mainframe Import Connector, the mainframe import tool
is much more full-featured, as shown in the following table:

With the DMX-h Sqoop Mainframe


Open-source Mainframe Import Tool
Import Connector
Data can only be transferred in text mode. Binary
Data can be transferred in binary mode,
data must be converted to text prior to transferring
allowing for golden (master) copy.
with Sqoop.
A COBOL copybook can be specified,
No metadata can be provided, and the record must
allowing for field processing during record
be transferred as a single field.
transfer.
Only uncompressed mainframe data can be Compressed mainframe data can be
transferred. transferred.
Transfer over FTPS secure connection can be
Data is transferred over FTP, which is not secure.
specified.
All mainframe datasets located in either a
Only a whole folder in a mainframe system can be
mainframe system or a Unix/Linux system
transferred.
can be transferred.
VSAM files cannot be transferred. VSAM files can be transferred.

Requirements
As part of DMX-h, the Sqoop Mainframe Import Connector requires one of the supported
Hadoop distributions, as well as the following software:

 Apache Sqoop version 1.4.6 or later (1.4.5 on Cloudera CDH 5.2 or later), which includes
the Sqoop mainframe import tool
 Hive version 0.13.0 or later may be needed when importing data into HCatalog

And the following configuration/connectivity settings are required on the ETL node:

 FTP or FTPS connectivity to the mainframe (FTP/FTPS are included with Hadoop, but
must be running on the mainframe)
 the DMX-h bin directory, <dmx_install_dir>/bin, must be included in the path
 the appropriate dmxhadoop jar file, <dmx_install_dir>/lib/dmxhadoop_<type>.jar, must
be specified in the Java classpath in one of the following ways, where <type> is either
mrv1 or mrv2 (for MapR, the jar file is dmxhadoop_<type>_mapr.jar):
o Edit the hadoop-env.sh file in the Hadoop conf directory (such as
/etc/hadoop/conf) and add the dmxhadoop jar file to the HADOOP_CLASSPATH
as follows:

export
HADOOP_CLASSPATH=<dmx_install_dir>/lib/dmxhadoop_<type>.jar:$HAD
OOP_CLASSPATH

o Copy the dmxhadoop jar file, <dmx_install_dir>/lib/dmxhadoop_<type>.jar, to


the Sqoop lib directory, <sqoop_install_dir>/lib.

Running the Sqoop Mainframe Import Connector


To run the Sqoop mainframe import tool with the DMX-h Sqoop Mainframe Import Connector,
you need to specify the DMX-h connection manager class name as follows:

sqoop import-mainframe <tool-arguments> --connection-manager


com.syncsort.dmexpress.hadoop.DMXMainframeManager [-- <connector-arguments>]

where:
 <tool-arguments> are the required mainframe import tool arguments
 <connector-arguments> are optional arguments accepted by the DMX-h Sqoop
Mainframe Import Connector, specified after an extra -- at the end of the command line

Mainframe Import Tool Arguments

Following is a list of arguments required by the mainframe import tool:


--connect <hostname>

Specify the mainframe hostname.


--username <username>

Specify the username for connecting to the mainframe.


--password-file <filename>
-P
--password <password>

Specify the password for the given username in one of these ways:

 The --password-file argument is the most secure and recommended way, where
<filename> is the full path to a file with 400 permissions in your home directory in which
the password is saved.
 The -P argument specifies to read the password from a console prompt.
 The -–password argument specifies the password on the command line, which is not
secure.

--dataset <datasetname>

Specify the dataset(s) with a fully-qualified name. The hierarchy of directories and the
representation of dataset names depends on the mainframe's FTP configuration options.
Typically, Partitioned Data Set (PDS) and Partitioned Data Set Extended (PDSE) are presented
as folders, and datasets outside or inside of PDS/PDSE are presented as files. High-level
qualifiers, which are the first (leftmost) qualifier in a dataset name, are represented as folders too.

Following are variations on fully qualified names, all of which are supported with the DMX-h
Sqoop Mainframe Import Connector, but only the first of which – all files in a folder on the
mainframe – is supported in the open-source mainframe import tool:

 To specify all files in a folder on the mainframe system, <datasetname> can be a


partitioned dataset or high-level qualifier. For example:
o Specify all files in the partitioned dataset SOMEUSER.SOMEPDS as:
--dataset SOMEUSER.SOMEPDS

o Specify all files with the high-level qualifier SOMEUSER as:

--dataset SOMEUSER

 To specify all files in a folder on a Unix/Linux system, <datasetname> needs to be a full


path to a folder. For example, specify all files in the folder /somedir as:

--dataset /somedir

 To specify some file(s) in a folder on the mainframe system, <datasetname> should


consist of the folder name (a partitioned dataset or high-level qualifier) followed by a file
name enclosed in parentheses, all enclosed in quotes. Depending on the mainframe's FTP
configuration, a wildcard may be used in the file name to specify a set of files in the
folder. For example:
o Specify the files with the high-level qualifier SOMEUSER that start with ORDER
as:

--dataset 'SOMEUSER(ORDER*)'

o Specify the files in the partitioned dataset SOMEUSER.SOMEPDS that end with
.DAT as:

--dataset 'SOMEUSER.SOMEPDS(*.DAT)'

 To specify some file(s) in a folder on a Unix/Linux system, <datasetname> needs to be a


full path to a file name. A wildcard can be used in the file name to specify a set of files in
the folder. For example, specify the files in the folder /somedir that end with .dat as:

--dataset /somedir/*.dat
--fields-terminated-by <char>

Specify the field separator character for input (unless overridden by --input-fields-terminated-by)
and output fields; default is comma (,). This argument only applies for text mode transfer.
--input-fields-terminated-by <char>

Specify the field separator character for input fields if it needs to be different from the one
specified by --fields-terminated-by. This argument only applies for text mode transfer.

See the Sqoop User Guide (version 1.4.6 or later) on the Sqoop site for details on the Sqoop
mainframe import tool arguments.
Sqoop Mainframe Import Connector Arguments

The DMX-h Sqoop Mainframe Import Connector can transfer mainframe data in one of the
following modes:

 Text Mode (default)


o If --cobol-copybook is specified, all mainframe number fields will be converted to
DMX-h edited numeric data type and all character fields will be converted to
UTF-8 encoding. VSAM files can be transferred by also specifying the --jobcard
argument.
o Otherwise, the whole data record will be treated as a single character field and the
data will be converted from EBCDIC to UTF-8 encoding.
 Binary Mode (specified via --binary-transfer)
o Data is transferred unchanged from the mainframe to HDFS as a "golden copy".

Following is a list of all arguments supported by the DMX-h Sqoop Mainframe Import
Connector:
--binary-transfer

Specify to transfer mainframe data in binary mode.

 The datasets are imported to HDFS in their exact same form as on the mainframe system,
creating a golden (master) copy of your mainframe data on HDFS.
 The --cobol-copybook, --jobcard, --source-compressed, and --target-fileformat arguments
are not applicable.
 VSAM files are not supported in binary mode.

For example, the following invocation transfers all files from the partitioned dataset
SOMEUSER.SOMEPDS in binary mode and stores them in HDFS:

sqoop import-mainframe --connect somehost --username SOMEUSER -P --dataset


SOMEUSER.SOMEPDS --connection-manager
com.syncsort.dmexpress.hadoop.DMXMainframeManager -- --binary-transfer
--cobol-copybook <copybook file>

Specify the COBOL copybook file which contains the record layout of the mainframe data,
where <copybook file> can be either a mainframe file specified with a fully-qualified name, or a
local file with a full UNIX-like pathname.
The COBOL copybook defines the layout of mainframe data records, and all fields in the record
will be converted according to the data type:

 All number fields will be converted to DMX-h edited numeric data type.
 All character fields will be converted from EBCDIC to UTF-8 encoding.

The fields in the converted record will be delimited by a comma (,), the default Sqoop field
delimiter, unless specified otherwise by the Sqoop tool arguments --fields-terminated-by or --
input-fields-terminated-by. Presence of the delimiter character within the fields will result in
incorrect parsing.

This argument is not applicable in binary transfer mode.

See External metadata in a COBOL copybook for supported COBOL copybooks.

For example, the following invocation transfers all files from the partitioned dataset
SOMEUSER.SOMEPDS on somehost to HDFS as text files using the COBOL copybook file
SOMEUSER.SOMEPDS.CPY as the metadata file:

sqoop import-mainframe --connect somehost --username SOMEUSER -P --dataset


SOMEUSER.SOMEPDS --connection-manager
com.syncsort.dmexpress.hadoop.DMXMainframeManager -- --cobol-copybook
SOMEUSER.SOMEPDS.CPY
--cobol-recordlayout <recordlayout name>

Specify which mainframe data record layout in the specified COBOL copybook to use, where
<recordlayout name> is the name of the record layout to use. If not specified, the data-name in
the first level 01 data description entry of the COBOL copybook will be used.
--dmx-home <dmx_install_dir>

Specify the full path of the DMX-h installation directory in the Hadoop cluster. If not specified,
the same path as on the ETL node from which the sqoop command was invoked will be used.
--jobcard <jobcard>

Specify the JCL job card when transferring VSAM files, where <jobcard> is a double-quoted
string containing the information required to run jobs on the mainframe system.

This argument is not applicable in binary transfer mode.

See Accessing files on mainframe servers for details on how DMX-h accesses VSAM files and
how to specify a required job card.

For example, the following invocation transfers a list of VSAM files of the form
SOME*.VSAM.DATA in the folder SOMEUSER to HDFS as text files, using the COBOL
copybook SOMEUSER.SOMEVSAM.CPY as the metadata file, and running the indicated JCL
job card:

sqoop import-mainframe --connect somehost --username SOMEUSER -P --dataset


SOMEUSER(SOME*.VSAM.DATA) --connection-manager
com.syncsort.dmexpress.hadoop.DMXMainframeManager -- --cobol-copybook
SOMEUSER.SOMEVSAM.CPY --
jobcard="(SOMEUSER,WWCDMX900),'DMXGRP',MSGCLASS=Q"
--secure-connection

Specify to transfer mainframe data over FTPS secure connection.


--source-compressed

Specify that mainframe data is compressed.

When transferred in text mode, the data will be uncompressed and converted by DMX-h before
writing to the target. DMX-h supports bzip2 and gzip compression formats. Since the record
format of compressed data is most likely unknown, you can specify the record format with the --
source-recordformat and --source-recordlength arguments.

This argument is not applicable in binary mode; the data will remain compressed in HDFS.
--source-recordformat <recordformat>

Specify the record format of the mainframe data, where <recordformat> is one of the following
mainframe record formats supported by DMX-h:

 mainframefixed - mainframe fixed record format


 mainframeunblockedvariable - mainframe variable record format
 mainframevariable - mainframe variable with block descriptor record format
 mainframevsam - mainframe VSAM record format

If not specified, the record format is retrieved from the mainframe system. If the record format is
unknown, mainframevariable is assumed.
--source-recordlength <recordlength>

Specify the record length of the mainframe data when the record format is specified with the --
source-recordformat argument, where <recordlength> is:

 the record length for fixed record format, in which case the argument is required
 the maximum record length for variable record format, in which case the argument is
optional, with a default value of 32,756 bytes
--target-fileformat <fileformat>

Specify the file format for storing data in HDFS, where <fileformat> can be one of the following
values:

 avrofile - to store data as Apache Avro files, a binary format with serialized Avro objects
and a header containing metadata
 parquetfile - to store data as Apache Parquet files, a columnar format for very efficient
compression and encoding schemes
 sequencefile - to store data as Hadoop sequence files, a binary key/value pair format

If not specified, data transferred in text mode will be stored as delimited text files.

This argument is not applicable in binary mode, as the data will remain in the mainframe format.

COBOL Copybook Metadata Import


The DMX-h Sqoop Mainframe Import Connector can import metadata along with data into Hive,
HCatalog, HBase, and other targets supported by the Sqoop import tool. With the specification
of a COBOL copybook, the DMX-h Sqoop Mainframe Import Connector provides the metadata
to access individual fields in the record. It maps all supported mainframe data types to
appropriate Java, Hive, or HCatalog representations as follows:

 All mainframe numeric data types, which are converted to displayable decimal numbers,
are mapped to Java BigDecimal, Hive double, or HCatalog decimal types.
o The HCatalog decimal type supports specification of precision and scale as of
Hive version 0.13.0; the DMX-h Sqoop Mainframe Import Connector specifies
the maximum precision as 38 and the maximum scale as 30.
 Character data type is mapped to string types in Java and Hive, and varchar in HCatalog.
o For HCatalog varchar type, the DMX-h Sqoop Mainframe Import Connector
specifies the maximum length as 65535.
 The default mapping can be overridden using the following Sqoop arguments:
o --map-column-java (for changing Java mapping)
o --map-column-hive (for changing Hive or HCatalog mapping)
 All non-alphanumeric characters in the field names of a COBOL record layout are
converted to an underscore ('_').
 The structure of composite fields, including array fields, is not kept.
o Each subfield of a composite field is treated as a simple field itself.
o The subfield name is constructed in the form: <composite field name>_<subfield
name>
o The elements of an array field are referenced individually.
o The array element name is constructed in the form: <array field name>_<
subscript1>_..._<subscriptN>
o The subscripts for multi-dimensional arrays are appended in sequence.
o If an array field is a composite field, the subfield name is added at the end.

For example, the following is a COBOL record layout description:

01 RECORD.

05 FIELD1 PIC 99.

05 FIELD2 OCCURS 2 TO 3 TIMES DEPENDING ON FIELD1.

10 SUBFIELD1 PIC X(2).

10 SUBFIELD2 PIC X(3).

The record consists of a simple field FIELD1 and an array field FIELD2. The array field
FIELD2 has two subfields, SUBFIELD1 and SUBFIELD2. The subfield SUBFIELD1 of
the first element of FIELD2 is represented as FIELD2_1_SUBFIELD1.

Data Transformation Language


DMExpress Data Transformation Language (DTL) is a text-based command language designed
for creating DMExpress jobs and tasks. DTL aligns the conceptual capabilities of the language
syntax with those provided by the DMExpress Job and Task Editors.

The plain text format of DTL can be advantageous over the DMExpress interactive development
environment (IDE) in certain scenarios. Since syntax can be created manually via a text editor or
can be created programmatically, DTL offers great flexibility. Some situations in which DTL can
be leveraged include when:

 DMExpress applications must be generated dynamically based on metadata repositories


residing in database tables or files or based on user input that is provided by a program
such as a Web user interface (UI). Any script, program, or application can be used to read
data from any source and to generate DTL syntax as appropriate.
 similar DMExpress applications must be created based on a template. Scripting or
programs can leverage templates as necessary.
 large numbers of DMExpress applications must be modified. The text format of DTL
allows for easy global editing and standardization.

The DMExpress Task and Job Editors, however, are ideal for situations not requiring dynamic
applications and for beginner DMExpress users as they offer an easy-to-use and intuitive
environment. The IDE provides helper dialogs, data and metadata visualization, context-sensitive
help, and a host of other features for easy application development.
Additionally, you can leverage both DTL and the DMExpress Task and Job Editors to meet your
requirements. You can create tasks and jobs through the IDE, for example, and execute them as
tasks and subjobs from within DTL jobs. This approach provides you with the flexibility to
develop the flows between tasks and subjobs programmatically. Conversely, you can create tasks
and jobs using DTL and either import them as DMExpress tasks and jobs or execute them as
tasks and subjobs from within a custom task defined in the IDE.

For a complete reference on DTL syntax and command options, see the DMExpress Data
Transformation Reference Guide:

Related concepts

Import DTL command


Import DTL command options files
DTL import command options support
External metadata in a DMExpress command options file
External metadata in a DMExpress DTL command options file

Import DTL command options files


You can invoke a Data Transformation Language (DTL) import from the DMExpress interactive
development environment (IDE) and from the command line:

 From the Job Editor and Task Editor, Import DTL allows you to import a specified DTL
command options file into a new job or task.
 From a Windows command line shell, import DTL syntax allows you to import an
unlimited number of DTL command options files into new jobs and tasks.

Before you begin


Prior to initiating a DTL import,

 determine whether to take action as per the DTL pre-import considerations.


 reference the list of supported job and task command options.
Processing distinctions between DTL and the IDE

Differences exist in the way jobs and tasks run when using DTL and the IDE. At runtime and
design time, the dynamism with which data flows are created through DTL or the IDE provides a
point of comparison:

 At runtime, a DTL job exhibits dynamism on the task and subjob level.

Through DTL, you establish data flows between tasks and subjobs; no requirement exists
for identifying data flows at the source and target level. At runtime, DTL dynamically
determines the data flow from the target of one task to the source of another and
automatically creates the data flows between the tasks.

 At design time, the IDE exhibits dynamism at the source and target level.

At the Job Editor, you draw and dynamically establish data flows from the target of one
task to the source of another task. The target and source names for the data flows are
saved to the DMExpress job file. At runtime, DMExpress accesses the DMExpress job
file to reference the static target and source names for the data flows.

Upon DTL import, the Job Editor dynamically creates data flows between tasks based on source
and target names and saves the source and target names to the DMExpress job file. The runtime
dynamism of creating data flows is not available through the Job Editor.

Initiating the DTL import


Initiate a DTL import from either the IDE or from the command line.

IDE import

Job Editor

Upon selecting Import DTL from the File menu of the Job Editor, the Browse dialog displays.
After selecting a DTL job command options file through the Browse dialog, the imported DTL
syntax displays on the Job Editor as a graph in a new unsaved job.

Each DTL subtask/subjob with a valid path is converted into a DMExpress task/job file with a
file extension of .dxt/.dxj: <task_file1...n>.dxt, <job_file1...n>.dxj. These DMExpress task/job
files are stored in the same location as the original DTL task/job command options file.

If you attempt to close the newly-created, top-level job without saving, you are prompted to save
the job.
Task Editor

Upon selecting Import DTL from the File menu of the Task Editor, the Browse dialog displays.
After selecting a DTL task command options file through the Browse dialog, the imported DTL
syntax displays as expanded branches on the Task tree in a new unsaved task.

If you attempt to close the newly-created task without saving, you are prompted to save the task.

DTL import messages

After DTL import, messages display in a DMExpress multi-message dialog:

 For a successful DTL import, generated warning or informational messages display in a


DMExpress warning dialog.
 For a failed DTL import, the fatal error message displays in a DMExpress error dialog.

Command line import

From a Windows command line shell, you can import a single DTL job or task command options
file or multiple DTL job or task command options files in batch. Expansion of wildcard
characters in the specified DTL command options file names is enabled by default. To prevent
wildcard expansion, specify the optional noWildcardExpansion parameter.

Job

dmexpressjob.exe /importDtl <DTL_job_command_options_file_1>


[<DTL_job_command_options_file_2> <DTL_job_command_options_file_n>]
[/noWildcardExpansion]

Task

dmexpresstask.exe /importDtl <DTL_task_command_options_file_1>


[<DTL_task_command_options_file_2> <DTL_task_command_options_file_n>]
[/noWildcardExpansion]

DTL import messages

Depending on the success of the DTL import, the following types of messages display on the
command line: success, warning, or error.

Related concepts
Data Transformation Language
DTL pre-import considerations
DTL import command options support
Import DTL command
Job Editor
Task Editor
Browse dialog

DTL pre-import considerations


When working with environment variables or with relative paths, determine whether to take
action based on the outlined DTL pre-import considerations.

Environment variables
Depending on whether the environment variables in your DTL command options files can be set
in both the interactive development environment (IDE) and through DTL or can be set through
DTL only, you may want to update your environment prior to DTL import.

Environment variables set in the IDE

Certain task and job elements can be specified as environment variables in dialogs within the
Task and Job Editors. Once defined, these environment variables are available to the DMExpress
environment. These same environment variables can also be specified in DTL command options
files.

Upon DTL import, if an environment variable that is specified in a DTL command options file is
also set in the environment, the environment variable value from the environment is inlined and
replaces the corresponding environment variable in the DTL command options file.

To prevent this replacement and retain the environment variables in imported DTL command
options files, clear the environment of these environment variables from either the Windows user
interface (UI) or from the command line prior to DTL import:

 Windows UI: From the Windows Control Panel, navigate to the environment variables:
1. Select System > Advanced System Settings.
2. At the Advanced tab of the System Properties window, select Environment
Variables.
a. In the System variables section of the Environment Variables window,
select an environment variable, which is also included in the DTL
command options file, and select Delete.
b. Repeat this selection and deletion process for each of the applicable
environment variables.
 Command line: Export the environment variables with blank values to the DMExpress
environment. Launch DMExpress from the same command line session.
After successful DTL import, the resulting DMExpress task or job file contains the environment
variables from the DTL command options file. These environment variables allow for dynamic
substitution of values at design time and runtime.

Environment variables set in DTL

DTL extends significant flexibility in allowing for the definition of environment variables, which
can include, but is not limited to, full or partial syntax of DTL command options and keywords.
These environment variables cannot be set in dialogs within the Task and Job Editors and are not
supported in the IDE.

To retain the values of these environment variables in the imported DTL command options files,
export the environment variables and their values from the command line to the DMExpress
environment prior to DTL import.

Upon DTL import, the exported DTL environment variable values from the environment are
inlined and replace the corresponding environment variables in the DTL command options file.

The resulting DMExpress task or job file contains the static values of the DTL environment
variables, which can be used within the IDE; it does not contain the original DTL environment
variables, which could not be parsed for use within the IDE.

Relative paths
When referencing a relative path within a DTL command options file, the relative path is based
on the operating system’s current working directory. If a relative path to a file within a task or
job is used with the /DATADICTIONARY, /INFILE with XML or header layouts, or /TASK
options, for example, and the referenced file must be processed as part of the import, the current
working directory must be set to the base path from which all relative paths can be resolved.

Prior to DTL import, set the current working directory to the base path either through the IDE or
at the command line:

 IDE: From either the Open or Save dialog on the Task or Job Editor, navigate to the
directory that will serve as the base path and either open or save any task or job file.
Upon file open or save, the current working directory is set to the specified base path.
 Command line: Navigate to the directory that will serve as the base path. Launch
DMExpress. The current working directory is set to the base path from which you
launched DMExpress.
Related concepts

Data Transformation Language


Import DTL command options files
DTL import command options support
Environment variables as parameters
Using environment variables in a DMExpress task or job
Import DTL command
Job Editor
Task Editor
Browse dialog

DTL import command options support


DTL Import currently includes support for a subset of DTL job and task command options.

Supported DTL job command options


Unless otherwise specified, all of the arguments for each of the following DTL job command
options are supported:

DTL job command


Notes/Processing distinctions
options
/CUSTOMTASK
/DTL
/END
Run sequences, non-direct, and direct data flows are supported.

Note: The Job Editor does not support the concept of a global flow
setting and, therefore, does not support the global /DEFAULTFLOW
option.

Upon DTL import, the Job Editor imports each data flow as a non-direct
/FLOW
data flow by default. Only when the DIRECT keyword is specified in the
/FLOW option does the Job Editor import the specified, individual data
flow as a direct data flow.

After the DTL import, any additional data flow must be specified as
DIRECT to avoid writing an intermediate file to disk when connecting
the single file target of one task to the single file source of another task.
Upon DTL import, the Task Editor converts each of the subjobs and tasks
/SUBJOB, /TASK outlined in the DMExpress command options file and adds them to the
new DMExpress job.
Supported DTL task command options
Unless otherwise specified, all of the arguments for each of the following DTL task command
options are supported:

DTL task command options Notes/Processing distinctions


/CONDITION
In alignment with the support provided for linking to DTL
command options files as external metadata, DTL import
/DATADICTIONARY
honors the /DATADICTIONARY considerations for linking
to files from within DTL command options files.
When the specified /DATASIZE unit is KILOBYTES,
MEGABYTES, GIGABYTES, or TERABYTES, the Task
/DATASIZE
Editor imports the data size in the equivalent number of
bytes.
/DBCOLUMNSEXTRACT
/DBCONNECTION
/DBDATA
/DBINPUT
/DBOUTPUT
/DELIMITEDRECORDLAYOUT
/DTL
/END
The Task Editor does not support aliases and, therefore, does
not support the layout_alias argument.
/FILE, /INFILE
Upon DTL import, the Task Editor ignores layout aliases,
which results in the generation of warnings or errors that
identify broken expressions.
/FILTER
/GROUPBY
/HCATCONNECTION
/INMAXRECORDLENGTH
/INPIPE
/JOINKEYS
/KEYS
/LOGOPTIONS
When the specified /MEMORY unit is KILOBYTES,
MEGABYTES, GIGABYTES, or TERABYTES, the Task
/MEMORY
Editor imports the memory size rounded to the closest
number of megabytes.
/MQCONNECTION
/MQFIELDSEXTRACT
/MQINPUT
/MQOUTPUT
/NOPROMPT
/NULLIF
/OUTFILE
/OUTPIPE
/PADBYTE
When the specified size unit for any of the maximum_size
arguments is KILOBYTES, MEGABYTES, GIGABYTES,
or TERABYTES, the Task Editor imports the maximum size
in the equivalent number of bytes.
/PARTITION
When the keyword UNSELECTEDRECORDS is used, the
value of the specified partition_suffix argument is replaced
with the value "remaining." The Task Editor always appends
the suffix "remaining" to the target file name that corresponds
to the partition of unselected records.
/RECORDLAYOUT
/REFORMAT
/SERVERCONNECTION
/SFDCCONNECTION
/STABLE
The Task Editor does not support summary names and,
therefore, does not support the sum_name argument.
/SUMMARY
Upon DTL import, the summary name is ignored, but the
summarize function is retained and always displays when
referencing the summary field/value.
/TASKDOCUMENTATION
While all of the /TASKTYPE arguments are supported, the
Task Editor automatically performs a sequence check on the
input records to all merge and aggregate tasks and, therefore,
does not support the keyword VERIFYORDER.

/TASKTYPE Upon DTL import, no message is generated that notifies


whether VERIFYORDER is or is not specified in the
/TASKTYPE option.

After DTL import, the Task Editor automatically performs a


runtime, sequence check on the input file records to all merge
and aggregate tasks. This sequence check may result in
performance degradation for those merge and aggregate tasks
that did not include VERIFYORDER in the original DTL
command options file.
In alignment with the support provided for linking to DTL
command options files as external metadata, DTL import
/VALUE
honors the /VALUE considerations for text constants and for
functions.
/WORKSPACE
/XMLEXTRACT

For a comprehensive list of DTL command options, see the DMExpress Data Transformation
Reference Guide:

Related concepts

Data Transformation Language


DTL pre-import considerations
Import DTL command options files
Import DTL command
External metadata in a DMExpress DTL command options file
Job Editor
Task Editor
Browse dialog

DMExpress Data Funnel


DMExpress Data Funnel (DMXDFNL) is a command-line tool that leverages the DMExpress
high-performance engine to transfer large volumes of data from source databases within an
unlimited number of database management systems (DBMSs) to target databases within an
unlimited number of DBMSs or to target files, which can be stored on a local file system or on
the Hadoop Distributed File System (HDFS). High-level data extraction and transfer
requirements such as "do not transfer any columns of this data type to the target," "move all
tables in this schema except for these," and "execute this data movement on any node in the
Hadoop cluster" are implemented through the specification of name/value pairs and strings in a
DMXDFNL configuration file.

Data enterprise mass data transfer solution


In generating DMExpress Data Transformation Language (DTL) tasks, which are executed
through the DMExpress engine, DMXDFNL harnesses DMExpress' processing power to
facilitate mass data transfers both within your data enterprise environment and from your data
enterprise environment to HDFS.

While the DMExpress Job and Task Editors enable the effective execution of a single data
transfer, DMXDFNL enables data transfers en masse. The transfer of data from a database
schema to HDFS, for example, would traditionally require the creation of individual tasks
through the Task Editor for each and every table in the database schema. Today, you can
circumvent the manual creation of hundreds, if not thousands of individual data transfer tasks, by
running DMXDFNL at the command line. In facilitating the execution of automatically-
generated DTL tasks through the DMExpress engine, DMXDFNL seamlessly optimizes the mass
transfer of data from your data enterprise environment to HDFS. DMXDFNL provides you with
the capability to populate your Hadoop data lake using a single command.

Things to know
Requirements

Supported sources and targets

Packaging considerations

Syntax and command execution

How DMXDFNL works


The transfer of data is specified through a DMXDFNL configuration file, which is a JSON file
that you create, name, and populate. Through name/value pairs and strings defined through
JSON objects and lists, you specify the following requirements for each data transfer definition:

 Credentials for all source database connections and for all target database or target file
connections.
 Order of priority in which generated DTL tasks are run
 Source databases and the data that is eligible for extraction
 Target databases or target files, to which the extracted volume of data is transferred

While you can chose to extract and transfer all of the data within a database schema or schemas,
you can fine tune the data transfer definitions specified within the DMXDFNL configuration file
to meet your needs. You can specify the data types to exclude from an extraction and the tables
to either exclude from an extraction or the tables to include in an extraction. In addition, you can
specify whether a job runs on a Hadoop-determined node in the Hadoop cluster.

DMXDFNL also allows for the execution of existing DMExpress task files, or DMExpress job
files, or both. Through strings defined through JSON objects and lists, you specify the following
requirements for each execute existing task job definition:

 Directory paths to existing DMExpress task files


 Directory paths to existing DMExpress job files

Hierarchy of the DMXDFNL configuration file

To ensure that an accurately populated DMXDFNL configuration file with the correct syntax is
available for DMXDFNL, familiarization with the members and elements that comprise the
JSON objects and lists is required. For an outline of the file components and details relevant to
each member and element, see DMXDFNL configuration file.

Process overview

Upon accessing the DMXDFNL configuration file, DMXDFNL does the following:

 Parses the configuration file


 Generates the DTL tasks that define the transfer of data
 Executes the generated DTL task files and any existing DMExpress task and job files that
are specified for execution

In parsing the DMXDFNL configuration file, DMXDFNL identifies the connections to the
source databases and to the target databases or target files for each specified DBMS. DMXDFNL
generates DTL tasks and identifies the priority in which the tasks are run and the number of tasks
that can be run in parallel. DMXDFNL identifies whether generated DTL task files are saved,
and if saved, the directory to which they are written. Unless otherwise specified, DMXDFNL
executes not only the generated DTL tasks through the DMExpress engine, but also executes all
existing DMExpress task and job files that are specified for execution. The progress of task and
job execution, warnings or errors, and run statistics are logged to standard output and display on
the system console.
Enhancement requests
Syncsort is committed to continued DMXDFNL development. If you have a requirement that
could be met in extending DMXDFNL capabilities, contact Syncsort technical support,
dmxsupport@syncsort.com.

Related concepts

DMXDFNL configuration file


Connections object: database member
Connections object: HDFS member
Job object: dtl_export_dir member
Job object: no_execute member
Job object: default_parallel member
Job object: subunit_order member
Job object: subunits member

DMXDFNL configuration file


The DMXDFNL configuration file must be a valid JSON file. As with all JSON files, the
configuration file must contain a root object, which includes the required connections and job
objects:

 connections

Through the required connections object, you can specify the following two types of
connections:

o database
o hdfs
 job

The required job object specifies job level settings and subunits, within which the transfer
of data is specified. The job object members include the following:

o dtl_export_dir
o no_execute
o default_parallel
o subunit_order
o subunits
DMXDFNL configuration file example

Related concepts

DMExpress Data Funnel

Connections object: database member


The optional database member is a list that specifies at least one anonymous database connection
object, which defines the connection through which DMXDFNL connects to a database source or
target. The members of an anonymous database connection object include the following:

 alias

The required alias member is a string that specifies the name of the database connection
and how it is referenced in the data transfer. This alias is specified in either the source
database or target database connection strings.

"alias" : "<name_of_database_connection>"

 dbms

The required dbms member is a string that specifies the database management system to
which DMXDFNL connects through the database connection.

"dbms" : "<db2|db2zos|hcatalog|hive|oracle|sqlserver|teradata>"

 database

The required database member is a string that specifies the name of the data source to
which DMXDFNL connects. The value component of the name/value pair must be a
valid identifier, which is based on the DBMS type:

o Apache HCatalog - WebHCat service URL that DMExpress uses to communicate


with WebHCat. As this URL references one schema only, you must define a
connection for each HCatalog schema with which DMExpress communicates.
o Apache Hive - ODBC DSN.
o IBM DB2 for Linux, UNIX, and Windows - DB2 connection alias.
o IBM DB2 for z/OS - DB2 connection alias.
o Oracle - Connect identifier in tnsnames.ora.
o Teradata - Hostname or IP address of the database.
o Microsoft SQL Server - ODBC DSN.
"database" : "<valid_identifer_for_DBMS_type>"

 authtype

The optional authtype member is a string that specifies the method used to authenticate
the database. Specify an authentication method supported by the specified database. The
default value is autodetect.

"authtype" : "<autodetect|dbms|os|kerberos>"

 username and password

The optional username and password members are strings, which can be used for all
authentication methods except OS. If you specify username and password values, these
values display in the generated DTL tasks. If you choose to specify the password as an
environment variable, the environment variable, for example, $DB_PASSWD, displays
in the generated DTL tasks.

"username" : "<user_name>",

"password" : "<password>"

 hive_site

For HCatalog connections, the required hive_site member is a string that specifies the
directory path to a Hive XML configuration file. Typically, the directory path to hive-
site.xml is specified.

"hive_site" : "<directory_path_to_Hive_XML_configuration_file>"

database connection example

Related concepts

DMExpress Data Funnel


DMXDFNL configuration file

Connections object: HDFS member


The optional HDFS member is a list that specifies at least one anonymous HDFS connection
object, which defines the remote file connection through which DMXDFNL connects to an
HDFS target. The members of an anonymous HDFS connection object include the following:

 alias

The required alias member is a string that specifies the name of the HDFS connection and
how it is referenced in the data transfer. This alias is specified in the target file
connection string.

"alias" : "<name_of_HDFS_connection>"

 namenode

The required namenode member is a string that specifies either the hostname or, in high
availability clusters, the name service of the namenode. DMXDFNL connects to the
namenode through the HDFS connection.

"namenode" : "<hostname_or_name_service_of_namenode>"

hdfs connection example

Related concepts

DMExpress Data Funnel


DMXDFNL configuration file

Job object: dtl_export_dir member


The optional dtl_export_dir member is a string that specifies the directory to which DMXDFNL
writes the generated DTL tasks to file. The generated DTL task files can be used for auditing
purposes, for import into the Task Editor to enable metadata access for downstream tasks, and to
allow for you to customize and execute the data transfer. If you are customizing the data transfer,
you can execute individual DTL tasks, or you can create a job that includes some or all of the
DTL tasks.

"dtl_export_directory" : "<full_directory_path>"
When a directory value is specified for dtl_export_dir, the generated DTL task files, which are
written to the specified directory, are named according to the following convention:

tsk_<schema_name>.<table_name>.tdtl

Note: If your schema name or table name has special characters, for example, slash, ‘/’,
DMXDFNL may not be able to write the task to file or to execute the task.

Consider the following:

 If a directory is not specified, the generated DTL tasks are not written to file.
 If the directory is set to the same directory that was set for a prior run, same-named DTL
task files are overwritten.
 If the directory is set to a different directory from that set for a prior run, the DTL task
files from the prior run remain in the directory to which they were written.

Related concepts

DMExpress Data Funnel


DMXDFNL configuration file

Job object: no_execute member


The optional no_execute member is a Boolean data type that allows you to prevent the execution
of the generated DTL task files, which are written to the directory specified in dtl_export_dir.
The default value is false.

"no_execute" : <true|false>

Related concepts

DMExpress Data Funnel


DMXDFNL configuration file

Job object: default_parallel member


The optional default_parallel member is an integral data type that specifies the number of tasks
that can run simultaneously for all subunits except those for which parallelism is defined at the
subunit order level. The default value is 1. If specified, the value must be greater than 1.

"default_parallel" : [1..]

Related concepts

DMExpress Data Funnel


DMXDFNL configuration file

Job object: subunit_order member


The required subunit_order member is a list that specifies the order in which the subunits are
executed and the subunit level of parallelism. A subunit_order list element can include a string,
in which case the string is an alias, or can include the following subunit_order object members:

 alias

The required alias member is a string that must match a subunit_alias name, which you
define in the subunits object to identify a subunit list.

"alias" : "<name_of_subunit_list_in_subunit_object>"

 parallel

The optional parallel member is an integral data type that specifies the number of DTL
tasks that can run simultaneously for the specified subunit. When set, this subunit_order
parallel value overrides the job-level default_parallel value. The default value is 1. If
specified, the value must be greater than 1.

"parallel" : [1..]

Note: All subunits that are scheduled in the subunit_order list must be defined in the subunits
object also.

subunit_order example
Related concepts

DMExpress Data Funnel


DMXDFNL configuration file

Job object: subunits member


The required subunits member is an object containing one or more subunit aliases. Each subunit
alias is a list of one or more anonymous data transfer definitions, or one or more execute existing
task job definitions, or a combination of data transfer definitions and execute existing task job
definitions:

 Each data transfer definition subunit list element is an object that specifies the source
from which data is extracted and the target into which the extracted data is loaded.
 Each execute existing task job definition subunit list element is an object that specifies
the directory paths to existing DMExpress task files, or to existing DMExpress job files,
or to both.

Each subunit_alias defines at least one of the following, which occurs as a single schedulable
unit:

 a set of source to target transfers


 a set of directory paths to existing DMExpress task files, or DMExpress job files, or both

You assign a unique value to each subunit_alias that is defined within the DMXDFNL
configuration file.

A subunit list element includes at least one of the following:

 Anonymous data transfer definition object members:


o node_agnostic
o source
o target
 Anonymous execute existing task job definition object members:
o tasks
o jobs

Note: More subunits can be defined in the subunits object than are scheduled in the
subunit_order list; however, all subunits scheduled in the subunit_order list must be defined in
the subunits object.

subunits example
Related concepts

DMExpress Data Funnel


DMXDFNL configuration file

Anonymous data transfer definition object:


node_agnostic member
The optional node_agnostic member is a Boolean data type that specifies whether a DMXDFNL-
generated task can be run on a Hadoop-determined node in the Hadoop cluster. When
node_agnostic is true, the task is executed on a Hadoop-determined node in the Hadoop cluster;
when node_agnostic is false, the task is executed on the system running DMXDFNL. The default
value is false.

"node_agnostic" : <true|false>

Related concepts

DMExpress Data Funnel


DMXDFNL configuration file
Job object: subunits member

Anonymous data transfer definition object:


source member
The required source member is an object, which defines the connection to the source, the
database source, and the data to be extracted from the database source. By default, the source
object defines an extraction that includes all of the tables in the schema; however, you can
specify the data types to exclude from the extraction, and you can specify either the tables to
exclude from the extraction or the tables to include in the extraction.

The source object members include the following:

 connection
The required connection member is a string that specifies the alias of the connection to
the database source. The alias is defined in the database connection object.

"connection" : "<alias_of_connection_to_database_source>"

 schema

The required schema member is a string that specifies the name of the source database
schema.

"schema" : "<source_database_schema>"

 exclude_types

The optional exclude_types member is a list that specifies the RDBMS data type or data
types, for example, large object data types, to exclude from the extraction.

"<RDBMS_data_type>"

 exclude

The optional exclude member is a list that specifies the database table or tables to exclude
from the extraction.

Note: An exclude list can be specified only when the optional include list is not specified.

"<table_name>"

 include

The optional include member is a list that specifies the database table or tables to include
in the extraction. No tables in the database schema other than those specified in the
include list are included in the extraction. For each element in the include list, you can
specify either a table name or a table name and filter, which allows you to filter records in
the table based on a condition.

Note: An include list can be specified only when the optional exclude list is not specified.

An include list element can include a string, in which case the string is the table name, or
can include the following include object members:

o table

The required table member is a string that specifies the name of the database table
to be extracted.
"table" : "<table_name>"

o filter

The optional filter member is a string that specifies the condition upon which
records are extracted from the database table. The filter string is equivalent to the
text that follows a WHERE clause in a SQL query.

"filter" : "<filter_condition>"

 extraction_formats

The optional extraction_formats is an object that enables you to override the default
extraction of date/time source columns. As DMExpress does not extract fractional
seconds in date/time data by default, you could, for example, use extraction_formats to
extract the fractional seconds.

"<DBMS_data_type>" : "<DMExpress_date/time_mask>"

source example

Related concepts

DMExpress Data Funnel


DMXDFNL configuration file
Job object: subunits member

Anonymous data transfer definition object:


target member
The required target member is an object, which defines the connection to the target and the
target, which can be either a database or a file. The target database destination or target file name
is inferred from the source database:

 Database targets - The name of the database table to which the data is transferred is the
same as the name of the database source table from which the data is extracted.
 File targets - The name of the target file is derived from the source database schema and
table names.
Character encoding
UTF-8 is the character encoding for all of the extracted data written to either target database
tables or to target files.

Database targets
The target object database members include the following:

 connection

The required connection member is a string that specifies the alias of the connection to
the database target. The alias is defined in the database connection object.

"connection" : "<alias_of_connection_to_database_target>"

 schema

The required schema member is a string that specifies the name of the target database
schema.

"schema" : "<target_database_schema>"

 disposition

The optional disposition member is a string that specifies whether to overwrite, that is, to
truncate all existing rows from the target table before inserting the extracted records, or to
append the extracted records to the end of the target table. The default is overwrite.

"disposition" : "<overwrite|append>"

target database example

File targets
The target object file members include the following:

 connection
The optional connection member is a string that must be specified to define a remote file
connection to HDFS. The alias is defined in the hdfs connection object. When a
connection string is not specified, the target defaults to the local file system.

"connection" : "<alias_of_remote_file_connection_to_HDFS>"

 dir

The required dir member is a string that specifies the local file directory or HDFS to
which the target file is written.

"dir" : "<target_file_directory>"

The target file written to the local file directory or to HDFS is named according to the
following convention:

<schema_name>.<table_name>.dat

 delimiter

The optional delimiter member is a string that specifies the single-byte delimiter that is
used to separate records in the target file. The default is a comma, ",".

"delimiter" : "<single_byte_delimiter>"

 enclosing_char

The optional enclosing_char member is a string that specifies the single-byte character
that is used to enclose fields in the target file records. The default is no enclosing
characters.

"enclosing_char" : "<single_byte_field_enclosing_character>"

 disposition

The optional disposition member is a string that specifies whether to overwrite the same-
named target file or to append the extracted records to the end of the target file. The
default is overwrite.

"disposition" : "<overwrite|append>"

target file example


Related concepts

DMExpress Data Funnel


DMXDFNL configuration file
Job object: subunits member

Anonymous execute existing task job


definition object: tasks member
The optional tasks member is a list of strings. A tasks list element is a string that specifies the
directory path to the existing DMExpress task file or files that will be executed on the local file
system.

"<directory_path_to_and_name_of_existing_DMExpress_task_files>"

tasks example

Related concepts

DMExpress Data Funnel


DMXDFNL configuration file
Job object: subunits member

Anonymous execute existing task job


definition object: jobs member
The optional jobs member is a list of strings. A jobs list element is a string that specifies the
directory path to the existing DMExpress job file or files that will be executed on the local file
system.

"<directory_path_to_and_name_of_existing_DMExpress_job_files>"

jobs example
Related concepts

DMExpress Data Funnel


DMXDFNL configuration file
Job object: subunits member

You might also like