Ab Initio Basic Turorial
Ab Initio Basic Turorial
Ab Initio Basic Turorial
It is one of the important ETL(Extract ,Transform, Loading ) tool for analyzing the data
for business purpose
ETL process
Extract :In this process the required data is extract from the source file like text
file,database and other source system .
Transformation:
In this process the extracted data is converted into the required format for analyzing
the data .it involves the following tasks.
Load :loading the data into the data warehouse or data repository other application
1. Manage and run AbInitio graphs and control the ETL processes
2. Provides AbInitio extensions to the operating system
3. ETL processes monitoring and debugging
4. Metadata management and interaction with the EME
5. The Graphical Development Environment (GDE), which install on PC (GDE
Computer) and configure to communicate with the host.
GDE is a graphical application for developers which is used for designing and running
AbInitio graphs.It also provides:
It allows user to create the graphs here. It is very easy to use. During running of the
graph the GDE is convert the graph into the shell script which is run in the Unix server
A graph is a data flow diagram that defines the various processing stages of a task
and the streams of data as they move from one stage to another. In a graph a
component represents a stage and a flow represents a data stream. In addition, there
are parameters for specifying various aspects of graph behavior
Build a graph in the GDE by dragging and dropping components, connecting them
with flows, and then defining values for parameters. run, debug, and tune your graph in
the GDE. In the process of building a graph, are developing an Ab Initio application,
and thus graph development is referred to as graph programming. When you are ready
to deploy your application, you save the graph as a script that you can run from the
command line.A sample Graph shown above
Graph Parameters
The sample graph is shown below the input file and output file is
a database component .The details about component is explained in the following
section. Input file and partition by round robin is connected by flows .flow is the path of
data passes in the graph .
Parts of a Graph
Metadata: Metadata is any information about data or how to process it. There are two
broad categories of metadata:
Dataset: It is one of the components it contains the table or files which hold the input
and output file
Component: it is used to build a graph .the component organizer contains all the
components
Layout:
4. A parallel layout specifies multiple nodes and multiple directories. It is permissible for
the same node to be repeated.
5. The location of a Data set is one or more places on one or more disks.
Phase are used to break up a graph into blocks for performance tuning. The primary
purpose of phasing is performance tuning by managing resources. Phasing limits the
number of simultaneous processes by breaking up the graph into different phases, only
one of which is running at any given time. One common use of phasing is to avoid
deadlocks. The temporary files created because of phase breaks are deleted at the
end of the phase regardless of whether the run was successful or not
Check point :
2 .The main aim of checkpoints is to provide the means to restart a failed graph from
some intermediate state.
3 .In this case, the temporary files from the last successful checkpoint are retained so
that the graph can be restarted from this point in the event of a failure. Only as each
new checkpoint is completedsuccessfully are the temporary files corresponding to the
previous checkpoint deleted.
1 .The graph execution can be done from the GDE itself or from the back-end as well
2 .A graph can be deployed to the back-end server as a Unix shell script or Windows
NT batch file.
3 .The deployed shell or the batch file can be executed at the back-end
Sandbox:
Sandbox Structure :
db - database-related
mp - graphs
xfr - transforms
Ab initio components
In this section we are going to discuss about some basic and important component in
Ab initio
1. Dataset components
2. Departition
3. Partition
4. Sort
5. Transform
6. Miscellaneous
7. Validate
Dataset Components:
Dataset components represent data records or act upon data records as follows:
INPUT FILE represents data records read as input to a graph from one or more
serial files or from a multifile
INPUT TABLE unloads data records from a database into an Ab Initio graph,
allowing you to specify as the source either a database table, or an SQL
statement that selects data records from one or more tables.
LOOK UP: represents one or more serial files or a multifile of data records small
enough to be held in main memory, letting a transform function retrieve records
much more quickly than it could retrieve them if they were stored on disk.
OUTPUT FILE represents data records written as output from a graph into one or
multiple serial files or a multifile.
OUTPUT TABLE loads data records from a graph into a database, letting you
specify the records' destination either directly as a single database table, or
through an SQL statement that inserts data records into one or more tables.
Departition components combine multiple flow partitions of data records into a single
flow as follows
Concatenate:
1. Reads all the data records from the first flow connected to the in port (counting
from top to bottom on the graph) and copies them to the outport.
1. Then reads all the data records from the second flow connected to the in port
and appends them to those of the first flow, and so on
GATHER:
1. Reads data records from the flows connected to the in port.Combines the
records arbitrarily.
INTERLEAVE:
Interleave combines blocks of data records from multiple flow partitions in round-robin
fashion
Parameter :
blocksize
(integer, required)
Number of data records Interleave reads from each flow before reading the same
number of data records from the next flow.
Default is 1.
1. Reads the number of data records specified in the blocksize parameter from the
first flow connected to the in port
1. Reads the number of data records specified in the blocksize parameter from the
next flow, and so on
MERGE:
Merge combines the data records from multiple flow partitions that have been sorted
based on the same key specifier, and maintains the sort order
Parameter:
Key :Name of the key(primary key or unique key) field for combine the data .the key
maybe more than one .
1 .Read the records from the in port and combine the records based on the sorting
order
Sort components
Checkpoint sort:
It sorts and merges data records, inserting a checkpoint between the sorting and
merging phases
FIND SPLITTERS:
Find splitters sorts data records according to a key specifier, and then finds the ranges
of key values that divide the total number of input data records approximately evenly
into a specified number of partitions.
Parameter:
key
(key specifier, required)
Name(s) of the key field(s) and the sequence specifier(s) required to Find Splitters to
use when it orders data records and sets splitter points.
Number of partitions
(Integer, required)
Number of partitions into which you want to divide the total number of data records
evenly.
The output from the out port of Find Splitters to the split port of PARTITION BY
RANGE.
If n represents the value in the num_partitions parameter, Find Splitters generates n-1
splitter points. These points specify the key values that divide the total number of input
records approximately evenly inton partitions.
You do not have to provide sorted input data records for Find Splitters. Find Splitters
sorts internally.
Find Splitters:
1. Writes a set of splitter points to the out port in a format suitable for the split port
of Partition by Range
Sort
Sort component sort the data in ascending or descending order according to the key
specified.
By default sorting is done in ascending order. To make the flow in descending order the
descending radio button has to be clicked.
In the parameter max-core value is required to be specified. Though there is a default
value, it recommended to use $ variable which is defined in the system [$MAX_CORE,
$MAX_CORE_HALF etc].
1. Sort stores temporary files Reads the records from all the flows connected to
the in port until it reaches the number of bytes specified in the max-
core parameter
2. Sorts the records and writes the results to a temporary file on disk
3. Repeats this procedure until it has read all records
4. Merges all the temporary files, maintaining the sort order
5. Writes the result to the out port . Sort stores temporary files in the working
directories specified by its layout
Sort within Groups refines the sorting of data records already sorted according to one
key specifier: it sorts the records within the groups formed by the first sort according to a
second key specifier
Sort within Groups assumes input records are sorted according to the major-
key parameter.
Sort within Groups reads data records from all the flows connected to the in port until it
either reaches the end of a group or reaches the number of bytes specified in the max-
core parameter.
When Sort within Groups reaches the end of a group, it does the following:
Transform components
Aggregate
Aggregate generates data records that summarize groups of data records ( similar to
rollup). But it has lesser control over data
Parameter :the following are the parameter for aggregate are sorted-input,key,max-
core,transform,select,reject threshold ,ramp,error log,reject log
The input must be sorted before using aggregate .we can use max core
parameter or sort component for sort the data .
If you do not supply an expression for the select parameter, processes all the
records on the in port.
If you have defined the select parameter, applies the select expression to the
records:
3. Aggregates the data records in each group, using the transform function as
follows:
a. For the first record of a group, Aggregate calls the transform function with two
arguments: NULL and the first record.
Aggregate saves the return value of the transform function in a temporary aggregate
record that has the record format of the out port.
b. For the rest of the data records in the group, Aggregate calls the transform
function with the temporary record for that group and the next record in the group as
arguments.
Again, Aggregate saves the return value of the transform function in a temporary
aggregate record that has the record format of the out port.
Aggregate stops execution of the graph when the number of reject events exceeds the
result of the following formula:
For more information, see "Setting limits and ramps for reject events".
5. Aggregate writes the temporary aggregate records to the out port in one of two
ways, depending on the setting of the sorted-input parameter:
a. When sorted-input is set to Input must be sorted or grouped, Aggregate writes the
temporary aggregate record to the out port after processing the last record of each
group, and repeats the preceding process with the next group.
b. When sorted-input is set to In memory: Input need not be sorted, Aggregate first
processes all the records, and then writes all the temporary aggregate records to the
out port.
Denormalize sorted:
Denormalize Sorted consolidates groups of related data records into
a single output record with a vector field for each group, and optionally computes
summary fields in the output record for each group. Denormalize Sorted requires
grouped input
Filter by Expression:
Filter by Expression filters data records according to a specified DML expression.
Basically it can be compared with the where clause of sql select statement.
Different functions can be used in the select expression of the filter by expression
component even look up can also be used. In this filter by expression there is reject-
threshold parameter
The value of this parameter specifies the component's tolerance for reject events.
Choose one of the following:
Abort on first reject Write Multiple Files stops the execution of the graph at the
first reject event it generates.
Never abort the component does not stop the execution of the graph, no matter
how many reject events it generates.
Use ramp/limit the component uses the settings in the ramp and limit
parameters to determine how many reject events to allow before it stops the execution
of the graph.
The default is Abort on first reject.
Filter by Expression:
FUSE: Fuse combines multiple input flows into a single output flow by applying a
transform function to corresponding records of each flow
Fuse applies a transform function to corresponding records of each input flow. The
first time the transform function executes, it uses the first record of each flow. The
second time the transform function executes, it uses the second record of each flow,
and so on. Fuse sends the result of the transform function to the out port.
1. Fuse tries to read from each of its input flows): If all of its input flows are finished,
fuse exits.
Otherwise, Fuse reads one record from each still-unfinished input port and a NULL from
each finished input port.
2. If Fuse reads a record from at least one flow, Fuse uses the records as arguments
to the select function if the select function is present.
If the select function is not present, Fuse uses the records as arguments to the
fuse function.
If the select function is present, fuse discards the records if select returns zero
and uses the records as arguments to the fuse function if select returns non-zero.
3. Fuse sends to the out port the record returned by the fuse function
Transform Components(Con)
JOIN:
Join reads the records from multiple ports, operates on the records with matching
keys using a multiinput transform function and writes the result into output ports
In join the key parameter has to be specified from input flow (either of the
flow) ascending or descending order. If all the input flows do not have any common
field, override-key must be specified to map the key specified
in1:the second input file is connected to this port .this in port will increase based on the
no of input file
out :the output of join component is going to this out port
Unused0:In this file contains the unused data (unmatched data) from the input file0
Unused1:In this file contains the unused data (unmatched data) from the input file1
Reject port :it contains the rejected record due to some error in the data from the file
Error port : It contains about detail description of rejection of the data .it write what is
error in the file
Log port:It writes the process status based on the time until the process end.
1. Inner join (default) Sets the record-requiredn parameters for all ports to
True. The GDE does not display the record-requiredn parameters, because they all
have the same value.
2. Outer join Sets the record-requiredn parameters for all ports to False.
The GDE does not display the record-requiredn parameters, because they all have the
same value.
3. Explicit Allows you to set the record-requiredn parameter for each port
individually
REFORMAT:
2. If you supply an expression for the select parameter, the expression filters the
records on the in port:
a) If the expression evaluates to 0 for a particular record, Reformat does not process
the record, which means that the record does not appear on any output port.
b) If the expression produces NULL for any record, Reformat writes a descriptive error
message and stops execution of the graph.
c) If the expression evaluates to anything other than 0 or NULL for a particular record,
Reformat processes the record.
3. If you do not supply an expression for the select parameter, Reformat processes all
the records on the in port.
4. Passes the records to the transform functions, calling the transform function on
each port, in order, for each record, beginning with out port 0 and progressing through
out port count - 1.
ROLLUP:
Rollup generates data records that summarize groups of data records on the basis
of key specified.
Parts of Aggregate
Input select (optional)
Initialize
Temporary variable declaration
Rollup (Computation)
Finalize
Output select (optional)
Initialize: rollup passes the first record in each group to the initialize transform function.
Rollup (Computation): Rollup calls the rollup transform function for each record in a
group, using that record and the temporary record for the group as arguments. The
rollup transform function returns a new temporary record.
Finalize:
If you leave sorted-input set to its default, Input must be sorted or grouped:
Rollup calls the finalize transform function after it processes all the input records in a
group.
Rollup passes the temporary record for the group and the last input record in the group
to the finalize transform function.
The finalize transform function produces an output record for the group.
Rollup repeats this procedure with each group.
Output select: If you have defined the output_select transform function, it filters the
output records.
SCAN:
1. For every input record, Scan generates an output record that includes a running,
cumulative summary for the data records group that input record belongs to. For
example, the output records might include successive year-to-date totals for groups of
data records
2. The input should be sorted before the scan else it produce error .
3. The main difference between Scan and Rollup is Scan generates intermediate
(cumulative) result and Rollup summarizes
Partition Components
Broadcast:
broadcast arbitrarily combines all the data records it receives into a single flow
and writes a copy of that flow to each of its output flow partitions 1. Reads
records from all flows on the in port
3. Copies all the records to all the flow partitions connected to the outport
Partition by Expression:
Partition by Key :
Distributes them to the flows connected to the out port, according to the key parameter,
writing records with the same key value to the same output flow
Partition by percentage:
Writes a specified percentage of the input records to each flow on the out port
Partition by Range:
Partition by range partition or divide the record based on the
specified range in the file.for example consider the file contains 50records and it has
four output file .i want 20 records in output 1and 5 records in output 2 & 3 remaining
records in output4 then we can specify the range in the parameter tab.
This component
Reads splitter records from the split port, and assumes that these records are sorted
according to the key parameter.
Determines whether the number of flows connected to the out port is equal to n (where
n-1 represents the number of splitter records).If not, Partition by Range writes an error
message and stops the execution of the graph.
Reads data records from the flows connected to the in port in arbitrary order.
Distributes the data records to the flows connected to the out port according to the
values of the key field(s), as follows:
a) Assigns records with key values less than or equal to the first splitter record to the
first output flow.
b) Assigns records with key values greater than the first splitter record, but less than or
equal to the second splitter record to the second output flow, and so on.
1. Reads records in arbitrary order from the flows connected to its in port
2. Distributes those records among the flows connected to its out port by sending
more records to the flows that consume records faster
3. Partition with Load Balance writes data records until each flow's output buffer fills
up
Miscellaneous Components:
GATHER LOG
Gather log used to collects the output from the log ports of components for analysis of a
graph after execution
1. Writes a record containing the text from the StartText parameter to the file
specified in the LogFile parameter.
1. Writes any log records from its in port to the file specified in the LogFile
parameter.
1. Writes a record containing the text from the End Text parameter to the file
specified in the LogFile parameter
LEADING RECORDS:
Leading record is used to copies the specified no of record from the in port and
write into the out port counting from the first record of the file
REPLICATE:
Replicate arbitrarily combines all the data records it receives into a single flow and
writes a copy of that flow to each of its output flows
1. Arbitrarily combines the data records from all the flows on the in port into a single
flow
2. Copies that flow to all the flows connected to the out port
GENERATE RECORDS:
Generate Records generate random values within the specified length and type for
each field, or you can control various aspects of the generated values. Typically, you
would use the output of Generate Records to test a graph
2. The values of the records depend on the record format of the out port, and the
optional command_line parameter.