Ab Initio Means

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 19

Ab Initio means “ Starts From the Beginning”. Ab-Initio software works with the client-server model.

The client is called “Graphical Development Environment” (you can call it GDE).It
resides on user desktop.The server or back-end is called Co-Operating System”. The Co-Operating
System can reside in a mainframe or unix remote machine.

The Ab-Initio code is called graph ,which has got .mp extension. The graph from GDE is required to be
deployed in corresponding .ksh version. In Co-Operating system the
corresponding .ksh in run to do the required job.

How Ab-Initio Job Is Run What happens when you push the “Run” button?

•Your graph is translated into a script that can be executed in the Shell Development
•This script and any metadata files stored on the GDE client machine are shipped (via
FTP) to the server.
•The script is invoked (via REXEC or TELNET) on the server.
•The script creates and runs a job that may run across many hosts.
•Monitoring information is sent back to the GDE client.
Ab-Intio Environment The advantage of Ab-Initio code is that it can run in both the serial and multi-file
system environment. Serial Environment: The normal UNIX file system. Muti-File System: Multi-File
System (mfs) is meant for parallelism. In an mfs a particular file physically stored across different
partition of the machine or even different
machine but pointed by a logical file, which is stored in the co-operating system. The
logical file is the control file which holds the pointer to the physical locations.
About Ab-Initio Graphs: An Ab-Initio graph comprises number of components to serve different purpose.
Data is read or write by a component according to the dml ( do not
confuse with the database “data manipulating language” The most commonly used
components are described in the following sections.

Co>Operating System

Co>Operating System is a program provided by AbInitio which operates on the top of the operating
system and is a base for all AbInitio processes. It provdes additional features known as air commands
which can be installed on a variety of system environments such as Unix, HP-UX, Linux, IBM AIX,
Windows systems. The AbInitio CoOperating System provides the following features:
- Manage and run AbInitio graphs and control the ETL processes
- Provides AbInitio extensions to the operating system
- ETL processes monitoring and debugging
- Metadata management and interaction with the EME

AbInitio GDE (Graphical Development Enviroment)

GDE is a graphical application for developers which is used for designing and running AbInitio graphs. It
also provides:
- The ETL process in AbInitio is represented by AbInitio graphs. Graphs are formed by components (from
the standard components library or custom), flows (data streams) and parameters.
- A user-friendly frontend for designing Ab Initio ETL graphs
- Ability to run, debug Ab Initio jobs and trace execution logs
- GDE AbInitio graph compilation process results in generation of a UNIX shell script which may be
executed on a machine without the GDE installed

AbInitio EME

Enterprise Meta>Environment (EME) is an AbInitio repository and environment for storing and
managing metadata. It provides capability to store both business and technical metadata. EME
metadata can be accessed from the Ab Initio GDE, web browser or AbInitio CoOperating system
command line (air commands)

Conduct>It

Conduct It is an environment for creating enterprise Ab Initio data integration systems. Its main role is to
create AbInitio Plans which is a special type of graph constructed of another graphs and scripts. AbInitio
provides both graphical and command-line interface to Conduct>IT.

Data Profiler

The Data Profiler is an analytical application that can specify data range, scope, distribution, variance,
and quality. It runs in a graphic environment on top of the Co>Operating system.

Component Library

The Ab Initio Component Library is a reusable software module for sorting, data transformation, and
high-speed database loading and unloading. This is a flexible and extensible tool which adapts at
runtime to the formats of records entered and allows creation and incorporation of new components
obtained from any program that permits integration and reuse of external legacy codes and storage
engines.

What is the relation between EME, GDE and Co-operating system?

ans. EME is said as enterprise metadata env, GDE as graphical development env and Co-operating
system can be said as asbinitio server
relation b/w this CO-OP, EME AND GDE is as fallows
Co operating system is the Abinitio Server. This co-op is installed on particular O.S platform that is called
NATIVE O.S .coming to the EME; it’s I just as repository in informatica, its hold the metadata,
trnsformations, db config files source and targets informations. coming to GDE its is end user
environment where we can develop the graphs(mapping just like in informatica)
designer uses the GDE and designs the graphs and save to the EME or Sand box it is at user side. Where
EME is ast server side.
What is the use of aggregation when we have rollup

as we know rollup component in abinitio is used to summarize group of data record. Then where we will
use aggregation?
Ans: Aggregation and Rollup both can summaries the data but rollup is much more convenient to use. In
order to understand how a particular summarization being rollup is much more explanatory compared
to aggregate. Rollup can do some other functionalities like input and output filtering of records.
Aggregate and rollup perform same action, rollup display intermediate
result in main memory, Aggregate does not support intermediate result
what are kinds of layouts does ab initio supports

Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at the same
time. The parallel one depends on the degree of data parallelism. If the multi-file system is 4-way
parallel then a component in a graph can run 4 ways parallel if the layout is defined such as it's same as
the degree of parallelism.

How can you run a graph infinitely?

To run a graph infinitely, the end script in the graph should call the .ksh file of the graph. Thus if the
name of the graph is abc.mp then in the end script of the graph there should be a call to abc.ksh.
Like this the graph will run infinitely.

How do you add default rules in transformer?

Double click on the transform parameter of parameter tab page of component properties, it will open
transform editor. In the transform editor click on the Edit menu and then select Add Default Rules from
the dropdown. It will show two options - 1) Match Names 2) Wildcard.

Do you know what a local lookup is?

If your lookup file is a multifile and partioned/sorted on a particular key then local lookup function can
be used ahead of lookup function call. This is local to a particular partition depending on the key.

Lookup File consists of data records which can be held in main memory. This makes the transform
function to retrieve the records much faster than retiring from disk. It allows the transform component
to process the data records of multiple files fastly.

What is the difference between look-up file and look-up, with a relevant example?

Generally Lookup file represents one or more serial files (Flat files). The amount of data is small enough
to be held in the memory. This allows transform functions to retrieve records much more quickly than it
could retrieve from Disk.
A lookup is a component of abinitio graph where we can store data and retrieve it by using a key
parameter.
A lookup file is the physical file where the data for the lookup is stored.
How many components in your most complicated graph? It depends the type of components you us.

Usually avoid using much complicated transform function in a graph.


Explain what is lookup?

Lookup is basically a specific dataset which is keyed. This can be used to mapping values as per the data
present in a particular file (serial/multi file). The dataset can be static as well dynamic (in case the lookup
file is being generated in previous phase and used as lookup file in current phase). Sometimes, hash-
joins can be replaced by using reformat and lookup if one of the inputs to the join contains less number
of records with slim record length.
AbInitio has built-in functions to retrieve values using the key for the lookup
what is a ramp limit?
The limit parameter contains an integer that represents a number of reject events

the ramp parameter contains a real number that represents a rate of reject events in the number of
records processed.
No of bad records allowed = limit + no of records*ramp.
Ramp is basically the percentage value (from 0 to 1)
this two together provides the threshold value of bad records.

Have you worked with packages?

Multistage transform components by default use packages. However user can create his own set of
functions in a transfer function and can include this in other transfer functions.

Have you used rollup component? Describe how.

If the user wants to group the records on particular field values then rollup is best way to do that. Rollup
is a multi-stage transform function and it contains the following mandatory functions.
1. Initialize
2. Rollup
3. Finalize
Also need to declare one temporary variable if you want to get counts of a particular group.

For each of the group, first it does call the initialize function once, followed by rollup function calls for
each of the records in the group and finally calls the finalize function once at the end of last rollup call.

How do you add default rules in transformer?

Add Default Rules — Opens the Add Default Rules dialog. Select one of the following: Match Names —
Match names: generates a set of rules that copies input fields to output fields with the same name. Use
Wildcard (.*) Rule — Generates one rule that copies input fields to output fields with the same name.

) If it is not already displayed, display the Transform Editor Grid.


2) Click the Business Rules tab if it is not already displayed.
3) Select Edit > Add Default Rules.

In case of reformat if the destination field names are same or subset of the source fields then no need to
write anything in the reformat xfr unless you don’t want to use any real transform other than reducing
the set of fields or split the flow into a number of flows to achieve the functionality.

What is the difference between partitioning with key and round robin?

Partition by Key or hash partition -> this is a partitioning technique which is used to partition data when
the keys are diverse. If the key is present in large volume then there can large data skew? But this
method is used more often for parallel data processing.

Round robin partition is another partitioning technique to uniformly distribute the data on each of the
destination data partitions. The skew is zero in this case when no of records is divisible by number of
partitions. A real life example is how a pack of 52 cards is distributed among 4 players in a round-robin
manner.

How do you improve the performance of a graph?

There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimize the number of sort components
4) Minimize sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port
8) For large dataset don't use broadcast as partitioner
9) Minimize the use of regular expression functions like re_index in the transfer functions
10) Avoid repartitioning of data unnecessarily

Try to run the graph as long as possible in MFS. For these input files should be partitioned and if possible
output file should also be partitioned.
How do you truncate a table?

From Abinitio run sql component using the DDL "trucate table
By using the Truncate table component in Ab Initio

Have you eveer encountered an error called "depth not equal"?

When two components are linked together if their layout doesnot match then this problem can occur
during the compilation of the graph. A solution to this problem would be to use a partitioning
component in between if there was change in layout.

What is the function you would use to transfer a string into a decimal?

In this case no specific function is required if the size of the string and decimal is same. Just use decimal
cast with the size in the transform function and will suffice. For example, if the source field is defined as
string(8) and the destination as decimal(8) then (say the field name is field1).

out.field :: (decimal(8)) in.field

If the destination field size is lesser than the input then use of string_substring function can be used likie
the following.
say destination field is decimal(5).

out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /* string_lrtrim used to trim leading


and trailing spaces */
What are primary keys and foreign keys?
You might also like:

Transform Components

Partition Components

Ab Initio Best Practices

Dataset Components

LinkWithin

Filter by expression

Filter by Expression filters data records according to a specified DML expression.


Basically it can be compared with the where clause of sql select statement.
Different functions can be used in the select expression of the filter by expression component even
lookup can also be used.
In this filter by expression there is reject-threshold parameter
The value of this parameter specifies the component's tolerance for reject events. Choose one of the
following:
• Abort on first reject — Write Multiple Files stops the execution of the graph at the first reject event it
generates.

• Never abort — the component does not stop the execution of the graph, no matter how many reject
events it generates.

• Use ramp/limit — the component uses the settings in the ramp and limit parameters to determine
how many reject events to allow before it stops the execution of the graph.
The default is Abort on first reject.

Join

Join reads the records from multiple ports, operates on the records with matching keys using a multi
input transform function and writes the result into output ports.

In join the key parameter has to be specified from input flow (either of the flow) ascending or
descending order (please refer to picture above).
If all the input flows do not have any common field, override-key must be specified to map the key
specified.

Reformat
Reformat changes the record format of data records by dropping fields, or by using DML expressions to
add fields, combine fields, or transform the data in the records
By default reformat has got one output port but incrementing value of count parameter number. But for
that two different transform functions has to be written for each output port.
If any selection from input ports is required the select parameter can be used instead of using ‘Filter by
expression’ component before reformat

Rollup

Rollup generates data records that summarize groups of data records on the basis of key specified.

Parts of Aggregate
• Input select (optional)
• Initialize
• Temporary variable declaration
• Rollup (Computation)
• Finalize
• Output select (optional)

Input_select : If it is defined , it filters the input records.

Initialize: rollup passes the first record in each group to the initialize transform function.

Temporary variable declaration:The initialize transform function creates a temporary record for the
group, with record type temporary_type.

Rollup (Computation): Rollup calls the rollup transform function for each record in a group, using that
record and the temporary record for the group as arguments. The rollup transform function returns a
new temporary record.

Finalize:
If you leave sorted-input set to its default, Input must be sorted or grouped:

• Rollup calls the finalize transform function after it processes all the input records in a group.
• Rollup passes the temporary record for the group and the last input record in the group to the finalize
transform function.
• The finalize transform function produces an output record for the group.
• Rollup repeats this procedure with each group.

Output select: If you have defined the output_select transform function, it filters the output records.

Aggregates

Aggregate generates data records that summarize groups of data records ( similar to rollup). But it has
lesser control over data.
Scan

Scan generates a series of cumulative summary records for groups of data records.
Consider above case input records scan transform functions generates record in output as (if
input_select and output_select parameters are not specified)

Scan also can be used for multiple functionality as same as rollup

The main difference between Scan and Rollup is Scan generates intermediate (cumulative) result and
Rollup summarizes.
Sort

Sort component sort the data in ascending or descending order according to the key specified.

By default sorting is done in ascending order. To make the flow in descending order the descending
radio button has to be clicked.
In the parameter max-core value is required to be specified. Though there is a default value, it
recommended to use $ variable which is defined in the system [$MAX_CORE, $MAX_CORE_HALF etc].

Sort within groups


Sort within Groups refines the sorting of data records already sorted according to one key specifier: it
sorts the records within the groups formed by the first sort according to a second key specifier.
In parameter part there are two sort keys
1) major key: it is the main key on which records are already sorted.
2) minor key : If the records are already sorted according to major key, according to minor key records
are resorted within major key group.
Partition by key and sort
Previously it was mentioned a partition by key component is generally followed by a sort component. If
the partitioning key and sorting key is the same instead to using those two components partition by key
and sort component should be used
In this component also key and max-core value has be mentioned as per same rule of sort component

Dedup Sorted

Dedup Sorted separates one specified data record in each group of data records from the rest of the
records in the group i.e. removes duplicate records from the flow according to key specified.

The duplicate records from a flow can be removed by three ways by this component by mentioning the
keep parameter.

1) first: This default the value. This implies the first record of the duplicates ( i.e. same key value) will be
kept
2) last: This implies the last record of the duplicates ( i.e. same key value) will be kept
3) unique-only: In this case all the duplicate records will be removed
The above picture suggest where to fix different parameters for dedup sorted component

Checkpoint Sort

Checkpointed Sort sorts and merges data records, inserting a checkpoint between the sorting and
merging phases depending on the key specified.

It is the sub-graph containing two components “partial sort” and “merge runs”. But neither of the
components works individually.
Dataset components are generally used to read from/ write to serial/multi files. The basic dataset
components are “input file”, “output file” ,”intermediate file” and “lookup file” “
There are number of dataset components but those two are mostly used dataset components are like
“read multiple file” write multiple file” are used read from and write to more than one serial files.

Input File

Input File represents data records read as input to a graph from one or multiple serial files or from a
multifile according to the dml specified.
If data for does not match with the dml error message ( metadata error) printed in the screen.
In the URL part of input file it is recommended to use variable ($ variable like $INPUT_FILEs)

We can use multiple files (of same type) as input


Click on partition radio and the click the edit button. In the edit box mention the variable name which
points the files

And the variable has to be defined in a fnx file like


export INPUT_FILES=`ls -1 $AI_TEMP/MML/CCE*`
or in sandbox where the left column should have the variable name (INPUT_FILES ) and right column
should have the definition ($AI_TEMP/MML/CCE*)
This INPUT_FILES points all the files under $AI_TEMP/MML directory which are stated with CCE.
In the read port of the input dml is required to be mentioned to read data from server and according to
that dml specified. This dml can be embedded or path of the same can be mentioned.

Output File

Output File stores data records from a graph to one or multiple serial files or to a multifile according to
the dml specified in the write port. The output file can be created in write or append mode or
permission for the other user can be controlled .
Output File stores data records from a graph to one or multiple serial files or to a multifile according to
the dml specified in the write port. The output file can be created in write or append mode or
permission for the other user can be controlled .

Lookup File

Lookup File represents one or multiple serial files or a multifile of data records small enough to be held
in main memory, letting a transform function retrieve records much more quickly than it could retrieve
them if they were stored on disk.
Lookup File associates key values with corresponding data values to index records and retrieve them
Intermediate File

Intermediate File represents one or multiple serial files or a multifile of intermediate results that a graph
writes during execution, and saves for your review after execution.
The upstream component writes to Intermediate File through Intermediate File’s write port. After the
flow of data records into the write port is complete, the downstream component reads from
Intermediate File’s read port. This guarantees that the writing and reading processes are in two separate
phases.

Dataset components are generally used to read from/ write to serial/multi files. The basic dataset
components are “input file”, “output file” ,”intermediate file” and “lookup file” “
There are number of dataset components but those two are mostly used dataset components are like
“read multiple file” write multiple file” are used read from and write to more than one serial files.

Input File

Input File represents data records read as input to a graph from one or multiple serial files or from a
multifile according to the dml specified.
If data for does not match with the dml error message ( metadata error) printed in the screen.
In the URL part of input file it is recommended to use variable ($ variable like $INPUT_FILEs)

We can use multiple files (of same type) as input


Click on partition radio and the click the edit button. In the edit box mention the variable name which
points the files

And the variable has to be defined in a fnx file like


export INPUT_FILES=`ls -1 $AI_TEMP/MML/CCE*`
or in sandbox where the left column should have the variable name (INPUT_FILES ) and right column
should have the definition ($AI_TEMP/MML/CCE*)
This INPUT_FILES points all the files under $AI_TEMP/MML directory which are stated with CCE.
In the read port of the input dml is required to be mentioned to read data from server and according to
that dml specified. This dml can be embedded or path of the same can be mentioned.

Output File

Output File stores data records from a graph to one or multiple serial files or to a multifile according to
the dml specified in the write port. The output file can be created in write or append mode or
permission for the other user can be controlled .

Output File stores data records from a graph to one or multiple serial files or to a multifile according to
the dml specified in the write port. The output file can be created in write or append mode or
permission for the other user can be controlled .
Lookup File

Lookup File represents one or multiple serial files or a multifile of data records small enough to be held
in main memory, letting a transform function retrieve records much more quickly than it could retrieve
them if they were stored on disk.
Lookup File associates key values with corresponding data values to index records and retrieve them

Intermediate File

Intermediate File represents one or multiple serial files or a multifile of intermediate results that a graph
writes during execution, and saves for your review after execution.
The upstream component writes to Intermediate File through Intermediate File’s write port. After the
flow of data records into the write port is complete, the downstream component reads from
Intermediate File’s read port. This guarantees that the writing and reading processes are in two separate
phases.

These are general guidelines that is ideal to implement in Ab Initio projects involving development,
maintenance, testing activities. These are tips collected from various other sources from the net as well
as from expert Ab Initio developers.
Project access control - Checking In and Checking out practices

* Before “Checking In” any graphs make sure that it has been deployed successfully.
* Also before “Checking In” inform the ETL Admin about the same.
* To obtain the latest version of the graph “Check Out” from EME Data store.
* Before running a graph “Check Out” from EME Data store to your individual sand box. In case the
graph is not present in the EME Data store “Check In” and then run it.
* The Abinitio Sand Box for all authorized users should be created only by the ETL Admin.
* Before creating graphs on the server ensure that the User-ID, Password in the EME Settings and the
Run Settings are the same.
* Before modifying a graph ensure that it is locked to prevent any sharing conflicts. When you lock a
graph you prevent other users modifying it at the same time. It is advisable that individual graphs are
handled by separate users.
* Do not create any table in the target database. In case it is needed, ask the DBA to do so.
* Any database related activities and problems should be reported to the concerned DBA immediately.
* Before you need to modify any table in the target database inform the concerned DBA and get his
approval.
* Do not change any of the environment variables. As these environment variables are global to all
graphs they should not be tampered with. Only the ETL Admin has rights to set or modify the
environment variables.
Good practices for project implementation

* While running a graph one may encounter errors. Hence maintain error logs for every error you come
across. A consolidated, detailed error sheet should be maintained containing error related and
resolution information of all users. This can be used for reference when facing similar errors later on. In
case you have a database error contact the DBA immediately.
* Ensure that you are using the relevant dbc file in all your graphs.
* Always validate a graph before executing it and ensure that it validates successfully. Deploy the graph
after successful validation.
* ab_project_setup.ksh should be executed on regular basis. Contact ETL Admin for further details.
* Before running a graph check whether the test parameters are valid.
* After implementing the desired modifications save and unlock the graph.

Handling run time related errors

* If you are testing a graph created by some one else contact the person who created the graph or the
person who made recent modifications to it. He will assist you or himself perform the needful.
* If the error encountered relates to an Admin settings problem contact the ETL Admin immediately.
* If you face a problem that you have not encountered and resolved before, look in to the consolidated
error sheet and check to see whether that problem has been previously faced and resolved by any other
user. You can also approach various online tech forums to get further input on the error.

Documentation practices

* Maintain documents regarding all the modifications performed on existing graphs or scripts.
* Maintain ETL design documents for all graphs created or modified. The documents should be modified
accordingly if any changes are performed on the existing graphs.
* While testing any graph follow the testing rules as per the testing template. Maintain documents for
all testing activities performed.

What is good about underlying tables

* Ensure that in all the graphs where we are using RDBMS tables as input, the join condition is on
indexed columns. If not then ensure that indexes are created on the columns that are used in the join
condition. This is very important because if indexes are absent then there would be full table scan
thereby resulting in very poor performance. Before execution of any graph use Oracle's Explain Plan
utility to find the execution path of query.
* Ensure that if there are indexes on target table, then they are dropped before running the graph and
recreated after the graph is run.
* If possible try to shift the sorting or aggregating of data to the source tables (provided you are using
RDBMS as a source and not a flat file). SQL order by or group by clause will be much faster than Ab Initio
because invariably the database server would be more powerful than Ab Initio server (even otherwise
SQL order by or group by is done efficiently (compared to any ETL tool) because Oracle runs the
statement in optimal mode.
* Bitmap indexes may not be created on tables that are updated frequently. Bitmap indexes tend to
occupy a lot of disk space. Instead a normal index (B-tree index) may be created.

DML & XFR Usage


* Do not embed the DML if it belongs to a landed file or if it is going to be reused in another graph.
Create DML files and specify as path.
* Do not embed the XFR if it is going to be re-used in another graph. Create XFR files and specify as path.

Efficient usage of components

* Skinny the file, if the source file contains more data elements than what you need for down stream
processing. Add a Reformat as your first component to eliminate any data elements that are not needed
for down stream processing.
* Apply any filter criteria as early in the flow as possible. This will reduce the number of records you will
need to process early in the flow.
* Apply any Rollup’s early in the flow as possible. This will reduce the number of records you will need to
process early in the flow.
* Separate out the functionality between components. If you need to perform a reformat and filter on
some data, use a reformat component and a filter component. Do not perform Reformat and filter in the
same component. If you have a justifiable reason to merge functionality then specify the same in
component description.

You might also like