ABInitio FAQ

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

ABInitio FAQ

Q.How can we count a number of records in a flat file using Abinitio?

A. Using the aggregate function "count"


Or
Use rollup component to count the number of record in the flat file.
Use {} as key in the key specifier. It will consider all the fields as one record and count the total
number of records.

Q. How do we use SCD Types in the Abinitio graphs?


A.

Q. What is the order of execution of a graph when it runs?

A. Order of Graph Execution


1. Initialisation of Parameters
2. Start script execution
3. Graph execution
4. End script execution

Q. How to calculate the total number of records in the file using REFORMAT instead of
ROLLUP?

A. Via its log port.


Or
Connect reformat to log port and use this code and in select parameter specify event_type
"finish"
type reformat_final_msg record
decimal("records") read_count;
string("readn") filler_read;
decimal("records") written_count;
string("writtenn") filler_written;
decimal("records") rejected_count;
string("rejected") filler_rejected;
end;
out::reformat(in) begin
out.rec_count :1: string_lrtrim(reinterpret_as(reformat_final_msg in.event_text).read_count);
end;

Q. How do we append records to an already existing file usin abinitio graph?

A. Create a graph by taking the existing file as the out put file and keep the mode of the output
file in Append Mode. Pass the new records from the input file to this output file through a
reformat. This will append new records in the existing File.

Q. What is output index? How does it work in reformat?


Does below function show Output index in use

1
ABInitio FAQ

output:1:if(in.emp.sal<500)in.emp.sal
output:2:force_error("Employee salary is less than 500)?

A. Output index function is used in reformat having multiple output ports to direct which
record goes to which out port.
for eg. for a reformat with 3 out ports such a function could be like

if (value 'A') 1 else if (value 'B') 2 else 3

which basically means that if the field 'value' of any record evaluates to A in the transform
function it will come out of port 1 only and not from 2 or 3.

Q. How does component folding works?


A.
Q. What is the advantage of SORT within GROUP Clause?

A. Sort within Groups refines the sorting of data records already sorted according to one key
and it sorts the records within the groups formed by the first sort according to a second key.

Q. What are environment variable? Why are they required?

A. Environment Variables or other wise know as ABINITIO environment variable. Its set in
stdenv under which private project and public project will be there.
Parameters like $AB_HOME $AB_AIR_PROT will be present in environment variable and
this will link to the relational path respectively .

Q. What is the need of config variables in abinitio?


(ab_job,ab_max_core) and where to define them?
A.

Q. How to avoid duplicates without using dedup component?

A. To avoid the duplicates use rollup component.


Rollup component avoidS the duplicates and produces actual results.
or
We can avoid duplicate by using "key_change" method of the rollup component.
The code will be like below.
out :: key_change(prev curr)
begin out :: cur ! prev ; end out :: rollup(in) begin out :: in ; end

Q. What will happen when we pass dot or invlaid parameters in the inout component
layout URL?
A.

Q. What is use of Ab_job command in Abinitio?

2
ABInitio FAQ

A. AB_JOB parameter is set when we want to run the same graph at the same time for different
job names.
or
When you want to run the same instance of the graph many times which is palced in one place
then we go for AB_JOB. its should be defined in sandbox parameter. If you dont give the value
for it it will take AB_JOB as default.

Q. How to use a normal batch graph as a sub graph in Continuous graph?


A.

Q. How to open Abinitio in UNIX?

A. We cannot open AB Initio in UNIX. We can only run graphs in UNIX using the .ksh.

Q. How do you do production support for Graph?

A. If the graph failed in the production usually we get emergency access to see the failure then
analyse the failure if it is a code bug then we go back to development env and fix the bug test it
then deploy back to production and run.

Q.How do you check whether graph is completed successfully or not (is it $? of unix?)

A. $mpjret 0 then success if it is 1 then fail.

Q. What are different return values?

A. 0 and 1
0 is success
1 is failure
$? return status of last executed command.

Q. Why and When do we get the "Pipeline Broken Error" in Ab Initio?

A. Pipeline broken error will actually indicates the failure of a downstream component.
It normally occurs when the database is running out of memory which makes database
components in the graph unavailable.

Q. What are the two types of .dbc files?


A. Generally .dbc files are classified in 2 types with accordance of their parametric value for
fixed_size_dml which can be either true or false.
If the value is false the database generates delimited types whenever possible(it recognizes null
as zero-length string).In case of true it takes fixed length dml.

3
ABInitio FAQ

Other parameters for .dbc files are:


dbms
db_version
db_home
db_name
db_nodes
User & password
case
generate_dml_with_nulls
fixed_size_dml
treat_blanks_as_null
oldstyle_emptystring_as_null
fully_qualify_dml
delimited_dml_with_maximum_size
interface
environment
direct_parallel

Q. What is the usage of .mfctl and .mdir files in the mfs directory of Ab Initio?

A. .mfctl and .mdir are both related to multifile system. .mfctl extension of control file created
when we are using the MFS. The file extension .mfctl will contain the URLs of all the data
partitions. The file with the extension .mdir will contain the URL of the control file used by
MFS.

Q. How to separate duplicate records with out Dedup sorted from the grouped input file?

A. in.* Rollup
or
with help roolup component functions like last first.
Or
you can use rollup to remove a duplicate record in a input file(note that it is key based duplicate)
it will keep the last record based on that key.
Or
Rollup will help to avoid the duplicate without using dedup component.
It takes the first record and reject the rest.

Q. how to rerun a graph in UNIX?

A. we can rerun graph by using ab_job variable.


Or
you can run the graph by giving the following command in unix
dtm run <recvory file name> -continue
or

4
ABInitio FAQ

when ever a graph fails it creates a .rec file in the working directory the directory may be where
ur graph deployed script is stored .so remove that .rec file and then run the deployed script of
the graph from unix u may use m_rollback –d.

Q. what do you mean by rerun?

A. Your graph failed and you want to run it again or you want to run multiple instances of this
graph.

Q. How do you pass parameters to a graph in AI ?

A. Using Input Parameters/ Graph parameter.


Or
If you want to pass a parameter to your graph then declare a formal parameter in edit-parametrs
region.
Or
yes you can declare parametes in edit paramter option in GDE while running the .ksh you can
pass the value in command line.

Q. Which component does not work in pipeline parallelism?

A. Sort component does not work in pipeline parallelism.


Or
Sort component does not work in pipeline parallelism it blocks the pipeline parallelism.
Or
sort component does not work in pipleline parallelism because in case of sort all the data must
read before writing any records hence it does not support pipeline parallelism. Hope this make
sense.
or
Sort Sort within group Rollup will break pipeline parallelism.

Q. How does one make use of the "Call Web Service" component in the
$AB_HOME/connectors/Internet directory of the component selectory window of the
Ab Initio Console? Explain with Sample Code?

A.

Q. What is patch database (IPD etc)?

A.

Q. How do you check root disk failed?

A.

5
ABInitio FAQ

Q. How do you restore whole OS backup and a selected single file?

A.
Q. how to create SCDs(slowly changing dimensions) in abinitio?

A. If you want to implement the SCDs in abinitio then you should do the delta processing.

Q. How do you join two files with different layouts?

A. if the two files have totally different layout....u can use Fuse Component.Read about it from
Abinitio Help.<.
Or
If the layout is totally different ----use Fuse Component.
Or
To join a serial file and a multifile if that is the case use broadcast component after the serial file
and before join.

Q. What is Vector Field? Explain?

A. Vector field. This field is used in the denormalize component.


Denormalize generates multiple output data records to each of its input records.
We specify field names, we specify output length, this legnth called the vector field.
Depends on vector field length generates output records.
Denomalize specify one element type & count the index. According to this vector field
generates output records.

Q. Which file should we keep it as a look up file, large file or less data records file &
why?

A. We should always use small file ( i.e. file with less no. of records ) as lookup. The reason is -
This file will be kept in main memory ( RAM ) from the starting to ending of the script/graph
run. Hence less the file size more performance from server.
or
Lookup file should be always small. If the data is growing every day then the performance will
become poor and its not wise to use bigger file as lookup. It spoils the lookup concept.

Q. How metadata management takes place in ABinitio?

A. it is possible with help of EME. it follows UNIX file structure.

Q. Is there a way of implementing File Listener in Ab Initio?? It should continuously scan


a given directory, as soon as a file is placed in that directory, it should copy that file to a
working directory and trigger a corresponding Ab Initio graph?

A. You can use the CONTINOUS components to build this. It requires and environment setup
though. You can read through the Ab Initio help by searching on 'Continuous graphs'.

6
ABInitio FAQ

Q. How many Sandboxes can be there for a project?

A. A Project can have many sandboxes.


We can see many developers working in different sandboxes which is attached to a single
project.
Or
we can have any no of sandboxes sand box is nothing but users work area where each user will
get copy of the project & do the modifications acc.
or
There can be numerous sandboxes for a project but there should be only one sandbox associated
with EME for a project.

Q. How will you connect two servers?

A. Connecting two different servers in Abinito is done thorugh a file called abinitio.rc. This is
used for remote connectivity. This file contains information like the server ip(or name) the user
name and the password required to connect.

Q. How can you extract and load without transforming?

A. Provided the DML is same you can directly connect both input and output datasets and
perform and extract and load operation. For example If the input dataset is a table and output is
file you can directly connect both these making sure the DML of the file is propagated from
table.

Q. If want to run the graph in unix !what command i need to use ?


A. 1. First design the graph.
2. Save it
3. Run it.
4. Go to runtab then go to deploy press deploy.
Now Abintio automatically generates ksh of the graph in run folder of your sand box.
5. Go to sand box in run folder there you will find your graph.ksh.

Q. how will i can implemate Insert,Update,delete in abinitio?

A. to find records which should be inserted , updated or


deleted one should use ab initio flow
a. unload master table
b. read delta file
c. use inner join to join a and b unused a will be your
delete records (if required) unused b will be your insert
record . joined a and b will be your update records.

Q. how will u view MFS in unix?

A. to view MFS in unix you should run m_expand command.

7
ABInitio FAQ

Q. what is diff/btween conditional dml & conditional component?

A. conditional DML can be pass as program variable


conditional components will be used only when condition past
to the graph is true.

Q.
Q.What is the difference between In-Memory Sort and Inputs must be sorted?

A.The Inmemory sort and input must be sorted options are


there in the Join,Rollup and Dedup components.
Main difference between these two is if you selected input
must be sorted options in the above mentioned components
the the downstream components will get the records in a
sorted oder. if you are selected option as Inmemory sort
then the downstream components will not get the sorted
records.

Q. Graph was failed how it is achived ?


A. There are several resons that graph will be failed.
I have one specific Answar for this is...

If the graph is failed then Abinitio will create one .rec


file in the run directory of your sendbox. if you want to
rollback the graph then use m_rollback command in the unix
directory or you can use m_cleanup utilities in the Unix
command.

Q. how will i can implemate Insert,Update,delete in abinitio? how will u view MFS in
unix?what is diff/btween conditional dml& conditional component?
A. to find records which should be inserted , updated or
deleted one should use ab initio flow
a. unload master table
b. read delta file
c. use inner join to join a and b unused a will be your
delete records (if required) unused b will be your insert
record . joined a and b will be your update records

to view MFS in unix you should run m_expand command


conditional DML can be pass as program variable
conditional components will be used only when condition past
to the graph is true.

Q. What is meant header and tailer, suppose header and tailer had some junk data how
will delete junk data ? which components r used?
A. 1. If you know the signature of header and tailer record
then use filerby expression component to filter the header
and tailer records
8
ABInitio FAQ

2. Use one reformate component and then inside the


transformation use next_in_sequence() function to assign
unique numbers to each record,and then use filter by
expression component to filter the records based on
sequence numbers.

3.Follow the step 2 and use instead of filter by expression


component use leading records component to filter the
header and tailer records.

Q. I had 10,000 records r there i loded today 4000 records, i need load to 4001 - 10,000
next day how is in Type 1 and how is it on type 2?
A. simply take a reformat component and then put
next_in_sequence()> 4000 in select parameter.

Q. what are the steps in actual ab initio graph processing including general,pre and post
process settings?
A. 1. Start script
2. Graph components.
3.End script

Q. What is air_project_parameters and air_sandbox_overrides? what is the relation


between them?
A. .air-project-parameters
Contains the parameter definitions of all the parameters
within a sandbox. This file is maintained by the GDE and
the Ab Initio environment scripts.

.air-sandbox-overrides
This file exists only if you are using version 1.11 or a
later version of the GDE. It contains the user's private
values for any parameters in .air-project-parameters that
have the Private Value flag set. It has the same format as
the .air-project-parameters file.

When you edit a value (in GDE) for a parameter that has the
Private Value flag checked, the value is stored in the .air-
sandbox-overrides file rather than the .air-project-
parameters file.

Q. In Join component which record will go to unused port and which will go to reject
port ?
A. In case of inner-join all the records not matching the key
specified goes to the respective unused ports, in full
outer-join none of the records goes to the unused ports. In
case of reject port, records which do not match with DML
come to the reject port.
9
ABInitio FAQ

OR
In case of inner-join all the records not matching the key
specified goes to the respective unused ports, in full
outer-join none of the records goes to the unused ports.
All the records which evaluates to NULL during joiin
transformation will go into reject port if the limit +
ramp*number_of_input_records_so_far <
number_of_input_records_so_far.

Q. wt is meant by repartioning in howmany ways it can be done?


A. Repartitioning means changing one or both of the following:
1) The degree of parallelism of partitioned data
2) The grouping of records within the partitions of
partitioned data

Q. How to Create Surrogate Key using Ab Initio?

A. There r many ways to create Surrogatekey but it depends on your business logic. here u
can try these ways...

1. use next_in_sequence() function in your transform.

2.use Assign key values component (if ur gde is higher than 1.10)

3.write a stored proc to this and call this stor proc wherever u need

Q. What is semi-join?

A. In abinitio,there are 3 types of join <br><br>1.inner join. 2.outer join and 3.semi
join.<br><br>for inner join 'record_requiredn' parameter is true for all in ports.<br><br>for
outer join it is false for all the in ports.<br><br>if you want the semi join you put
'record_required' as true for the required component and false for other components.<br>

Q. How will you ensure that the components created in one version do not malfunction/cease
functioning in other version?

A. Runtime behaviour of components will remain same in all versions unless its requires to
have any additional paramter to be defined in any version. Evolution of new version of ETL
comes with some changes in component level parameters (observation as of now).
or

Components should be compatibile to run in previous versions of GDE. The depreciated


components would run in new versions.

Q. What data modelling do you follow while loading of data to tables? Also the DB you are
inserting the data has Star schema or Snow flake schema?

10
ABInitio FAQ

A.

Q. How does force_error function work ? If we set never abort in reformat , will force_error
stop the graph or will it continue to process the next set of records ?

A. Here you can set the two conditions for the reformat component
1. If you want to fail set the reject thresold to fail on first reject
2. If don't want to fail you set never to abort.

Force_error is used to abort any graph if the conditions are not met and you write the error
errors records in file and then abort the graphs this can done in different ways.

Or

force_error() fuction will not stop the graph it will write the error message to the error port for
that record and will process the next record.

Q. Phase verses Checkpoint?

A. Phase is breaking the graph into different block. It create some temp file while running and
deletes it once the completion is done.

Checkpoint is used for recovery purpose. when the graph is interrupted instead of rerunning the
graph from the start. the excution starts from the stop where it is stopeed.

Q. what is the function of XFR in abinitio? It would be great if one of you can explain me in
brief what is the function of xfr (like what does it do ,where is it stored ,how does it affect )?

A. As you know when you create a new sandbox in ab initio environment the following
directories will be created
1.mp
2.dml
3.xfr
4.db
etc etc.

xfr is directory in abinitio where we can write our own function and use them during the
tranformation(rollup , reformat etc..).

example you can write a function to convert a string into decimal or to get string max length ,
I can write that function in a file called user_define_function.xfr in xfr directory inside this
file i can define a function called string_to_interger or get_string_max_length or both. In any
transform component you can include the file liek
include "<full path>/user_define_function.xfr "

you can called the function like anyother function in ab initio.

11
ABInitio FAQ

Q. What is the difference between the flows of 3 parallelisms?


A. Parallelism's are of 3 types:

1. Component Parallelism: All program components runnings simultaneously on different data


sets.

2. Pipeline Parallelism: All program components runnings simultaneously on same data sets.
we can break the pipeline parallelism using all sort based components.

Ex: sort sort within groups AGG Rollup Join etc.

3. Data Parallelism: Distributes data records into multiple locations using partition
components.

Q. How can I calculate the total memory requirement of a graph?

A. You can roughly calculate memory requirement as:

1. Each partition of a component uses:


~ 7 MB + max-core (if any)

2. Add size of lookup files used in phase (if multiple components use same lookup only count it
once)

3. Multiply by degree of parallelism. Add up all components in a phase; that is how much
memory is used in that phase.

4. (Total memory requirement of a graph) > (the largest-memory phase in the graph).

Q. How can I achieve cummulative sumary in AB Initio other than using SCAN component. Is
there any inbuilt function available for that?

A. Scan is really the most simple way to achieve this. Another way is to use a ROLLUP since it
is a multistage component. You need to put the ROLLUP component into multistage format
and write the intermediate results to a temp array (I think they're called vectors in AI). The
ROLLUP loops through each record in your defined group.

Let's say you want to get intermediate results by date. You sort your data by {ID; DATE} first.
Then ROLLUP by {ID}. The ROLLUP will execute it's transformation for each record per ID.
So store your results in a temp vector which will need to be initialized to be the size of your
largest group. Each time the ROLLUP enters the tranformation write to the [i] position in the
array and increment i each time. As long as this is all done in the "rollup" transformation and
not the "finalize" transformation it will run the "initialize" portion before it moves to the next
ID.

12
ABInitio FAQ

I have done it this way but the Scan is easier. I was doing a more simple rollup before I found
that I needed cumulative intermediate results so I just modified my existing ROLLUP. Ab Initio
documentation does not explain this technique in detail but it can be done.

or

There are three ways


1) You can use Scan with rollup component
2) Use Rollup component
3) You can also use Scan followed by Dedup sort and select the last record. That will solve the
purpose
or

Other then scan we can use rollup to do the cumulative summary.

Or

Use in built componenet in Abinitio .. "SCANWITHROLLUP"

Q. I have file containing 5 unique rows and I am passing them through SORT component using
null key and and passing output of SORT to Dedup sort. What will happen, what will be the
output.?

A. If there is no key used in the sort component while using the dedup sort the output depends
on the keep parameter.
If its set to firt then the output would have only the first record
if its set to last the output would have the last record
if its set to unique_only then there would be no records in the output file.

Q. Can we process 1 GB data(1 million records) by using Lookup? How?

A. I think it is not adviseable to use a 1GB lookup file it will definitely effect the parallel
processing of other applications and affect the performance.

I would prefer to use the MFS lookup file and not serial lookup file in this case.

Q. If I have 2 files containing field file1(A,B,C) and file2(A,B,D), if we partition both the files on
key A using partition by key and pass the output to join component, if the join key is (A,B) will
it join or not and WHY?

A.

13
ABInitio FAQ

Q. In my sandbox i am having 10 graphs, i checked-in those graphs into EME. Again i checked-
out the graph and i do the modifications, i found out the modifications was wrong. what i have
to do if i want to get the original graph..?

A.

How do I create subgraphs in Ab Initio?

Q.What is a sandbox?
A. Sandbox is a directory structure of which each directory level is assigned a variable name, is
used to manage check-in and checkout of repository based objects such as graphs.

fin -------> top level directory ( $AI_PROJECT )


|
|---- dml -------> second level directory ( $AI_DML )
|
|----- xfr -------> second level directory ( $AI_XfR )
|
|----- run --------> second level directory ( $AI_RUN )
|

You'll require a sandbox when you use EME (repository s/w) to maintain release control.

Within EME for the same project an identical structure will exist.

The above-mentioned structure will exist under the os (eg unix), for instance for the project
called fin, and is usually name of the top-level directory.

In EME, a similar structure will exist for the project: fin.

When you checkout or check-in a whole project or an object belonging to a project, the
information is exchanged between these two structures.

For instance, if you checkout a dml called fin.dml for the project called fin, you need a sandbox
with the same structure as the EME project called fin. Once you've created that, as shown
above, fin.dml or a copy of it will come out from EME and be placed in the dml directory of your
sandbox.

Q. I have a job that will do the following: ftps files from remote server; reformat data in those
files and updates the database; deletes the temporary files. How do we trap errors generated by
Ab Initio when an ftp fails? If I have to re-run / re-start a graph again, what are the points to be
considered? does *.rec file have anything to do with it?

A. AbInitio has very good restartability and recovery features built into it. In Your situation
you can do the tasks you mentioned in one graph with phase breaks.

FTP in phase 1 and your transaformation in next phase and then DB update in another pahse

14
ABInitio FAQ

(This is just an example this may not best of doing it as best design depends on various other
factors)

If the graph fails during FTP then your graph fails in Phase 0, you can restart the graph, if your
graph fails in Phase 1 then AB_JOB.rec file exists and when you restart your graph you would
see a message saying recovery file exists, do you want to start your graph from last successful
check point or restart from begining. Same thing if it fails in Phase 2.

Phases are expensive from Disk I/O perspective, so have to be careful in doing too much
phasing.

Coming back to error trapping each component has reject, error, log ports, reject captures
rejected records, error captures corresponding error and log captures the execution statistics of
the component. You can control reject status of each component by setting reject threshold to
either "Never Abort", "Abort on first reject" or setting "ramp/limit"

Recovery files keep tack of crucial information for recovering the graph from failed status,
which node the component is executing on etc. It is a bad idea to just remove the *.rec files, you
always want to rollback the recovery fils cleanly so that temporary files created during graph
execution won't hang around and occupy disk space and create issues.

always use m_rollback –d

Q. What is Ad hoc multifile? How is it used?


A. Here is a description of Ad hoc multifile:

Ad hoc multifiles treat several serial files having the same record format as a single graph
component.

Frequently, the input of a graph consists of a set of serial files, all of which have to be processed
as a unit. An Ad hoc multifile is a multifile created 'on the fly' out of a set of serial files, without
needing to define a multifile system to contain it. This enables you to represent the needed set of
serial files with a single input file component in the graph. Moreover, the set of files used by the
component can be determined at runtime. This lets the user customize which set of files the
graph uses as input without having to change the graph itself, even after it goes into production.

Ad hoc multifiles can be used as output, intermediate, and lookup files as well as input files.

The simplest way to define an Ad hoc multifile is to list the files explicitly as follows:

1. Insert an input file component in your graph.


2. Open the properties dialog. Select Description tab.
3. Select Partitions in the Data Location of the Description tab
4. Click Edit to open the Define multifile Partitions dialog box.
5. Click New and enter the first file name. Click New again and enter the second file name and
so on.
6. Click OK.

15
ABInitio FAQ

If you have added 'n' files, then the input file now acts something like a file in a n-way multifile
system, whose data partitions are the n files you listed. It is possible for components to run in
the layout of the input file component. However, there is no way to run commands such as m_ls
or m_dump on the files, because they do not comprise a real multifile system.

There are other ways than listing the input files explicitly in an Ad hoc multifile.

1. Listing files using wildcards - If the input file names have a common pattern then you can use
a wild card for all the files. E.g. $AI_SERIAL/ad_hoc_input_*.dat. All the files that are found at
the runtime matching the wild card pattern will be taken for the Ad hoc multifile.

2. Listing files in a variable. You can create a runtime parameter for the graph and inside the
parameter you can list all the files separated by spaces.

3. Listing files using a command - E.g. $(ls $AI_SERIAL/ad_hoc_input_*.dat), which produces


the list of files to be used for the ad hoc multifile. This method gives maximum flexibility in
choosing the input files, since you can use complex commands also that involves owner of file or
date time stamp.

Q. What is the difference between Replicate and Broadcast?


A. Broadcast and Replicate are similar components but generally Replicate is used to increase
Component Parallelism, emitting multiple straight flows to seperate pipelines. Broadcast is used
to increase data parallelism by feeding records to fan-out or all-to-all flows.
Or
Replicate is old component when compared to broadcast. You can use Broadcast as join
component, where as Replicate you can't use as join. By Default, Replicate is Straight flow and
Broadcast is fan-out or All-To-All Flow.
Broadcast is used for Data Parallism whereas Replicate is used for Component Parallesim.
Or
Replicate

Supports component parallelism

Input File -------> Replicate --------> Format ---->Output File


|
|
|
--------->Rollup-------> output File

Broadcast

Supports data parallelism

Input File1 (MF) -----------------> JOIN -----------> Output File


^

16
ABInitio FAQ

|
|
Input File 2(Serial)---> Broadcast -->

Input File2 is a serial file and it is being joined with a mf, input file2, without being partitioned.
The compoment, Broadcast, is writing data to all partitions of Input file1, creating an implicit
fan out flow.
Or
The short answer is that the Replicate copies a flow while a Broadcast multiplies it. Broadcast is
a partitioner where Replicate is a simple flow-copy mechanism.

Replicate appears in over 90% of all AI graphs (across the board of all implementations
worldwide) where Broadcast appears in less than 1% of all graphs.

You won't see any difference in the two until you start using data-parallel, then it will go south
rather quickly. Here's an experiment:

Use a simple serial input file, followed by a broadcast, then a 4-way multifile output file
component.

If you run the graph with say, 100 records from the input file, it will create 400 records in the
output file - 100 records for each flow partition encountered.

If you had used a Replicate, it would have read and written 100 records.

Hi Just went through 8 ab initio interviews and some of the tough


questions were as follows.

1.What is the function you would use to transfer a string into a decimal.?

2.How many parallelisms in ab initio and a definition of the three. ?

3.What is the difference between db config and a cfg file?

4.Have you eveer encountered an error called depth not equal (this
apparently occurs when you extensively create graphs.....kinda a trick
question)?

5.How do you truncate a table.....each candidate would say only 1 of the


several ways to do this. ?

6.How do you improve the performance of a graph?

7.Whats the difference between partitioning with key and round robin?

8.Have you worked with packages?

17
ABInitio FAQ

9.How do you add default rules in transformer?

10.What is a ramp limit

11.Have you used rollup component ....describe?

12.How many components in your most complicated graph?

13.Do you know what a local lookup is?

Latest Features in Ab Initio - 2.14

Dynamic Script Generation is the latest buzz in Ab Initio world and one of it’s finest. It comes
with lots of other advantages which were not there in earlier versions of Ab Initio
Co>Operating System. Now it is available in Co>Operating System version 2.14.46 and

above.

This feature typically enables the use of Ab Initio PDL (Parameter Definition Language) and
Component Folding.

Now if we enable this feature by changing the script generation method to Dynamic in Run
Settings we will be able to run a graph without deploying it through GDE. From now onwards
we will execute the mp file only; there is no need to have the ksh. In production server once we
run the mp file using air sandbox run command on the fly it generates a reduced script, which
contains the commands to set up the host environment. It doesn’t include component details of
the graph at all.

You can check the mp file of dynamic script generation enabled graph. It is an editable text file.

Component Folding: It is a feature by which Co>Operating system combines group of


components and runs them as a single process. Now question - Does it improve the
performance? Yes, in most of the cases it will bring a significant performance boost over the
traditional approach of execution.

Prerequisites of Component Folding:

• The components must be foldable • They must be in same phase and layout • Components
must be connected via a straight flow.

How it works (Advantages):

1. When this is enabled by checking the folding option in Run Setting, Co>Operating System
runtime folds all the processes (foldable components) in a single process. As a result number of

18
ABInitio FAQ

processes is reduced when a graph executes. Every process has overheads of creation of new
process, scheduling, memory consumption etc. These overheads will vary from OS to OS. In
some OS like MVS, creation and maintenance of processes are very costly compared to different
flavors of UNIX.

2. Another major benefit of component folding is the reduction of interpretation time for the
DML between processes. Because it will end up with multitool folded processes communicating
with other multitool or unitool.

3. Apart from that increase in number of processes results higher interprocess communication.
Data movement between two or more processes will not only consume time but memory too. In
CFG (Continuous Flow Graph) interprocess communication is always very high. So it is worth
enabling Component folding in a CFG.

Disadvantages of Component Folding:

1. Pipeline Parallelism: As component folding folds different component in a single process it


will hurt the pipeline parallelism of Ab Initio. If flow of our graph is like - Input File -> Filter
By Expression -> Reformat -> Output File. In traditional method by the help of Pipeline
Parallelism FBE and Reformat will execute concurrently. But now these two components are
folded together so there is no chance of parallel execution.

2. Address Space: In a 32 bit OS maximum limit of Address space for process is 4 GB. So if we
combine 4 different components to a single process by component folding OS will allow only 4
GB of address space for all 4 instead of 4X4 total 16 GB of spaces. So we should avert
component folding components where memory use is very high as in-memory Rollup, Join, and
Reformat with lookup. Some components like Sort, in-memory Join causes internal buffering of
data. Combing them in a single process will result writing to disk (Higher IO).

Set AB_MULTITOOL_MAXCORE variable to limit the maximum allowable memory for the
folded component group.

Excluding any component from Component Folding:

I know sometime you would wish to prevent components to be folded to allow pipeline
parallelism or to access more address space. Then you need to exclude some components from
being folded.

Set AB_FOLD_COMPONENTS_EXCLUDE_MPNAMES configuration variable to space


separated mpname of the components in your $HOME/.abinitiorc or system wide
$AB_HOME/config/abinitiorc file. e.g. export
AB_FOLD_COMPONENTS_EXCLUDE_MPNAMES= hash-rollup reformat-transform

19
ABInitio FAQ

In other way to prevent two different components from getting folded together right click on
the flow between them and uncheck the Allow Component Folding option.

Everything has its cost. So it is always worth benchmarking before taking a decision. Prevent
and allow component folding for your components of the graph, tune it for the highest
performance.

CPU tracking report of folded components in a graph:

To report the execution detail of folded graph on console we need to override the
AB_REPORT variable with show-folding option as – AB_REPORT=”show-folding flows
times interval=180 scroll=true spillage totals file-percentages”.

The folded components are displayed as multitool process in CPU tracking information. The
CPU time for a folded component is shown twice one for the component itself once as a
multitool component.

Parameter Definition Language (PDL):

PDL is used to put logic for inline computation in parameter value. It provides high flexibility
in terms of interpretation. It supports both $ and ${} substitution. For this you need to set the
interpretation PDL and write the DML expression within $[ ]. This approach is much faster
than traditional shell scripting. It is the way to move forward to a much flexible and robust
technique of designing. With the use of it we can abolish the old shell scripting as script-end
and script-start are already beaten enough to death since last few years. You can use PDL
interpretation for condition of a component.

NOTE. The detail of PDL within the GDE is lacking any consistency. Basically, we can use the
majority of the Ab Initio DML functions. I would recommend looking at the metaprograming
section for starters. Then play with the parameters editor.

e.g.

Suppose in a graph we have a conditional component which runs based on existence of a file
called emp.dat.

20
ABInitio FAQ

Now FILE_NAME parameter is defined as /home/xyz/emp.dat and a conditional parameter


called EXIST is defined as

$[if (file_information($’FILE_NAME’).found) 1 else 0]

We can define a parameter with type and transform function with the help of parameter
AB_DML_DEFS.

e.g. Suppose AB_DML_DEFS is defined as

out :: sqrt(in) = begin out :: math_sqrt(in); end;

Now in a parameter called SQRT is defined as $[sqrt (16)]

Resolved value from this parameter will be 4.

Ensure your host run settings are checked for dynamic script generation, and read the 2.14
patchset notes for a description of any hint.

21

You might also like