ABInitio FAQ
ABInitio FAQ
ABInitio FAQ
Q. How to calculate the total number of records in the file using REFORMAT instead of
ROLLUP?
A. Create a graph by taking the existing file as the out put file and keep the mode of the output
file in Append Mode. Pass the new records from the input file to this output file through a
reformat. This will append new records in the existing File.
1
ABInitio FAQ
output:1:if(in.emp.sal<500)in.emp.sal
output:2:force_error("Employee salary is less than 500)?
A. Output index function is used in reformat having multiple output ports to direct which
record goes to which out port.
for eg. for a reformat with 3 out ports such a function could be like
which basically means that if the field 'value' of any record evaluates to A in the transform
function it will come out of port 1 only and not from 2 or 3.
A. Sort within Groups refines the sorting of data records already sorted according to one key
and it sorts the records within the groups formed by the first sort according to a second key.
A. Environment Variables or other wise know as ABINITIO environment variable. Its set in
stdenv under which private project and public project will be there.
Parameters like $AB_HOME $AB_AIR_PROT will be present in environment variable and
this will link to the relational path respectively .
Q. What will happen when we pass dot or invlaid parameters in the inout component
layout URL?
A.
2
ABInitio FAQ
A. AB_JOB parameter is set when we want to run the same graph at the same time for different
job names.
or
When you want to run the same instance of the graph many times which is palced in one place
then we go for AB_JOB. its should be defined in sandbox parameter. If you dont give the value
for it it will take AB_JOB as default.
A. We cannot open AB Initio in UNIX. We can only run graphs in UNIX using the .ksh.
A. If the graph failed in the production usually we get emergency access to see the failure then
analyse the failure if it is a code bug then we go back to development env and fix the bug test it
then deploy back to production and run.
Q.How do you check whether graph is completed successfully or not (is it $? of unix?)
A. 0 and 1
0 is success
1 is failure
$? return status of last executed command.
A. Pipeline broken error will actually indicates the failure of a downstream component.
It normally occurs when the database is running out of memory which makes database
components in the graph unavailable.
3
ABInitio FAQ
Q. What is the usage of .mfctl and .mdir files in the mfs directory of Ab Initio?
A. .mfctl and .mdir are both related to multifile system. .mfctl extension of control file created
when we are using the MFS. The file extension .mfctl will contain the URLs of all the data
partitions. The file with the extension .mdir will contain the URL of the control file used by
MFS.
Q. How to separate duplicate records with out Dedup sorted from the grouped input file?
A. in.* Rollup
or
with help roolup component functions like last first.
Or
you can use rollup to remove a duplicate record in a input file(note that it is key based duplicate)
it will keep the last record based on that key.
Or
Rollup will help to avoid the duplicate without using dedup component.
It takes the first record and reject the rest.
4
ABInitio FAQ
when ever a graph fails it creates a .rec file in the working directory the directory may be where
ur graph deployed script is stored .so remove that .rec file and then run the deployed script of
the graph from unix u may use m_rollback –d.
A. Your graph failed and you want to run it again or you want to run multiple instances of this
graph.
Q. How does one make use of the "Call Web Service" component in the
$AB_HOME/connectors/Internet directory of the component selectory window of the
Ab Initio Console? Explain with Sample Code?
A.
A.
A.
5
ABInitio FAQ
A.
Q. how to create SCDs(slowly changing dimensions) in abinitio?
A. If you want to implement the SCDs in abinitio then you should do the delta processing.
A. if the two files have totally different layout....u can use Fuse Component.Read about it from
Abinitio Help.<.
Or
If the layout is totally different ----use Fuse Component.
Or
To join a serial file and a multifile if that is the case use broadcast component after the serial file
and before join.
Q. Which file should we keep it as a look up file, large file or less data records file &
why?
A. We should always use small file ( i.e. file with less no. of records ) as lookup. The reason is -
This file will be kept in main memory ( RAM ) from the starting to ending of the script/graph
run. Hence less the file size more performance from server.
or
Lookup file should be always small. If the data is growing every day then the performance will
become poor and its not wise to use bigger file as lookup. It spoils the lookup concept.
A. You can use the CONTINOUS components to build this. It requires and environment setup
though. You can read through the Ab Initio help by searching on 'Continuous graphs'.
6
ABInitio FAQ
A. Connecting two different servers in Abinito is done thorugh a file called abinitio.rc. This is
used for remote connectivity. This file contains information like the server ip(or name) the user
name and the password required to connect.
A. Provided the DML is same you can directly connect both input and output datasets and
perform and extract and load operation. For example If the input dataset is a table and output is
file you can directly connect both these making sure the DML of the file is propagated from
table.
7
ABInitio FAQ
Q.
Q.What is the difference between In-Memory Sort and Inputs must be sorted?
Q. how will i can implemate Insert,Update,delete in abinitio? how will u view MFS in
unix?what is diff/btween conditional dml& conditional component?
A. to find records which should be inserted , updated or
deleted one should use ab initio flow
a. unload master table
b. read delta file
c. use inner join to join a and b unused a will be your
delete records (if required) unused b will be your insert
record . joined a and b will be your update records
Q. What is meant header and tailer, suppose header and tailer had some junk data how
will delete junk data ? which components r used?
A. 1. If you know the signature of header and tailer record
then use filerby expression component to filter the header
and tailer records
8
ABInitio FAQ
Q. I had 10,000 records r there i loded today 4000 records, i need load to 4001 - 10,000
next day how is in Type 1 and how is it on type 2?
A. simply take a reformat component and then put
next_in_sequence()> 4000 in select parameter.
Q. what are the steps in actual ab initio graph processing including general,pre and post
process settings?
A. 1. Start script
2. Graph components.
3.End script
.air-sandbox-overrides
This file exists only if you are using version 1.11 or a
later version of the GDE. It contains the user's private
values for any parameters in .air-project-parameters that
have the Private Value flag set. It has the same format as
the .air-project-parameters file.
When you edit a value (in GDE) for a parameter that has the
Private Value flag checked, the value is stored in the .air-
sandbox-overrides file rather than the .air-project-
parameters file.
Q. In Join component which record will go to unused port and which will go to reject
port ?
A. In case of inner-join all the records not matching the key
specified goes to the respective unused ports, in full
outer-join none of the records goes to the unused ports. In
case of reject port, records which do not match with DML
come to the reject port.
9
ABInitio FAQ
OR
In case of inner-join all the records not matching the key
specified goes to the respective unused ports, in full
outer-join none of the records goes to the unused ports.
All the records which evaluates to NULL during joiin
transformation will go into reject port if the limit +
ramp*number_of_input_records_so_far <
number_of_input_records_so_far.
A. There r many ways to create Surrogatekey but it depends on your business logic. here u
can try these ways...
2.use Assign key values component (if ur gde is higher than 1.10)
3.write a stored proc to this and call this stor proc wherever u need
Q. What is semi-join?
A. In abinitio,there are 3 types of join <br><br>1.inner join. 2.outer join and 3.semi
join.<br><br>for inner join 'record_requiredn' parameter is true for all in ports.<br><br>for
outer join it is false for all the in ports.<br><br>if you want the semi join you put
'record_required' as true for the required component and false for other components.<br>
Q. How will you ensure that the components created in one version do not malfunction/cease
functioning in other version?
A. Runtime behaviour of components will remain same in all versions unless its requires to
have any additional paramter to be defined in any version. Evolution of new version of ETL
comes with some changes in component level parameters (observation as of now).
or
Q. What data modelling do you follow while loading of data to tables? Also the DB you are
inserting the data has Star schema or Snow flake schema?
10
ABInitio FAQ
A.
Q. How does force_error function work ? If we set never abort in reformat , will force_error
stop the graph or will it continue to process the next set of records ?
A. Here you can set the two conditions for the reformat component
1. If you want to fail set the reject thresold to fail on first reject
2. If don't want to fail you set never to abort.
Force_error is used to abort any graph if the conditions are not met and you write the error
errors records in file and then abort the graphs this can done in different ways.
Or
force_error() fuction will not stop the graph it will write the error message to the error port for
that record and will process the next record.
A. Phase is breaking the graph into different block. It create some temp file while running and
deletes it once the completion is done.
Checkpoint is used for recovery purpose. when the graph is interrupted instead of rerunning the
graph from the start. the excution starts from the stop where it is stopeed.
Q. what is the function of XFR in abinitio? It would be great if one of you can explain me in
brief what is the function of xfr (like what does it do ,where is it stored ,how does it affect )?
A. As you know when you create a new sandbox in ab initio environment the following
directories will be created
1.mp
2.dml
3.xfr
4.db
etc etc.
xfr is directory in abinitio where we can write our own function and use them during the
tranformation(rollup , reformat etc..).
example you can write a function to convert a string into decimal or to get string max length ,
I can write that function in a file called user_define_function.xfr in xfr directory inside this
file i can define a function called string_to_interger or get_string_max_length or both. In any
transform component you can include the file liek
include "<full path>/user_define_function.xfr "
11
ABInitio FAQ
2. Pipeline Parallelism: All program components runnings simultaneously on same data sets.
we can break the pipeline parallelism using all sort based components.
3. Data Parallelism: Distributes data records into multiple locations using partition
components.
2. Add size of lookup files used in phase (if multiple components use same lookup only count it
once)
3. Multiply by degree of parallelism. Add up all components in a phase; that is how much
memory is used in that phase.
4. (Total memory requirement of a graph) > (the largest-memory phase in the graph).
Q. How can I achieve cummulative sumary in AB Initio other than using SCAN component. Is
there any inbuilt function available for that?
A. Scan is really the most simple way to achieve this. Another way is to use a ROLLUP since it
is a multistage component. You need to put the ROLLUP component into multistage format
and write the intermediate results to a temp array (I think they're called vectors in AI). The
ROLLUP loops through each record in your defined group.
Let's say you want to get intermediate results by date. You sort your data by {ID; DATE} first.
Then ROLLUP by {ID}. The ROLLUP will execute it's transformation for each record per ID.
So store your results in a temp vector which will need to be initialized to be the size of your
largest group. Each time the ROLLUP enters the tranformation write to the [i] position in the
array and increment i each time. As long as this is all done in the "rollup" transformation and
not the "finalize" transformation it will run the "initialize" portion before it moves to the next
ID.
12
ABInitio FAQ
I have done it this way but the Scan is easier. I was doing a more simple rollup before I found
that I needed cumulative intermediate results so I just modified my existing ROLLUP. Ab Initio
documentation does not explain this technique in detail but it can be done.
or
Or
Q. I have file containing 5 unique rows and I am passing them through SORT component using
null key and and passing output of SORT to Dedup sort. What will happen, what will be the
output.?
A. If there is no key used in the sort component while using the dedup sort the output depends
on the keep parameter.
If its set to firt then the output would have only the first record
if its set to last the output would have the last record
if its set to unique_only then there would be no records in the output file.
A. I think it is not adviseable to use a 1GB lookup file it will definitely effect the parallel
processing of other applications and affect the performance.
I would prefer to use the MFS lookup file and not serial lookup file in this case.
Q. If I have 2 files containing field file1(A,B,C) and file2(A,B,D), if we partition both the files on
key A using partition by key and pass the output to join component, if the join key is (A,B) will
it join or not and WHY?
A.
13
ABInitio FAQ
Q. In my sandbox i am having 10 graphs, i checked-in those graphs into EME. Again i checked-
out the graph and i do the modifications, i found out the modifications was wrong. what i have
to do if i want to get the original graph..?
A.
Q.What is a sandbox?
A. Sandbox is a directory structure of which each directory level is assigned a variable name, is
used to manage check-in and checkout of repository based objects such as graphs.
You'll require a sandbox when you use EME (repository s/w) to maintain release control.
Within EME for the same project an identical structure will exist.
The above-mentioned structure will exist under the os (eg unix), for instance for the project
called fin, and is usually name of the top-level directory.
When you checkout or check-in a whole project or an object belonging to a project, the
information is exchanged between these two structures.
For instance, if you checkout a dml called fin.dml for the project called fin, you need a sandbox
with the same structure as the EME project called fin. Once you've created that, as shown
above, fin.dml or a copy of it will come out from EME and be placed in the dml directory of your
sandbox.
Q. I have a job that will do the following: ftps files from remote server; reformat data in those
files and updates the database; deletes the temporary files. How do we trap errors generated by
Ab Initio when an ftp fails? If I have to re-run / re-start a graph again, what are the points to be
considered? does *.rec file have anything to do with it?
A. AbInitio has very good restartability and recovery features built into it. In Your situation
you can do the tasks you mentioned in one graph with phase breaks.
FTP in phase 1 and your transaformation in next phase and then DB update in another pahse
14
ABInitio FAQ
(This is just an example this may not best of doing it as best design depends on various other
factors)
If the graph fails during FTP then your graph fails in Phase 0, you can restart the graph, if your
graph fails in Phase 1 then AB_JOB.rec file exists and when you restart your graph you would
see a message saying recovery file exists, do you want to start your graph from last successful
check point or restart from begining. Same thing if it fails in Phase 2.
Phases are expensive from Disk I/O perspective, so have to be careful in doing too much
phasing.
Coming back to error trapping each component has reject, error, log ports, reject captures
rejected records, error captures corresponding error and log captures the execution statistics of
the component. You can control reject status of each component by setting reject threshold to
either "Never Abort", "Abort on first reject" or setting "ramp/limit"
Recovery files keep tack of crucial information for recovering the graph from failed status,
which node the component is executing on etc. It is a bad idea to just remove the *.rec files, you
always want to rollback the recovery fils cleanly so that temporary files created during graph
execution won't hang around and occupy disk space and create issues.
Ad hoc multifiles treat several serial files having the same record format as a single graph
component.
Frequently, the input of a graph consists of a set of serial files, all of which have to be processed
as a unit. An Ad hoc multifile is a multifile created 'on the fly' out of a set of serial files, without
needing to define a multifile system to contain it. This enables you to represent the needed set of
serial files with a single input file component in the graph. Moreover, the set of files used by the
component can be determined at runtime. This lets the user customize which set of files the
graph uses as input without having to change the graph itself, even after it goes into production.
Ad hoc multifiles can be used as output, intermediate, and lookup files as well as input files.
The simplest way to define an Ad hoc multifile is to list the files explicitly as follows:
15
ABInitio FAQ
If you have added 'n' files, then the input file now acts something like a file in a n-way multifile
system, whose data partitions are the n files you listed. It is possible for components to run in
the layout of the input file component. However, there is no way to run commands such as m_ls
or m_dump on the files, because they do not comprise a real multifile system.
There are other ways than listing the input files explicitly in an Ad hoc multifile.
1. Listing files using wildcards - If the input file names have a common pattern then you can use
a wild card for all the files. E.g. $AI_SERIAL/ad_hoc_input_*.dat. All the files that are found at
the runtime matching the wild card pattern will be taken for the Ad hoc multifile.
2. Listing files in a variable. You can create a runtime parameter for the graph and inside the
parameter you can list all the files separated by spaces.
Broadcast
16
ABInitio FAQ
|
|
Input File 2(Serial)---> Broadcast -->
Input File2 is a serial file and it is being joined with a mf, input file2, without being partitioned.
The compoment, Broadcast, is writing data to all partitions of Input file1, creating an implicit
fan out flow.
Or
The short answer is that the Replicate copies a flow while a Broadcast multiplies it. Broadcast is
a partitioner where Replicate is a simple flow-copy mechanism.
Replicate appears in over 90% of all AI graphs (across the board of all implementations
worldwide) where Broadcast appears in less than 1% of all graphs.
You won't see any difference in the two until you start using data-parallel, then it will go south
rather quickly. Here's an experiment:
Use a simple serial input file, followed by a broadcast, then a 4-way multifile output file
component.
If you run the graph with say, 100 records from the input file, it will create 400 records in the
output file - 100 records for each flow partition encountered.
If you had used a Replicate, it would have read and written 100 records.
1.What is the function you would use to transfer a string into a decimal.?
4.Have you eveer encountered an error called depth not equal (this
apparently occurs when you extensively create graphs.....kinda a trick
question)?
7.Whats the difference between partitioning with key and round robin?
17
ABInitio FAQ
Dynamic Script Generation is the latest buzz in Ab Initio world and one of it’s finest. It comes
with lots of other advantages which were not there in earlier versions of Ab Initio
Co>Operating System. Now it is available in Co>Operating System version 2.14.46 and
above.
This feature typically enables the use of Ab Initio PDL (Parameter Definition Language) and
Component Folding.
Now if we enable this feature by changing the script generation method to Dynamic in Run
Settings we will be able to run a graph without deploying it through GDE. From now onwards
we will execute the mp file only; there is no need to have the ksh. In production server once we
run the mp file using air sandbox run command on the fly it generates a reduced script, which
contains the commands to set up the host environment. It doesn’t include component details of
the graph at all.
You can check the mp file of dynamic script generation enabled graph. It is an editable text file.
• The components must be foldable • They must be in same phase and layout • Components
must be connected via a straight flow.
1. When this is enabled by checking the folding option in Run Setting, Co>Operating System
runtime folds all the processes (foldable components) in a single process. As a result number of
18
ABInitio FAQ
processes is reduced when a graph executes. Every process has overheads of creation of new
process, scheduling, memory consumption etc. These overheads will vary from OS to OS. In
some OS like MVS, creation and maintenance of processes are very costly compared to different
flavors of UNIX.
2. Another major benefit of component folding is the reduction of interpretation time for the
DML between processes. Because it will end up with multitool folded processes communicating
with other multitool or unitool.
3. Apart from that increase in number of processes results higher interprocess communication.
Data movement between two or more processes will not only consume time but memory too. In
CFG (Continuous Flow Graph) interprocess communication is always very high. So it is worth
enabling Component folding in a CFG.
2. Address Space: In a 32 bit OS maximum limit of Address space for process is 4 GB. So if we
combine 4 different components to a single process by component folding OS will allow only 4
GB of address space for all 4 instead of 4X4 total 16 GB of spaces. So we should avert
component folding components where memory use is very high as in-memory Rollup, Join, and
Reformat with lookup. Some components like Sort, in-memory Join causes internal buffering of
data. Combing them in a single process will result writing to disk (Higher IO).
Set AB_MULTITOOL_MAXCORE variable to limit the maximum allowable memory for the
folded component group.
I know sometime you would wish to prevent components to be folded to allow pipeline
parallelism or to access more address space. Then you need to exclude some components from
being folded.
19
ABInitio FAQ
In other way to prevent two different components from getting folded together right click on
the flow between them and uncheck the Allow Component Folding option.
Everything has its cost. So it is always worth benchmarking before taking a decision. Prevent
and allow component folding for your components of the graph, tune it for the highest
performance.
To report the execution detail of folded graph on console we need to override the
AB_REPORT variable with show-folding option as – AB_REPORT=”show-folding flows
times interval=180 scroll=true spillage totals file-percentages”.
The folded components are displayed as multitool process in CPU tracking information. The
CPU time for a folded component is shown twice one for the component itself once as a
multitool component.
PDL is used to put logic for inline computation in parameter value. It provides high flexibility
in terms of interpretation. It supports both $ and ${} substitution. For this you need to set the
interpretation PDL and write the DML expression within $[ ]. This approach is much faster
than traditional shell scripting. It is the way to move forward to a much flexible and robust
technique of designing. With the use of it we can abolish the old shell scripting as script-end
and script-start are already beaten enough to death since last few years. You can use PDL
interpretation for condition of a component.
NOTE. The detail of PDL within the GDE is lacking any consistency. Basically, we can use the
majority of the Ab Initio DML functions. I would recommend looking at the metaprograming
section for starters. Then play with the parameters editor.
e.g.
Suppose in a graph we have a conditional component which runs based on existence of a file
called emp.dat.
20
ABInitio FAQ
We can define a parameter with type and transform function with the help of parameter
AB_DML_DEFS.
Ensure your host run settings are checked for dynamic script generation, and read the 2.14
patchset notes for a description of any hint.
21