Paper Infa Optimization Pravin
Paper Infa Optimization Pravin
Paper Infa Optimization Pravin
ByPravin Ingale
pravin.ingale@in.ibm.com
AGENDA
Below are the goals of this document:
1. To provide useful guidelines to follow while designing or coding
Informatica mappings with the intention to develop better
mappings, from performance perspective, at the beginning of the
project lifecycle itself. If developers follow these guidelines - by
the end of the project they would have already taken care of
general performance related issues that can occur with
Informatica mappings. These guidelines are from Informatica
help files, previous project guidelines and my own project
learnings.
2. To provide some useful tips that I had used in my projects to
performance tune ETL mappings. These are scenario based tips
and may not be applicable to every Informatica ETL project.
Note: The optimization tips were used in Informatica 8.6
version and may/may not be valid for other versions
6.5.
7. Use Normalizer
Use a Normalizer Transformation to pivot rows rather than
multiple instances of the same target.
8. Only connect what is used
8.1. Delete
unnecessary
ports
and
links
between
transformations to minimize the amount of data moved;
particularly in the Source Qualifier.
8.2. In lookup transformations, change unused ports to be
neither input nor output. This makes generated SQL
override as small as possible, which cuts down on the
amount of cache necessary and thereby improves
performance.
9. Watch the data types
Sometimes data conversion is excessive. Data types are
automatically converted when types are different between
connected ports.
Minimize data type changes between
transformations by planning data flow prior to developing the
mapping.
10.
11.
12.
13.
14.
15.
Date comparisons
Avoid date comparisons in lookup; replace with string.
16.
Optimize Joiner
18.1. Normal joins are faster than outer joins and the resulting
set of data is also smaller.
18.2. Join sorted data when possible.
18.3. When using a Joiner Transformation, be sure to make the
source with the smallest amount of data the driving/Master
source.
18.4. For an unsorted Joiner transformation, designate as the
master source the source with fewer rows.
18.5. For a sorted Joiner transformation, designate as the
master source the source with fewer duplicate key values.
19.
20.
21.
24.
Optimize targets
You can identify target bottlenecks by configuring the session to
write to a flat file target. If the session performance increases
significantly when you write to a flat file you have a target
bottleneck.
Consider
performing
the
following
tasks
performance:
24.1. Drop indexes and key constraints.
24.2. Increase checkpoint intervals.
24.3. Use bulk loading.
24.4. Use external loading.
24.5. Increase database network packet size.
24.6. Optimize target databases.
25.
to
increase
Optimize sources
If the session reads from relational source, you can use a filter
transformation, a read test mapping, or a database query to
identify source bottlenecks:
Filter Transformation - measure the time taken to process a
given amount of data, then add an always false filter
transformation in the mapping after each source qualifier so that
no data is processed past the filter transformation. You have a
This
section
gives
statistics
about
the
READER,
TRANSFORMATION and WRITER threads.
The thread with the highest busy percentage is the bottleneck.
In the example above, the transformation is bottleneck and
needs to be optimized first.
parallel more was the cache size required and after a point mappings
were failing due to insufficient memory on the Informatica nodes.
Solution Used MD5 function.
MD5 is a data encoding function and returns the MD5 hash value of the
input value as char. This function is available in the Expression
transformation.
The Hub data model was modified and an additional column of char
data-type by name MD5_Value was added onto the identified tables.
All the identified ETL mappings were modified. In the first expression
an output port was created which received the MD5 value of all the
non-key columns, concatenated in defined sequence, to be written to
the target table. This MD5 value was inserted into the MD5_Value
column of the target tables. Secondly the lookups were modified to
extract and compare only the MD5_Value column from the lookup with
the calculated MD5_Value of the source columns. Since instead of n
number of lookup columns only the MD5_Value column data was
getting cached during mapping execution and lookups the memory
requirement of the ETL mappings came down to a significant extent.
Now there was no more a limit on the number of mappings to be
executed in parallel in production. The data comparison in the
expression to identify updates was also faster now.
This helped to bring down the ETL batch execution window by having
all possible ETL mappings execute in parallel.
Scenario 3:
In one of my projects, the source system was providing day-end
changes to our application Hub. There were 45 source tables from
which ETLs were loading data into 2 super-type and 4 sub-type tables.
The target table had huge volume of data and ETL had to do a lookup
on the target tables in multiple passes to generate surrogate id,
identify insert/update etc. The lookups on the target tables were
caching this huge volume of data and the lookup threads used to run
for a long time to get this caching done.
Solution Disabled the Lookup caching enabled property.
We saw that the number of changed records was around 1-2% of the
total volume of records in the target tables. So we created 2 copies of
each
workflow.
The
original
workflows
were
named
as
<workflowname>_bulk
and
new
were
named
as
<worklfowname>_incr (incremental).
We created 2 sets of the scheduled job that would run all these 45 odd
workflows as BULK & INCR. Needless to mention, all the _bulk
workflows were added in BULK job and all _incr workflows were added
in INCR job. The only other difference was we disabled the Lookup
caching enabled property in the major lookups in the _incr workflows.
This is depicted in the below screenshot-
Since the daily changes were very less, we scheduled the INCR job to
run daily at night. The BULK job we kept on-demand and was hardly
run it was used just the first time during production deployment (for
history data load) and planned for later execution when any bulk syncup requirement would come in future. The target tables had no history
maintenance requirement.
The INCR job now completed faster since the heavy lookup caching
was omitted. Since the source records were in few tens or occasionally
a hundred there was no need to download all the target data into
lookup cache. For the few tens of source records only those number of
lookup queries would hit the target tables and lookup done.
Scenario 4:
In one project there were a set of account and customer tables that
were getting loaded every month. These were month end snapshot
tables and used to contain month-end data for each account or
customer for every month. E.g. - there were around 35 million odd
accounts and the table data was supposed to grow every month. With
data growth the ETL performance was degrading every month.