Informatica Advanced Training
Informatica Advanced Training
Informatica Advanced Training
Course Objectives
PowerCenter/PowerMart Architecture PowerCenter Advanced Transformations Informatica Performance Tuning Informatica Templates
PowerCenter Architecture
This section will cover Informatica Architecture & Connectivity - 6.x and 7.x Informatica Architecture & Connectivity 5.x
Architectural Differences
native ODBC
Targets
ODBC
TCP/IP
Heterogeneous Sources
Repository Server
Heterogeneous Targets
TCP/IP
Repository Agent
native
Repository Designer Workflow Workflow Rep Server Manager Manager Monitor Administration Console
Repository
Sources 1-n
Targets 1-n
Repository
Server
6
Native / ODBC
Native / ODBC
Network Protocol
Server
7
Architectural Differences
Sr. No. 1 Informatica 5.x Only one server component - Informatica server which is ETL engine of Informatica suite Informatica 6.x and 7.x versions Two server components - Repository Server (for managing communication of different Informatica client and server tools with Metadata repository) and Informatica Server (ETL engine of Informatica suite) All client tools and Informatica server communicate with Repository server over TCP/IP. Repository server connects to Metadata repository using Repository Agent processes which use native connectivity for accessing metadata Workflows are executable components Workflow Manager tool used for registering Informatica Server with repository, creation of tasks, worklets and workflows and scheduling of workflows. Workflow Monitor used for load-monitoring Versioning is in-built in the suite. In addition, there are built-in configuration management tools in order to migrate components between environments such as Dev, Test and Prod supporting a lights out migration Repository Server Administrator Console is a new client tool for maintaining and managing repository PMREPAGENT Command-line utility now available for Repository maintenance 8 PMREP used for User Management
Native or ODBC connectivity used for communication between Metadata repository and client tools and Informatica server Sessions are executable components Server Manager client tool used for registering Informatica Server with repository, creation of sessions and batches and scheduling the same as well as load-monitoring Third-party tools required for versions of Informatica components managing
3 4
Repository Manager used for all maintenance tasks like Backup and Restore, copy, creating repository contents, etc PMREP Command-line utility used for all User management as well as Repository Maintenance
This section will cover Transformation Concepts Recap Stored Procedure Transformation (Lab 1)
Filter Aggregator
Pivoting Data (Rows to columns) (Lab 2)
Incremental Aggregation (Lab 3,4) Implementation using mapping variables and Aggregator
Implementation using SP and Aggregator Implementation using Filter and Aggregator
Joiner
Self Join (Lab 5) Joiner to emulate Lookup Contd
10
Normalizer
Converting Columns to Rows (Lab 6) Sourcing data from VSAM files
Lookup
Dynamic Lookup Unconnected Lookup
11
What is a transformation?
A transformation is repository object that generates, modifies and/or passes data to other repository object
12
Transformation Types
Active Transformation The number of rows that pass through this transformation change, i.e. Number of rows in the transformation may or may not be equal to number of rows out of the transformation Example Number of rows coming in Filter transformation may be 10. But depending on condition, number of rows out of Filter can be anywhere between Zero and 10
Passive Transformation
The number of rows into the transformation is always equal to number of rows out of the transformation Example Number of rows into Expression is always equal to number of rows out of the transformation
13
Teradata
Contd
15
17
Filter - Basics
Filter transformation allows to filter rows in a mapping It is active transformation and is always connected
Any expression that returns Boolean value can be used as a filter condition
Filter condition is case sensitive
18
19
Aggregator - Basics
Aggregator transformation performs aggregate calculations like sum, max, min, count on groups of data
20
Aggregator Cache
21
22
Example: Source contains two fields, field_name and field_value. These are to be converted into a target table which has column names as given by list of field_names in Source
Aggregator transformation and FIRST function can be used for converting rows into columns Elaborating above example, target field names are First_Name, Last_Name, Middle_Name and Salutation o_First_Name = FIRST(field_value, field_name = FIRSTNAME)
1
Contd
23
24
Incremental Aggregation
When to use Incremental Aggregation? New source data can be captured Incremental changes do not significantly change the target When aggregation does not contain Percentile or median 2 functions When mapping includes aggregator with Transaction Transformation scope How to capture Incremental Data? Using Filter and Variables
1
25
Incremental Aggregation
Session failure scenarios when using Incremental Aggregation When PowerCenter Server data movement mode is changed from ASCII to Unicode or vise-versa When PowerCenter Server code page is changed to incompatible code page When session sort order is changed for PowerCenter server in Unicode mode When Enable High Precision option is changed in a session To avoid session failures for above, reinitialize the cache by deleting the incremental aggregation files
26
Incremental Aggregation
Configuring Session Set location of cache files. By default index and data files are created in $PMCacheDir Cache filenames can be written to session log by setting tracing level to Verbose Init. In session log, path after TE_7034 gives the location of index file and path after TE_7035 gives the location of data file Incremental Aggregation Session Properties can be set in Session -> Properties -> Performance
27
Mapping Variable
28
29
Joiner - Basics
Joiner Transformation joins two sources with at least one matching port Joiner requires inputs from two separate pipelines or two branches from one pipeline Joiner is an Active transformation and is always connected Joiner transformation cannot be used in following cases: Either input pipeline contains an Update Strategy A Sequence Generator transformation is directly connected before a Joiner transformation Joiner conditions for string comparison are case sensitive Join types supported are Normal, Master Outer, Detail Outer and Full Outer
30
Joiner - Cache
31
33
Self Join
Scenario of self join is encountered when some aggregated information is required along with Transaction data in target table
Using Joiner transformation, this can be achieved in Informatica PowerCenter version 7.x
This feature was not available in earlier Informatica versions
34
Self Join
35
36
37
38
Normalizer - Basics
Normalizer Transformation is used to organize the data in a mapping COBOL Sources For COBOL sources, Normalizer acts as a Source Qualifier These sources are often stored in normalized form, where OCCUR statement nests multiple records in single record For relational sources, Normalizer can be used to create multiple rows from a single row of data Normalizer is an Active transformation and is always connected
39
Normalizer Ports
When using normalizer, two keys are generated Generated Key One port for each REDEFINES clause Naming convention - GK_<redefined_field_name> At the end of each session the Power Center updates the GK value to the last value generated for the session plus one Generated Column ID One port for each OCCURS clause Naming convention - GCID_<occurring_field_name>
Note: These are default ports created by Normalizer and cannot be deleted
40
43
44
Built-in seq
45
Some rows may not be bound by Transaction boundaries and is an open transaction
Transaction boundaries originate from transaction control points Transaction Control Point is a transformation that defines or 1 redefines the transaction boundary
46
Transformation Scope
PowerCenter server can be configured on how to apply transformation logic to incoming data with the Transformation Scope transformation property. Following values can be set: Row Applies the transformation logic to one row of data at a time. Examples are expression, external procedure, filter, lookup, normalizer (relational), router, sequence generator, stored procedure, etc.
Transaction Applies the transformation logic to all rows in a transaction. With this scope, PC server preserves incoming transaction boundaries. Examples are aggregator, joiner, Custom transformation, Rank, Sorter, etc.
All Input Applies the transformation logic on all incoming data. With this scope, PC server drops incoming transaction boundaries and outputs all rows from the transformation as an open transaction. For transformations like aggregator, joiner, Rank, Sorter and Custom transformation, this is default scope
47
49
50
51
Lookup - Basics
Lookup transformation is created when data is to be looked up Lookup transformation is passive and can be connected or unconnected Lookup is used to perform tasks like Getting a related value (keys from dimension table) Perform a calculation
Lookup - Cache
The PowerCenter Server builds a cache in memory when it processes the first row of data in a cached Lookup transformation
The PowerCenter Server stores Condition Values in the Index Cache and Output Values in Data Cache
Cache files are created in $PMCacheDir by default. These files are deleted and cache memory is released on completion of session, exception being Persistent Caches Lookup cache can be Persistent Cache Static Cache Dynamic Cache Shared Cache
53
Lookup - Cache
Persistent Cache With this option, lookup cache files are created by Informatica Server the first time and can be reused the next time Informatica Server processes Lookup transformation, thus eliminating the time required to read the Lookup table. Lookup can be configured to use persistent cache when table does not change between sessions. Static Cache By default, PowerCenter Server creates static or read-only cache for any lookup table. The PowerCenter Server does not update the cache while it processes the Lookup Transformation Dynamic Cache Usual scenario where dynamic cache is used is when Lookup table is also a target table. The PowerCenter Server dynamically inserts or updates data in Lookup cache
Shared Cache Lookup cache can be shared between multiple transformations. Unnamed cache can be shared between transformations of single mapping. Named cache can be shared between transformations of single mapping or across mappings
54
Lookup - Cache
55
Unconnected Lookup
An Unconnected Lookup transformation receives input values from the result of :LKP expression in another transformation Unconnected Lookup transformation can be called more than once in a mapping Unconnected Lookup transformation can return only single column as against connected lookup. This single column is designated as Return Port
Unconnected Lookup does not support user-defined default value for return port. By default, return value is NULL
Unconnected Lookup cannot have dynamic cache Unconnected Lookup can be used in scenarios where Slowly changing dimension tables are to be updated Lookup is called multiple times Conditional Lookup
56
Use the Update Strategy transformation before or after Lookup, to flag rows for insert or update to the target
Ignore NULL Property
Per port Ignore NULL values from input row and update the cache only with non-NULL values from input
58
60
Additional Information Following options are available to work around this: Make sure there are no duplicate rows in the table before starting the session Use a static cache instead of dynamic cache Do a SELECT DISTINCT in the lookup cache SQL
61
62
Problem Description: While editing a mapping, the following transformations are grayed out and cannot be added to a mapping or created in Transformation Developer Union, Mid-Stream XML Generator, Midstream XML Parser
Solution: The transformations are grayed out because they need to be registered in the PowerCenter repository. These transformation types were introduced in PowerCenter 7.x and are treated as "native" plug-ins to the PowerCenter Repository. Like any plugin, they need to be registered in the repository. In normal operations when the repository is created, they should be registered automatically. If the registration did not occur, these transformations are not available in the Designer. Contd
64
On UNIX, enter the following commands: pmrepagent registerplugin -r RepositoryName -n Administrator -x Administrator -t "Oracle" -u DatabaseUserName -p DatabasePassword -c DatabaseConnectString -i native/pmuniontransform.xml
65
Union Transformation
66
Double
Real Integer Small integer
10
10 6 6
16
16 16 16 Unicode mode: 2*(precision + 5) ASCII mode: precision + 9
67
68
Cache Sizes
Transformation Name AGGREGATOR Min Index Cache Size Number of Groups * [(Summation( Column size) + 17] 200 * [(Summation of Column size) + 16] Number of input rows * [(Summation of Column size) + 16] (Sum of master column sizes in join condition + 16) X number rows in master table Number OF Groups * [(Sum of Column size) + 17] (Sum of master column sizes NOT in join condition but on output ports + 8) * number of rows in master table Number of Groups * [(number of ranks * ((Sum of Column size) + 10)) + 20] Max Index Cache Size Double Min Index Cache size Number of rows in Lookup table * [(Summation of Column size) + 16] * 2 Data Cache Size Number of Groups * [(Summation( Column size) + 7] Number of rows in Lookup table * [(Summation of Column size) + 8]
LOOKUP
SORTER
JOINER
RANK
69
70
71
Performance Tuning
There are two general areas to optimize and tune External components to Informatica (OS, memory, etc.) Internal to Informatica (tasks, mapping, workflows, etc.) Getting data through the Informatica engine This involves optimizing at the task and mapping level This involves optimizing the system to make sure Informatica runs well Getting data into and out of the Informatica engine This usually involves optimizing non-Informatica components The engine cant run faster than the source or target
72
Measuring Performance
For the purpose of identifying bottlenecks we will use:
Wall Clock time as a relative measurement time
second)
Rows per second (rows/sec) will allow performance measurement of a session over a period of time and with changes in our environment. Rows per sec can have a very wide range depending on the size of the row (number of bytes), the type of source/target (flat file or relational) and underlying hardware.
73
Identifying Bottlenecks
Reader Processes Configure session to read from Flat File target instead of relational target Measure the performance Writer Processes Configure session to read from Flat File target instead of relational target Measure the performance Mapping Generally if the bottleneck is not with the reader or writer process then the next step is to review the mapping Mapping bottlenecks can be created by improper configuration of aggregator, joiner, sorter, rank and lookup caches
74
Mapping Optimizing
Single-Pass Read Use a single SQL when reading multiple tables from the same database. Data type conversions are expensive Watch out for hidden port to port data type conversions Over use of string and character conversion functions Use filters early and often Filters can be used as SQL overrides (Source Qualifies, Lookups) and as transformations to reduce the amount of data processed
Simplify expressions Factor out common logic Use variables to reduce the number of time a function is used
75
Mapping Optimizing
Use operators instead of functions when possible
The concatenation operator (||) is faster than the CONCAT
function
Simplify nested IIFs when possible Use proper cache sizing for Aggregators, Rank, Sorter, Joiner and Lookup transformations
Incorrect cache sizing creates additional disk swaps that
can have a large performance degradation Use the performance counters to verify correct sizing
76
bandwidth
calculation
77
78
pre-defined range values for keys Available in PowerCenter 5, but only for Source Qualifier. Key Range partitioning can be applied to other transformations in 6.x and up Common use for this new functionality:
Apply to Target Definition to align output with physical partition scheme of target table. Apply to Target Definition to write all data to a single file to stream into a database bulk loader that does not support concurrent loads (e.g. Teradata, DB2)
79
hence variable and output only are not allowed because they have not yet been evaluated
You can select multiple keys to form a Composite Key Range specification is: Start Range and End Range You can specify an Open Range also NULL values will go to the First Partition All unmatched rows will also go into First Partition, user will see the following warning message once in the log file:
80
TRANSF_1_1_2_1> TT_11083 WARNING! A row did not match any of the key ranges specified at transformation [EXPTRANS]. This row and all subsequent unmatched rows will be sent to the first target partition.
each partition. The user need not specify anything because key values are not interrogated Common use:
Apply to a flat file source qualifier when dealing with unequal
input file sizes Use user hash when there are downstream lookups/joiners Trick: All but one of the input files can be empty you no longer have to physically partition input files Note: There are performance implications with doing this. Sometimes its better, sometimes its worse.
81
function applied to the key values PowerCenter supports Auto Hash Partitioning automatically for aggregator and rank transformations Goal: Evenly distribute data, but make sure that like key values are always processed by the same partition A hash function is applied to a set of ports.
Hash function returns a value between 1 and the number of partitions. A good hash function provides a uniform distribution of return values
key ports (ex, Group By key or Sort key) as a composite key Only valid for Unsorted Aggregation, Rank and Sorter The default partition type for unsorted aggregator and rank NULL values will get converted to zero for hashing.
Hash Key
Just like Auto Hash Key, but the user explicitly specifies the ports
Common use: When dealing with input files of unequal sizes, hash partition (vs. round-robin) data into downstream lookups and joiners to improve locality of reference of caches (hash on ports used in the lookup/join condition) Override the auto hash for performance reasons
83
partition Since data is not distributed, the user need not specify anything Common use:
Create additional stages (processing threads) within pipeline
to improve performance
84
85
86
Partitioning Dos
Cache as much as possible in memory Spread cache files across multiple physical devices both within and across partitions Unless the directory is hosted on some kind of disk array, configure disk based caches to use as many disk devices as possible Round Robin or Hash partition a single input file until you determine this is bottleneck Range Partition to align source/target with physical partitioning scheme of source/target tables Pass through partition to apply more CPU resources to a pipeline (when TX is bottleneck)
87
Partitioning Donts
Dont add partition points if the session is already source or target constrained Tune the source or target to eliminate the bottleneck Dont add partition points if the CPUs are already maxed out (%idle < ~5%) Eliminate unnecessary processing and/or buy more CPUs Dont add partition points if system performance monitor shows regular page out activity Eliminate unnecessary processing and/or buy more memory Dont add multiple partitions until youve tested and tuned a single partition Youll be glad you did
88
Reader
Transformation
Transformation
Writer
Source Qualifier or Normalizer Transformation Rank and unsorted Aggregator Transformation Target Instances
Pass-through
Controls how the Server reads data from the source and passes data into the source qualifier Ensures that the Server group rows before it sends them to the transformation Control how the instances distribute data to the targets
Hash auto-keys
Pass-through
89
Informatica Templates
90
This section will cover Template for generating Mapping Specifications Template for generating sessions/workflow specifications Unit Test Plan for Informatica Mapping Standards and Best Practices
91
Mapping Specifications
92
Session/Workflow Specifications
93
94