Abintio
Abintio
Abintio
Presenters Name
Role Month, Year
Australia | Canada | France | India | New Zealand | Singapore | Switzerland | United Arab Emirates | United Kingdom | United States 2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
Agenda
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
DWH Concept
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
Data warehouses store large volumes of data which are frequently used by Decision Support Systems It is maintained separately from the organizations operational databases Data warehouses are relatively static with only infrequent updates A data warehouse is a stand-alone repository of information, integrated from several, possibly heterogeneous operational databases
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
Source Systems
Extract
Transfer
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
What is ETL?
Extract, Transform, and Load (ETL) is a process that involves extracting data from outside sources, transforming it to fit business needs (which can include quality levels), and ultimately loading it into the end target, i.e. the data warehouse. Extract The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include nonrelational database structures such as IMS or other data structures such as VSAM or ISAM. Extraction converts the data into a format for transformation processing.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 8
What is ETL?
Transform
The transform stage applies a series of rules or functions to the extracted data from the source to derive the data to be loaded to the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformations types to meet the business and technical needs of the end target may be required: Selecting only certain columns to load (or selecting null columns not to load) Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female) Encoding free-form values (e.g., mapping "Male" to "1" and "Mr." to M) Deriving a new calculated value (e.g., sale amount = qty * unit price) Joining together data from multiple sources (e.g., lookup, merge, etc.) Summarizing multiple rows of data (e.g., total sales for each store, and for each region) Generating surrogate key values Transposing or pivoting (turning multiple columns into multiple rows or vice versa) Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns) Applying any form of simple or complex data validation; if failed, a full, partial or no rejection of the data, and thus no, partial or all the data is handed over to the next step, depending on the rule design and exception handling. Most of the above transformations itself might result in an exception, e.g. when a codetranslation parses an unknown code in the extracted data.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 9
What is ETL?
Load
The load phase loads the data into the end target, usually being the data warehouse (DW). Depending on the requirements of the organization, this process ranges widely. Some data warehouses might overwrite existing information with cumulative, updated data, while other DW (or even other parts of the same DW) might add new data in a histories form, e.g. hourly. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. Some systems maintain a history and audit trail of all changes to the data loaded in the DW. As the load phase interacts with a database, the constraints defined in the database schema as well as in triggers activated upon data load apply (e.g. uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
10
Introduction to Ab Initio
1. What is Ab Initio 2. Ab Initio Platforms 3. Architecture of Ab Initio 4. Run Process 5. Components 6. Parallelism 7. Sandbox and Projects 8. Basic Graph Development 9. Multifile 10. Performance Tuning
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
11
1.What is Ab Initio
Data processing tool from Ab Initio software corporation (http://www.abinitio.com) Latin for from the beginning Ab Initio is a general purpose data processing platform for enterprise class, mission critical applications such as data warehousing, click stream processing, data movement, data transformation and analytics. Designed to support largest and most complex business applications Proven best of breed ETL solution. Applications of Ab Initio: ETL for data warehouses, data marts and operational data sources. Parallel data cleansing and validation. Parallel data transformation and filtering. High performance analytics Real time, parallel data capture.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
12
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
13
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
14
Co>operating System
The Co>Operating System is core software that unites a network of computing resources-CPUs, storage disks, programs, datasets-into a production-quality data processing system with scalable performance and mainframe reliability. The Co>Operating System is layered on top of the native operating systems of a collection of computers. It provides a distributed model for process execution, file management, process monitoring, check-pointing, and debugging.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
15
Co-operating System
On a typical installation, the Co-operating system is installed on a Unix or Windows NT server while the GDE is installed on a Pentium PC.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
16
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
17
E M E
Locking
18
GDE
GDE
GDE
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
Ab Initio runs on many operating systems Compaq Tru64 UNIX Digital unix Hewlett-Packard HP-UNIX IBM AIX Unix NCR MP-RAS Red Hat Linux IBM/Sequent DYNIX/ptx Siemens Pyramid Reliant UNIX Silicon Graphics IRIX Sun Solaris Windows NT and Windows 2000
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
19
3.Architecture of Ab Initio
Applications Ab Initio Metadata Repository (EME)
Application Development Environments Graphical (GDE) C ++ Shell Component Library .ksh Ab Initio Co>Operating System User-defined Components Third Party Components
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
20
Architecture of Ab Initio
Host Machine 1
Unix Shell Script or NT Batch File
Supplies parameter values to underlying programs through arguments and environment variables Controls the flow of data through pipes Usually generated using the GDE
GDE
Ability to graphically design batch programs comprising Ab Initio components, connected by pipes Ability to test run the graphical design and monitor its progress Ability to generate a shell script or batch file from the graphical design
Co>Operating System
Ab Initio Built-in Component Programs (Partitions, Transforms etc)
Host Machine 2
User Programs Co-Operating System
User Programs
Operating System
( Unix , Windows NT )
Operating System
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
21
4.Run Process
What happens when you push the Run button ? Your graph is translated into a script that can be executed in the Shell Development Environment. This script and any metadata files stored on the GDE client machine are shipped (via FTP) to the server. The script is invoked (via REXEC or TELNET) on the server. The script creates and runs a job that may run across many nodes. Monitoring information is sent back to the GDE client.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
22
Run Process
Please have look a below Sample graph and find what happens when we press run button on the top right side in the screen shot
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
23
Host GDE
Client
Host
Processing nodes
24
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
Host GDE
Agent Agent
Client
Host
Processing nodes
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
25
Host GDE
Agent Agent
Client
Host
Processing nodes
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
26
Component Execution
Component processes do their jobs. Component processes communicate directly with datasets and each other to move data around.
Host GDE
Agent Agent
Client
Host
Processing nodes
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
27
Host GDE
Agent Agent
Client
Host
Processing nodes
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
28
Agent Termination
When all of an Agents Component processes exit, the Agent informs the Host process that those components are finished. The Agent process then exits
Host
Client
Host
Processing nodes
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
29
Host Termination
When all Agents have exited, the Host process informs the GDE that the job is complete. The Host process then exits.
Host GDE
Client
Host
Processing nodes
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
30
5.Components Overview
There are Mainly two sets of Components available in Abinitio Dataset Components:-Components Which holds data Program Components:-Components which process data
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
31
Dataset Components
Input file :
INPUT FILE represents records read as input to a graph from one or more serial files or from a multi file.
Input table
Input Table unloads records from a database into a graph, allowing you to specify as the source either a database table or an SQL statement that selects records from one or more tables.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
32
Dataset Components
Output file:
OUTPUT FILE represents records written as output from a graph into one or more serial files or a multifile. When the target of an OUTPUT FILE component is a particular file (such as /dev/null, NUL, a named pipe, or some other special file), the Co>Operating System never deletes and recreates that file, nor does it ever truncate it.
Output table:
OUTPUT TABLE loads records from a graph into a database, letting you specify the destination either directly as a single database table, or through an SQL statement that inserts records into one or more tables.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
33
Program Components
Sort:
SORT sorts and merges records. You can use SORT to order records before you send them to a component that requires grouped or sorted records. key (key specifier, required) Name(s) of the key field(s) and the sequence specifier(s) you want the component to use when it orders records. max-core (integer, required) Maximum memory usage in bytes. Default is 100663296 (100 MB). When the component reaches the number of bytes specified in the max-core parameter, it sorts the records it has read and writes a temporary file to disk.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
34
Reformat:
REFORMAT changes the format of records by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records.
1. Reads record from input port 2. Record passes as argument to transform function or xfr 3. Records written to out ports, if the function returns a success status 4. Records written to reject ports, if the function returns a failure status
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
35
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
36
Join:
1. Reads records from multiple input ports 2. Operates on records with matching keys using a multiinput transform function 3. Writes result to the output port PORTS
in out unused reject (optional) error (optional) log (optional)
PARAMETERS
count key override key transform limit Ramp
37
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
38
Filter by Expression:
FILTER BY EXPRESSION filters records according to a DML ex 1.Reads data records from the in port. 2.Applies the expression in the select_expr parameter to each record. If the expression returns: Non-0 value FILTER BY EXPRESSION writes the record to the out port. 0 FILTER BY EXPRESSION writes the record to the deselect port. If you do not connect a flow to the deselect port, FILTER BY EXPRESSION discards the records. NULL FILTER BY EXPRESSION writes the record to the reject port and a descriptive error message to the error port. FILTER BY EXPRESSION stops execution of the graph when the number of reject events exceeds the result of the following formula: limit + (ramp * number_of_records_processed_so_far)
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
39
Normalize :
NORMALIZE generates multiple output records from each of its input records. You can directly specify the number of output records for each input record, or the number of output records can depend on some calculation.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
40
Before Normalization
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
41
After Normalization
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
42
DENORMALIZE SORTED:
DENORMALIZE SORTED consolidates groups of related records by key into a single output record with a vector field for each group, and optionally computes summary fields in the output record for each group. DENORMALIZE SORTED requires grouped input. For example, if you have a record for each person that includes the households to which that person belongs, DENORMALIZE SORTED can consolidate those records into a record for each household that contains a variable number of people.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
43
Before Denormalize
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
44
After Denormalize:
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
45
Multistage components
Data transformation in multiple stages following several sets of rules Each set of rule form one transform function Information is passed across stages by temporary variables Stages include initialization, iteration, finalization and more Few multistage components are aggregate,rollup,scan
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
46
Partition components:
Data can be partitioned using Partition by Round-robin Broadcast Partition by Key Partition by Expression Partition by Range Partition by Percentage Partition by Load Balance
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
48
Partition by Roundrobin PARTITION BY ROUND-ROBIN distributes blocks of records evenly to each output flow in round-robin fashion. Suppose you attach four flows to the PARTITION BY ROUND-ROBIN output port, as shown in the following figure. PARTITION BY ROUNDROBIN writes to Load-1, then Load-2, then Load-3, then Load-4, then back to Load-1 again.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
49
Broadcast
BROADCAST arbitrarily combines all records it receives into a single flow and writes a copy of that flow to each of its output flow partitions.
Partition by Key
PARTITION BY KEY distributes records to its output flow partitions according to key values. PARTITION BY KEY does the following: Reads records in arbitrary order from the in port. Distributes records to the flows connected to the out port, according to the key parameter, writing records with the same key value to the same output flow. PARTITION BY KEY is typically followed by SORT
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
50
Partition by Expression
PARTITION BY EXPRESSION distributes records to its output flow partitions according to a specified DML expression.
Partition by Range
PARTITION BY RANGE distributes records to its output flow partitions according to the ranges of key values specified for each partition. PARTITION BY RANGE distributes the records relatively equally among the partitions. Use PARTITION BY RANGE when you want to divide data into useful, approximately equal, groups. Input can be sorted or unsorted. If the input is sorted, the output is sorted; if the input is unsorted, the output is unsorted. The records with the key values that come first in the key order go to partition 0, the records with the key values that come next in the order go to partition 1, and so on. The records with the key values that come last in the key order go to the partition with the highest number.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 51
Partition by Percentage
PARTITION BY PERCENTAGE distributes a specified percentage of the total number of input records to each output flow. Partition by Load Balance PARTITION WITH LOAD BALANCE distributes records to its output flow partitions by writing more records to the flow partitions that consume records faster. The output port for PARTITION WITH LOAD BALANCE is ordered
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
52
Method
Round robin Hash Function Range
Key-Based Balancing
No Yes Yes Yes Good Good Depends on data and function Depends on splitters
Uses
Record-independent parallelism Key-dependent parallelism Application specific Key-dependent parallelism, Global Ordering Record-independent parallelism
Load-level
No
Depends on load
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
53
De-partition components:
Data can be de-partitioned using
Gather Concatenate Merge Interleave
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
54
Gather Reads data records from the flows connected to the input port Combines the records arbitrarily and writes to the output Concatenate Concatenate appends multiple flow partitions of data records one after another Merge Combines data records from multiple flow partitions that have been sorted on a key Maintains the sort order Interleave: INTERLEAVE combines blocks of records from multiple flow partitions in round-robin fashion. You can use INTERLEAVE to undo the effects of PARTITION BY ROUNDROBIN.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
55
6.Parallelism
Parallel Runtime Environment Where some or all of the components of an application datasets and processing modules are replicated into a number of partitions, each spawning a process. Ab Initio can process data in parallel runtime environment Forms of Parallelism Component Parallelism Pipeline Parallelism Data Parallelism
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
56
Parallelism
Data Parallelism
servers at the same time Data is processed at the different
Pipeline Parallelism
Pipeline parallelism occurs when several connected program components on the same branch of a graph execute simultaneously.
Component Parallelism
working in parallel
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
57
Data Parallelism
Data parallelism occurs when a graph separates data into multiple divisions, allowing multiple copies of program components to operate on the data in all the divisions simultaneously.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
58
Global View:
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
59
Pipeline Parallelism
the following graph divides a list of customers into two groups, GOOD CUSTOMERS and OTHER CUSTOMERS. The SCORE component assigns a score to each customer in the CUSTOMERS dataset, then the SELECT component directs each customer to the proper group based on that score.
Processing Record: 55
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 60
Component Parallelism
The following graph takes the CUSTOMERS and TRANSACTIONS datasets, sorts them, then merges them into a dataset named MERGED INFORMATION.Because the SORT CUSTOMERS and SORT TRANSACTIONS components are on different branches of the graph, they execute at the same time, creating component parallelism.
Sorting Transactions
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
61
/Projects
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
62
Create a new graph Go to (File>New) Then File>Save As (i.e., my_graph) to save it in the appropriate sandbox to enable this new graph to pick up the proper environment.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
63
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
64
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
65
66
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
67
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
68
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
69
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
70
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
71
y A port is a connection point that allows data to flow into or out of a component. Most components have at least one port. y The data streaming into or out of a component is called a flow.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
72
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
73
Adding Flows
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
74
Enter expression
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
75
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
Classification: GE Internal
76
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
77
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
78
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
79
Dedup component removes duplicate records. Dedup criteria will be either uniqueonly, First or Last.
80
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
81
9.MULTIFILES
Multifiles are parallel files composed of individual files, which may be located on separate disks or systems. These individual files are the partitions of the multifile. Understanding the concept of multifiles is essential when you are developing parallel applications that use files, because the parallelization of data drives the parallelization of the application. An Ab Initio multifile organizes all partitions of a multifile into one single virtual file that you can reference as one entity.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
82
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
83
Multifile Commands
m_mkfs m_mkdir m_ls m_expand m_dump m_cp m_mv m_touch m_rm
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 84
Creates a multifile system rooted at mfsurl and having as partitions the new directories dir-url1, dir-url2, ...
$ m_mkfs //host1/u/jo/mfs3 \ //host1/vol4/dat/mfs3_p0 \ //host2/vol3/dat/mfs3_p1 \ //host3/vol7/dat/mfs3_p2 $ m_mkfs my-mfs my_mfs_p0 my_mfs_p1 my_mfs_p2
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 85
Creates the named multidirectory. The url must refer to a pathname within an existing multifile system.
Lists information on the file or directories specified by the urls. The information presented is controlled by the options, which follow the form of ls.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
87
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
88
Displays contents of files, multifiles, or selected records from files or multifiles, similar to View Data from GDE.
$ m_dump end 20 $ m_dump $ m_dump 'id*2 $ m_dump simple.dml simple.dat -start 10 simple.dml -describe simple.dml simple.dat -end 1 -print help
89
Copies files or multifiles that have the same degree of parallelism. Behind the scenes, m_cp actually builds and runs a small graph, so it may copy from one machine to another where Ab Initio is installed.
$ m_cp foo bar $ m_cp mfile:foo \ mfile://OtherHost/path/to/the/mdir/bar $ m_cp mfile:foo mfile:bar \ //OtherHost/path/to/the/mdir
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane. 90
Moves a single file, multifile, directory, or multi-directory from one path to another path on the same host via renaming does not actually move data.
$ m_mv foo bar $ m_mv mfile:foo mfile:/path/to/the/mdir/bar
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
91
Creates an empty file or multifile in the specified location. If some or all of the data partitions already exist in the expected locations, they will not be destroyed.
$ m_touch foo $ m_touch mfile:/path/to/the/mdir/bar
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
92
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
93
Other Commands
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
94
Kills a running job. Should be executed by the user who started the job from the launching node of the job. Must be given the recovery file name for the job.
$ m_kill my_graph.rec
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
96
Evaluates a DML expression outside a graph. It useful for quickly testing out or debugging a complex expression.
$ m_eval "1+1" 2 $ m_eval "reinterpret_as(record string('|') f1,f2,f3; end, 'a|b|c|').f2" "b
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
98
10.Performance Tuning
What is Good Performance? O O O O Minimizing Minimizing Minimizing Minimizing wall clock time overall CPU usage memory usage disk usage
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
99
Parallelism Go parallel as soon as possible. Ask yourself why any serial input isnt followed immediately by a Partition component. Once data is partitioned, do not bring down to serial, then partition back to parallel. Repartition instead. For very small processing jobs (hundreds or thousands of records, runtime in minutes) serial may be better for reduced startup costs.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
100
Serial Inputs If you need to reformat serial input data to find the true partition key, do not do this serially. Instead, do this:
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
101
Do not access large files across NFS. Use Ab Initio to transfer the data instead, or an FTP component. Use Ad Hoc Multifiles to read many serial files (with same record format) in parallel. To read many, many input files in parallel, use Ad Hoc Multifiles and a fan-in flow to a Concatenate. M must evenly divide N Pad file list with /dev/null if it doesnt
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
102
Phase breaks (and checkpoints) Often, phase breaks will not add to wall clock time since the graph will be mostly CPU-bound, and some additional I/O will not be an issue. Phase breaks let you allocate more memory to individual components. Visualize what happens in each component. Separate components that would benefit from using large amounts of memory. Try to avoid landing multiple copies of the same data to disk in a phase break after a Replicate component.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
103
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
104
Record Formats In general, completely fixed format records take less CPU to process than variable length records. Drop fields that arent needed as soon as possible. This is often done for free in transform components. Flatten out conditional fields as soon as possible. Often, conditional fields are used to store multiple record types in a single format. Split these into separate processing streams as soon as possible. Join them back at the end of the graph, if required.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
105
Sorting If you cant make all your fields fixed length, you can still benefit from having the key fields:
fixed length at the beginning of the record.
If you are sorted by a primary key and need to resort by a secondary key, use the Sort Groups component. If you wish to checkpoint near a sort that will land data on disk, consider a Checkpointed Sort component instead.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
106
In-Memory Components
Join, Rollup, and Scan can operate either in-memory or on sorted data. If your data does not fit in memory, and you need to do multiple joins or rollups on the same key, it will be most efficient to sort once and set the rollups and joins to expect sorted input. In-memory components run efficiently when there is enough memory allocated to them. If the data volume grows until these components need to drop their data to disk, performance may suddenly decrease one day. A graph that relies on sorted data and does not use in-memory components will have more uniform performance characteristics as data volume grows.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
107
Exceeding max-core If an in-memory Join cannot fit its non-driving inputs (plus overhead) in the provided max-core, then it will drop all the inputs to disk. Similarly, Sort and Rollup will drop all their data to disk if max-core does not fit all the data plus overhead. It is better to set max-core too low rather than too high and risk OS swapping. Ab Initio does better job than the OS at staging working data to disk.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
108
Reduce number of records Use Rollup or Filter by Expression as soon as possible if they will reduce the number of records being processed. Join as early as possible if this will reduce the number of records being processed. Join as late as possible if this will increase the number of records or the width of records being processed.
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
109
2010 Keane. This document is the property of Keane. No part of this document shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, to parties outside of your organization without prior written permission from Keane.
110