Download as DOCX, PDF, TXT or read online from Scribd
Download as docx, pdf, or txt
You are on page 1of 8
Azure Data Factory Syllabus
This course covers basic to advanced ETL concepts, Data
Warehouse (DWH), Data Mashups, Data Flow concepts using SQL server and Azure SaaS components, Azure SQL Database Migrations, Azure Storage, Azure Data Warehouse (ADW), Incremental Loads, Power Query, Azure Data Lakes and 4 Real- time projects in Manufacturing, Healthcare, Banking and BA domains.
Main topics highlighted in the course
Azure Data Factory Azure Databricks Azure Synapse Azure Cosmos DB ADF Resources, Monitor Power Query in ADF Azure Storage Explorer Azure Data Explorer Data Migrations Big Data Storage Performance Tuning Security Management Prepping, Ingestions Spark Clusters Python, Scala, ETL tools, Staging
Chapter 1: Cloud Basics, Azure SQL
Cloud Introduction and Azure Basics Azure Implementation: IaaS, PaaS, SaaS Azure Data Engineer: Job Roles Azure Storage Components Azure ETL & Streaming Components Need for Azure Data Factory (ADF) Need for Azure Synapse Analytics Azure Resources and Resource Types Azure Account, Subscription (Free) Azure SQL Server [Logical Server] Firewall Rules and Azure Services Azure SQL Database Deployment Azure SQL Pool Deployment Compute: DTU Versus DWU Test Connections from SSMS Chapter 2: Synapse SQL Pools (DWH) Dedicated SQL Pools in Azure Enterprise Data Warehouse with Synapse Massively Parallel Processing (MPP) Control Nodes and Compute Nodes DMS: Data Movement Service Start/Resume/Pause & Scaling SQL Pool Config @ TSQL Scripts Start/Resume/Pause, Scaling Options Table Creations @ TSQL Scripts Table Partitions: Left & Right Distributions: Round Robin, Hash Distributions: Replicate and Usage Auto Indexing & Column Store Planning for Big Data Loads Need for ADF: Azure Data Factory Chapter 3: Azure Data Factory Concepts Azure Data Factory (ADF) Concepts Hybrid Data Integration at Scale ADF Pipelines: Architecture Integration Runtime (IR) & Use Linked Services and Datasets Pipeline Design: Activities Copy Data Tool, Data Flow Pipeline Triggers and Schedules ADF Pipeline with Copy Data Tool Azure SQL DB to Synapse Data Loads Working with Multi Tables Data Loads Creating Linked Services, Datasets Basic Data Loads: Publish, Trigger Copy Method: Bulk Insert DIU: Data Integration Units Chapter 4: On-Premise Data Loads Copy Data Tool for ETL Operations On-Premise Data Sources with Azure Self-Hosted Integration Runtime (IR) Access Keys, Remote Linked Services Synapse SQL Pool (DW) with On-Premise Staged Data Copy and Performance Pipeline Executions and Monitoring Pipeline RunIDs and Audits / Tracing Creating Azure Storage Account Storage Container, BLOB File Uploads DIU Allocations and Concurrency Pipeline Trigger, Author, and Monitor Staging with Storage Account, Container Polybase For Azure Synapse, Advantages Pipeline Execution: DIU & DOCP Chapter 5: Incremental Loads with ADF Incremental Loads with Files (BLOB) Pipeline Executions and Schedules Regular Schedules and Tumbling Window Execution Retry and Delay Options Binary Copy, Last Modified Date in Blob Automated Loops and Trigger Schedules Incremental Loads Verification Tests Incompatible Rows Skips, Fault Tolerance Database Tables: Incremental Loads Copy Method: UPSERT, Business Keys ETL Staging Advantages & Performance ADF Pipelines: Execution Settings ADF Logging Options, Consistency Check Compression Option, DOP and DOCP ADF Pipeline Triggers and Monitoring
Chapter 6: ADF Data Flow – 1
Data Flow Task, Data Flow Activity Transformations with Data Flow Spark Cluster for Debugging Cluster Node Configurations Spark Cluster Types & Sizing Transaction Optimized – Capacity Memory Optimized – Capacity Data Cleansing with ADF Data Orchestration with Data Flow SELECT Transformation & Options Conditional Split Transformation UNION, SELECT Transformation Spark Cluster for Pipeline Executions Pipeline Monitoring & Run IDs Adding Data Flow into Pipelines Chapter 7: ADF Data Flow – 2 ADF Pipelines for ETL Operations Data Flow Tasks and Activities in Synapse JOIN & EXISTS Transformations Aggregate & Group by Transformations Window Functions & Rank in Data Flow Rank / Dense Rank / Row Number Derived Column Transformation Lookup, Surrogate Key, Parse Type Convert, Cast Transformations Reusing Data Flow Tasks in Synapse Pipeline Validations & Executions Inline Datasets, Schema Drift Data Deduplication with ADF DFT Optimization Techniques Data Flow Task – Staging, Logging Chapter 8: Azure Synapse Analytics Azure Synapse Analytics Resource Azure Synapse Analytics Workspace Managed Resource Group, SQL Account Synapse Workspace & Synapse Studio Operations with Synapse Workspace ADLS Gen 2 Storage Account, Container Synapse Studio: Scripts & Pipelines Dedicated SQL Pools: Creation, Use Synapse Tables, Data Loads with TSQL COPY INTO Statements with T-SQL Row Terminator and Compressions T-SQL Queries and Aggregations Aggregation Data Loads in Synapse Creating Synapse Pipelines with TSQL Stored Procedure Activity & Triggers Chapter 9: Synapse Analytics with Spark Synapse Pipelines: Performance Advantages Pivot Transformation for Normalization Generating Pivot Column, Aggregations Pivot Transformation and Pivot Settings Pivot Key Selection, Value and Nulls Pivoted Columns and Column Pattern Column Prefix, Help Graphic & Metadata Denormalized Data and Aggregations Apache Spark Pool in Azure Synapse Spark Cluster Nodes: Vcores, Memory Notebooks: Purpose, Usage Options Python Notebooks for Remote Access Creating Databases in Apache Spark Pool Data Loads from Dedicated SQL Pools PySpark Code for Data Operations, Writes Chapter 10: Synapse Security & Parameters Azure Active Directory (AAD) Users, Groups IAM: Identity & Access Management Synapse Workspace Security with RBAC ADF Security with RBAC: Owner, Contributor Azure Synapse SQL Pool Security: Logins Creating SQL Logins & Users: master SQL Users in Azure SQL DB and SQL Pool Grant, Control, Revoke: Security Roles Parameters – Creation and Use in Pipelines Dynamic Connections with Credentials User Name and Password Connectivity Dynamic Dataset Configurations Pipeline Expressions with Parameters Resource Classes and Usage with SQL Pool
Chapter 11: Change Data Capture (CDC)
Change Data Capture (CDC) Data Loads Incremental Loads with CDC Types SQL Server CDC: ETL Load Dates Run Mode Options and CDC Types Output Pipeline Expression, Data Window Azure SQL DB Destinations, Watermarks JSON Parameters, Pipeline Scheduling Pipeline Validation, Trigger, Monitoring Synapse SQL Pool: Data Loads (DWH) ETL Optimization Techniques SQL Pool (Synapse) Optimizations Pipeline Optimization Techniques Chapter 12: Pipeline Monitoring, Security Azure Monitor Resource and Usage Pipeline Monitoring Techniques ADF: Pipeline Monitoring and Alerts Synapse: Pipeline Monitoring and Alerts Synapse: Storage Monitoring and Alerts Conditions, Signal Rules, and Metrics Email Notifications with Azure Serverless Pool in Azure Synapse Connections, Usage with Serverless Pool Using Azure Open Datasets in Synapse OPENROWSET and BULK Data Loads Working with Parquet Files in Synapse Python Notebooks (Pyspark) in Synapse