DP 900

Download as pdf or txt
Download as pdf or txt
You are on page 1of 229
At a glance
Powered by AI
The document discusses concepts related to working with data in Azure including relational and non-relational data, data analytics techniques, Azure Data Factory pipelines and triggers, and Power BI components.

The document discusses descriptive, diagnostic, predictive, prescriptive and cognitive analytics techniques.

The document discusses scheduled, tumbling window, and event-based triggers that can be used in an Azure Data Factory pipeline.

DP-900

Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


August 2020

This course is regularly


updated! This might not be
the most recent version.

Get the latest videos at


http://sjd.ca/dp900
Introduction
Beginning to
work with data
in the cloud
DP-900
Optional
certification
Less expensive
than other
certifications
Can lead to
Azure DBA or
Azure Data
Engineer paths
Microsoft Azure
focus
Core Data
Concepts
(15-20%)
Relational Data
(25-30%)
Non-Relational
Data (25-30%)
Data Analytics
(25-30%)
DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Core Data
Concepts
(15-20%)
Data Workloads
Batch Data
Examples of Batch Formats
CSV (comma-separated) or TSV (tab-separated) file

JSON or XML

Apache Parquet - more efficient than CSV or JSON

Blob files

Another database

Cache for offline viewing


Streaming Data
Examples of Streaming Data
Event Hub or IoT Hub

Blob storage - Log processing

Apache Kafka

Netflix or YouTube video

This course video


Characteristics
of Relational
Data
Characteristics
Tables - rows and columns

Views - a query that runs on a table that can be itself queried

Primary Key (PK) uniquely identifies a row

Relationships between the tables using a foreign key - parent-child

Database enforces integrity - referential integrity


DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Data Analytics
Data
Visualization
Exam tip:
Microsoft is not
going to ask you
what chart
format is best
The Problem of Too Much Data
Increasingly easier and cheaper to collect data

Hard to see trends and insights from raw numbers

Data visualization let’s you distill thousands or millions of rows of data into an easily
digestible format

Let’s you see relevant trends


Chart Types
http://sql.bi/visual-reference
Analytics
Techniques
Exam tip: Pay
particular
attention to this
5 Types of Analytics
● Descriptive

● Diagnostic

● Predictive

● Prescriptive

● Cognitive
Descriptive Analytics
What happened?

● Today’s sales
● Service appointments booked
● Net margins

Summary of existing data records

“Hindsight”
Diagnostic Analytics
Why did it happen?

Drill down into the data

● Change to sales per location


● Comparing one region against another over time
● Customers who had their problems solved on the first visit vs customers who
had to come back for 2 or more visits
Predictive Analytics
What will likely happen in the future?

Based on past trends

● Projected sales reports


● Azure Portal predicts your bill at the end of the month
● Weather prediction
Prescriptive Analytics
What should I do?

Advise on best approach for maximum success

● Google Maps navigation


● Recommendations - If you liked this movie, you might like that one
● Search Engine Optimization tools
Cognitive Analytics
Artificial intelligence and machine learning

Analyzing data to come up with a “model” of how the world works

Makes predictions based on that model

Learn and improve over time

● Reading Twitter to determine brand sentiment


● Self-driving car
ELT and ETL
ELT and ETL
A general name for the style in which you take data (from outside a system) and get
it into a system ready for use by the system

Very rarely is data from outside 100% formatted exactly how you need it

Extraction is how you get the data from outside

Loading is the action of getting that data into a database

Some Transformation might be required to modify columns type, format, order, and
even generate new data (i.e. fname + ‘ ‘ + lname = full_name)
ELT
E - Extract

L - Load

T - Transform
ETL
E - Extract

T - Transform

L - Load
Data Processing
Two types...

● Batch processing

● Stream processing
DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Relational Data
(25-30%)
Azure Relational
DBs
Relational DB Services
● SQL Server in a VM
● SQL Managed Instance
● Azure SQL Database
● Azure Database for MySQL, PostgreSQL, or MariaDB
SQL Server in a VM
Guaranteed compatibility to SQL Server on premises

You manage everything - OS upgrades, software upgrades, backups, replication

Push the maximum performance out of the CPU through expert tweaks

No limitations (i.e. can support data above 4TB)

Pay for the server and licensing, not per DB


SQL Managed Instance
Close to 100% compatibility to SQL Server on premises

Fully-managed service

4 to 80 vCores

32 GB to 8 TB storage
Azure SQL Database
Close to 100% compatibility to SQL Server on premises

Lots of options for provisioned and serverless databases

Pay for performance or pay for hardware

2 to 80 vCores

5 GB and 4 TB storage

Starting at $5 per month


Azure Database
Managed versions of MySQL, PostgreSQL, and MariaDB

Azure manages the servers and software upgrades

Compatible with your legacy systems


Relational Data
Structures
How is data stored?
Relational data is stored in “tables”

Tables have rows and columns

Like an Excel spreadsheet

The columns of a table are defined in advance, called a schema


Tables
Typically, they are intended to store a single “type” of data

“Employees”, “Orders”, “Products”, “OrderDetails”

Best practice is for every table to have a “primary key”, also called an index

This is often an “ID” field when no other suitable column exists


Example - Employees
EmployeeID - int (PRIMARY KEY)

FirstName - string

LastName - string

PhoneNumber - string (NULL)

DepartmentID - int (FOREIGN KEY) ← Another table called Departments

ManagerID - int (FOREIGN KEY) ← Points back to the Employees table (recursive)

DOB - string (UNIQUE KEY)


Example - Employees
EmployeeID - int (PRIMARY KEY)

FirstName - string

LastName - string

PhoneNumber - string (NULL)

DepartmentID - int (FOREIGN KEY)

ManagerID - int (FOREIGN KEY)

DOB - string (UNIQUE KEY) ← date of birth cannot be unique


Indexes
Indexes are used to improve the performance of queries

The primary key is by default an index

The data of the table is physically stored by primary key sorted order

But you can define other indexes if it’s common to query on them

It might be common to query on ManagerID, and that could be an index


Views
A view is like a table - you can run a query on it

But the data it returns is based on another table

It can be a simplified view of a table, or a more complex one


Manager View
SELECT Emp.EmployeeID, Emp.FirstName, Emp.LastName,
Manager.EmployeeID, Manager.FirstName, Manager.LastName

FROM Employees Emp

INNER JOIN Employees Manager ON


Manager.EmployeeID = Emp.ManagerID
DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Azure
Relational DBs
IaaS, PaaS and
SaaS
SQL Server in a
VM
SQL Server in a VM
Virtual Machine options - Windows or Linux

SQL Server options - 2008, 2012, 2014, 2017, 2019

Tiers - Free Developer, Web, Standard, Enterprise


Advantages
Guaranteed to act identically to your on premises DB

No retraining, no retooling

Just change the connection string


Azure SQL
Database
Azure SQL Database
Uses SQL Server engine underneath

Designed to work in the cloud

Easy to grow to larger plans as your needs grow without downtime

Additional tools like advisors to tune your DB


Types
Single Database - allocate resources to a specific database

Elastic Database - allocate resources to a group of databases to share (pool)


Advantages
Mostly compatible with SQL Server

You can scale easily

Modern, cloud approach


Azure Synapse
Analytics
(SQL DW)
Synapse Analytics
Used to be called SQL Data Warehouse (SQL DW)

Evolution past DW

Serverless on-demand or provisioned

A database designed for reporting and analytics


Azure Database
for PostgreSQL,
MySQL,
MariaDB
Other Azure Database Options
Azure is adding other database engines, as a service

PostgreSQL, MySQL and MariaDB

Instead of you hosting those and maintaining them

Allows you to maintain your existing solution and migrate to the cloud
SQL Managed
Instance
SQL Managed Instance
Aims to bridge the gap between Azure SQL Database and SQL Server in a VM

Let Microsoft manage the server and the database engine software

Compatible with SQL Server

Not a common option, in my view


DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Managing
Relational Data
Provisioning
and Deployment
Deployment Options
Create through the Azure Portal
ARM Templates
Data Security
Connectivity
Query Tools
Query Editor (Portal)
Query editor built into the portal

Access from the SQL Database database module

Execute queries
Azure Data Studio
Cross-platform software to work with SQL Server relational data

- Windows, Mac OS, and Linux

Intellisense, code snippets, source control, integrated terminal

Charting of query results

Open source, free


SQL Server Management Studio (SSMS)
The “OG” of SQL Server management

Supports database administration tasks

- User management, vulnerability assessment, security features


- Performance tuning advisors
- Import and export of DACPAC files
sqlcmd
Command line utility

Execute T-SQL statements, stored procs, and script files

Windows and Linux options (RHEL, Ubuntu, SUSE)

MacOS in preview

Can run in a docker container


DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


SQL
DDL vs DML
Data Definition Language (DDL)
The data management SQL commands around database structures

CREATE, ALTER, DROP, TRUNCATE, COMMENT, RENAME


Data Manipulation Language (DML)
The SQL commands around data manipulation

SELECT, INSERT, UPDATE, DELETE, LOCK TABLE


Data Control Language (DCL)
Granting rights and permissions to others

GRANT and REVOKE


Querying Other
Relational DBs
Structured Query Language (SQL)
First developed as SEQUEL in the early 1970s

Had to change to SQL due to an existing trademark

SQL Standard first formalized in 1986 as SQL-86, now up to SQL:2019


Vendors!
Each database engine has its own version of SQL that extends the standard for its
own purpose, usually to add programming elements (stored procs!)

Most vendor version of SQL are not 100% compatible with the standard

SQL Server uses T-SQL, Oracle uses PL/SQL, MySQL uses SQL/PSM, PostgreSQL
uses PL/pgSQL

They are not often compatible - an Oracle PL/SQL command would not work
unaltered on SQL Server
DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Non-Relational
Data (25-30%)
Non-Relational
Data
What is a Non-Relational Database?
Technically: any database that is not based on tables, rows and columns

Often, the way the data is stored better supports the application’s use
Not only SQL
(NoSQL)
The internet has
changed the way
we package data
to send around
Facebook - 1994
Google - 1998
Twitter - 2006
HTML - 1991
XML - 1998
JSON - 2005+
There are 200
billion tweets
per year
Imagine trying
to run a search
query on a table
with 1 trillion
rows...
Non-relational
databases are
optimized for
different uses
DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Data types
Document Data - Cosmos DB Core SQL

*** Used to be named DocumentDB


Column-Family Data - Cosmos DB
Cassandra API

*** Also HBase in HDInsight


Key-Value Data - Cosmos DB Table API

*** Also Table Storage, Redis Cache


Graph Data - Cosmos DB Graph API
Time Series Data - Time Series Insights

*** Also OpenTSDB with HBase on HDInsight


Object Data - Blob Storage

*** Also Data Lake Storage, File Storage


Azure Search
DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


How to choose?
One Ring to Rule Them All...
There is no single data store that is best for all use cases

Sometimes you use more than one type


Benefits of Relational Databases
● Normalization reduces duplication, reduces mistakes

● Data relationships drive data insights

● Data integrity

● Many of the established products are decades old, bug free

● Business intelligence tools like Power BI makes the data accessible

● Database enforces constraints, fixed schema


Benefits of NoSQL Databases
● Optimized for specific data types

● Can be designed for performance at internet scales

● Might not require massive hardware, reduced expense

● Open source
DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Azure
Non-Relational
DBs
Azure
Non-Relational
Options
Cosmos DB
● Enterprise-grade non-relational database
● Supports many data models (graph, document, table, column-family)
● Compatible with established APIs (Cassandra, MongoDB, Gremlin, Etcd)
● Select the data consistency you need
● Easy to scale worldwide
● Sub-10 ms latency
● Service Level Agreements for throughput, latency, availability and consistency
Table Storage
● Azure Storage account
● 5 PB maximum limit
● $0.045 per GB - cheapest data storage option on Azure
● Plus pay for operations - $0.00036 per 10,000 transactions
● Service Level Agreement shockingly poor (10 seconds for a query)
Blob Storage
● Azure Storage account
● 5 PB maximum limit
● $0.0208 per GB - cheapest data storage option on Azure
● Supports premium, hot, cool, archive access tiers
● Supports “reserved capacity” - ~$0.014 per GB for 100TB/3 years
● Supports blob indexing
● Plus pay for operations - $0.00036 per 10,000 transactions
● Service Level Agreement shockingly poor (2 seconds per MB)
File Storage
● Azure Storage account
● 5 PB maximum limit
● $0.06 per GB - cheapest data storage option on Azure
● Standard and premium options
● Service Level Agreement shockingly poor (2 seconds per MB)
DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Management
Provisioning
and Deployment
ARM Templates
Data Security
Connectivity
Management
Tools
DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Data Analytics
(25-30%)
Analytics
Workloads
Types of Workloads
There are three main (relational) database workloads:

● As a place to store business transactions as they occur (OLTP)


● As a place to hold data for complex analysis (OLAP)
● As a centralized repository for data from different sources (data warehouse)
OLTP - Online Transaction Processing
● Most business applications require a place to store and retrieve data
● As business transactions occur, they are recorded to the database
● Existing rows can be updated
● Data can be retrieved by SQL queries
● Optimized for general use
Traits of OLTP
● Database normalization
● Schema heavily enforced, data integrity
● Strong consistency
● Heavy writes, moderate reads
● Updateable
● Data size MBs to TBs
Azure OLTP
● Azure SQL Database
● SQL Server in a VM
● Azure Database for MySQL
● Azure Database for PostgreSQL
OLAP - Online Analytical Processing
● Data stored in a transactional database was not designed for complex analysis
● Data stored in a transactional database can change at any time
● Running complex reports can slow down a transactional database
● Can take time to prepare the data for analysis
● Cubes, dimensions, measures
Traits of OLAP
● No locking
● No updates
● Heavy reads, read-only
● Multi-dimensional indexing
● Data size GBs
Azure OLAP
● SQL Server with Columnstore indexes
● Azure Analysis Services
● SQL Server Analysis Services
Data Warehousing
● Central repository of data from one or more different sources
● Current and historical data used for reporting and analysis
● Can rename or reformat columns to make it easier for users to create reports
● Users can run reports without affecting the day-to-day business data systems
Azure Data Warehousing
Symmetric Multiprocessing (SMP):

● Azure SQL Database


● SQL Server in a VM

Massively Parallel Processing (MPP):

● Azure Synapse Analytics (SQL DW)


● Apache Hive on HDInsight
● Interactive Query (Hive LLAP) on HDInsight
When to Use a DW
When queries are long running or affect day-to-day operations

When data needs further processing (ETL or ELT) before it can be analyzed

When you want to remove historical data from your day-to-day systems (archiving)

When you need to integrate data from several sources

When users are confused by the data structures, table names or column names
when building reports in PowerBI
DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Data Warehouse
Describe the
components of a
modern data
warehouse
Data Sources
All data comes from somewhere else

One or more data sources

It can be structured data - existing SQL databases

It can be unstructured data - CSVs, JSON, log files


Azure Data Factory
Azure Data Factory is a hybrid data integration
service that allows you to create, schedule and
orchestrate your ETL / ELT workflows

Azure Data Factory moves the data from outside of


Azure to inside of Azure
Azure Blob Storage / Data Lake
Azure Blob Storage is a massively scalable object
storage for any type of unstructured data - images,
videos, audio, documents and other

Azure Data Lake is a type of blob storage that is


designed to handle even larger amounts of data

The unprocessed data is stored here


Azure Databricks
Azure Databricks is a fast, easy and
collaborative Apache Spark-based analytics
platform

Databricks allows you to manipulate data at


large scales, by linking to one data source (like
Azure Data Lake), modifying the data, and
storing it in another data source (like Azure
Synapse Analytics / SQL DW)
Azure Synapse Analytics
Azure Synapse Analytics is the fast, flexible
and trusted cloud data warehouse that lets
you scale, compute and store elastically and
independently, with a massively parallel
processing architecture

Data is optimized for read-only queries


Azure Analysis Services
Azure Analysis Services is an enterprise grade
analytics as a service that lets you govern, deploy,
test, and deliver your BI solution with confidence

Data is optimized for complex queries with cubes


and dimensions
Power BI
Power BI is a suite of business intelligence
tools that make is easier for users to look at
data, analyze it, and create reports. They can
then publish them to the organization to
consume on the web or on mobile devices.
DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Data Ingestion
and Processing
The job of
“Data Engineer”
Data Engineer
Azure data engineers are responsible for data-related implementation tasks that
include:

● provisioning data storage services;


● ingesting streaming and batch data;
● transforming data;
● implementing security requirements;
● implementing data retention policies;
● identifying performance bottlenecks; and
● accessing external data sources.
Move on to
ingesting and
transforming
data
Azure Data
Factory
Data Factory = Data Orchestration
Brings data from external sources…

… through a series of transformations

… into an end data store


Data Factory Features
Orchestration platform - data workflows from A to B

Data transformation

Create and schedule jobs

Cloud version of SQL Server Integration Services (SSIS)


Host SSIS Packages in the Cloud
Data Factory Activities
Activities define actions to perform on your data.

● Data movement activities


● Data transformation activities
● Control activities
Data Factory Pipelines
Pipeline - A logical grouping of activities to perform some task

A data factory can contain multiple pipelines

Pipelines can perform their tasks sequentially (serial) or perform several tasks in
parallel
Pipeline Triggers
Pipeline Run - an instance of a pipeline execution

Pipelines can be run manually, or started using a trigger

Trigger - an event that causes a pipeline to run:

● Scheduled trigger
● Tumbling window
● Event-based
Manual Pipeline Run
Just as it sounds

You manually start the pipeline to run by executing it


Scheduled Trigger
Runs at a predetermined recurring schedule

● Every Monday at 8:00 AM


● Every Monday at 8:00 AM and Friday at 5:00 PM
● Every Hour

You can even set start and end times on a trigger, so they don’t fire after a certain
date
Tumbling Window
It’s a type of scheduled trigger that has some interesting properties

You set it up to run at a predetermined interval (i.e. hourly)

It can be set up to run “in the past”

Good for when the pipeline is designed to process data by the time period
specified

“Tumbling windows are a series of fixed-sized, non-overlapping, and contiguous


time intervals.”
DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Data
Visualization
Exam tip: know
the names of the
various Power
BI components
Paginated
reports
Paginated Reports
Designed to be printed and shared

Formatted to fit well on a page

Contains all the data, even if it spans multiple pages

Examples: Invoice, Sales Detail Report, Profit/Loss Statement


PowerBI Feature
SQL Server Reporting Services (SSRS) capability within Power BI

PowerBI Report Builder creates paginated reports

Standalone tool separate from Power BI

Reports get published to Power BI Service


Interactive
reports
Interactive
Reports designed to be viewed on screen

Can click for more details, drill down on the data

Report is “interactive”

Very visual, not a dump of data rows

Make use of “hover”

Report can change design / layout if based on user action


Power BI Feature
Uses Power BI Server installed on premises

Part of Power BI Premium


Dashboards
What are Dashboards?
Typically a mixture of chart types on a single page

All the relevant data, at a glance

Click to go to the individual report

Helps identify anomalies visually


Power BI
Content
Workflow
Typical workflow in Power BI
Connect to the data source that has the data

Pull what you need into the in-memory data model

Edit, transform the data as you require

Build reports using Power BI Desktop

Share the report


DP-900
Microsoft Certified: Azure Data Fundamentals

© 2020 Scott Duffy, SoftwareArchitect.ca


Thank you!

You might also like