Azure Databricks An Introduction

Azure Data Services Overview
Bryan Cafferky
© Copyright Microsoft Corporation. All rights reserved.
Technical Solutions Professional

Why So Many Options?

Why So Many
Options?
https://www.quebechebdo.com/content/dam/tc/quebec-hebdo/images/2010/7/16/une-nouvelle-salle-d-operation-pour-cont-1075320.jpg
Data Science, ML, and AI Perspective
M I C R O S O F T C O N F I D E N T I A L – I N T E R N A L O N LY
Use Cases
eadmissions
atient Risks like Infection, Cardiac Arrest, Pulmonary Distress
atient Managed Care – Risks and Prevention
ustomer Patient Analytics Portal
entiment Analysis
Azure Data Platform
Data Analysis Platform
Client Tools
ML Workbench
Streaming
Azure Data Factory

Data Ingestion
DEMO
GA 3/2018
On-Prem
Analysis Services Data Lake SQL Data Warehouse Managed Instance/ Cosmos DB HDInsight Databricks
Analytics Massively Parallel Processing SQL DB
* Planned
• PostgreSQL and MySQL Offered as Azure Services (in preview).
Scale
An Introduction to
Azure Databricks
Bryan Cafferky
Technical Solutions Professional
https://bostonazurebootcamp.com/2018
Global Boot Camp:
GitHub: https://github.com/bcafferky/shared
Making Things
Easier
Making Things
Easier
l a ss
Fi rst C
Azur
e
Easy
Big Data Scheduling
Cluster
Creation
+ Notebooks
Security Collaboration
Machine Learning
DATA B R I C K S - C O M PA N Y OV E RV I E W
 Founded in late 2013

 By the creators of Apache Spark, original team
from UC Berkeley AMPLab
 Largest code contributor code to Apache Spark
 Level 2/3 support partnership with
• Hortonworks
• MapR
• DataStax
 Provides certifications such as Databricks
Certified Application, Databricks Certified
Distribution and Databricks Certified Developer
 Main Product: The Unified Analytics Platform
 In Oct 2017, introduced Databricks Delta
(currently in private preview).
A Z U R E DATA B R I C K S
 Azure Databricks is a first party service on Azure.

• Unlike with other clouds, it is not an Azure Marketplace or a
3rd party hosted service.
 Azure Databricks is integrated seamlessly with Azure
services:
• Azure Portal: Service an be launched directly from Azure
Portal
• Azure Storage Services: Directly access data in Azure Blob
Storage and Azure Data Lake Store
• Azure Active Directory: For user authentication, eliminating
the need to maintain two separate sets of uses in
Databricks and Azure. Microsoft Azure
• Azure SQL DW and Azure Cosmos DB: Enables you to
combine structured and unstructured data for analytics
• Apache Kafka for HDInsight: Enables you to use Kafka as a
streaming data source or sink
• Azure Billing: You get a single bill from Azure
• Azure Power BI: For rich data visualization
 Eliminates need to create a separate account with

Databricks.
Jumpstart with Sample Notebooks
https://clipartion.com/free-clipart-12075/
Azure Databrick Notebooks Toolbar
Annotations
Code
Cells
Importing Notebooks Under
Workspace/Users/User –
Select Import from the
Dropdown
Importing Notebooks
From file or
URL
Import
library
Importing Notebooks
Lot of sample
notebooks to
try
https://databricks.com/resources/type/example-notebooks
Importing Notebooks Open and
Click on the
Import
Notebook
Button
Copy the Link

to the
Clipboard
Importing Notebooks
Paste the Link
and Click
Import
Importing Notebooks Imported
Notebook
Why Spark?
• Open-source data processing engine built around speed, ease of use, and
sophisticated analytics
• In memory engine that is up to 100 times faster than Hadoop
• Largest open-source data project with 1000+ contributors
• Highly extensible with support for Scala, Java and Python alongside Spark SQL,
GraphX, Streaming and Machine Learning Library (Mllib)
APACHE SPARK
An unified, open source, parallel, data processing framework for Big Data Analytics
Spark SQL Spark MLlib Spark GraphX

Interactive Machine Streaming Graph
Spark Unifies: Computation
Queries Learning Stream processing
 Batch Processing
 Interactive SQL
 Real-time processing
 Machine Learning
Spark Core Engine
 Deep Learning
 Graph Processing Standalone
Yarn Mesos
Scheduler
Spark MLlib
Spark Structured Machine
Streaming Learning
Stream processing
A Z U R E DATA B R I C K S
Azure Databricks
Collaborative Workspace
Machine learning models

IoT / streaming data
DATA DATA BUSINESS
ENGINEER SCIENTIST ANALYST
Deploy Production Jobs & Workflows

BI tools
Cloud storage
MULTI-STAGE JOB SCHEDULER NOTIFICATION &

PIPELINES LOGS
Data warehouses
Optimized Databricks Runtime Engine Data exports
Hadoop storage
DATABRICKS APACHE SERVERLESS Rest APIs
I/O SPARK Data warehouses
Enhance Productivity Build on secure & trusted cloud Scale without limits
GENERAL SPARK CLUSTER ARCHITECTURE
Driver Program
SparkContext
 ‘Driver’ runs the user’s ‘main’ function and
executes the various parallel operations on
the worker nodes.
 The results of the operations are collected by Cluster Manager
the driver
 The worker nodes read and write data from/to Worker Node Worker Node Worker Node
Data Sources including HDFS.
 Worker node also cache transformed data in Cache Cache Cache
memory as RDDs (Resilient Data Sets).
Task Task Task
 Worker nodes and the Driver Node execute as
VMs in public clouds (AWS, Google and
Azure).
Data Sources (HDFS, SQL, NoSQL, …)

S E C U R E C O L L A BO RAT I O N
Azure Databricks enables secure collaboration between colleagues
• With Azure Databricks

colleagues can securely share
key artifacts such as Clusters,
Notebooks, Jobs and
Workspaces Fine Grained Permissions
• Secure collaboration is enabled
through a combination of:
Fine grained permissions:

Defines who can do what on which
artifacts (access control)
AAD-based User
Authentication
AAD-based authentication: Ensures
that users are actually who they
claim to be
A Z U R E DATA B R I C K S I N T E G RAT I O N W I T H A A D
Azure Databricks is integrated with AAD—so Azure Databricks users are just regular AAD
users
 There is no need to define users—and

their access control—separately in
Databricks.
 AAD users can be used directly in
Azure Databricks for all user-based
access control (Clusters, Jobs, Access Authentication
Notebooks etc.). Control
 Databricks has delegated user

Azure Databricks
authentication to AAD enabling single-
sign on (SSO) and unified
authentication.
 Notebooks, and their outputs, are
stored in the Databricks account.
However, AAD-based access-control
ensures that only authorized users
can access them.
DATA B R I C K S AC C E S S C O N T R O L
Access control can be defined at the user level via the Admin Console
Access Control can be defined for Workspaces, Clusters, Jobs and REST APIs
Workspace Access Defines who can who can view, edit, and run
Control notebooks in their workspace
Allows users to who can attach to, restart, and

manage (resize/delete) clusters.
Cluster Access
Databric Control
ks Allows Admins to specify which users have
Access permissions to create clusters
Control Allows owners of a job to control who can view job
Jobs Access Control
results or manage runs of a job (run now/cancel)
Allows users to use personal access tokens instead of

REST API Tokens
passwords to access the Databricks REST API
A Z U R E DATA B R I C K S C O R E A RT I FAC T S
Clusters
Libraries Workspac
es
Azure
Databrick
s
Jobs Notebook
s
Demo
It’s Time for a Closer
Look
Compute and
Storage are
Separate
Azure Databricks – service home page
Search for Resource

Azure Databricks – service home page
Azure Databricks – creating a workspace
Azure Databricks – workspace
deployment
Help All Along the Way
Quick Start
Documentat
ion
Azure Databricks – workspace home page
A Deeper Dive
CLUSTERS
 Azure Databricks clusters are the set of Azure Linux

VMs that host the Spark Worker and Driver Nodes
 Your Spark application code (i.e. Jobs) runs on the
provisioned clusters.
 Azure Databricks clusters are launched in your
subscription—but are managed through the Azure
Databricks portal.
 Azure Databricks provides a comprehensive set of
graphical wizards to manage the complete lifecycle of
clusters—from creation to termination.
C LU S T E R C R E AT I O N
 You can create two types of clusters –

Standard and Serverless Pool (see next
slide)
 While creating a cluster you can specify:
• Number of nodes
• Autoscaling and Auto Termination policy
• Auto Termination policy
• Spark Configuration details
• The Azure VM instance types for the
Driver and Worker Nodes
Graphical wizard in the Azure Databricks portal to create a Standard Cluster

CLUSTER ACCESS CONTROL
• There are two configurable types of permissions for Cluster Access Control:
• Individual Cluster Permissions - This controls a user’s ability to attach notebooks to a cluster, as well as to
restart/resize/terminate/start clusters.
• Cluster Creation Permissions - This controls a user’s ability to create clusters
• Individual permissions can be configured on the Clusters Page

by clicking on Permissions under the ‘More Actions’ icon of an
existing cluster
• There are 4 different individual cluster permission levels: No
Permissions, Can Attach To, Can Restart, and Can Manage.
Abilities No Permissions Can Attach To Can Restart Can Manage
Privileges are shown below
Attach notebooks to
x x x
cluster
Tom Smith
View Spark UI x x x (tom@company.com)
View cluster metrics

x x x
(Ganglia)
Terminate cluster x x
Start cluster x x
Restart cluster x x
Resize cluster x
Modify permissions x
Administration
Administration Settings
What
do you
want to
restrict
?
Job – Advanced Settings
Job Schedule – CRON Format Support
Calling Other Notebooks
Getting to the Spark Console
Spark GUI
Spark Job Metrics
Dashboard
Show Existing
New Dashboard
Dashboard
Widgets
Widget
Settings
Dynamic Visualizations with Widgets
Wrapping Up

Azure Databricks An Introduction

Uploaded by

Copyright:

Available Formats

Azure Databricks An Introduction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Azure Databricks An Introduction

Uploaded by

Copyright:

Available Formats

Azure Data Services Overview

Technical Solutions Professional

© Copyright Microsoft Corporation. All rights reserved.

atient Risks like Infection, Cardiac Arrest, Pulmonary Distress

atient Managed Care – Risks and Prevention

ustomer Patient Analytics Portal

Azure Data Factory

• PostgreSQL and MySQL Offered as Azure Services (in preview).

 Founded in late 2013

 Azure Databricks is a first party service on Azure.

• Azure Power BI: For rich data visualization

 Eliminates need to create a separate account with

Copy the Link

• In memory engine that is up to 100 times faster than Hadoop

• Largest open-source data project with 1000+ contributors

Spark SQL Spark MLlib Spark GraphX

Machine learning models

Deploy Production Jobs & Workflows

MULTI-STAGE JOB SCHEDULER NOTIFICATION &

Data Sources (HDFS, SQL, NoSQL, …)

• With Azure Databricks

Fine grained permissions:

 There is no need to define users—and

 Databricks has delegated user

Allows users to who can attach to, restart, and

Allows users to use personal access tokens instead of

Search for Resource

 Azure Databricks clusters are the set of Azure Linux

 You can create two types of clusters –

Graphical wizard in the Azure Databricks portal to create a Standard Cluster

• Individual permissions can be configured on the Clusters Page

View cluster metrics

You might also like