Azure Databricks An Introduction

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 54

Azure Data Services Overview

Bryan Cafferky
© Copyright Microsoft Corporation. All rights reserved.

Technical Solutions Professional


Why So Many Options?

© Copyright Microsoft Corporation. All rights reserved.


Why So Many
Options?

https://www.quebechebdo.com/content/dam/tc/quebec-hebdo/images/2010/7/16/une-nouvelle-salle-d-operation-pour-cont-1075320.jpg
© Copyright Microsoft Corporation. All rights reserved.
Data Science, ML, and AI Perspective

M I C R O S O F T C O N F I D E N T I A L – I N T E R N A L O N LY
Use Cases

eadmissions

atient Risks like Infection, Cardiac Arrest, Pulmonary Distress

atient Managed Care – Risks and Prevention

ustomer Patient Analytics Portal

entiment Analysis
Azure Data Platform
Data Analysis Platform
Client Tools

ML Workbench

Streaming

Azure Data Factory


Data Ingestion

DEMO

GA 3/2018
On-Prem

Analysis Services Data Lake SQL Data Warehouse Managed Instance/ Cosmos DB HDInsight Databricks
Analytics Massively Parallel Processing SQL DB
* Planned

• PostgreSQL and MySQL Offered as Azure Services (in preview).

Scale
An Introduction to
Azure Databricks

Bryan Cafferky
Technical Solutions Professional

https://bostonazurebootcamp.com/2018
Global Boot Camp:
GitHub: https://github.com/bcafferky/shared
Making Things
Easier
Making Things
Easier
l a ss
Fi rst C

Azur
e

Easy
Big Data Scheduling
Cluster
Creation

+ Notebooks

Security Collaboration
Machine Learning
DATA B R I C K S - C O M PA N Y OV E RV I E W

 Founded in late 2013


 By the creators of Apache Spark, original team
from UC Berkeley AMPLab
 Largest code contributor code to Apache Spark
 Level 2/3 support partnership with
• Hortonworks
• MapR
• DataStax
 Provides certifications such as Databricks
Certified Application, Databricks Certified
Distribution and Databricks Certified Developer
 Main Product: The Unified Analytics Platform
 In Oct 2017, introduced Databricks Delta
(currently in private preview).
A Z U R E DATA B R I C K S

 Azure Databricks is a first party service on Azure.


• Unlike with other clouds, it is not an Azure Marketplace or a
3rd party hosted service.
 Azure Databricks is integrated seamlessly with Azure
services:
• Azure Portal: Service an be launched directly from Azure
Portal
• Azure Storage Services: Directly access data in Azure Blob
Storage and Azure Data Lake Store
• Azure Active Directory: For user authentication, eliminating
the need to maintain two separate sets of uses in
Databricks and Azure. Microsoft Azure
• Azure SQL DW and Azure Cosmos DB: Enables you to
combine structured and unstructured data for analytics
• Apache Kafka for HDInsight: Enables you to use Kafka as a
streaming data source or sink
• Azure Billing: You get a single bill from Azure

• Azure Power BI: For rich data visualization

 Eliminates need to create a separate account with


Databricks.
Jumpstart with Sample Notebooks

https://clipartion.com/free-clipart-12075/
Azure Databrick Notebooks Toolbar

Annotations

Code
Cells
Importing Notebooks Under
Workspace/Users/User –
Select Import from the
Dropdown
Importing Notebooks
From file or
URL

Import
library
Importing Notebooks
Lot of sample
notebooks to
try

https://databricks.com/resources/type/example-notebooks
Importing Notebooks Open and
Click on the
Import
Notebook
Button

Copy the Link


to the
Clipboard

https://databricks.com/resources/type/example-notebooks
Importing Notebooks
Paste the Link
and Click
Import

https://databricks.com/resources/type/example-notebooks
Importing Notebooks Imported
Notebook

https://databricks.com/resources/type/example-notebooks
Why Spark?

• Open-source data processing engine built around speed, ease of use, and
sophisticated analytics

• In memory engine that is up to 100 times faster than Hadoop

• Largest open-source data project with 1000+ contributors

• Highly extensible with support for Scala, Java and Python alongside Spark SQL,
GraphX, Streaming and Machine Learning Library (Mllib)
APACHE SPARK
An unified, open source, parallel, data processing framework for Big Data Analytics

Spark SQL Spark MLlib Spark GraphX


Interactive Machine Streaming Graph
Spark Unifies: Computation
Queries Learning Stream processing
 Batch Processing
 Interactive SQL
 Real-time processing
 Machine Learning
Spark Core Engine
 Deep Learning
 Graph Processing Standalone
Yarn Mesos
Scheduler
Spark MLlib
Spark Structured Machine
Streaming Learning
Stream processing
A Z U R E DATA B R I C K S

Azure Databricks
Collaborative Workspace

Machine learning models


IoT / streaming data
DATA DATA BUSINESS
ENGINEER SCIENTIST ANALYST

Deploy Production Jobs & Workflows


BI tools
Cloud storage

MULTI-STAGE JOB SCHEDULER NOTIFICATION &


PIPELINES LOGS
Data warehouses
Optimized Databricks Runtime Engine Data exports

Hadoop storage
DATABRICKS APACHE SERVERLESS Rest APIs
I/O SPARK Data warehouses

Enhance Productivity Build on secure & trusted cloud Scale without limits
GENERAL SPARK CLUSTER ARCHITECTURE

Driver Program
SparkContext
 ‘Driver’ runs the user’s ‘main’ function and
executes the various parallel operations on
the worker nodes.
 The results of the operations are collected by Cluster Manager
the driver
 The worker nodes read and write data from/to Worker Node Worker Node Worker Node
Data Sources including HDFS.
 Worker node also cache transformed data in Cache Cache Cache
memory as RDDs (Resilient Data Sets).
Task Task Task
 Worker nodes and the Driver Node execute as
VMs in public clouds (AWS, Google and
Azure).

Data Sources (HDFS, SQL, NoSQL, …)


S E C U R E C O L L A BO RAT I O N
Azure Databricks enables secure collaboration between colleagues

• With Azure Databricks


colleagues can securely share
key artifacts such as Clusters,
Notebooks, Jobs and
Workspaces Fine Grained Permissions
• Secure collaboration is enabled
through a combination of:

Fine grained permissions:


Defines who can do what on which
artifacts (access control)
AAD-based User
Authentication
AAD-based authentication: Ensures
that users are actually who they
claim to be
A Z U R E DATA B R I C K S I N T E G RAT I O N W I T H A A D
Azure Databricks is integrated with AAD—so Azure Databricks users are just regular AAD
users

 There is no need to define users—and


their access control—separately in
Databricks.
 AAD users can be used directly in
Azure Databricks for all user-based
access control (Clusters, Jobs, Access Authentication
Notebooks etc.). Control

 Databricks has delegated user


Azure Databricks
authentication to AAD enabling single-
sign on (SSO) and unified
authentication.
 Notebooks, and their outputs, are
stored in the Databricks account.
However, AAD-based access-control
ensures that only authorized users
can access them.
DATA B R I C K S AC C E S S C O N T R O L
Access control can be defined at the user level via the Admin Console

Access Control can be defined for Workspaces, Clusters, Jobs and REST APIs

Workspace Access Defines who can who can view, edit, and run
Control notebooks in their workspace

Allows users to who can attach to, restart, and


manage (resize/delete) clusters.
Cluster Access
Databric Control
ks Allows Admins to specify which users have
Access permissions to create clusters
Control Allows owners of a job to control who can view job
Jobs Access Control
results or manage runs of a job (run now/cancel)

Allows users to use personal access tokens instead of


REST API Tokens
passwords to access the Databricks REST API
A Z U R E DATA B R I C K S C O R E A RT I FAC T S

Clusters

Libraries Workspac
es

Azure
Databrick
s
Jobs Notebook
s
Demo
It’s Time for a Closer
Look
Compute and
Storage are
Separate
Azure Databricks – service home page

Search for Resource


Azure Databricks – service home page
Azure Databricks – creating a workspace
Azure Databricks – workspace
deployment
Help All Along the Way

Quick Start
Documentat
ion
Azure Databricks – workspace home page
A Deeper Dive
CLUSTERS

 Azure Databricks clusters are the set of Azure Linux


VMs that host the Spark Worker and Driver Nodes
 Your Spark application code (i.e. Jobs) runs on the
provisioned clusters.
 Azure Databricks clusters are launched in your
subscription—but are managed through the Azure
Databricks portal.
 Azure Databricks provides a comprehensive set of
graphical wizards to manage the complete lifecycle of
clusters—from creation to termination.
C LU S T E R C R E AT I O N

 You can create two types of clusters –


Standard and Serverless Pool (see next
slide)
 While creating a cluster you can specify:
• Number of nodes
• Autoscaling and Auto Termination policy
• Auto Termination policy
• Spark Configuration details
• The Azure VM instance types for the
Driver and Worker Nodes

Graphical wizard in the Azure Databricks portal to create a Standard Cluster


CLUSTER ACCESS CONTROL
• There are two configurable types of permissions for Cluster Access Control:
• Individual Cluster Permissions - This controls a user’s ability to attach notebooks to a cluster, as well as to
restart/resize/terminate/start clusters.
• Cluster Creation Permissions - This controls a user’s ability to create clusters

• Individual permissions can be configured on the Clusters Page


by clicking on Permissions under the ‘More Actions’ icon of an
existing cluster
• There are 4 different individual cluster permission levels: No
Permissions, Can Attach To, Can Restart, and Can Manage.
Abilities No Permissions Can Attach To Can Restart Can Manage
Privileges are shown below
Attach notebooks to
x x x
cluster
Tom Smith
View Spark UI x x x (tom@company.com)

View cluster metrics


x x x
(Ganglia)

Terminate cluster x x

Start cluster x x

Restart cluster x x

Resize cluster x

Modify permissions x
Administration
Administration Settings

What
do you
want to
restrict
?
Job – Advanced Settings
Job Schedule – CRON Format Support
Calling Other Notebooks
Getting to the Spark Console

Spark GUI
Spark Job Metrics
Dashboard

Show Existing

New Dashboard
Dashboard
Widgets
Widget
Settings
Dynamic Visualizations with Widgets
Wrapping Up

You might also like