Azure Databricks An Introduction
Azure Databricks An Introduction
Azure Databricks An Introduction
Bryan Cafferky
© Copyright Microsoft Corporation. All rights reserved.
https://www.quebechebdo.com/content/dam/tc/quebec-hebdo/images/2010/7/16/une-nouvelle-salle-d-operation-pour-cont-1075320.jpg
© Copyright Microsoft Corporation. All rights reserved.
Data Science, ML, and AI Perspective
M I C R O S O F T C O N F I D E N T I A L – I N T E R N A L O N LY
Use Cases
eadmissions
entiment Analysis
Azure Data Platform
Data Analysis Platform
Client Tools
ML Workbench
Streaming
DEMO
GA 3/2018
On-Prem
Analysis Services Data Lake SQL Data Warehouse Managed Instance/ Cosmos DB HDInsight Databricks
Analytics Massively Parallel Processing SQL DB
* Planned
Scale
An Introduction to
Azure Databricks
Bryan Cafferky
Technical Solutions Professional
https://bostonazurebootcamp.com/2018
Global Boot Camp:
GitHub: https://github.com/bcafferky/shared
Making Things
Easier
Making Things
Easier
l a ss
Fi rst C
Azur
e
Easy
Big Data Scheduling
Cluster
Creation
+ Notebooks
Security Collaboration
Machine Learning
DATA B R I C K S - C O M PA N Y OV E RV I E W
https://clipartion.com/free-clipart-12075/
Azure Databrick Notebooks Toolbar
Annotations
Code
Cells
Importing Notebooks Under
Workspace/Users/User –
Select Import from the
Dropdown
Importing Notebooks
From file or
URL
Import
library
Importing Notebooks
Lot of sample
notebooks to
try
https://databricks.com/resources/type/example-notebooks
Importing Notebooks Open and
Click on the
Import
Notebook
Button
https://databricks.com/resources/type/example-notebooks
Importing Notebooks
Paste the Link
and Click
Import
https://databricks.com/resources/type/example-notebooks
Importing Notebooks Imported
Notebook
https://databricks.com/resources/type/example-notebooks
Why Spark?
• Open-source data processing engine built around speed, ease of use, and
sophisticated analytics
• Highly extensible with support for Scala, Java and Python alongside Spark SQL,
GraphX, Streaming and Machine Learning Library (Mllib)
APACHE SPARK
An unified, open source, parallel, data processing framework for Big Data Analytics
Azure Databricks
Collaborative Workspace
Hadoop storage
DATABRICKS APACHE SERVERLESS Rest APIs
I/O SPARK Data warehouses
Enhance Productivity Build on secure & trusted cloud Scale without limits
GENERAL SPARK CLUSTER ARCHITECTURE
Driver Program
SparkContext
‘Driver’ runs the user’s ‘main’ function and
executes the various parallel operations on
the worker nodes.
The results of the operations are collected by Cluster Manager
the driver
The worker nodes read and write data from/to Worker Node Worker Node Worker Node
Data Sources including HDFS.
Worker node also cache transformed data in Cache Cache Cache
memory as RDDs (Resilient Data Sets).
Task Task Task
Worker nodes and the Driver Node execute as
VMs in public clouds (AWS, Google and
Azure).
Access Control can be defined for Workspaces, Clusters, Jobs and REST APIs
Workspace Access Defines who can who can view, edit, and run
Control notebooks in their workspace
Clusters
Libraries Workspac
es
Azure
Databrick
s
Jobs Notebook
s
Demo
It’s Time for a Closer
Look
Compute and
Storage are
Separate
Azure Databricks – service home page
Quick Start
Documentat
ion
Azure Databricks – workspace home page
A Deeper Dive
CLUSTERS
Terminate cluster x x
Start cluster x x
Restart cluster x x
Resize cluster x
Modify permissions x
Administration
Administration Settings
What
do you
want to
restrict
?
Job – Advanced Settings
Job Schedule – CRON Format Support
Calling Other Notebooks
Getting to the Spark Console
Spark GUI
Spark Job Metrics
Dashboard
Show Existing
New Dashboard
Dashboard
Widgets
Widget
Settings
Dynamic Visualizations with Widgets
Wrapping Up