0% found this document useful (0 votes)

113 views145 pages

Scalable Machine Learning With Apache Spark en

Uploaded by

Ankit Kabi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views145 pages

Scalable Machine Learning With Apache Spark en

Uploaded by

Ankit Kabi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 145

Scalable Machine

Learning
with Apache
Spark™

©2023 Databricks Inc. — All rights reserved

Introductions

▪ Introductions
▪ Name
▪ Spark/ML/Databricks Experience
▪ Professional Responsibilities
▪ Fun Personal Interest/Fact
▪ Expectations for the Course

©2023 Databricks Inc. — All rights reserved

Course Objectives
1 Create data processing pipelines with Spark

2 Build and tune machine learning models with Spark ML

3 Track, version, and deploy machine learning models with MLﬂow

4 Perform distributed hyperparameter tuning with Hyperopt

5 Scale the inference of single-node models with Spark

©2023 Databricks Inc. — All rights reserved

Agenda (half-days)
Day 1 Day 2 Day 3 Day 4
1. Spark Review* 1. Linear Regression, pt. 1. Decision Trees 1. AutoML
2. Delta Lake Review* 1 Lab 2. Break 2. AutoML Lab
3. ML Overview* 2. Linear Regression, pt. 3. Random Forest and 3. Feature Store
4. Break 2 Hyperparameter 4. Break
5. Data Cleansing 3. Break Tuning 5. XGBoost
6. Data Exploration Lab 4. Linear Regression, pt. 4. Hyperparameter 6. Inference with
7. Break 2 Lab Tuning Lab Pandas UDFs
8. Linear Regression, pt. 5. MLflow Tracking 5. Break 7. Pandas UDFs Lab
1 6. Break 6. Hyperopt 8. Break
7. MLflow Model 7. Hyperopt Lab 9. Training with Pandas
Registry Function API
8. MLflow Lab 10. Pandas API on Spark

©2023 Databricks Inc. — All rights reserved *Optional

Agenda (full days)
Day 1 Day 2
1. Spark Review* 1. Decision Trees
2. Delta Lake Review* 2. Break
3. ML Overview* 3. Random Forest and Hyperparameter
4. Break Tuning
5. Data Cleansing 4. Hyperparameter Tuning Lab
6. Data Exploration Lab 5. Break
7. Break 6. Hyperopt
8. Linear Regression, pt. 1 7. Hyperopt Lab
9. Linear Regression, pt. 1 Lab 8. AutoML
10. Linear Regression, pt. 2 9. AutoML Lab
11. Break 10. Feature Store
12. Linear Regression, pt. 2 Lab 11. Break
13. MLflow Tracking 12. XGBoost
14. Break 13. Inference with Pandas UDFs
15. MLflow Model Registry 14. Pandas UDFs Lab
16. MLflow Lab 15. Break
16. Training with Pandas Function API
17. Koalas
©2023 Databricks Inc. — All rights reserved *Optional
Survey

Programming
Apache Spark Machine Learning Language

©2023 Databricks Inc. — All rights reserved

Databricks Certiﬁed ML Associate
Certiﬁcation helps you gain industry recognition, competitive
differentiation, greater productivity, and results.

• This course helps you prepare for the

Databricks Certiﬁed Machine Learning
Associate exam
• Please see the Databricks Academy for
additional prep materials

For more information visit:

databricks.com/learn/certiﬁcation

©2022 Databricks Inc. — All rights reserved 7

LET’S GET STARTED

©2023 Databricks Inc. — All rights reserved

Apache Spark™
Overview

©2023 Databricks Inc. — All rights reserved

Apache Spark Background
▪ Founded as a research project at
UC Berkeley in 2009
▪ Open-source uniﬁed data analytics
engine for big data
▪ Built-in APIs in SQL, Python, Scala,
R, and Java

©2023 Databricks Inc. — All rights reserved

Have you ever counted
the number of M&Ms in a
jar?
©2023 Databricks Inc. — All rights reserved
Spark Cluster
Driver One Driver

Worker Worker Worker Worker

Executor Executor Executor Executor

JVM JVM JVM JVM

Many Executor JVMs

RDD DataFrame Dataset

(2011) (2013) (2015)

Distributed collection of Distributed collection of Internally rows, externally

JVM objects row objects JVM objects

Functional operators (map, Expression-based Almost the “best of both

ﬁlter, etc.) operations and UDFs worlds”: type safe + fast

Logical plans and optimizer But still slower than

DataFrames
Fast/efﬁcient internal
representations

©2023 Databricks Inc. — All rights reserved

Spark DataFrame Execution
PySpark DataFrame Java/Scala DataFrame SparkR DataFrame

Logical Plan

Catalyst Optimizer

Physical Execution

©2023 Databricks Inc. — All rights reserved

Under the Catalyst Optimizer’s Hood

Logical Physical Code

Analysis
Optimization Planning Generation
SQL Query

Cost Model
Unresolved Optimized Physical Selected
Logical Physical
Logical
Plan
Logical Physical
Plans Physical RDDs
Plan Plan Plans
Plans Plan

DataFrame

©2023 Databricks Inc. — All rights reserved

When to Use Spark
Data or model is too large to
Scaling Out process on a single machine,
commonly resulting in
out-of-memory errors

Data or model is processing slowly

and could beneﬁt from shorter
Speeding Up processing times and faster
results

©2023 Databricks Inc. — All rights reserved

Delta Lake Overview

©2023 Databricks Inc. — All rights reserved

Open-source Storage Layer

©2023 Databricks Inc. — All rights reserved

Delta Lake’s Key Features
▪ ACID transactions
▪ Time travel (data versioning)
▪ Schema enforcement and evolution
▪ Audit history
▪ Parquet format
▪ Compatible with Apache Spark API

©2023 Databricks Inc. — All rights reserved

Machine Learning
Overview (Optional)

©2023 Databricks Inc. — All rights reserved

What is Machine Learning
▪ Learn patterns and relationships in your data without explicitly
programming them
▪ Derive an approximation function to map features to an output or relate
them to each other

Machine
Features Output
Learning

©2023 Databricks Inc. — All rights reserved

Types of Machine Learning
Supervised Learning Unsupervised Learning

▪ Labeled data (known function output) ▪ Unlabeled data (no known function output)
▪ Regression (a continuous/ordinal-discrete ▪ Clustering (categorize records based on
output) features)
▪ Classiﬁcation (a categorical output) ▪ Dimensionality reduction (reduce feature space)

©2023 Databricks Inc. — All rights reserved

Types of Machine Learning
Semi-supervised Learning Reinforcement Learning

▪ Labeled and unlabeled data, mostly unlabeled ▪ States, actions, and rewards
▪ Combines supervised learning and ▪ Useful for exploring spaces and exploiting
unsupervised learning information to maximize expected cumulative
▪ Commonly trying to label the unlabeled data rewards
to be used in another round of training ▪ Frequently utilizes neural networks and deep
learning

©2023 Databricks Inc. — All rights reserved

Machine Learning Workﬂow

Deﬁne
Deﬁne Success,
Feature
Business Use Constraints Data Collation Modeling Deployment
Engineering
Case and
Infrastructure

©2023 Databricks Inc. — All rights reserved

Deﬁning and Measuring Success: Establish
baseline!

©2023 Databricks Inc. — All rights reserved

DATA CLEANSING DEMO

©2023 Databricks Inc. — All rights reserved

Importance of Data Visualization

©2023 Databricks Inc. — All rights reserved

Importance of Data Visualization

©2023 Databricks Inc. — All rights reserved

How do we build and evaluate models?

©2023 Databricks Inc. — All rights reserved

DATA EXPLORATION LAB

©2023 Databricks Inc. — All rights reserved

Linear Regression

©2023 Databricks Inc. — All rights reserved

Linear Regression
Goal: Find the line of best ﬁt. Y
ŷ = w0+w1x

y≈ŷ+ϵ

where...

x: feature
y: label
w0: y-intercept
w1: slope of the line of best ﬁt X

©2023 Databricks Inc. — All rights reserved

Minimizing the Residuals
Y

▪ Red point: True value

▪ Purple & Orange dotted lines:
Residuals
▪ Green line: Line of best ﬁt

The goal is to draw a line

that minimizes the sum of
X the squared residuals.

©2023 Databricks Inc. — All rights reserved

Regression Evaluators
Y
Measure the “closeness”
between the actual value
and the predicted value.

Evaluation Metrics

▪ Loss: (y - ŷ)
▪ Absolute loss: |y - ŷ|
▪ Squared loss: (y - ŷ)2
X

©2023 Databricks Inc. — All rights reserved

Evaluation Metric: Root mean-squared-error
(RMSE)

©2023 Databricks Inc. — All rights reserved

Linear Regression Assumptions
Y
▪ Linear relationship between X and
the mean of Y (linearity)
▪ Observations are independent
from one another (independence)
▪ Y is normally distributed for any
ﬁxed observation (normality)
▪ The variance of residual is the
same for any feature
(homoscedasticity)

©2023 Databricks Inc. — All rights reserved

Linear Regression Assumptions
So, which datasets are suited for linear regression?

©2023 Databricks Inc. — All rights reserved

Train vs. Test RMSE

Which is more important? Why?

Train

Test

©2023 Databricks Inc. — All rights reserved

Evaluation Metric: R2

What is the range of R2?

Do we want it to be higher or
lower?

©2023 Databricks Inc. — All rights reserved

Machine Learning Libraries

Scikit-learn is a popular single-node machine learning library.

But what if our data or model get too big?

©2023 Databricks Inc. — All rights reserved

Machine Learning in Spark
Machine learning in Spark allows us to work
Scale Out and Speed Up with bigger data and train models faster by
distributing the data and computations
across multiple workers.

Spark Machine Learning MLlib Spark ML

Libraries
Original ML API Newer ML API for
for Spark Spark

Based on RDDs Based on

DataFrames
Maintenance
©2023 Databricks Inc. — All rights reserved Mode
LINEAR REGRESSION
DEMO I

©2023 Databricks Inc. — All rights reserved

LINEAR REGRESSION
LAB I

©2023 Databricks Inc. — All rights reserved

Non-numeric Features
Two primary types of non-numeric features

Categorical Features Ordinal Features

A series of categories of a single A series of categories of a single

feature feature

No intrinsic ordering Relative ordering, but not

necessarily consistent spacing
e.g. Dog, Cat, Fish
e.g. Infant, Toddler, Adolescent,
Teen, Young Adult, etc.

©2023 Databricks Inc. — All rights reserved

Non-numeric Features in Linear Regression
How do we handle non-numeric features for linear
regression?

▪ X-axis is numeric, so features need to be numeric

▪ Convert our non-numeric features to numeric features?

Could we assign numeric values to each of the categories?

▪ “Dog” = 1, “Cat” = 2, “Fish” = 3, etc.

▪ Does this make sense?

©2023 Databricks Inc. — All rights reserved

This implies 1 Cat is equal to 2 Dogs!
Non-numeric Features in Linear Regression
Instead, we commonly use a practice known as one-hot encoding (OHE).
▪ Creates a binary “dummy” feature for each category

Animal Dog Cat Fish

Dog OHE 1 0 0

Cat 0 1 0

Fish 0 0 1

▪ Doesn’t force a uniformly-spaced, ordered numeric representation

©2023 Databricks Inc. — All rights reserved

One-hot Encoding at Scale
You might be thinking...
▪ Okay, I see what’s happening here … this works for a handful of animals.

▪ But what if we have an entire zoo of animals? That would result in really wide
data!

Spark uses sparse vectors for this…

DenseVector(0, 0, 0, 7, 0, 2, 0, 0, 0, 0)
SparseVector(10, [3, 5], [7, 2])

▪ Sparse vectors take the form:

(Number of elements, [indices of non-zero elements], [values of non-zero elements])

©2023 Databricks Inc. — All rights reserved

LINEAR REGRESSION
DEMO II

©2023 Databricks Inc. — All rights reserved

LINEAR REGRESSION
LAB II

©2023 Databricks Inc. — All rights reserved

MLﬂow Tracking

©2023 Databricks Inc. — All rights reserved

MLﬂow

▪ Open-source platform for machine learning lifecycle

▪ Operationalizing machine learning
▪ Developed by Databricks
▪ Pre-installed on the Databricks Runtime for ML

©2023 Databricks Inc. — All rights reserved

Core Machine Learning Issues
▪ Keeping track of experiments or model development
▪ Reproducing code
▪ Comparing models
▪ Standardization of packaging and deploying models

MLﬂow addresses these issues.

©2023 Databricks Inc. — All rights reserved

MLﬂow Components

Tracking Projects Models Model Registry

Record and Packaging format General model Centralized and
query for reproducible format that collaborative
experiments: runs on any supports diverse model lifecycle
code, data, platform deployment management
conﬁg, results tools

▪ APIs: CLI, Python, R, Java, REST

©2023 Databricks Inc. — All rights reserved

Ensure
MLﬂow tracking and autologging reproducibility

Track ML development
with one line of code:
parameters, metrics,
data lineage, model, and
environment. Model, environment, and artifacts
Metrics
Parameters and tags,
mlflow.autolog() including data version

Analyze results in UI or programmatically

● How does tuning parameter X affect my metric?
● What is the best model?
● Did I run training for long enough?
©2023 Databricks Inc. — All rights reserved
Model Deployment Options

Serving

In-Line Code Containers Batch & Stream OSS Inference Cloud Inference
Scoring Solutions Services

©2023 Databricks Inc. — All rights reserved

The Full ML Lifecycle

©2023 Databricks Inc. — All rights reserved

MLFLOW TRACKING
DEMO

©2023 Databricks Inc. — All rights reserved

MLﬂow Model Registry

©2023 Databricks Inc. — All rights reserved

MLﬂow Model Registry
▪ Collaborative, centralized model hub
▪ Facilitate experimentation, testing, and production
▪ Integrate with approval and governance workﬂows
▪ Monitor ML deployments and their performance

Databricks MLﬂow Blog Post

©2023 Databricks Inc. — All rights reserved

One Collaborative Hub for Model
Management
Full lineage from deployed models to training code /
Centralized Model Management and Discovery data

● Overview of all registered models, their versions at

Staging and Production
● Search by name, tags, etc.
● Model-based ACLs ● Full lineage from Model Version to
○ Run that produced the model
○ Notebook that produced the run
○ Exact revision history of the notebook that produced the run

©2023 Databricks Inc. — All rights reserved

Version Control and Visibility into
Deployment Process
Versioning of ML artifacts Visibility and auditability of the deployment process

● Audit log of stage transitions and requests per model

● Overview of active model versions and their deployment

stage
● Comparison of versions and their logged metrics, parameters,
etc.
©2023 Databricks Inc. — All rights reserved
Review Processes and CI/CD
Integration
Manual review process Automation through CI/CD integration

Webhooks allow registering of callbacks

(e.g. for tests / deployment) on events in
the Model Registry

Staging Production Archived

Data Scientists Deployment

Engineers
● Stage-based Access Controls
● Request and approval workﬂow for stage transitions
● Webhooks for events like model creation, version creation, transition
request, etc.
● Mechanisms to store results / metadata through Tags and Comments

©2023 Databricks Inc. — All rights reserved

MLFLOW MODEL REGISTRY
DEMO

©2023 Databricks Inc. — All rights reserved

MLFLOW
LAB

©2023 Databricks Inc. — All rights reserved

Decision Trees

©2023 Databricks Inc. — All rights reserved

Decision Making
Salary > $50,000 Root Node
Yes No

Commute > 1 hr Decline Offer

Yes No

Decline Offer Offers Free Coffee

Yes No

Leaf Node Accept Offer Decline Offer Leaf Node

Salary : 61,000 Salary : 61,000

Commute: 30 mins Commute: 30 mins
Free Coffee: Yes Free Coffee: No
©2023 Databricks Inc. — All rights reserved
Decision Making
Salary > $50,000

Yes No

Commute > 1 hr Decline Offer

Yes No

Decline Offer Offers Free Coffee

Yes No

Accept Offer Decline

Salary Offer
> $60,000

Yes No

Salary : 61,000
Commute: 30 mins Accept Offer Decline Offer
Free Coffee: No
©2023 Databricks Inc. — All rights reserved
Decision Making
Salary > $50,000 Root Node
Yes No
Salary : 61,000
Commute > 1 hr Decline Offer Commute: 30 mins
Free Coffee: No
Yes No

Decline Offer Offers Free Coffee

Yes No

Accept Offer Salary > $60,000

Accept Offer Decline Offer

©2023 Databricks Inc. — All rights reserved

Determining Splits

Commute? Commute?

< 1 hr > 1 hr < 1 0 min > 10 min

1 hr is a better splitting point for Commute because it

provides information about the classiﬁcation.
©2023 Databricks Inc. — All rights reserved
Determining Splits

Commute? Bonus?

< 1 hr > 1 hr Yes No

Commute is a better choice because it provides

information about the classiﬁcation.
©2023 Databricks Inc. — All rights reserved
Creating Decision Boundaries
Commute

Salary > $50,000 Decline Offer

Yes No
1 hour

Commute > 1 hr Decline Offer

Decline Offer
Yes No
Accept Offer
Decline Offer Accept Offer

$50,000
Salary

©2023 Databricks Inc. — All rights reserved

Lines vs. Boundaries
Linear Regression Decision Trees
▪ Lines through data ▪ Boundaries instead of lines
▪ Assumed linear relationship ▪ Learn complex relationships
Commute

1 hour

X $50,000 Salary

©2023 Databricks Inc. — All rights reserved

Linear Regression or Decision Tree?

It depends on the data...

Tree Depth: the length of the Salary >

Root Node 0
$50,000
longest path from a root note to
a leaf node Yes No

Commute > 1 hr Decline Offer 1

Yes No
3
Decline Offer
Offers Free
Coffee
2
Yes No

Leaf Node Accept Offer Decline Offer Leaf Node 3

Note: shallow trees tend to underﬁt, and deep trees tend to

©2023 Databricks Inc. — All rights reserved

Additional Resource

R2D3 has an excellent visualization of

how decision trees work.

©2023 Databricks Inc. — All rights reserved

DECISION TREE DEMO

©2023 Databricks Inc. — All rights reserved

Random Forests

©2023 Databricks Inc. — All rights reserved

Decision Trees
Pros Cons
▪ Interpretable ▪ Poor accuracy
▪ Simple ▪ High variance
▪ Classiﬁcation/Regression
▪ Nonlinear relationships

©2023 Databricks Inc. — All rights reserved

Bias vs. Variance

©2023 Databricks Inc. — All rights reserved

Bias-Variance Tradeoff
Error = Variance + Bias2 + noise

Error ▪ Reduce Bias

Optimum Model
Total Error
Complexity
▪ Build more complex
Variance models
▪ Reduce Variance
▪ Use a lot of data
▪ Build simple models
▪ What about the noise?
Bias2

Model Complexity
©2023 Databricks Inc. — All rights reserved
©2023 Databricks Inc. — All rights reserved Source
Building Five Hundred Decision Trees
▪ Using more data reduces variance for one model
▪ Averaging more predictions reduces prediction variance
▪ But that would require more decision trees
▪ And we only have one training set … or do we?

©2023 Databricks Inc. — All rights reserved

Bootstrap Sampling
A method for simulating N new datasets:

1. Take sample with replacement from original training set

2. Repeat N times

©2023 Databricks Inc. — All rights reserved

Bootstrap Visualization
Bootstrap 1 (N = 100) Bootstrap 2 (N = 100)

Training Set (N = 100)

Bootstrap 3 (N = 100) Bootstrap 4 (N = 100)

Why are some points in the bootstrapped

samples not selected?
©2023 Databricks Inc. — All rights reserved
Training Set Coverage
Assume we are bootstrapping N draws from a training set with
N observations ...
▪ Probability of an element getting picked in each draw:
▪ Probability of an element not getting picked in each draw:
▪ Probability of an element not getting drawn in the entire
sample:

As N → ∞, the probability for each element of not

getting picked in a sample approaches 0.368.

©2023 Databricks Inc. — All rights reserved

Bootstrap Aggregating
▪ Train a tree on each of sample, and average the predictions
▪ This is bootstrap aggregating, commonly referred to as
bagging
Bootstrap 1 Bootstrap 2 Bootstrap 3 Bootstrap 4

Decision Tree Decision Tree Decision Tree Decision Tree

1 2 3 4

Final
Prediction

©2023 Databricks Inc. — All rights reserved

Random Forest Algorithm
Full Training Data

Bootstrap 1 Bootstrap 2 Bootstrap K

...

At each split, a subset of features is considered to

ensure each tree is different.

©2023 Databricks Inc. — All rights reserved

Random Forest Aggregation
Scoring Record

...

Aggregation

Final Prediction

▪ Majority-voting for classiﬁcation

▪ Mean for regression
©2023 Databricks Inc. — All rights reserved
RANDOM FOREST DEMO

©2023 Databricks Inc. — All rights reserved

Hyperparameter
Tuning

©2023 Databricks Inc. — All rights reserved

What is a Hyperparameter?
▪ Examples for Random Forest:
▪ Tree depth
▪ Number of trees
▪ Number of features to consider

A parameter whose value is used to

control the training process.

©2023 Databricks Inc. — All rights reserved

Selecting Hyperparameter Values
▪ Build a model for each hyperparameter value
▪ Evaluate each model to identify the optimal hyperparameter
value
▪ What dataset should we use to train and evaluate?

Training Validation Test

What if there isn’t enough data to split

into three separate sets?

©2023 Databricks Inc. — All rights reserved

K-Fold Cross Validation

Pass 1: Training Training Validation

Average
Validation Errors
Pass 2: Training Validation Training to Identify
Optimal
Hyperparameter
Pass 3: Validation Training Training Values

Final Pass: Training with Optimal Hyperparameters Test

HYPERPARAMETER TUNING
DEMO

Optimizing Hyperparameter Values
Grid Search
▪ Train and validate every unique combination of
hyperparameters
Tree Depth Number of Trees Tree Depth Number of Trees

5 2 5 2

8 4 5 4

8 2

8 4

Question: With 3-fold cross validation, how many models will this build?

HYPERPARAMETER TUNING
LAB

Hyperparameter
Tuning
with Hyperopt

Problems with Grid Search
▪ Exhaustive enumeration is expensive
▪ Manually determined search space
▪ Past information on good hyperparameters isn’t used
▪ So what do you do if…
▪ You have a training budget
▪ You have many hyperparameters to tune
▪ You want to pick your hyperparameters based on past
results

Hyperopt
▪ Open-source Python library
▪ Optimization over awkward search spaces (real-valued,
discrete, and conditional dimensions)
▪ Supports serial or parallel optimization
▪ Spark integration
▪ Core algorithms for optimization:
▪ Random Search
▪ Adaptive Tree of Parzen Estimators (TPE)

Paper
Optimizing Hyperparameter Values
Random Search

Generally outperforms grid search

Optimizing Hyperparameter Values
Tree of Parzen Estimators

▪ Bayesian process
▪ Creates meta model that maps hyperparameters to
probability of a score on the objective function
▪ Provide a range and distribution for continuous and
discrete values
▪ Adaptive TPE better tunes the search space by
▪ Freezing hyperparameters
▪ Tuning number of random trials before TPE

HYPEROPT
DEMO

HYPEROPT
LAB

AutoML

Databricks AutoML
A glass-box solution that empowers data teams without taking away control

MLﬂow experiment
Auto-created MLﬂow Easily deploy
UI and API to Experiment to track models and to Model
start AutoML metrics Registry
training

Data exploration
notebook Understand and
Generated notebook with debug data
feature summary statistics and quality and
distributions preprocessing

Reproducible trial Iterate further on

notebooks models from
Generated notebooks with source
code for every model
AutoML, adding
your expertise
©2023 Databricks Inc. — All rights reserved
AutoML solves two key pain points
for data scientists
Quickly Verify the Predictive Power of a Get a Baseline Model to Guide Project
Dataset Direction

Marketing Data Data

Team Science Science
Team Team
Dataset Dataset Baseline
Model

“Can this dataset be used to predict “What direction should I go in for this
customer churn?” ML project and what benchmark
should
I aim to beat?”

Problems with Existing AutoML
Solutions
Opaque-Box and Production Cliff Problems in AutoML

? ?
AutoML AutoML Returned Production Deployed
Conﬁguration Training Best Model Cliff Model
“Opaque
Box”
Problem Result / Pain Points

1. A “production cliff” exists where data scientists need to ● The “best” model returned is often not good enough
modify the returned “best” model using their domain to deploy
expertise before deployment ● Data scientists must spend time and energy reverse
2. Data scientists need to be able to explain how they engineering these “opaque-box” returned models so
trained a model for regulatory purposes (e.g., FDA, GDPR, that they can modify them and/or explain them
etc.) and most AutoML solutions have “opaque box”
models
©2023 Databricks Inc. — All rights reserved
“Glass-Box” AutoML
Conﬁgure

Train and Evaluate with a UI

Customize

Deploy

AutoML Lab

Feature Store

Feature Store
The ﬁrst Feature Store codesigned with a Data and MLOps Platform
Feature Store
Batch (high throughput)
Feature
Feature Registry
Provider
Online (low latency)

Feature Registry Feature Provider

▪ Discoverability and Reusability ▪ Batch and online access to Features
▪ Versioning ▪ Feature lookup packaged with Models
▪ Upstream and downstream ▪ Simpliﬁed deployment process
Lineage

Co-designed with Co-designed with

▪ Open format ▪ Open model format that supports all ML

▪ Built-in data versioning and governance frameworks
▪ Native access through PySpark, SQL, ▪ Feature version and lookup logic
etc. hermetically logged with Model

Gradient Boosted
Decision Trees

Decision Tree Ensembles
▪ Combine many decision Full Training Data
trees
▪ Random Forest Bootstrap 1 Bootstrap 2 Bootstrap K
▪ Bagging
▪ Independent trees ...
▪ Results aggregated to a
ﬁnal prediction
▪ There are other methods of
ensembling decision trees

Boosting
Full Training Data

▪ Sequential (one tree at a time)

▪ Each tree learns from the last
▪ Sequence of trees is the ﬁnal
model

Gradient Boosted Decision Trees
▪ Common boosted trees algorithm
▪ Fits each tree to the residuals of the previous tree
▪ On the ﬁrst iteration, residuals are the actual label values

Model 1 Model 2 Final Prediction

Y Prediction Residual Y Prediction Residual Y Prediction

40 35 5 5 3 2 40 38

60 67 -7 -7 -4 -3 60 63

30 28 2 2 3 -1 30 31

33 32 1 1 0 1 33 32

Boosting vs. Bagging
GBDT RF
▪ Starts with high bias, low variance ▪ Starts with high variance, low
▪ Works right bias
▪ Works left
Error Total Error

Optimum Model
Complexity
Variance

Bias2

Model Complexity

Gradient Boosted Decision Trees
Implementations
▪ Spark ML
▪ Built into Spark
▪ Utilizes Spark’s existing decision tree implementation
▪ XGBoost
▪ Designed and built speciﬁcally for gradient boosted trees
▪ Regularized to prevent overﬁtting
▪ Pre-installed in Databricks Runtime for ML (Python & Scala APIs)

XGBOOST DEMO

Appendix

ML Deployment
Options

What is ML Deployment?

▪ Data Science != Data Engineering

▪ Data science is scientiﬁc
▪ Business problems → data problems
▪ Model mathematically
▪ Optimize performance
▪ Data engineers are concerned with
▪ Reliability
▪ Scalability
▪ Maintainability
▪ SLAs
▪ ...

DevOps vs. ModelOps

▪ DevOps = software development + IT operations

▪ Manages deployments
▪ CI/CD of features, patches, updates, rollbacks
▪ ModelOps = data modeling + deployment operations
▪ Artifact management (Continuous Training)
▪ Model performance monitoring (Continuous Monitoring)
▪ Data management
▪ Use of containers and managed services

The Four Deployment Paradigms

1. Batch
▪ 80-90% of deployments
▪ Leverages databases and object storage
▪ Fast retrieval of stored predictions
2. Streaming (continuous)
▪ 10-15% of deployments
▪ Moderately fast scoring on new data
3. Real Time
▪ 5-10% of deployments
▪ Usually using REST (Azure ML, SageMaker, containers)
4. On-device (edge)

Latency Requirements (roughly)

Latency Requirements

10 ms 100 ms 1 min 1 hour 1day

Real Time Streaming Batch

Overview of a typical Databricks CI/CD
pipeline
Continuous Continuous
integration delivery

Code Build Release Deploy Test Operate

See CI/CD Templates for a starting point

Types of Supervised Learning
Regression Classiﬁcation

▪ Predicting a continuous output ▪ Predicting a categorical/discrete

output

Types of Classification
Binary Classification Multiclass Classification
Two label classes Three or more label classes

Model output is commonly the probability of a

record belonging to each of the classes.
©2023 Databricks Inc. — All rights reserved
Binary Classiﬁcation
Binary Classiﬁcation
Two label classes ▪ Outputs:
▪ Probability that the record
is Green given a set of
features
▪ Probability that the record
is Red given a set of
features
▪ Reminders:
▪ Probabilities are bounded
between 0 and 1
▪ And linear regression
returns any real number

Bounding Binary Classiﬁcation Probabilities
How can we keep model outputs between 0 and 1?

▪ Logistic Function:
▪ Large positive inputs → 1
▪ Large negative inputs → 0
©2023 Databricks Inc. — All rights reserved
Converting Probabilities to Classes
▪ In binary classiﬁcation, the class probabilities are directly
complementary
▪ So let’s set our Red class equal to 1, and our Blue class equal to 0
▪ The model output is 𝐏[y = 1 | x] where x represents the features
But we need class predictions, not probability predictions
▪ Set a threshold on the probability predictions
▪ 𝐏[y = 1 | x] < 0.5 → y = 0
▪ 𝐏[y = 1 | x] ≥ 0.5 → y = 1

Evaluating Binary Classiﬁcation Models
▪ How can the model be wrong?
▪ Type I Error: False Positive
▪ Type II Error: False Negative
▪ Representing these errors with a confusion matrix.

Binary Classiﬁcation Metrics
Accuracy Precision

TP + TN TP
TP + FP + TN + FN TP + FP

Recall F1

TP 2 x Precision x Recall
TP + FN Precision + Recall

Collaborative Filtering

Recommendation Systems

Naive Approaches to Recommendation
▪ Hand-curated
▪ Aggregates

Question: What are problems with these

approaches?

Content-based Recommendation
▪ Idea: Recommend items to a customer that are similar to other
items the customer liked
▪ Creates a proﬁle for each user or product
▪ User: demographic info, ratings, etc.
▪ Item: genre, ﬂavor, brand, actor list, etc.

Content-based Recommendation
▪ Advantages
▪ No need for data from other users
▪ New item recommendations
▪ Disadvantages
▪ Cold-start problem
▪ Determining appropriate features
▪ Implicit information

Collaborative Filtering
▪ Idea: Make recommendations for one customer (ﬁltering) by
collecting and analyzing the interests of many users
(collaboration)
▪ Advantages over content-based recommendation
▪ Relies only on past user behavior (no proﬁle creation)
▪ Domain independent
▪ Generally more accurate
▪ Disadvantages
▪ Extremely susceptible to cold-start problem (user and item)

Types of Collaborative Filtering
▪ Neighborhood Methods: Compute relationships between items
or users
▪ Computationally expensive
▪ Not empirically as good
▪ Latent Factor Models: Explain the ratings by characterizing items
and users by small number of inferred factors
▪ Matrix factorization
▪ Characterizes both items and users by vectors of factors
from item-rating pattern
▪ Explicit feedback: sparse matrix
▪ Scalable

Latent Factor Approach

Ratings Matrix

Matrix Factorization

Alternating Least Squares
▪ Step 1: Randomly initialize user and movie factors
▪ Step 2: Repeat the following
1. Fix the movie factors, and optimize user factors
2. Fix the user factors, and optimize movie factors

Pj785wzufupk LFS260 Labs - V2021 03 05
No ratings yet
Pj785wzufupk LFS260 Labs - V2021 03 05
94 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
LTRSPM 2010 LG
No ratings yet
LTRSPM 2010 LG
110 pages
UTD CNSP 2.0 Developement Track
No ratings yet
UTD CNSP 2.0 Developement Track
132 pages
C2 - W1 Mlopssadsa
No ratings yet
C2 - W1 Mlopssadsa
111 pages
PRESENTATION - Ask The Expert - How Do I Integrate SAS Viya and Open Source
No ratings yet
PRESENTATION - Ask The Expert - How Do I Integrate SAS Viya and Open Source
121 pages
Full Certified Tester 4
No ratings yet
Full Certified Tester 4
104 pages
Devopsbootcampcloer
No ratings yet
Devopsbootcampcloer
115 pages
K8S Vs Openshift
No ratings yet
K8S Vs Openshift
6 pages
Waiting For Your Valuable Responds 1719890904
No ratings yet
Waiting For Your Valuable Responds 1719890904
30 pages
DEVOPSv3.3 Master DevOps Glossary 10dec2020
No ratings yet
DEVOPSv3.3 Master DevOps Glossary 10dec2020
59 pages
CLS 1306 WXCC - AI&Orchestration
No ratings yet
CLS 1306 WXCC - AI&Orchestration
135 pages
CLE Materials 03.14.19
No ratings yet
CLE Materials 03.14.19
100 pages
Sustainable Web Development With Ruby On Rails P2.0
No ratings yet
Sustainable Web Development With Ruby On Rails P2.0
487 pages
Kubernetes Error
No ratings yet
Kubernetes Error
7 pages
Software Testing Unit-1
No ratings yet
Software Testing Unit-1
111 pages
Github Course
No ratings yet
Github Course
173 pages
Devops Full Notes
No ratings yet
Devops Full Notes
223 pages
Practical DevSecOps 2021 - 8
No ratings yet
Practical DevSecOps 2021 - 8
83 pages
M1 CDL Student Slides v2
No ratings yet
M1 CDL Student Slides v2
184 pages
1.top500oops Java Interview Que
No ratings yet
1.top500oops Java Interview Que
127 pages
Ansible Comprehensive Guide-1
No ratings yet
Ansible Comprehensive Guide-1
47 pages
Brkent 2006
No ratings yet
Brkent 2006
102 pages
50 Kubernetes Errors & Solutions
No ratings yet
50 Kubernetes Errors & Solutions
15 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
UTD CNSP Workshop Guide
No ratings yet
UTD CNSP Workshop Guide
110 pages
Amazon CloudFront A Comprehensive Guide
No ratings yet
Amazon CloudFront A Comprehensive Guide
22 pages
UTD CNSP 2.0 Security Track
No ratings yet
UTD CNSP 2.0 Security Track
104 pages
Devops Engineering Aws Student-Guide-1.6
No ratings yet
Devops Engineering Aws Student-Guide-1.6
497 pages
Network Plateform Software Reference
No ratings yet
Network Plateform Software Reference
104 pages
Kubernetes Architecture A Deep Dive
No ratings yet
Kubernetes Architecture A Deep Dive
10 pages
Devops Shack: Linux Commands Documentation
No ratings yet
Devops Shack: Linux Commands Documentation
7 pages
Introduction To Appsec Combined Keynote PDF
No ratings yet
Introduction To Appsec Combined Keynote PDF
107 pages
Unit 3 Final 1
No ratings yet
Unit 3 Final 1
153 pages
(T-AK8S-I) M3 - Kubernetes Architecture - ILT v1.7
No ratings yet
(T-AK8S-I) M3 - Kubernetes Architecture - ILT v1.7
53 pages
Interview Questions Merged Removed
No ratings yet
Interview Questions Merged Removed
218 pages
Aws Certified Big Data Slides
No ratings yet
Aws Certified Big Data Slides
517 pages
Anible Use Cases
No ratings yet
Anible Use Cases
15 pages
TER36055 - V3.0 SG Ed1 CE PDF
No ratings yet
TER36055 - V3.0 SG Ed1 CE PDF
914 pages
Day 1
No ratings yet
Day 1
105 pages
Python Interview Questions
No ratings yet
Python Interview Questions
121 pages
AWS Partner Security Best Practices (Technical) - 200-SIPSBP-14-En-SG
No ratings yet
AWS Partner Security Best Practices (Technical) - 200-SIPSBP-14-En-SG
206 pages
Devops 84
No ratings yet
Devops 84
197 pages
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
No ratings yet
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
112 pages
KPLABS+Course+ +Terraform+D0
No ratings yet
KPLABS+Course+ +Terraform+D0
5 pages
Microsoft 70-744 203q
No ratings yet
Microsoft 70-744 203q
166 pages
CCP-Rev4..1a GCF SENDV1
No ratings yet
CCP-Rev4..1a GCF SENDV1
140 pages
Spaces pQx3Sex1V1mpQ54zINRe PDF Export
No ratings yet
Spaces pQx3Sex1V1mpQ54zINRe PDF Export
405 pages
AWS Certified Cloud Practitioner Slides v2.6
No ratings yet
AWS Certified Cloud Practitioner Slides v2.6
480 pages
HP Continuous Delivery Automation 1.00
No ratings yet
HP Continuous Delivery Automation 1.00
660 pages
User-Group & Permissions-Ownership
No ratings yet
User-Group & Permissions-Ownership
6 pages
OpenSolaris DTrace - Harry J Foxwell PDF
No ratings yet
OpenSolaris DTrace - Harry J Foxwell PDF
181 pages
Devops Shack: Linux Directories Structure & Explanation
No ratings yet
Devops Shack: Linux Directories Structure & Explanation
5 pages
Docker Scenario Based Questions and Answers
No ratings yet
Docker Scenario Based Questions and Answers
25 pages
Steps For Creating A Virtual Machine (VM) in AWS
No ratings yet
Steps For Creating A Virtual Machine (VM) in AWS
4 pages
SAFe 4 Agilist Exam Study Guide (4.6)
No ratings yet
SAFe 4 Agilist Exam Study Guide (4.6)
14 pages
Testingexperience01 09
No ratings yet
Testingexperience01 09
104 pages
Splunk Punk: Taming Logs, Alerts, and the Chaos of SIEM
From Everand
Splunk Punk: Taming Logs, Alerts, and the Chaos of SIEM
Scott Markham
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet