Scalable Machine
Learning
with Apache
Spark™
©2023 Databricks Inc. — All rights reserved
Introductions
▪ Introductions
▪ Name
▪ Spark/ML/Databricks Experience
▪ Professional Responsibilities
▪ Fun Personal Interest/Fact
▪ Expectations for the Course
©2023 Databricks Inc. — All rights reserved
Course Objectives
1 Create data processing pipelines with Spark
2 Build and tune machine learning models with Spark ML
3 Track, version, and deploy machine learning models with MLflow
4 Perform distributed hyperparameter tuning with Hyperopt
5 Scale the inference of single-node models with Spark
©2023 Databricks Inc. — All rights reserved
Agenda (half-days)
Day 1 Day 2 Day 3 Day 4
1. Spark Review* 1. Linear Regression, pt. 1. Decision Trees 1. AutoML
2. Delta Lake Review* 1 Lab 2. Break 2. AutoML Lab
3. ML Overview* 2. Linear Regression, pt. 3. Random Forest and 3. Feature Store
4. Break 2 Hyperparameter 4. Break
5. Data Cleansing 3. Break Tuning 5. XGBoost
6. Data Exploration Lab 4. Linear Regression, pt. 4. Hyperparameter 6. Inference with
7. Break 2 Lab Tuning Lab Pandas UDFs
8. Linear Regression, pt. 5. MLflow Tracking 5. Break 7. Pandas UDFs Lab
1 6. Break 6. Hyperopt 8. Break
7. MLflow Model 7. Hyperopt Lab 9. Training with Pandas
Registry Function API
8. MLflow Lab 10. Pandas API on Spark
©2023 Databricks Inc. — All rights reserved *Optional
Agenda (full days)
Day 1 Day 2
1. Spark Review* 1. Decision Trees
2. Delta Lake Review* 2. Break
3. ML Overview* 3. Random Forest and Hyperparameter
4. Break Tuning
5. Data Cleansing 4. Hyperparameter Tuning Lab
6. Data Exploration Lab 5. Break
7. Break 6. Hyperopt
8. Linear Regression, pt. 1 7. Hyperopt Lab
9. Linear Regression, pt. 1 Lab 8. AutoML
10. Linear Regression, pt. 2 9. AutoML Lab
11. Break 10. Feature Store
12. Linear Regression, pt. 2 Lab 11. Break
13. MLflow Tracking 12. XGBoost
14. Break 13. Inference with Pandas UDFs
15. MLflow Model Registry 14. Pandas UDFs Lab
16. MLflow Lab 15. Break
16. Training with Pandas Function API
17. Koalas
©2023 Databricks Inc. — All rights reserved *Optional
Survey
Programming
Apache Spark Machine Learning Language
©2023 Databricks Inc. — All rights reserved
Databricks Certified ML Associate
Certification helps you gain industry recognition, competitive
differentiation, greater productivity, and results.
• This course helps you prepare for the
Databricks Certified Machine Learning
Associate exam
• Please see the Databricks Academy for
additional prep materials
For more information visit:
databricks.com/learn/certification
©2022 Databricks Inc. — All rights reserved 7
LET’S GET STARTED
©2023 Databricks Inc. — All rights reserved
Apache Spark™
Overview
©2023 Databricks Inc. — All rights reserved
Apache Spark Background
▪ Founded as a research project at
UC Berkeley in 2009
▪ Open-source unified data analytics
engine for big data
▪ Built-in APIs in SQL, Python, Scala,
R, and Java
©2023 Databricks Inc. — All rights reserved
Have you ever counted
the number of M&Ms in a
jar?
©2023 Databricks Inc. — All rights reserved
Spark Cluster
Driver One Driver
Worker Worker Worker Worker
Executor Executor Executor Executor
JVM JVM JVM JVM
Many Executor JVMs
©2023 Databricks Inc. — All rights reserved
Spark’s Structured Data APIs
RDD DataFrame Dataset
(2011) (2013) (2015)
Distributed collection of Distributed collection of Internally rows, externally
JVM objects row objects JVM objects
Functional operators (map, Expression-based Almost the “best of both
filter, etc.) operations and UDFs worlds”: type safe + fast
Logical plans and optimizer But still slower than
DataFrames
Fast/efficient internal
representations
©2023 Databricks Inc. — All rights reserved
Spark DataFrame Execution
PySpark DataFrame Java/Scala DataFrame SparkR DataFrame
Logical Plan
Catalyst Optimizer
Physical Execution
©2023 Databricks Inc. — All rights reserved
Under the Catalyst Optimizer’s Hood
Logical Physical Code
Analysis
Optimization Planning Generation
SQL Query
Cost Model
Unresolved Optimized Physical Selected
Logical Physical
Logical
Plan
Logical Physical
Plans Physical RDDs
Plan Plan Plans
Plans Plan
DataFrame
©2023 Databricks Inc. — All rights reserved
When to Use Spark
Data or model is too large to
Scaling Out process on a single machine,
commonly resulting in
out-of-memory errors
Data or model is processing slowly
and could benefit from shorter
Speeding Up processing times and faster
results
©2023 Databricks Inc. — All rights reserved
Delta Lake Overview
©2023 Databricks Inc. — All rights reserved
Open-source Storage Layer
©2023 Databricks Inc. — All rights reserved
Delta Lake’s Key Features
▪ ACID transactions
▪ Time travel (data versioning)
▪ Schema enforcement and evolution
▪ Audit history
▪ Parquet format
▪ Compatible with Apache Spark API
©2023 Databricks Inc. — All rights reserved
Machine Learning
Overview (Optional)
©2023 Databricks Inc. — All rights reserved
What is Machine Learning
▪ Learn patterns and relationships in your data without explicitly
programming them
▪ Derive an approximation function to map features to an output or relate
them to each other
Machine
Features Output
Learning
©2023 Databricks Inc. — All rights reserved
Types of Machine Learning
Supervised Learning Unsupervised Learning
▪ Labeled data (known function output) ▪ Unlabeled data (no known function output)
▪ Regression (a continuous/ordinal-discrete ▪ Clustering (categorize records based on
output) features)
▪ Classification (a categorical output) ▪ Dimensionality reduction (reduce feature space)
©2023 Databricks Inc. — All rights reserved
Types of Machine Learning
Semi-supervised Learning Reinforcement Learning
▪ Labeled and unlabeled data, mostly unlabeled ▪ States, actions, and rewards
▪ Combines supervised learning and ▪ Useful for exploring spaces and exploiting
unsupervised learning information to maximize expected cumulative
▪ Commonly trying to label the unlabeled data rewards
to be used in another round of training ▪ Frequently utilizes neural networks and deep
learning
©2023 Databricks Inc. — All rights reserved
Machine Learning Workflow
Define
Define Success,
Feature
Business Use Constraints Data Collation Modeling Deployment
Engineering
Case and
Infrastructure
©2023 Databricks Inc. — All rights reserved
Defining and Measuring Success: Establish
baseline!
©2023 Databricks Inc. — All rights reserved
DATA CLEANSING DEMO
©2023 Databricks Inc. — All rights reserved
Importance of Data Visualization
©2023 Databricks Inc. — All rights reserved
Importance of Data Visualization
©2023 Databricks Inc. — All rights reserved
How do we build and evaluate models?
©2023 Databricks Inc. — All rights reserved
DATA EXPLORATION LAB
©2023 Databricks Inc. — All rights reserved
Linear Regression
©2023 Databricks Inc. — All rights reserved
Linear Regression
Goal: Find the line of best fit. Y
ŷ = w0+w1x
y≈ŷ+ϵ
where...
x: feature
y: label
w0: y-intercept
w1: slope of the line of best fit X
©2023 Databricks Inc. — All rights reserved
Minimizing the Residuals
Y
▪ Red point: True value
▪ Purple & Orange dotted lines:
Residuals
▪ Green line: Line of best fit
The goal is to draw a line
that minimizes the sum of
X the squared residuals.
©2023 Databricks Inc. — All rights reserved
Regression Evaluators
Y
Measure the “closeness”
between the actual value
and the predicted value.
Evaluation Metrics
▪ Loss: (y - ŷ)
▪ Absolute loss: |y - ŷ|
▪ Squared loss: (y - ŷ)2
X
©2023 Databricks Inc. — All rights reserved
Evaluation Metric: Root mean-squared-error
(RMSE)
©2023 Databricks Inc. — All rights reserved
Linear Regression Assumptions
Y
▪ Linear relationship between X and
the mean of Y (linearity)
▪ Observations are independent
from one another (independence)
▪ Y is normally distributed for any
fixed observation (normality)
▪ The variance of residual is the
same for any feature
(homoscedasticity)
©2023 Databricks Inc. — All rights reserved
Linear Regression Assumptions
So, which datasets are suited for linear regression?
©2023 Databricks Inc. — All rights reserved
Train vs. Test RMSE
Which is more important? Why?
Train
Test
©2023 Databricks Inc. — All rights reserved
Evaluation Metric: R2
What is the range of R2?
Do we want it to be higher or
lower?
©2023 Databricks Inc. — All rights reserved
Machine Learning Libraries
Scikit-learn is a popular single-node machine learning library.
But what if our data or model get too big?
©2023 Databricks Inc. — All rights reserved
Machine Learning in Spark
Machine learning in Spark allows us to work
Scale Out and Speed Up with bigger data and train models faster by
distributing the data and computations
across multiple workers.
Spark Machine Learning MLlib Spark ML
Libraries
Original ML API Newer ML API for
for Spark Spark
Based on RDDs Based on
DataFrames
Maintenance
©2023 Databricks Inc. — All rights reserved Mode
LINEAR REGRESSION
DEMO I
©2023 Databricks Inc. — All rights reserved
LINEAR REGRESSION
LAB I
©2023 Databricks Inc. — All rights reserved
Non-numeric Features
Two primary types of non-numeric features
Categorical Features Ordinal Features
A series of categories of a single A series of categories of a single
feature feature
No intrinsic ordering Relative ordering, but not
necessarily consistent spacing
e.g. Dog, Cat, Fish
e.g. Infant, Toddler, Adolescent,
Teen, Young Adult, etc.
©2023 Databricks Inc. — All rights reserved
Non-numeric Features in Linear Regression
How do we handle non-numeric features for linear
regression?
▪ X-axis is numeric, so features need to be numeric
▪ Convert our non-numeric features to numeric features?
Could we assign numeric values to each of the categories?
▪ “Dog” = 1, “Cat” = 2, “Fish” = 3, etc.
▪ Does this make sense?
©2023 Databricks Inc. — All rights reserved
This implies 1 Cat is equal to 2 Dogs!
Non-numeric Features in Linear Regression
Instead, we commonly use a practice known as one-hot encoding (OHE).
▪ Creates a binary “dummy” feature for each category
Animal Dog Cat Fish
Dog OHE 1 0 0
Cat 0 1 0
Fish 0 0 1
▪ Doesn’t force a uniformly-spaced, ordered numeric representation
©2023 Databricks Inc. — All rights reserved
One-hot Encoding at Scale
You might be thinking...
▪ Okay, I see what’s happening here … this works for a handful of animals.
▪ But what if we have an entire zoo of animals? That would result in really wide
data!
Spark uses sparse vectors for this…
DenseVector(0, 0, 0, 7, 0, 2, 0, 0, 0, 0)
SparseVector(10, [3, 5], [7, 2])
▪ Sparse vectors take the form:
(Number of elements, [indices of non-zero elements], [values of non-zero elements])
©2023 Databricks Inc. — All rights reserved
LINEAR REGRESSION
DEMO II
©2023 Databricks Inc. — All rights reserved
LINEAR REGRESSION
LAB II
©2023 Databricks Inc. — All rights reserved
MLflow Tracking
©2023 Databricks Inc. — All rights reserved
MLflow
▪ Open-source platform for machine learning lifecycle
▪ Operationalizing machine learning
▪ Developed by Databricks
▪ Pre-installed on the Databricks Runtime for ML
©2023 Databricks Inc. — All rights reserved
Core Machine Learning Issues
▪ Keeping track of experiments or model development
▪ Reproducing code
▪ Comparing models
▪ Standardization of packaging and deploying models
MLflow addresses these issues.
©2023 Databricks Inc. — All rights reserved
MLflow Components
Tracking Projects Models Model Registry
Record and Packaging format General model Centralized and
query for reproducible format that collaborative
experiments: runs on any supports diverse model lifecycle
code, data, platform deployment management
config, results tools
▪ APIs: CLI, Python, R, Java, REST
©2023 Databricks Inc. — All rights reserved
Ensure
MLflow tracking and autologging reproducibility
Track ML development
with one line of code:
parameters, metrics,
data lineage, model, and
environment. Model, environment, and artifacts
Metrics
Parameters and tags,
mlflow.autolog() including data version
Analyze results in UI or programmatically
● How does tuning parameter X affect my metric?
● What is the best model?
● Did I run training for long enough?
©2023 Databricks Inc. — All rights reserved
Model Deployment Options
Serving
In-Line Code Containers Batch & Stream OSS Inference Cloud Inference
Scoring Solutions Services
©2023 Databricks Inc. — All rights reserved
The Full ML Lifecycle
©2023 Databricks Inc. — All rights reserved
MLFLOW TRACKING
DEMO
©2023 Databricks Inc. — All rights reserved
MLflow Model Registry
©2023 Databricks Inc. — All rights reserved
MLflow Model Registry
▪ Collaborative, centralized model hub
▪ Facilitate experimentation, testing, and production
▪ Integrate with approval and governance workflows
▪ Monitor ML deployments and their performance
Databricks MLflow Blog Post
©2023 Databricks Inc. — All rights reserved
One Collaborative Hub for Model
Management
Full lineage from deployed models to training code /
Centralized Model Management and Discovery data
● Overview of all registered models, their versions at
Staging and Production
● Search by name, tags, etc.
● Model-based ACLs ● Full lineage from Model Version to
○ Run that produced the model
○ Notebook that produced the run
○ Exact revision history of the notebook that produced the run
©2023 Databricks Inc. — All rights reserved
Version Control and Visibility into
Deployment Process
Versioning of ML artifacts Visibility and auditability of the deployment process
● Audit log of stage transitions and requests per model
● Overview of active model versions and their deployment
stage
● Comparison of versions and their logged metrics, parameters,
etc.
©2023 Databricks Inc. — All rights reserved
Review Processes and CI/CD
Integration
Manual review process Automation through CI/CD integration
Webhooks allow registering of callbacks
(e.g. for tests / deployment) on events in
the Model Registry
Staging Production Archived
v1
v2
v3
Data Scientists Deployment
Engineers
● Stage-based Access Controls
● Request and approval workflow for stage transitions
● Webhooks for events like model creation, version creation, transition
request, etc.
● Mechanisms to store results / metadata through Tags and Comments
©2023 Databricks Inc. — All rights reserved
MLFLOW MODEL REGISTRY
DEMO
©2023 Databricks Inc. — All rights reserved
MLFLOW
LAB
©2023 Databricks Inc. — All rights reserved
Decision Trees
©2023 Databricks Inc. — All rights reserved
Decision Making
Salary > $50,000 Root Node
Yes No
Commute > 1 hr Decline Offer
Yes No
Decline Offer Offers Free Coffee
Yes No
Leaf Node Accept Offer Decline Offer Leaf Node
Salary : 61,000 Salary : 61,000
Commute: 30 mins Commute: 30 mins
Free Coffee: Yes Free Coffee: No
©2023 Databricks Inc. — All rights reserved
Decision Making
Salary > $50,000
Yes No
Commute > 1 hr Decline Offer
Yes No
Decline Offer Offers Free Coffee
Yes No
Accept Offer Decline
Salary Offer
> $60,000
Yes No
Salary : 61,000
Commute: 30 mins Accept Offer Decline Offer
Free Coffee: No
©2023 Databricks Inc. — All rights reserved
Decision Making
Salary > $50,000 Root Node
Yes No
Salary : 61,000
Commute > 1 hr Decline Offer Commute: 30 mins
Free Coffee: No
Yes No
Decline Offer Offers Free Coffee
Yes No
Accept Offer Salary > $60,000
Accept Offer Decline Offer
©2023 Databricks Inc. — All rights reserved
Determining Splits
Commute? Commute?
< 1 hr > 1 hr < 1 0 min > 10 min
1 hr is a better splitting point for Commute because it
provides information about the classification.
©2023 Databricks Inc. — All rights reserved
Determining Splits
Commute? Bonus?
< 1 hr > 1 hr Yes No
Commute is a better choice because it provides
information about the classification.
©2023 Databricks Inc. — All rights reserved
Creating Decision Boundaries
Commute
Salary > $50,000 Decline Offer
Yes No
1 hour
Commute > 1 hr Decline Offer
Decline Offer
Yes No
Accept Offer
Decline Offer Accept Offer
$50,000
Salary
©2023 Databricks Inc. — All rights reserved
Lines vs. Boundaries
Linear Regression Decision Trees
▪ Lines through data ▪ Boundaries instead of lines
▪ Assumed linear relationship ▪ Learn complex relationships
Commute
1 hour
X $50,000 Salary
©2023 Databricks Inc. — All rights reserved
Linear Regression or Decision Tree?
It depends on the data...
©2023 Databricks Inc. — All rights reserved
Tree Depth
Tree Depth: the length of the Salary >
Root Node 0
$50,000
longest path from a root note to
a leaf node Yes No
Commute > 1 hr Decline Offer 1
Yes No
3
Decline Offer
Offers Free
Coffee
2
Yes No
Leaf Node Accept Offer Decline Offer Leaf Node 3
Note: shallow trees tend to underfit, and deep trees tend to
©2023 Databricks Inc. — All rights reserved overfit
Underfitting vs. Overfitting
Underfitting Just Right Overfitting
©2023 Databricks Inc. — All rights reserved
Additional Resource
R2D3 has an excellent visualization of
how decision trees work.
©2023 Databricks Inc. — All rights reserved
DECISION TREE DEMO
©2023 Databricks Inc. — All rights reserved
Random Forests
©2023 Databricks Inc. — All rights reserved
Decision Trees
Pros Cons
▪ Interpretable ▪ Poor accuracy
▪ Simple ▪ High variance
▪ Classification/Regression
▪ Nonlinear relationships
©2023 Databricks Inc. — All rights reserved
Bias vs. Variance
©2023 Databricks Inc. — All rights reserved
Bias-Variance Tradeoff
Error = Variance + Bias2 + noise
Error ▪ Reduce Bias
Optimum Model
Total Error
Complexity
▪ Build more complex
Variance models
▪ Reduce Variance
▪ Use a lot of data
▪ Build simple models
▪ What about the noise?
Bias2
Model Complexity
©2023 Databricks Inc. — All rights reserved
©2023 Databricks Inc. — All rights reserved Source
Building Five Hundred Decision Trees
▪ Using more data reduces variance for one model
▪ Averaging more predictions reduces prediction variance
▪ But that would require more decision trees
▪ And we only have one training set … or do we?
©2023 Databricks Inc. — All rights reserved
Bootstrap Sampling
A method for simulating N new datasets:
1. Take sample with replacement from original training set
2. Repeat N times
©2023 Databricks Inc. — All rights reserved
Bootstrap Visualization
Bootstrap 1 (N = 100) Bootstrap 2 (N = 100)
Training Set (N = 100)
Bootstrap 3 (N = 100) Bootstrap 4 (N = 100)
Why are some points in the bootstrapped
samples not selected?
©2023 Databricks Inc. — All rights reserved
Training Set Coverage
Assume we are bootstrapping N draws from a training set with
N observations ...
▪ Probability of an element getting picked in each draw:
▪ Probability of an element not getting picked in each draw:
▪ Probability of an element not getting drawn in the entire
sample:
As N → ∞, the probability for each element of not
getting picked in a sample approaches 0.368.
©2023 Databricks Inc. — All rights reserved
Bootstrap Aggregating
▪ Train a tree on each of sample, and average the predictions
▪ This is bootstrap aggregating, commonly referred to as
bagging
Bootstrap 1 Bootstrap 2 Bootstrap 3 Bootstrap 4
Decision Tree Decision Tree Decision Tree Decision Tree
1 2 3 4
Final
Prediction
©2023 Databricks Inc. — All rights reserved
Random Forest Algorithm
Full Training Data
Bootstrap 1 Bootstrap 2 Bootstrap K
...
At each split, a subset of features is considered to
ensure each tree is different.
©2023 Databricks Inc. — All rights reserved
Random Forest Aggregation
Scoring Record
...
Aggregation
Final Prediction
▪ Majority-voting for classification
▪ Mean for regression
©2023 Databricks Inc. — All rights reserved
RANDOM FOREST DEMO
©2023 Databricks Inc. — All rights reserved
Hyperparameter
Tuning
©2023 Databricks Inc. — All rights reserved
What is a Hyperparameter?
▪ Examples for Random Forest:
▪ Tree depth
▪ Number of trees
▪ Number of features to consider
A parameter whose value is used to
control the training process.
©2023 Databricks Inc. — All rights reserved
Selecting Hyperparameter Values
▪ Build a model for each hyperparameter value
▪ Evaluate each model to identify the optimal hyperparameter
value
▪ What dataset should we use to train and evaluate?
Training Validation Test
What if there isn’t enough data to split
into three separate sets?
©2023 Databricks Inc. — All rights reserved
K-Fold Cross Validation
Pass 1: Training Training Validation
Average
Validation Errors
Pass 2: Training Validation Training to Identify
Optimal
Hyperparameter
Pass 3: Validation Training Training Values
Final Pass: Training with Optimal Hyperparameters Test
©2023 Databricks Inc. — All rights reserved
HYPERPARAMETER TUNING
DEMO
©2023 Databricks Inc. — All rights reserved
Optimizing Hyperparameter Values
Grid Search
▪ Train and validate every unique combination of
hyperparameters
Tree Depth Number of Trees Tree Depth Number of Trees
5 2 5 2
8 4 5 4
8 2
8 4
Question: With 3-fold cross validation, how many models will this build?
©2023 Databricks Inc. — All rights reserved
HYPERPARAMETER TUNING
LAB
©2023 Databricks Inc. — All rights reserved
Hyperparameter
Tuning
with Hyperopt
©2023 Databricks Inc. — All rights reserved
Problems with Grid Search
▪ Exhaustive enumeration is expensive
▪ Manually determined search space
▪ Past information on good hyperparameters isn’t used
▪ So what do you do if…
▪ You have a training budget
▪ You have many hyperparameters to tune
▪ You want to pick your hyperparameters based on past
results
©2023 Databricks Inc. — All rights reserved
Hyperopt
▪ Open-source Python library
▪ Optimization over awkward search spaces (real-valued,
discrete, and conditional dimensions)
▪ Supports serial or parallel optimization
▪ Spark integration
▪ Core algorithms for optimization:
▪ Random Search
▪ Adaptive Tree of Parzen Estimators (TPE)
©2023 Databricks Inc. — All rights reserved
Paper
Optimizing Hyperparameter Values
Random Search
Generally outperforms grid search
©2023 Databricks Inc. — All rights reserved
Optimizing Hyperparameter Values
Tree of Parzen Estimators
▪ Bayesian process
▪ Creates meta model that maps hyperparameters to
probability of a score on the objective function
▪ Provide a range and distribution for continuous and
discrete values
▪ Adaptive TPE better tunes the search space by
▪ Freezing hyperparameters
▪ Tuning number of random trials before TPE
©2023 Databricks Inc. — All rights reserved
HYPEROPT
DEMO
©2023 Databricks Inc. — All rights reserved
HYPEROPT
LAB
©2023 Databricks Inc. — All rights reserved
AutoML
©2023 Databricks Inc. — All rights reserved
Databricks AutoML
A glass-box solution that empowers data teams without taking away control
MLflow experiment
Auto-created MLflow Easily deploy
UI and API to Experiment to track models and to Model
start AutoML metrics Registry
training
Data exploration
notebook Understand and
Generated notebook with debug data
feature summary statistics and quality and
distributions preprocessing
Reproducible trial Iterate further on
notebooks models from
Generated notebooks with source
code for every model
AutoML, adding
your expertise
©2023 Databricks Inc. — All rights reserved
AutoML solves two key pain points
for data scientists
Quickly Verify the Predictive Power of a Get a Baseline Model to Guide Project
Dataset Direction
Marketing Data Data
Team Science Science
Team Team
Dataset Dataset Baseline
Model
“Can this dataset be used to predict “What direction should I go in for this
customer churn?” ML project and what benchmark
should
I aim to beat?”
©2023 Databricks Inc. — All rights reserved
Problems with Existing AutoML
Solutions
Opaque-Box and Production Cliff Problems in AutoML
? ?
AutoML AutoML Returned Production Deployed
Configuration Training Best Model Cliff Model
“Opaque
Box”
Problem Result / Pain Points
1. A “production cliff” exists where data scientists need to ● The “best” model returned is often not good enough
modify the returned “best” model using their domain to deploy
expertise before deployment ● Data scientists must spend time and energy reverse
2. Data scientists need to be able to explain how they engineering these “opaque-box” returned models so
trained a model for regulatory purposes (e.g., FDA, GDPR, that they can modify them and/or explain them
etc.) and most AutoML solutions have “opaque box”
models
©2023 Databricks Inc. — All rights reserved
“Glass-Box” AutoML
Configure
Train and Evaluate with a UI
Customize
Deploy
©2023 Databricks Inc. — All rights reserved
AutoML Lab
©2023 Databricks Inc. — All rights reserved
Feature Store
©2023 Databricks Inc. — All rights reserved
Feature Store
The first Feature Store codesigned with a Data and MLOps Platform
Feature Store
Batch (high throughput)
Feature
Feature Registry
Provider
Online (low latency)
Feature Registry Feature Provider
▪ Discoverability and Reusability ▪ Batch and online access to Features
▪ Versioning ▪ Feature lookup packaged with Models
▪ Upstream and downstream ▪ Simplified deployment process
Lineage
Co-designed with Co-designed with
▪ Open format ▪ Open model format that supports all ML
▪ Built-in data versioning and governance frameworks
▪ Native access through PySpark, SQL, ▪ Feature version and lookup logic
etc. hermetically logged with Model
©2023 Databricks Inc. — All rights reserved
Gradient Boosted
Decision Trees
©2023 Databricks Inc. — All rights reserved
Decision Tree Ensembles
▪ Combine many decision Full Training Data
trees
▪ Random Forest Bootstrap 1 Bootstrap 2 Bootstrap K
▪ Bagging
▪ Independent trees ...
▪ Results aggregated to a
final prediction
▪ There are other methods of
ensembling decision trees
©2023 Databricks Inc. — All rights reserved
Boosting
Full Training Data
▪ Sequential (one tree at a time)
▪ Each tree learns from the last
▪ Sequence of trees is the final
model
©2023 Databricks Inc. — All rights reserved
Gradient Boosted Decision Trees
▪ Common boosted trees algorithm
▪ Fits each tree to the residuals of the previous tree
▪ On the first iteration, residuals are the actual label values
Model 1 Model 2 Final Prediction
Y Prediction Residual Y Prediction Residual Y Prediction
40 35 5 5 3 2 40 38
60 67 -7 -7 -4 -3 60 63
30 28 2 2 3 -1 30 31
33 32 1 1 0 1 33 32
©2023 Databricks Inc. — All rights reserved
Boosting vs. Bagging
GBDT RF
▪ Starts with high bias, low variance ▪ Starts with high variance, low
▪ Works right bias
▪ Works left
Error Total Error
Optimum Model
Complexity
Variance
Bias2
Model Complexity
©2023 Databricks Inc. — All rights reserved
Gradient Boosted Decision Trees
Implementations
▪ Spark ML
▪ Built into Spark
▪ Utilizes Spark’s existing decision tree implementation
▪ XGBoost
▪ Designed and built specifically for gradient boosted trees
▪ Regularized to prevent overfitting
▪ Pre-installed in Databricks Runtime for ML (Python & Scala APIs)
©2023 Databricks Inc. — All rights reserved
XGBOOST DEMO
©2023 Databricks Inc. — All rights reserved
Appendix
©2023 Databricks Inc. — All rights reserved
ML Deployment
Options
©2023 Databricks Inc. — All rights reserved
What is ML Deployment?
▪ Data Science != Data Engineering
▪ Data science is scientific
▪ Business problems → data problems
▪ Model mathematically
▪ Optimize performance
▪ Data engineers are concerned with
▪ Reliability
▪ Scalability
▪ Maintainability
▪ SLAs
▪ ...
©2023 Databricks Inc. — All rights reserved
DevOps vs. ModelOps
▪ DevOps = software development + IT operations
▪ Manages deployments
▪ CI/CD of features, patches, updates, rollbacks
▪ ModelOps = data modeling + deployment operations
▪ Artifact management (Continuous Training)
▪ Model performance monitoring (Continuous Monitoring)
▪ Data management
▪ Use of containers and managed services
©2023 Databricks Inc. — All rights reserved
The Four Deployment Paradigms
1. Batch
▪ 80-90% of deployments
▪ Leverages databases and object storage
▪ Fast retrieval of stored predictions
2. Streaming (continuous)
▪ 10-15% of deployments
▪ Moderately fast scoring on new data
3. Real Time
▪ 5-10% of deployments
▪ Usually using REST (Azure ML, SageMaker, containers)
4. On-device (edge)
©2023 Databricks Inc. — All rights reserved
Latency Requirements (roughly)
Latency Requirements
10 ms 100 ms 1 min 1 hour 1day
Real Time Streaming Batch
©2023 Databricks Inc. — All rights reserved
Overview of a typical Databricks CI/CD
pipeline
Continuous Continuous
integration delivery
Code Build Release Deploy Test Operate
See CI/CD Templates for a starting point
©2023 Databricks Inc. — All rights reserved
Logistic Regression
©2023 Databricks Inc. — All rights reserved
Types of Supervised Learning
Regression Classification
▪ Predicting a continuous output ▪ Predicting a categorical/discrete
output
©2023 Databricks Inc. — All rights reserved
Types of Classification
Binary Classification Multiclass Classification
Two label classes Three or more label classes
Model output is commonly the probability of a
record belonging to each of the classes.
©2023 Databricks Inc. — All rights reserved
Binary Classification
Binary Classification
Two label classes ▪ Outputs:
▪ Probability that the record
is Green given a set of
features
▪ Probability that the record
is Red given a set of
features
▪ Reminders:
▪ Probabilities are bounded
between 0 and 1
▪ And linear regression
returns any real number
©2023 Databricks Inc. — All rights reserved
Bounding Binary Classification Probabilities
How can we keep model outputs between 0 and 1?
▪ Logistic Function:
▪ Large positive inputs → 1
▪ Large negative inputs → 0
©2023 Databricks Inc. — All rights reserved
Converting Probabilities to Classes
▪ In binary classification, the class probabilities are directly
complementary
▪ So let’s set our Red class equal to 1, and our Blue class equal to 0
▪ The model output is 𝐏[y = 1 | x] where x represents the features
But we need class predictions, not probability predictions
▪ Set a threshold on the probability predictions
▪ 𝐏[y = 1 | x] < 0.5 → y = 0
▪ 𝐏[y = 1 | x] ≥ 0.5 → y = 1
©2023 Databricks Inc. — All rights reserved
Evaluating Binary Classification Models
▪ How can the model be wrong?
▪ Type I Error: False Positive
▪ Type II Error: False Negative
▪ Representing these errors with a confusion matrix.
©2023 Databricks Inc. — All rights reserved
Binary Classification Metrics
Accuracy Precision
TP + TN TP
TP + FP + TN + FN TP + FP
Recall F1
TP 2 x Precision x Recall
TP + FN Precision + Recall
©2023 Databricks Inc. — All rights reserved
Collaborative Filtering
©2023 Databricks Inc. — All rights reserved
Recommendation Systems
©2023 Databricks Inc. — All rights reserved
Naive Approaches to Recommendation
▪ Hand-curated
▪ Aggregates
Question: What are problems with these
approaches?
©2023 Databricks Inc. — All rights reserved
Content-based Recommendation
▪ Idea: Recommend items to a customer that are similar to other
items the customer liked
▪ Creates a profile for each user or product
▪ User: demographic info, ratings, etc.
▪ Item: genre, flavor, brand, actor list, etc.
©2023 Databricks Inc. — All rights reserved
Content-based Recommendation
▪ Advantages
▪ No need for data from other users
▪ New item recommendations
▪ Disadvantages
▪ Cold-start problem
▪ Determining appropriate features
▪ Implicit information
©2023 Databricks Inc. — All rights reserved
Collaborative Filtering
▪ Idea: Make recommendations for one customer (filtering) by
collecting and analyzing the interests of many users
(collaboration)
▪ Advantages over content-based recommendation
▪ Relies only on past user behavior (no profile creation)
▪ Domain independent
▪ Generally more accurate
▪ Disadvantages
▪ Extremely susceptible to cold-start problem (user and item)
©2023 Databricks Inc. — All rights reserved
Types of Collaborative Filtering
▪ Neighborhood Methods: Compute relationships between items
or users
▪ Computationally expensive
▪ Not empirically as good
▪ Latent Factor Models: Explain the ratings by characterizing items
and users by small number of inferred factors
▪ Matrix factorization
▪ Characterizes both items and users by vectors of factors
from item-rating pattern
▪ Explicit feedback: sparse matrix
▪ Scalable
©2023 Databricks Inc. — All rights reserved
Latent Factor Approach
©2023 Databricks Inc. — All rights reserved
Ratings Matrix
©2023 Databricks Inc. — All rights reserved
Matrix Factorization
©2023 Databricks Inc. — All rights reserved
Alternating Least Squares
▪ Step 1: Randomly initialize user and movie factors
▪ Step 2: Repeat the following
1. Fix the movie factors, and optimize user factors
2. Fix the user factors, and optimize movie factors
©2023 Databricks Inc. — All rights reserved