Apache Spark Programming With Databricks
Apache Spark Programming With Databricks
Apache Spark Programming With Databricks
Programming
With Databricks
Spark Query
Performance Partitioning
Architecture Optimization
Introductions
©2022 Databricks Inc. — All rights reserved
Welcome!
Let’s get to know you
▪ Name
▪ Role and team
▪ Programing experience
▪ Motivation for attending
▪ Personal interest
Spark Core
©2022 Databricks Inc. — All rights reserved
Spark Core
Databricks Ecosystem
Spark Overview
Spark SQL
Reader & Writer
DataFrame & Column
5000+
Lakehouse
across the globe
10
10
✓ Collaborative
✓ Simple Data
Engineering
BI and SQL
Analytics
Data Science
and ML
Real-Time Data
Applications
✓ Open
Azure Data
Factory Synapse
Google
BigQuery
Amazon
Redshift
and formats.
Google
AI Platform
Lakehouse Platform
Data Providers
450+
Centralized Governance
AWS
Glue
✓ Collaborative
Spark SQL +
Streaming MLlib
DataFrames
Job Task 1
Stage 1
Spark
Job Task 2
application
Stage 2
Job
Task
Core Task
Core Task
Core Core Core Core Task
Core Task
Core
Name Score ID
Functions
©2022 Databricks Inc. — All rights reserved
Functions
Aggregation
Datetimes
Complex Types
Additional Functions
User-Defined Functions
Performance
©2022 Databricks Inc. — All rights reserved
Performance
Spark Architecture
Query Optimization
Partitioning
Driver
Executo
r
Core
Partitio
n
A B C D E F
G H I J K L
A B C D E F
G H I J K L
L F
K G
J H
I
G H I J K L
A B C D E F
G H I J K L
A B C D E F
G H I J K L
B C D F
G I K L
G H I J K L
A B C D E F
G H I J K L
A B D E F
G H I J K L
G H I J K L
E
J H
B C D F
G I K L
A B C D E F
G H I J K L
A B C D E F
G H I J K L
A B C D E F
G H I J K L
A B C D E F
G H I J K L
A B C D E F
G H I J K L
5 6
5 6
A B C D E F
4 5
4 5
G H I J K L
5 6
A B C D E F
4 5
G H I J K L
5 6
A B C D E F
4 5
G H I J K L
A B C D E F
5
6
4
5 G H I J K L
A B C D E F
20
G H I J K L
20
A B C D E F
G H I J K L
20
5 6
A B C D E F A B C D E F
4 5
G H I J K L G H I J K L
Metadata Catalog
Catalyst Catalog
Cost Model
Query Unresolved Optimized Physical
Physical Selected
Logical Plan
Logical Plan Physical
Plans
RDDs
Logical Plan Plans Physical Plan
Plans
PHYSICAL WHOLE-STAGE
ANALYSIS PLANNING CODE GENERATION
Metadata Catalog
Catalyst Catalog
Cost Model
Query Unresolved Optimized Physical
Physical Selected
Logical Plan
Logical Plan Physical
Plans
RDDs
Logical Plan Plans Physical Plan
Plans
PHYSICAL WHOLE-STAGE
ANALYSIS PLANNING CODE GENERATION
Runtime Statistics
ADAPTIVE QUERY EXECUTION
Structured Streaming
©2022 Databricks Inc. — All rights reserved
Structured
Streaming
Streaming Query
Stream Aggregations
Advantages
Use Cases
Sources 73
Real-time
Notifications Incremental ETL
reporting
Update data to
Real-time
serve in Online ML
decision making
real-time
©2022 Databricks Inc. — All rights reserved
Micro-Batch Processing
Micro-batches
= new rows
appended to
unbounded table
Files
FOR TESTING
Files Foreach
FOR DEBUGGING
APPEND
UPDATE COMPLETE
Add new records
Update changed Rewrite full output
only
records in place
Idempotent sinks
(streamingDF
.groupBy(col("device"),
window(col("time"), "1 hour"))
.count())
spark.conf.set("spark.sql.shuffle.partitions",spark.sparkContext.defaultParallelism)
(streamingDF
.withWatermark("time", "2 hours")
.groupBy(col("device"),
window(col("time"), "1 hour"))
.count()
)
Delta Lake
©2022 Databricks Inc. — All rights reserved
Delta
Lake
Using Spark with Delta Lake
96
Delta Lake
storage Delta tables Delta Engine
layer
▪ Ensures consistency
▪ Auto-optimized writes