Get Started
with PySpark
Programming
©2022 Databricks Inc. — All rights reserved 1
Module Agenda
Get Started with PySpark Programming
Spark SQL Overview
DE 0.1 - Spark SQL
DE 0.2L - Spark SQL Lab
DE 0.3 - DataFrame & Column
DE 0.4L - Purchase Revenues Lab
DE 0.5 - Aggregation
DE 0.6L - Revenue by Traffic Lab
©2022 Databricks Inc. — All rights reserved 2
Spark SQL Overview
©2022 Databricks Inc. — All rights reserved 3
Spark SQL is a module for structured data processing
with multiple interfaces
DataFrame API
SQL
Python, Scala, Java, R
©2022 Databricks Inc. — All rights reserved
The same Spark SQL query can be expressed with
SQL and the DataFrame API
SELECT id, result spark.table("exams")
FROM exams .select("id", "result")
WHERE result > 70 .where("result > 70")
ORDER BY result .orderBy("result")
©2022 Databricks Inc. — All rights reserved
Spark SQL executes all queries on the same
engine
SQL Queries
Python DataFrame
API
Query Plans RDDs Execution
Scala DataFrame API
©2022 Databricks Inc. — All rights reserved
Spark SQL optimizes queries before execution
Query Plan Optimized RDDs Execution
Query Plan
©2022 Databricks Inc. — All rights reserved