Data Engineering Fundamentals
Pandas
vs PySpark
Eren Han
Data Engineering Fundamentals
1
LOAD CSV
Pandas PySpark
df = spark.read \
.options(header=True,
df = pd.read_csv('sample.csv')
inferSchema=True) \
.csv('sample.csv')
Eren Han
Data Engineering Fundamentals
2
VIEW DATAFRAME
Pandas PySpark
df df.show()
df.head(10) df.show(10)
Eren Han
Data Engineering Fundamentals
3
CHECK COLUMNS AND DATA TYPES
Pandas PySpark
df.columns df.columns
df.dtypes df.dtypes
Eren Han
Data Engineering Fundamentals
4
RENAME COLUMNS
Pandas PySpark
df.columns = [x, y, z] df.toDF(x, y, z)
df.rename(columns= {"old":"new"}) df.withColumnRenamed("old","new")
Eren Han
Data Engineering Fundamentals
5
DROP COLUMN
Pandas PySpark
df.drop("column", axis=1) df.drop("column")
Eren Han
Data Engineering Fundamentals
6
FILTERING
Pandas PySpark
df[df.column < 80] df[df.column < 80]
df[(df.column < 80) & (df.column2 == 50)] df[(df.column < 80) & (df.column2 == 50)]
Eren Han
Data Engineering Fundamentals
7
ADD COLUMN
Pandas PySpark
df["new"] = 1 / df.column df.withColumn("new", 1 /
df.column)
Note: Division by zero is Note: Division by zero is NULL.
infinite.
Eren Han
Data Engineering Fundamentals
8
FILL NULLS
Pandas PySpark
df.fillna(0) df.fillna(0)
Eren Han
Data Engineering Fundamentals
9
AGGREGATION
Pandas PySpark
df.groupby([date, product]) \ df.groupby([date, product]) \
.agg({"sales":"mean", .agg({"sales":"mean",
"revenue":"max"}) "revenue":"max"})
Eren Han
Data Engineering Fundamentals
10
STANDARD TRANSFORMATIONS
Pandas PySpark
import numpy as np import pysapark.sql.functions as F
df["logcolumn"] = np.log(df.column) df.withColumn("logcolumn",
F.log(df.column)
Eren Han
Data Engineering Fundamentals
11
CONDITIONAL STATEMENTS
Pandas PySpark
df["cond"]= df.apply(lambda x: 1 if import pysapark.sql.functions as F
df.col1>20 else 2 if df.col2==6 else df.withColumn("cond", \
3, axis=1) F.when(df.col1>20,1) \
.when(df.col2==6,2)
.otherwise(3))
Eren Han
Data Engineering Fundamentals
12
MERGE / JOIN DATAFRAMES
Pandas PySpark
df.merge(df2, on="key") df.join(df2, on="key")
df.merge(df2, left_on="a",right_on="b") df.join(df2, df.a == df2.b)
Eren Han
Data Engineering Fundamentals
13
SUMMARY STATISTICS
Pandas PySpark
df.describe() df.describe().show()
Note: Only
count,mean,stddev,min,max.
Eren Han
Data Engineering Fundamentals
14
CHANGE DATA TYPES
Pandas PySpark
from pyspark.sql.types
df['A'] = df['A'].astype(int)
import IntegerType
df = df.withColumn('A',
col('A').cast(IntegerType()))
Eren Han
Data Engineering Fundamentals
Thank You for
reading. I hope
you enjoyed it.
Eren Han