0% found this document useful (0 votes)

2K views3 pages

Pyspark Vs Pandas Cheatsheet

This document provides a cheatsheet comparing common data analysis tasks in Pandas and PySpark. It outlines how to import libraries, define datasets, read/write data, inspect data, handle missing/duplicate values, rename/select columns, join datasets, group and sort data using each framework. The cheatsheet acts as a quick reference guide to help users choose the appropriate tool for different data processing and manipulation operations.

Uploaded by

api-261489892

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views3 pages

Pyspark Vs Pandas Cheatsheet

Uploaded by

api-261489892

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

CHEATSHEET: PANDAS VS PYSPARK

Vanessa Afolabi

Import Libraries and Set System Options:

PANDAS PYSPARK

import pandas as pd from pyspark.sql.types import *

pd.options.display.max colwidth = 1000 from pyspark.sql.functions import *
from pyspark.sql import SQLContext*

Define and create a dataset:

PANDAS PYSPARK

data = {’col1’ : [ , , ], ’col2’ : [ , , ]} StructField(’Col1’, IntegerType())

df = pd.DataFrame(data, columns = [’col1’, ’col2’]) StructField(’Col2’, StringType())
schema = StructType([list of StructFields])
df = SQLContext(sc).createDataFrame(sc.emptyRDD(), schema)

Read and Write to CSV:

PANDAS PYSPARK

df.read csv() SQLContext(sc).read csv()

df.to csv() df.toPandas.to csv()

Indexing and Splitting:

PANDAS PYSPARK

df.loc[ ] df.randomSplit(weights=[ ], seed=n)

df.iloc[ ]

Inspect Data:
PANDAS PYSPARK

df.head() df.show()
df.head(n)
df.columns df.printSchema()
df.columns
df.shape df.count()
Handling Duplicate Data:
PANDAS PYSPARK

df.unique() df.distinct().count()
df.duplicated
df.drop duplicates() df.dropDuplicates()

Rename Columns:
PANDAS PYSPARK

df.rename(columns={”old col”:”new col”}) df.withColumnRenamed(”old col”,”new col”)

Handling Missing Data:

PANDAS PYSPARK

df.dropna() df.na.drop()
df.fillna() df.na.fill()
df.replace df.na.replace()
df[’col’].isna() df.col.isNull()
df[’col’].isnull()
df[’col’].notna() df.col.isNotNull()
df[’col’].notnull()

Common Column Functions:

PANDAS PYSPARK

df[”col”] = df[”col”].str.lower() df = df.withColumn(’col’,lower(df.col))

df[”col”] = df[”col”].str.replace() df = df.select(’*’,regexp replace().alias())
df = df.select(’*’,regexp extract().alias())
df[”col”] = df[”col”].str.split() df = df.withColumn(’col’,split(’col’))
df[”col”] = df[”col”].str.join() df = df.withColumn(’col’, UDF JOIN(df.col, lit(’ ’)))
df[”col”] = df[”col”].str.strip() df = df.withColumn(’col’, trim(df.col))

Apply User Defined Functions:

PANDAS PYSPARK

df[’col’] = df[’col’].map(UDF) df = df.withColumn(’col’, UDF(df.col))

df.apply(f) df = df.withColumn(’col’, when(cond, UDF(df.col)).otherwise())
df.applyMap(f)

Join two dataset columns:

PANDAS PYSPARK

df[’new col’] = df[’col1’] + df[’col2’] df = df.withColumn(’new col’,concat ws(’ ’,df.col1,df.col2))

df.select(’*’,concat(df.col1,df.col2).alias(’new col’))
Convert dataset column to a list:
PANDAS PYSPARK

list(df[’col’) df.select(”col”).rdd.flatMap(lambda x:x).collect()

Filter Dataset:
PANDAS PYSPARK

df = df[df[’col’] != ” ”] df = df[df[’col’] == val]

df = df.filter(df[’col’] == val)

Select Columns:
PANDAS PYSPARK

df = df[[’col1’,’col2’,’col3’]] df = df.select(’col1’,’col2’,’col3’)

Drop Columns:
PANDAS PYSPARK

df.drop([’B’,’C’], axis=1) df.drop(’col1’,’col2’)

df.drop(columns = [’B’,’C’])

Grouping Data:
PANDAS PYSPARK

df.groupby(by=[’col1’,’col2’]).count() df.groupBy(’col’).count().show()

Combining Data:
PANDAS PYSPARK

pd.concat([df1,df2]) df1.union(df2)
df1.append(df2)
df1.join(df2) df1.join(df2)

Cartesian Product:
PANDAS PYSPARK

df1[’key’] = 1 df1.crossJoin(df2)
df2[’key’] = 1
df1.merge(df2, how=’outer’, on=’key’)

Sorting Data:
PANDAS PYSPARK

df.sort values() df.sort()

df.sort index() df.orderBy()

Microsoft Power BI Data Analyst
No ratings yet
Microsoft Power BI Data Analyst
7 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Spark Architecture
100% (1)
Spark Architecture
12 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
DBMS Notes IB
100% (1)
DBMS Notes IB
27 pages
Dimensional Data Modeling Introduction
100% (3)
Dimensional Data Modeling Introduction
56 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Databricks Cloud How To Log Analysis Example
No ratings yet
Databricks Cloud How To Log Analysis Example
9 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Spark Walmart Data Analysis Project
0% (1)
Spark Walmart Data Analysis Project
17 pages
Airflow DAG - Best Practices: DAG As Configuration File
100% (1)
Airflow DAG - Best Practices: DAG As Configuration File
6 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
ABD00 Notebooks Combined - Databricks
No ratings yet
ABD00 Notebooks Combined - Databricks
109 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Apache Spark Tutorial
100% (4)
Apache Spark Tutorial
36 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Class XII Data Handlinng Using PandasI
No ratings yet
Class XII Data Handlinng Using PandasI
46 pages
Pyspark Hands On
No ratings yet
Pyspark Hands On
189 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Introduction To Apache Spark (Spark) : - by Praveen
No ratings yet
Introduction To Apache Spark (Spark) : - by Praveen
19 pages
HDFS Interview Questions
No ratings yet
HDFS Interview Questions
29 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Pyspark Commands
No ratings yet
Pyspark Commands
12 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
Cleaning Data With PySpark Chapter1
0% (1)
Cleaning Data With PySpark Chapter1
20 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
DB2 9.7 for Linux, UNIX, and Windows Database Administration: Certification Study Notes
From Everand
DB2 9.7 for Linux, UNIX, and Windows Database Administration: Certification Study Notes
Roger E. Sanders
5/5 (1)
PySpark SQL Pandas CheatSheet
No ratings yet
PySpark SQL Pandas CheatSheet
2 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Algorithms & Data Structures 06
No ratings yet
Algorithms & Data Structures 06
13 pages
Most Useful Excel Formulas For A CA Trainee
No ratings yet
Most Useful Excel Formulas For A CA Trainee
14 pages
Veritas Netbackup Cheat Sheet PDF
No ratings yet
Veritas Netbackup Cheat Sheet PDF
4 pages
Chapter 02 Review
No ratings yet
Chapter 02 Review
23 pages
All Basic Principles and Concept of Databases
No ratings yet
All Basic Principles and Concept of Databases
83 pages
Building RAG Apps
No ratings yet
Building RAG Apps
32 pages
DB02 ER Model
No ratings yet
DB02 ER Model
53 pages
Question Bank 1
No ratings yet
Question Bank 1
29 pages
6 Design Issues of DDBMS
No ratings yet
6 Design Issues of DDBMS
2 pages
Clone An Oracle Database Using Rman Duplicate
No ratings yet
Clone An Oracle Database Using Rman Duplicate
3 pages
Firebird Database Statistics Reporting Tool: Norman Dunbar
No ratings yet
Firebird Database Statistics Reporting Tool: Norman Dunbar
21 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
6 pages
Configuring BI Publisher To Use Oracle BI Data
No ratings yet
Configuring BI Publisher To Use Oracle BI Data
10 pages
SQL Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
No ratings yet
SQL Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
8 pages
CS519 Final Viva Preparation
No ratings yet
CS519 Final Viva Preparation
37 pages
AD Assignment 1
No ratings yet
AD Assignment 1
16 pages
Elements of Object Oriented Data Model
No ratings yet
Elements of Object Oriented Data Model
19 pages
IDEA - Collaborative Filtering Techniques in Recommendation Systems
No ratings yet
IDEA - Collaborative Filtering Techniques in Recommendation Systems
11 pages
Excel and Tableau: A Beautiful Partnership: Faye Satta, Senior Technical Writer Eriel Ross, Technical Writer
No ratings yet
Excel and Tableau: A Beautiful Partnership: Faye Satta, Senior Technical Writer Eriel Ross, Technical Writer
8 pages
AD3491 - FDSA - Unit I - Introduction - Part I
100% (2)
AD3491 - FDSA - Unit I - Introduction - Part I
23 pages
Big Data
No ratings yet
Big Data
957 pages
Lý thuyết SSDO 1
No ratings yet
Lý thuyết SSDO 1
3 pages
PT-1 (CW-13) Question Paper
No ratings yet
PT-1 (CW-13) Question Paper
3 pages
Program 1
No ratings yet
Program 1
3 pages
VL2024250102391 Ast04
No ratings yet
VL2024250102391 Ast04
2 pages
Database Management Systems (Csen 3102) PDF
No ratings yet
Database Management Systems (Csen 3102) PDF
3 pages
DDBMS Assignments
No ratings yet
DDBMS Assignments
3 pages

Pyspark Vs Pandas Cheatsheet

Uploaded by

Pyspark Vs Pandas Cheatsheet

Uploaded by

CHEATSHEET: PANDAS VS PYSPARK

Import Libraries and Set System Options:

import pandas as pd from pyspark.sql.types import *

Define and create a dataset:

data = {’col1’ : [ , , ], ’col2’ : [ , , ]} StructField(’Col1’, IntegerType())

Read and Write to CSV:

df.read csv() SQLContext(sc).read csv()

Indexing and Splitting:

df.loc[ ] df.randomSplit(weights=[ ], seed=n)

df.rename(columns={”old col”:”new col”}) df.withColumnRenamed(”old col”,”new col”)

Handling Missing Data:

Common Column Functions:

df[”col”] = df[”col”].str.lower() df = df.withColumn(’col’,lower(df.col))

Apply User Defined Functions:

df[’col’] = df[’col’].map(UDF) df = df.withColumn(’col’, UDF(df.col))

Join two dataset columns:

df[’new col’] = df[’col1’] + df[’col2’] df = df.withColumn(’new col’,concat ws(’ ’,df.col1,df.col2))

list(df[’col’) df.select(”col”).rdd.flatMap(lambda x:x).collect()

df = df[df[’col’] != ” ”] df = df[df[’col’] == val]

df.drop([’B’,’C’], axis=1) df.drop(’col1’,’col2’)

df.sort values() df.sort()

You might also like