0% found this document useful (0 votes)

47 views

Pyspark_Coding_Interview_Questions

The document contains a series of PySpark coding interview questions and solutions provided by Raushan Kumar. It covers various tasks such as removing duplicates, calculating average salaries, finding populous cities, word counting, and managing transactions. Each task includes sample data, code snippets, and expected outputs to demonstrate the functionality of PySpark in data processing.

Uploaded by

Yogesh Sable Patil

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

Pyspark_Coding_Interview_Questions

Uploaded by

Yogesh Sable Patil

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

PYSPARK

Interview Code
Raushan Kumar
✓ Remove duplicate
✓ Calculate Average Salary
✓ Top 5 most populous cities
✓ Word count
✓ Average salary + emp count
✓ Running Total stock price
✓ Transaction + Approved
✓ Pivot
✓ Find manager

Raushan Kumar
https://www.linkedin.com/in/raushan-kumar-553154297/
INTERVIEW PYSPARK CODING QUESTIONS
P1) Write a PySpark code snippet to remove duplicate records from a
DataFrame based on a composite key consisting of 'customer_id' and
'transaction_date'.
Input Dataset

Pyspark Code
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Initialize Spark session
spark =
SparkSession.builder.master("local").appName("CustomerTransactions").getOr
Create()
# Define the data
data = [
Row(customer_id=1, transaction_id=1001, transaction_date="2025-02-01"),
Row(customer_id=2, transaction_id=1002, transaction_date="2025-02-03"),
Row(customer_id=1, transaction_id=1003, transaction_date="2025-02-01"),
Row(customer_id=3, transaction_id=1004, transaction_date="2025-02-10"),
Row(customer_id=2, transaction_id=1005, transaction_date="2025-02-15"),
Row(customer_id=1, transaction_id=1006, transaction_date="2025-02-07")]

Raushan Kumar
1
https://www.linkedin.com/in/raushan-kumar-553154297/
# Create DataFrame from the list of Rows
df = spark.createDataFrame(data)
# Show the DataFrame
df.show()

df1=df.dropDuplicates(subset=["customer_id","transaction_date"])
df1.show()

Raushan Kumar
2
https://www.linkedin.com/in/raushan-kumar-553154297/
P2) Given a DataFrame containing employee details, write a PySpark code
snippet to group employees by their department and calculate the average
salary for each department.
Input Dataset

Pyspark Code

data = [
Row(EmployeeID=1, Department="HR", Salary=50000),
Row(EmployeeID=2, Department="IT", Salary=75000),
Row(EmployeeID=3, Department="Finance", Salary=62000),
Row(EmployeeID=4, Department="IT", Salary=82000),
Row(EmployeeID=5, Department="HR", Salary=52000),
Row(EmployeeID=6, Department="Finance", Salary=60000)
]
# Create DataFrame
df = spark.createDataFrame(data)
# Show the DataFrame
df.show()

Raushan Kumar
3
https://www.linkedin.com/in/raushan-kumar-553154297/
df.groupBy('Department')
.agg(avg('Salary').alias('Average_Salary'))
.show()
Output

P3) Given a dataset of Indian cities with their respective populations, write
a PySpark code snippet to find the top 5 most populous cities.
Input Dataset

Raushan Kumar
4
https://www.linkedin.com/in/raushan-kumar-553154297/
Pyspark Code
data=[('Mumbai',20411000),
('Delhi',16787941),
('Bangalore',8443675),
('Chennai',4681087),
('Kolkata',4486679),
('Hyderabad',6809970),
('Ahmedabad',5570585),
('Surat',4467797),
('Pune',3124458),
('Jaipur',3046163)]
schema=["City","Population"]
df=spark.createDataFrame(data,schema)
df1=df.orderBy(col('Population').desc()).limit(5)
df1.show()
Output

Raushan Kumar
5
https://www.linkedin.com/in/raushan-kumar-553154297/
P4) Find the top N most frequent words in a large text file
Sample Data
Let’s assume you have a text file named sample.txt with the following
content:
Hello world
Hello from PySpark
PySpark is awesome
Hello PySpark world

Input Dataset
sample.txt with the following content:
Hello world
Hello from PySpark
PySpark is awesome
Hello PySpark world

Pyspark Code
Step 1 – Load the sample.txt file into dataframe
df=spark.read.text('/FileStore/sample.txt')
df.show()
Output

Raushan Kumar
6
https://www.linkedin.com/in/raushan-kumar-553154297/
Approach 1 (Using Dataframe)

Step 2 – Split the line based on whitespace

splitted_df=df.select(split(col('value'),' ').alias('words_list'))
splitted_df.show(truncate=False)

Step 3 – Explode the list

exploded_df=splitted_df.select(explode(col('words_list')).alias('words'))
exploded_df.show()

Raushan Kumar
7
https://www.linkedin.com/in/raushan-kumar-553154297/
Step 4 – Group By based on column ‘words’ and apply ‘count’
final_df=exploded_df.groupBy(col('words')).count()
final_df.show()

Complete Code
1. df=spark.read.text('/FileStore/sample.txt')
2. splitted_df=df.select(split(col('value'),' ').alias('words_list'))
3. exploded_df=splitted_df.select(explode(col('words_list')).alias('words
'))
4. final_df=exploded_df.groupBy(col('words')).count()
5. final_df.show(truncate=False)
Output

Raushan Kumar
8
https://www.linkedin.com/in/raushan-kumar-553154297/
Approach 2 (Using RDD)

from pyspark.sql import SparkSession

# Initialize Spark session
spark =
SparkSession.builder.master("local").appName("WordCount").getOrCreate()
# Create RDD by reading the text file (replace with your file path)
rdd = spark.sparkContext.textFile("/FileStore/sample.txt")
# Perform word count using RDD operations
word_counts = (
# Split each line into words
rdd.flatMap(lambda line: line.split())
# Map each word to (word, 1)
.map(lambda word: (word, 1))
# Reduce by key (sum the counts)
.reduceByKey(lambda a, b: a + b)
)
# Collect the results and print them
word_counts.collect()
Output

Raushan Kumar
9
https://www.linkedin.com/in/raushan-kumar-553154297/
P5) Calculate the average salary and count of employees for each
department.

Sample Data
data = [
("Sales", 5000, "John"),
("Sales", 6000, "Doe"),
("HR", 7000, "Jane"),
("HR", 8000, "Alice"),
("IT", 4500, "Bob"),
("IT", 5500, "Charlie"),
]

Pyspark Code

schema=["department", "salary", "employee_name"]

df=spark.createDataFrame(data,schema)
groupped_df=(df.groupBy(col('department'))
.agg(
avg(col('salary')).alias('Average_Salary'),
count(col('employee_name')).alias('Total_Employee')
))
groupped_df.show()
Output

Raushan Kumar
10
https://www.linkedin.com/in/raushan-kumar-553154297/
P6) You are given a dataset containing daily stock prices. Write a PySpark
program to calculate the running total of stock prices for each stock symbol
in the dataset.

Pyspark Code
from pyspark.sql.window import Window
from pyspark.sql.functions import *
data = [ ("2024-09-01", "AAPL", 150), ("2024-09-02", "AAPL", 160),
("2024-09-03", "AAPL", 170), ("2024-09-01", "GOOGL", 1200),
("2024-09-02", "GOOGL", 1250), ("2024-09-03", "GOOGL", 1300) ]
# Create DataFrame
df = spark.createDataFrame(data, ["date", "symbol", "price"])
winSpec=Window.partitionBy(col('symbol')).orderBy(col('date').asc())
df1=df.withColumn('Running_Total',sum(col('price')).over(winSpec))
df1.show()
Output

Raushan Kumar
11
https://www.linkedin.com/in/raushan-kumar-553154297/
P7) Write a pyspark code to find for each yearmonth and country,
✓ Number of transactions and their total amount
✓ Number of approved transactions and their total amount
Input

Output

Pyspark Code
from pyspark.sql.functions import to_date
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,
DateType
schema = StructType([
StructField("id", IntegerType(), True),
StructField("country", StringType(), True),
StructField("state", StringType(), True),
StructField("amount", IntegerType(), True),
StructField("trans_date", StringType(), True) ])

Raushan Kumar
12
https://www.linkedin.com/in/raushan-kumar-553154297/
# Define the data with trans_date as a string
data = [
(121, "US", "approved", 1000, "2018-12-18"),
(122, "US", "declined", 2000, "2018-12-19"),
(123, "US", "approved", 2000, "2019-01-01"),
(124, "DE", "approved", 2000, "2019-01-07")
]
# Create DataFrame from the data with the defined schema
df = spark.createDataFrame(data, schema)
# Convert trans_date to DateType using to_date()
df = df.withColumn("trans_date", to_date(df["trans_date"], "yyyy-MM-dd"))
# Add year_month, approved_indicator, approved_amount column
df1=(df.withColumn('year_month',concat(year(col('trans_date')),month(col('tran
s_date'))))
.withColumn('approved_indicator',when(col('state')=='approved',1).otherwise
(0))
.withColumn('approved_amount',when(col('state')=='approved',col('amount')
).otherwise(0))
)
# Do the final aggregation
final_df=(df1.groupBy(col('country'),col('year_month'))
.agg(
count('*').alias('transaction_count'),
sum(col('approved_indicator')).alias('approved_count'),
sum(col('amount')).alias('transaction_total_amount'),
sum(col('approved_amount')).alias('approved_total_amount')))
final_df.show()

Raushan Kumar
13
https://www.linkedin.com/in/raushan-kumar-553154297/
Output

P8) Write a pyspark code for below dataset

Raushan Kumar
14
https://www.linkedin.com/in/raushan-kumar-553154297/
Pyspark Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, lit
from pyspark.sql import functions as F
# Initialize Spark session
spark =
SparkSession.builder.master("local").appName("SkillsPivot").getOrCreate()
# Define the data
data = [ (['A', 'B'], '01/11/20'), (['B', 'I', 'R'], '01/11/20'),(['S', 'H'], '02/11/20'),
(['A', 'H', 'S'], '02/11/20')]
# Define the schema
schema = ['all_skills', 'dates']
# Create DataFrame from the data with the defined schema
df = spark.createDataFrame(data, schema)
# Step 1: Exploding the 'all_skills' column to create one row for each skill
df_exploded = df.withColumn("skill",
explode(col("all_skills"))).drop("all_skills")
# Step 2: Count the occurrences of each skill per date
df_counts = df_exploded.groupBy("skill", "dates").count()
# Step 3: Pivot the DataFrame to have dates as columns
df_pivot = df_counts.groupBy("skill").pivot("dates").agg(sum("count"))
# Replace Null with 0
Final_df=df_pivot.na.fill(0)
# Show the result
Final_df.show(truncate=False)

Raushan Kumar
15
https://www.linkedin.com/in/raushan-kumar-553154297/
Output

P9) Find the manager’s name of the employee

Raushan Kumar
16
https://www.linkedin.com/in/raushan-kumar-553154297/
Pyspark Code
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType,FloatType
# Initialize Spark session
spark =
SparkSession.builder.master("local").appName("EmployeeDataFrame").getOrC
reate()
# Sample data: List of employee records (empid, empname, deptid, salary,
managerid)
data = [
(1, "John Doe", 101, 100.00, None),
(2, "Jane Smith", 102, 3000.00, 1),
(3, "Sam Brown", 101, 20.00, 1),
(4, "Lucy Black", 103, 1000.00, 2),
(5, "Mike White", 102, 200.00, 3),
(6, "Mike Tyson", 102, 200.00, 2),
(7, "Taylor White", 102, 200.00, 2),
(8, "Andrew Flintoff", 102, 200.00, 3)
]
# Define the schema
schema = StructType([
StructField("empid", IntegerType(), True),
StructField("empname", StringType(), True),
StructField("deptid", IntegerType(), True),
StructField("salary", FloatType(), True),
StructField("managerid", IntegerType(), True)])

Raushan Kumar
17
https://www.linkedin.com/in/raushan-kumar-553154297/
# Create DataFrame from the data with the defined schema
df = spark.createDataFrame(data, schema)
joined_df=(df.alias('emp')
.join(df.alias('manager'),
col("emp.managerid")==col("manager.empid"),
"left")
.select(col("emp.empid").alias('Employee_Id'),
col("emp.empname").alias('Employee_name'),
col("manager.empname").alias('Manager_Name')))
joined_df.show()

Output

By: Raushan Kumar

Please follow for more such content:
https://www.linkedin.com/in/raushan-kumar-553154297/

Raushan Kumar
18
https://www.linkedin.com/in/raushan-kumar-553154297/

PYSPARK Interview Questions
100% (2)
PYSPARK Interview Questions
126 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
The Everyday Life Bible Kindle Joyce Meyer
No ratings yet
The Everyday Life Bible Kindle Joyce Meyer
68 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
SQL Cheat Sheet Python
No ratings yet
SQL Cheat Sheet Python
1 page
Pyspark coding questions from StrataScratch platform
No ratings yet
Pyspark coding questions from StrataScratch platform
23 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Pyspark IQ FREE Guide
No ratings yet
Pyspark IQ FREE Guide
57 pages
30 Pyspark Coding Questions
No ratings yet
30 Pyspark Coding Questions
9 pages
⚠️ TCS Rejected Many Due to Weak PySpark Logic!?
No ratings yet
⚠️ TCS Rejected Many Due to Weak PySpark Logic!?
7 pages
Top 100 Pyspark Functions for Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions for Data Engineers 1738131847
30 pages
Docse
No ratings yet
Docse
3 pages
EDA Python for Data Analsis
No ratings yet
EDA Python for Data Analsis
10 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Week12 Assignment Solution
No ratings yet
Week12 Assignment Solution
10 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
No ratings yet
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
12 pages
Spark and Scala 2
No ratings yet
Spark and Scala 2
11 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
CS 2018 042
No ratings yet
CS 2018 042
8 pages
PySpark_slides
No ratings yet
PySpark_slides
30 pages
MyinterviewQs (1)
No ratings yet
MyinterviewQs (1)
9 pages
Week 6 Assignment
No ratings yet
Week 6 Assignment
2 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Practical File IP
No ratings yet
Practical File IP
27 pages
DATAFRAME Vs DATASETS
No ratings yet
DATAFRAME Vs DATASETS
9 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
1731556887911
No ratings yet
1731556887911
275 pages
Class 12 Practical File Informatics Practices
No ratings yet
Class 12 Practical File Informatics Practices
28 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Spark
No ratings yet
Spark
12 pages
Pyspark Hands on
No ratings yet
Pyspark Hands on
189 pages
Databricks vs SQL Cheat Sheet
No ratings yet
Databricks vs SQL Cheat Sheet
11 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Apache Spark - Practices
No ratings yet
Apache Spark - Practices
24 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Practice 1,2
No ratings yet
Practice 1,2
8 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
PB 1 IP Answer Key 2024
No ratings yet
PB 1 IP Answer Key 2024
6 pages
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Person Demographics 1
No ratings yet
Person Demographics 1
800 pages
StatementThu Dec 19 12_33_10 GMT+05_30 2024
No ratings yet
StatementThu Dec 19 12_33_10 GMT+05_30 2024
8 pages
Job Responsibility of ADL Officer (1)
No ratings yet
Job Responsibility of ADL Officer (1)
1 page
_talend best scenario
No ratings yet
_talend best scenario
6 pages
48 Medium Article on PySpark Scenarios
No ratings yet
48 Medium Article on PySpark Scenarios
6 pages
Column manipulation part2
No ratings yet
Column manipulation part2
2 pages
Nijesh_Hadoop_4yrs
No ratings yet
Nijesh_Hadoop_4yrs
2 pages
ADF Activities
No ratings yet
ADF Activities
35 pages
Emp_Dept_Scripts
No ratings yet
Emp_Dept_Scripts
3 pages
GXDP-300 GX
100% (1)
GXDP-300 GX
14 pages
Safety Instrumented System - Basics
No ratings yet
Safety Instrumented System - Basics
12 pages
Ishraq Volume 1-r3
No ratings yet
Ishraq Volume 1-r3
630 pages
Lying Bear Crochet Pattern
100% (3)
Lying Bear Crochet Pattern
6 pages
Rule 83 - Siguion Reyna v. Chionglo-Sia
No ratings yet
Rule 83 - Siguion Reyna v. Chionglo-Sia
2 pages
Historical Places of London
No ratings yet
Historical Places of London
3 pages
HG Assessment Tool 8 Kamagong FINAL
No ratings yet
HG Assessment Tool 8 Kamagong FINAL
75 pages
Aim of Promoting and Improving The Quality of Life Through Ongoing Support and Development of
No ratings yet
Aim of Promoting and Improving The Quality of Life Through Ongoing Support and Development of
5 pages
Elton Mayo
No ratings yet
Elton Mayo
10 pages
10 1 1 889 2795 PDF
No ratings yet
10 1 1 889 2795 PDF
157 pages
LLs - Parent
No ratings yet
LLs - Parent
30 pages
12 Associated Bank v. Spouses Montano Sr.
No ratings yet
12 Associated Bank v. Spouses Montano Sr.
7 pages
Legal Cases, NRM and Minority Faiths
No ratings yet
Legal Cases, NRM and Minority Faiths
295 pages
Quantiative Data Analysis
No ratings yet
Quantiative Data Analysis
30 pages
Sight Words and CVC Words Sentences
No ratings yet
Sight Words and CVC Words Sentences
12 pages
Levetiracetam PDF
No ratings yet
Levetiracetam PDF
3 pages
Gas Bubble Disease
No ratings yet
Gas Bubble Disease
4 pages
Globalization: International Business
No ratings yet
Globalization: International Business
7 pages
Movie Analysis - 20240123 - 115614 - 0000
No ratings yet
Movie Analysis - 20240123 - 115614 - 0000
3 pages
Chapter 3 Renaissance Worldview
No ratings yet
Chapter 3 Renaissance Worldview
40 pages
Battle Cards
No ratings yet
Battle Cards
11 pages
Intro
No ratings yet
Intro
8 pages
Conquering Your Fear of Speaking in Public
No ratings yet
Conquering Your Fear of Speaking in Public
29 pages
Communists Have More Fun! The Dialectics of Fulfillment in Cinema of The People's Republic of China
No ratings yet
Communists Have More Fun! The Dialectics of Fulfillment in Cinema of The People's Republic of China
18 pages
Contract of Employement (Auditor)
No ratings yet
Contract of Employement (Auditor)
4 pages
What Is It?: Foundations of Science Teaching and Learning
No ratings yet
What Is It?: Foundations of Science Teaching and Learning
7 pages
Rajasthan Public Service Commission, Ajmer
No ratings yet
Rajasthan Public Service Commission, Ajmer
3 pages
JW Player Module 2.8.0.rev.1.2.0
No ratings yet
JW Player Module 2.8.0.rev.1.2.0
44 pages
Classics in Cultural Criticism I: Britain, Edited by Bernd-Peter Lange
No ratings yet
Classics in Cultural Criticism I: Britain, Edited by Bernd-Peter Lange
36 pages