Pyspark_Coding_Interview_Questions
Pyspark_Coding_Interview_Questions
Interview Code
Raushan Kumar
✓ Remove duplicate
✓ Calculate Average Salary
✓ Top 5 most populous cities
✓ Word count
✓ Average salary + emp count
✓ Running Total stock price
✓ Transaction + Approved
✓ Pivot
✓ Find manager
Raushan Kumar
https://www.linkedin.com/in/raushan-kumar-553154297/
INTERVIEW PYSPARK CODING QUESTIONS
P1) Write a PySpark code snippet to remove duplicate records from a
DataFrame based on a composite key consisting of 'customer_id' and
'transaction_date'.
Input Dataset
Pyspark Code
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Initialize Spark session
spark =
SparkSession.builder.master("local").appName("CustomerTransactions").getOr
Create()
# Define the data
data = [
Row(customer_id=1, transaction_id=1001, transaction_date="2025-02-01"),
Row(customer_id=2, transaction_id=1002, transaction_date="2025-02-03"),
Row(customer_id=1, transaction_id=1003, transaction_date="2025-02-01"),
Row(customer_id=3, transaction_id=1004, transaction_date="2025-02-10"),
Row(customer_id=2, transaction_id=1005, transaction_date="2025-02-15"),
Row(customer_id=1, transaction_id=1006, transaction_date="2025-02-07")]
Raushan Kumar
1
https://www.linkedin.com/in/raushan-kumar-553154297/
# Create DataFrame from the list of Rows
df = spark.createDataFrame(data)
# Show the DataFrame
df.show()
df1=df.dropDuplicates(subset=["customer_id","transaction_date"])
df1.show()
Raushan Kumar
2
https://www.linkedin.com/in/raushan-kumar-553154297/
P2) Given a DataFrame containing employee details, write a PySpark code
snippet to group employees by their department and calculate the average
salary for each department.
Input Dataset
Pyspark Code
data = [
Row(EmployeeID=1, Department="HR", Salary=50000),
Row(EmployeeID=2, Department="IT", Salary=75000),
Row(EmployeeID=3, Department="Finance", Salary=62000),
Row(EmployeeID=4, Department="IT", Salary=82000),
Row(EmployeeID=5, Department="HR", Salary=52000),
Row(EmployeeID=6, Department="Finance", Salary=60000)
]
# Create DataFrame
df = spark.createDataFrame(data)
# Show the DataFrame
df.show()
Raushan Kumar
3
https://www.linkedin.com/in/raushan-kumar-553154297/
df.groupBy('Department')
.agg(avg('Salary').alias('Average_Salary'))
.show()
Output
P3) Given a dataset of Indian cities with their respective populations, write
a PySpark code snippet to find the top 5 most populous cities.
Input Dataset
Raushan Kumar
4
https://www.linkedin.com/in/raushan-kumar-553154297/
Pyspark Code
data=[('Mumbai',20411000),
('Delhi',16787941),
('Bangalore',8443675),
('Chennai',4681087),
('Kolkata',4486679),
('Hyderabad',6809970),
('Ahmedabad',5570585),
('Surat',4467797),
('Pune',3124458),
('Jaipur',3046163)]
schema=["City","Population"]
df=spark.createDataFrame(data,schema)
df1=df.orderBy(col('Population').desc()).limit(5)
df1.show()
Output
Raushan Kumar
5
https://www.linkedin.com/in/raushan-kumar-553154297/
P4) Find the top N most frequent words in a large text file
Sample Data
Let’s assume you have a text file named sample.txt with the following
content:
Hello world
Hello from PySpark
PySpark is awesome
Hello PySpark world
Input Dataset
sample.txt with the following content:
Hello world
Hello from PySpark
PySpark is awesome
Hello PySpark world
Pyspark Code
Step 1 – Load the sample.txt file into dataframe
df=spark.read.text('/FileStore/sample.txt')
df.show()
Output
Raushan Kumar
6
https://www.linkedin.com/in/raushan-kumar-553154297/
Approach 1 (Using Dataframe)
Raushan Kumar
7
https://www.linkedin.com/in/raushan-kumar-553154297/
Step 4 – Group By based on column ‘words’ and apply ‘count’
final_df=exploded_df.groupBy(col('words')).count()
final_df.show()
Complete Code
1. df=spark.read.text('/FileStore/sample.txt')
2. splitted_df=df.select(split(col('value'),' ').alias('words_list'))
3. exploded_df=splitted_df.select(explode(col('words_list')).alias('words
'))
4. final_df=exploded_df.groupBy(col('words')).count()
5. final_df.show(truncate=False)
Output
Raushan Kumar
8
https://www.linkedin.com/in/raushan-kumar-553154297/
Approach 2 (Using RDD)
Raushan Kumar
9
https://www.linkedin.com/in/raushan-kumar-553154297/
P5) Calculate the average salary and count of employees for each
department.
Sample Data
data = [
("Sales", 5000, "John"),
("Sales", 6000, "Doe"),
("HR", 7000, "Jane"),
("HR", 8000, "Alice"),
("IT", 4500, "Bob"),
("IT", 5500, "Charlie"),
]
Pyspark Code
Raushan Kumar
10
https://www.linkedin.com/in/raushan-kumar-553154297/
P6) You are given a dataset containing daily stock prices. Write a PySpark
program to calculate the running total of stock prices for each stock symbol
in the dataset.
Pyspark Code
from pyspark.sql.window import Window
from pyspark.sql.functions import *
data = [ ("2024-09-01", "AAPL", 150), ("2024-09-02", "AAPL", 160),
("2024-09-03", "AAPL", 170), ("2024-09-01", "GOOGL", 1200),
("2024-09-02", "GOOGL", 1250), ("2024-09-03", "GOOGL", 1300) ]
# Create DataFrame
df = spark.createDataFrame(data, ["date", "symbol", "price"])
winSpec=Window.partitionBy(col('symbol')).orderBy(col('date').asc())
df1=df.withColumn('Running_Total',sum(col('price')).over(winSpec))
df1.show()
Output
Raushan Kumar
11
https://www.linkedin.com/in/raushan-kumar-553154297/
P7) Write a pyspark code to find for each yearmonth and country,
✓ Number of transactions and their total amount
✓ Number of approved transactions and their total amount
Input
Output
Pyspark Code
from pyspark.sql.functions import to_date
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,
DateType
schema = StructType([
StructField("id", IntegerType(), True),
StructField("country", StringType(), True),
StructField("state", StringType(), True),
StructField("amount", IntegerType(), True),
StructField("trans_date", StringType(), True) ])
Raushan Kumar
12
https://www.linkedin.com/in/raushan-kumar-553154297/
# Define the data with trans_date as a string
data = [
(121, "US", "approved", 1000, "2018-12-18"),
(122, "US", "declined", 2000, "2018-12-19"),
(123, "US", "approved", 2000, "2019-01-01"),
(124, "DE", "approved", 2000, "2019-01-07")
]
# Create DataFrame from the data with the defined schema
df = spark.createDataFrame(data, schema)
# Convert trans_date to DateType using to_date()
df = df.withColumn("trans_date", to_date(df["trans_date"], "yyyy-MM-dd"))
# Add year_month, approved_indicator, approved_amount column
df1=(df.withColumn('year_month',concat(year(col('trans_date')),month(col('tran
s_date'))))
.withColumn('approved_indicator',when(col('state')=='approved',1).otherwise
(0))
.withColumn('approved_amount',when(col('state')=='approved',col('amount')
).otherwise(0))
)
# Do the final aggregation
final_df=(df1.groupBy(col('country'),col('year_month'))
.agg(
count('*').alias('transaction_count'),
sum(col('approved_indicator')).alias('approved_count'),
sum(col('amount')).alias('transaction_total_amount'),
sum(col('approved_amount')).alias('approved_total_amount')))
final_df.show()
Raushan Kumar
13
https://www.linkedin.com/in/raushan-kumar-553154297/
Output
Raushan Kumar
14
https://www.linkedin.com/in/raushan-kumar-553154297/
Pyspark Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, lit
from pyspark.sql import functions as F
# Initialize Spark session
spark =
SparkSession.builder.master("local").appName("SkillsPivot").getOrCreate()
# Define the data
data = [ (['A', 'B'], '01/11/20'), (['B', 'I', 'R'], '01/11/20'),(['S', 'H'], '02/11/20'),
(['A', 'H', 'S'], '02/11/20')]
# Define the schema
schema = ['all_skills', 'dates']
# Create DataFrame from the data with the defined schema
df = spark.createDataFrame(data, schema)
# Step 1: Exploding the 'all_skills' column to create one row for each skill
df_exploded = df.withColumn("skill",
explode(col("all_skills"))).drop("all_skills")
# Step 2: Count the occurrences of each skill per date
df_counts = df_exploded.groupBy("skill", "dates").count()
# Step 3: Pivot the DataFrame to have dates as columns
df_pivot = df_counts.groupBy("skill").pivot("dates").agg(sum("count"))
# Replace Null with 0
Final_df=df_pivot.na.fill(0)
# Show the result
Final_df.show(truncate=False)
Raushan Kumar
15
https://www.linkedin.com/in/raushan-kumar-553154297/
Output
Raushan Kumar
16
https://www.linkedin.com/in/raushan-kumar-553154297/
Pyspark Code
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType,FloatType
# Initialize Spark session
spark =
SparkSession.builder.master("local").appName("EmployeeDataFrame").getOrC
reate()
# Sample data: List of employee records (empid, empname, deptid, salary,
managerid)
data = [
(1, "John Doe", 101, 100.00, None),
(2, "Jane Smith", 102, 3000.00, 1),
(3, "Sam Brown", 101, 20.00, 1),
(4, "Lucy Black", 103, 1000.00, 2),
(5, "Mike White", 102, 200.00, 3),
(6, "Mike Tyson", 102, 200.00, 2),
(7, "Taylor White", 102, 200.00, 2),
(8, "Andrew Flintoff", 102, 200.00, 3)
]
# Define the schema
schema = StructType([
StructField("empid", IntegerType(), True),
StructField("empname", StringType(), True),
StructField("deptid", IntegerType(), True),
StructField("salary", FloatType(), True),
StructField("managerid", IntegerType(), True)])
Raushan Kumar
17
https://www.linkedin.com/in/raushan-kumar-553154297/
# Create DataFrame from the data with the defined schema
df = spark.createDataFrame(data, schema)
joined_df=(df.alias('emp')
.join(df.alias('manager'),
col("emp.managerid")==col("manager.empid"),
"left")
.select(col("emp.empid").alias('Employee_Id'),
col("emp.empname").alias('Employee_name'),
col("manager.empname").alias('Manager_Name')))
joined_df.show()
Output
Raushan Kumar
18
https://www.linkedin.com/in/raushan-kumar-553154297/