0% found this document useful (0 votes)
47 views

Pyspark_Coding_Interview_Questions

The document contains a series of PySpark coding interview questions and solutions provided by Raushan Kumar. It covers various tasks such as removing duplicates, calculating average salaries, finding populous cities, word counting, and managing transactions. Each task includes sample data, code snippets, and expected outputs to demonstrate the functionality of PySpark in data processing.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Pyspark_Coding_Interview_Questions

The document contains a series of PySpark coding interview questions and solutions provided by Raushan Kumar. It covers various tasks such as removing duplicates, calculating average salaries, finding populous cities, word counting, and managing transactions. Each task includes sample data, code snippets, and expected outputs to demonstrate the functionality of PySpark in data processing.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

PYSPARK

Interview Code
Raushan Kumar
✓ Remove duplicate
✓ Calculate Average Salary
✓ Top 5 most populous cities
✓ Word count
✓ Average salary + emp count
✓ Running Total stock price
✓ Transaction + Approved
✓ Pivot
✓ Find manager

Raushan Kumar
https://www.linkedin.com/in/raushan-kumar-553154297/
INTERVIEW PYSPARK CODING QUESTIONS
P1) Write a PySpark code snippet to remove duplicate records from a
DataFrame based on a composite key consisting of 'customer_id' and
'transaction_date'.
Input Dataset

Pyspark Code
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Initialize Spark session
spark =
SparkSession.builder.master("local").appName("CustomerTransactions").getOr
Create()
# Define the data
data = [
Row(customer_id=1, transaction_id=1001, transaction_date="2025-02-01"),
Row(customer_id=2, transaction_id=1002, transaction_date="2025-02-03"),
Row(customer_id=1, transaction_id=1003, transaction_date="2025-02-01"),
Row(customer_id=3, transaction_id=1004, transaction_date="2025-02-10"),
Row(customer_id=2, transaction_id=1005, transaction_date="2025-02-15"),
Row(customer_id=1, transaction_id=1006, transaction_date="2025-02-07")]

Raushan Kumar
1
https://www.linkedin.com/in/raushan-kumar-553154297/
# Create DataFrame from the list of Rows
df = spark.createDataFrame(data)
# Show the DataFrame
df.show()

df1=df.dropDuplicates(subset=["customer_id","transaction_date"])
df1.show()

Raushan Kumar
2
https://www.linkedin.com/in/raushan-kumar-553154297/
P2) Given a DataFrame containing employee details, write a PySpark code
snippet to group employees by their department and calculate the average
salary for each department.
Input Dataset

Pyspark Code

data = [
Row(EmployeeID=1, Department="HR", Salary=50000),
Row(EmployeeID=2, Department="IT", Salary=75000),
Row(EmployeeID=3, Department="Finance", Salary=62000),
Row(EmployeeID=4, Department="IT", Salary=82000),
Row(EmployeeID=5, Department="HR", Salary=52000),
Row(EmployeeID=6, Department="Finance", Salary=60000)
]
# Create DataFrame
df = spark.createDataFrame(data)
# Show the DataFrame
df.show()

Raushan Kumar
3
https://www.linkedin.com/in/raushan-kumar-553154297/
df.groupBy('Department')
.agg(avg('Salary').alias('Average_Salary'))
.show()
Output

P3) Given a dataset of Indian cities with their respective populations, write
a PySpark code snippet to find the top 5 most populous cities.
Input Dataset

Raushan Kumar
4
https://www.linkedin.com/in/raushan-kumar-553154297/
Pyspark Code
data=[('Mumbai',20411000),
('Delhi',16787941),
('Bangalore',8443675),
('Chennai',4681087),
('Kolkata',4486679),
('Hyderabad',6809970),
('Ahmedabad',5570585),
('Surat',4467797),
('Pune',3124458),
('Jaipur',3046163)]
schema=["City","Population"]
df=spark.createDataFrame(data,schema)
df1=df.orderBy(col('Population').desc()).limit(5)
df1.show()
Output

Raushan Kumar
5
https://www.linkedin.com/in/raushan-kumar-553154297/
P4) Find the top N most frequent words in a large text file
Sample Data
Let’s assume you have a text file named sample.txt with the following
content:
Hello world
Hello from PySpark
PySpark is awesome
Hello PySpark world

Input Dataset
sample.txt with the following content:
Hello world
Hello from PySpark
PySpark is awesome
Hello PySpark world

Pyspark Code
Step 1 – Load the sample.txt file into dataframe
df=spark.read.text('/FileStore/sample.txt')
df.show()
Output

Raushan Kumar
6
https://www.linkedin.com/in/raushan-kumar-553154297/
Approach 1 (Using Dataframe)

Step 2 – Split the line based on whitespace


splitted_df=df.select(split(col('value'),' ').alias('words_list'))
splitted_df.show(truncate=False)

Step 3 – Explode the list


exploded_df=splitted_df.select(explode(col('words_list')).alias('words'))
exploded_df.show()

Raushan Kumar
7
https://www.linkedin.com/in/raushan-kumar-553154297/
Step 4 – Group By based on column ‘words’ and apply ‘count’
final_df=exploded_df.groupBy(col('words')).count()
final_df.show()

Complete Code
1. df=spark.read.text('/FileStore/sample.txt')
2. splitted_df=df.select(split(col('value'),' ').alias('words_list'))
3. exploded_df=splitted_df.select(explode(col('words_list')).alias('words
'))
4. final_df=exploded_df.groupBy(col('words')).count()
5. final_df.show(truncate=False)
Output

Raushan Kumar
8
https://www.linkedin.com/in/raushan-kumar-553154297/
Approach 2 (Using RDD)

from pyspark.sql import SparkSession


# Initialize Spark session
spark =
SparkSession.builder.master("local").appName("WordCount").getOrCreate()
# Create RDD by reading the text file (replace with your file path)
rdd = spark.sparkContext.textFile("/FileStore/sample.txt")
# Perform word count using RDD operations
word_counts = (
# Split each line into words
rdd.flatMap(lambda line: line.split())
# Map each word to (word, 1)
.map(lambda word: (word, 1))
# Reduce by key (sum the counts)
.reduceByKey(lambda a, b: a + b)
)
# Collect the results and print them
word_counts.collect()
Output

Raushan Kumar
9
https://www.linkedin.com/in/raushan-kumar-553154297/
P5) Calculate the average salary and count of employees for each
department.

Sample Data
data = [
("Sales", 5000, "John"),
("Sales", 6000, "Doe"),
("HR", 7000, "Jane"),
("HR", 8000, "Alice"),
("IT", 4500, "Bob"),
("IT", 5500, "Charlie"),
]

Pyspark Code

schema=["department", "salary", "employee_name"]


df=spark.createDataFrame(data,schema)
groupped_df=(df.groupBy(col('department'))
.agg(
avg(col('salary')).alias('Average_Salary'),
count(col('employee_name')).alias('Total_Employee')
))
groupped_df.show()
Output

Raushan Kumar
10
https://www.linkedin.com/in/raushan-kumar-553154297/
P6) You are given a dataset containing daily stock prices. Write a PySpark
program to calculate the running total of stock prices for each stock symbol
in the dataset.

Pyspark Code
from pyspark.sql.window import Window
from pyspark.sql.functions import *
data = [ ("2024-09-01", "AAPL", 150), ("2024-09-02", "AAPL", 160),
("2024-09-03", "AAPL", 170), ("2024-09-01", "GOOGL", 1200),
("2024-09-02", "GOOGL", 1250), ("2024-09-03", "GOOGL", 1300) ]
# Create DataFrame
df = spark.createDataFrame(data, ["date", "symbol", "price"])
winSpec=Window.partitionBy(col('symbol')).orderBy(col('date').asc())
df1=df.withColumn('Running_Total',sum(col('price')).over(winSpec))
df1.show()
Output

Raushan Kumar
11
https://www.linkedin.com/in/raushan-kumar-553154297/
P7) Write a pyspark code to find for each yearmonth and country,
✓ Number of transactions and their total amount
✓ Number of approved transactions and their total amount
Input

Output

Pyspark Code
from pyspark.sql.functions import to_date
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,
DateType
schema = StructType([
StructField("id", IntegerType(), True),
StructField("country", StringType(), True),
StructField("state", StringType(), True),
StructField("amount", IntegerType(), True),
StructField("trans_date", StringType(), True) ])

Raushan Kumar
12
https://www.linkedin.com/in/raushan-kumar-553154297/
# Define the data with trans_date as a string
data = [
(121, "US", "approved", 1000, "2018-12-18"),
(122, "US", "declined", 2000, "2018-12-19"),
(123, "US", "approved", 2000, "2019-01-01"),
(124, "DE", "approved", 2000, "2019-01-07")
]
# Create DataFrame from the data with the defined schema
df = spark.createDataFrame(data, schema)
# Convert trans_date to DateType using to_date()
df = df.withColumn("trans_date", to_date(df["trans_date"], "yyyy-MM-dd"))
# Add year_month, approved_indicator, approved_amount column
df1=(df.withColumn('year_month',concat(year(col('trans_date')),month(col('tran
s_date'))))
.withColumn('approved_indicator',when(col('state')=='approved',1).otherwise
(0))
.withColumn('approved_amount',when(col('state')=='approved',col('amount')
).otherwise(0))
)
# Do the final aggregation
final_df=(df1.groupBy(col('country'),col('year_month'))
.agg(
count('*').alias('transaction_count'),
sum(col('approved_indicator')).alias('approved_count'),
sum(col('amount')).alias('transaction_total_amount'),
sum(col('approved_amount')).alias('approved_total_amount')))
final_df.show()

Raushan Kumar
13
https://www.linkedin.com/in/raushan-kumar-553154297/
Output

P8) Write a pyspark code for below dataset

Raushan Kumar
14
https://www.linkedin.com/in/raushan-kumar-553154297/
Pyspark Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, lit
from pyspark.sql import functions as F
# Initialize Spark session
spark =
SparkSession.builder.master("local").appName("SkillsPivot").getOrCreate()
# Define the data
data = [ (['A', 'B'], '01/11/20'), (['B', 'I', 'R'], '01/11/20'),(['S', 'H'], '02/11/20'),
(['A', 'H', 'S'], '02/11/20')]
# Define the schema
schema = ['all_skills', 'dates']
# Create DataFrame from the data with the defined schema
df = spark.createDataFrame(data, schema)
# Step 1: Exploding the 'all_skills' column to create one row for each skill
df_exploded = df.withColumn("skill",
explode(col("all_skills"))).drop("all_skills")
# Step 2: Count the occurrences of each skill per date
df_counts = df_exploded.groupBy("skill", "dates").count()
# Step 3: Pivot the DataFrame to have dates as columns
df_pivot = df_counts.groupBy("skill").pivot("dates").agg(sum("count"))
# Replace Null with 0
Final_df=df_pivot.na.fill(0)
# Show the result
Final_df.show(truncate=False)

Raushan Kumar
15
https://www.linkedin.com/in/raushan-kumar-553154297/
Output

P9) Find the manager’s name of the employee

Raushan Kumar
16
https://www.linkedin.com/in/raushan-kumar-553154297/
Pyspark Code
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType,FloatType
# Initialize Spark session
spark =
SparkSession.builder.master("local").appName("EmployeeDataFrame").getOrC
reate()
# Sample data: List of employee records (empid, empname, deptid, salary,
managerid)
data = [
(1, "John Doe", 101, 100.00, None),
(2, "Jane Smith", 102, 3000.00, 1),
(3, "Sam Brown", 101, 20.00, 1),
(4, "Lucy Black", 103, 1000.00, 2),
(5, "Mike White", 102, 200.00, 3),
(6, "Mike Tyson", 102, 200.00, 2),
(7, "Taylor White", 102, 200.00, 2),
(8, "Andrew Flintoff", 102, 200.00, 3)
]
# Define the schema
schema = StructType([
StructField("empid", IntegerType(), True),
StructField("empname", StringType(), True),
StructField("deptid", IntegerType(), True),
StructField("salary", FloatType(), True),
StructField("managerid", IntegerType(), True)])

Raushan Kumar
17
https://www.linkedin.com/in/raushan-kumar-553154297/
# Create DataFrame from the data with the defined schema
df = spark.createDataFrame(data, schema)
joined_df=(df.alias('emp')
.join(df.alias('manager'),
col("emp.managerid")==col("manager.empid"),
"left")
.select(col("emp.empid").alias('Employee_Id'),
col("emp.empname").alias('Employee_name'),
col("manager.empname").alias('Manager_Name')))
joined_df.show()

Output

By: Raushan Kumar


Please follow for more such content:
https://www.linkedin.com/in/raushan-kumar-553154297/

Raushan Kumar
18
https://www.linkedin.com/in/raushan-kumar-553154297/

You might also like