0% found this document useful (0 votes)
32 views3 pages

PySpark Interview Questions

The document provides an overview of PySpark, including its differences from traditional Spark, the concept of Resilient Distributed Datasets (RDDs), and the use of DataFrames. It includes informative and scenario-based questions related to PySpark operations, as well as practical code examples for creating DataFrames, filtering data, performing aggregations, handling missing values, and executing SQL queries. The content is aimed at understanding and applying PySpark for data processing tasks.

Uploaded by

vkscribdind
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views3 pages

PySpark Interview Questions

The document provides an overview of PySpark, including its differences from traditional Spark, the concept of Resilient Distributed Datasets (RDDs), and the use of DataFrames. It includes informative and scenario-based questions related to PySpark operations, as well as practical code examples for creating DataFrames, filtering data, performing aggregations, handling missing values, and executing SQL queries. The content is aimed at understanding and applying PySpark for data processing tasks.

Uploaded by

vkscribdind
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Informative Questions

1. What is PySpark, and how does it differ from traditional Spark?


2. Explain the concept of Resilient Distributed Datasets (RDDs) in
PySpark.
3. How do DataFrames in PySpark differ from RDDs?
4. What are some common transformations and actions available
in PySpark?
5. Describe how PySpark handles partitioning and shuffling of data.
Scenario-Based Questions
1. You need to process streaming data from Kafka using PySpark
Streaming. How would you set this up?
2. Imagine you have to join two large DataFrames that do not fit
into memory; what strategies would you employ?
3. How would you optimize a slow-running PySpark job that
processes large datasets?
4. You need to perform aggregations on a dataset that has missing
values; how would you handle this in PySpark?
5. If you encounter skewed data during processing, what
techniques can you use to mitigate its effects?
1. Write PySpark code to create a DataFrame from a list of tuples and show its content:

data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]

columns = ["Name", "Age"]

df = spark.createDataFrame(data, schema=columns)

df.show()

2. Implement code to filter rows from a DataFrame based on a condition (e.g., Age > 30):

filtered_df = df.filter(df.Age > 30)

filtered_df.show()

3. Write code to group data by a column and calculate the average of another column (e.g., average
age by name):

avg_age_df = df.groupBy("Name").agg({"Age": "avg"})

avg_age_df.show()

4. Create a DataFrame from an external JSON file and display its schema and content:

json_df = spark.read.json("data.json")

json_df.printSchema()

json_df.show()

5. Write code to perform an inner join between two DataFrames and show the result:

df1 = spark.createDataFrame([("Alice", 1), ("Bob", 2)], ["Name", "ID"])

df2 = spark.createDataFrame([(1, "HR"), (2, "Finance")], ["ID", "Department"])

joined_df = df1.join(df2, "ID", "inner")

joined_df.show()

6. Implement code to write a DataFrame to Parquet format and read it back into another DataFrame:

df.write.parquet("output.parquet")

parquet_df = spark.read.parquet("output.parquet")

parquet_df.show()
7. Create a new column in an existing DataFrame by applying a transformation on another column
(e.g., double the age):

df_with_new_col = df.withColumn("Double_Age", df.Age * 2)

df_with_new_col.show()

8. Write code to handle missing values in a DataFrame by filling them with default values (e.g., fill
null ages with 0):

filled_df = df.fillna({"Age": 0})

filled_df.show()

9. Implement code to calculate the total number of records in a DataFrame using an action (e.g.,
count):

total_count = df.count()

print(f"Total records: {total_count}")

10. Write PySpark code to create and use a temporary view for SQL queries on DataFrames:

df.createOrReplaceTempView("people")

sql_result = spark.sql("SELECT Name, Age FROM people WHERE Age > 30")

sql_result.show()

You might also like