PySpark Interview Questions
PySpark Interview Questions
df = spark.createDataFrame(data, schema=columns)
df.show()
2. Implement code to filter rows from a DataFrame based on a condition (e.g., Age > 30):
filtered_df.show()
3. Write code to group data by a column and calculate the average of another column (e.g., average
age by name):
avg_age_df.show()
4. Create a DataFrame from an external JSON file and display its schema and content:
json_df = spark.read.json("data.json")
json_df.printSchema()
json_df.show()
5. Write code to perform an inner join between two DataFrames and show the result:
joined_df.show()
6. Implement code to write a DataFrame to Parquet format and read it back into another DataFrame:
df.write.parquet("output.parquet")
parquet_df = spark.read.parquet("output.parquet")
parquet_df.show()
7. Create a new column in an existing DataFrame by applying a transformation on another column
(e.g., double the age):
df_with_new_col.show()
8. Write code to handle missing values in a DataFrame by filling them with default values (e.g., fill
null ages with 0):
filled_df.show()
9. Implement code to calculate the total number of records in a DataFrame using an action (e.g.,
count):
total_count = df.count()
10. Write PySpark code to create and use a temporary view for SQL queries on DataFrames:
df.createOrReplaceTempView("people")
sql_result = spark.sql("SELECT Name, Age FROM people WHERE Age > 30")
sql_result.show()