15 Asked Questions in KPMG
15 Asked Questions in KPMG
• even_counter = spark.sparkContext.accumulator(0)
Q 14: Explain a scenario where we want to reduce
the number of partition but we prefer reparation.
• Suppose you have a PySpark DataFrame that is heavily skewed, meaning that a
few partitions contain significantly more data than others. This skew can lead to
straggler tasks and inefficient resource utilization during computations, as the
workload is not evenly distributed across the cluster.
• In this scenario, coalesce might not be sufficient because it only merges existing
partitions without performing a full shuffle. Since the data is heavily skewed,
simply merging partitions may not redistribute the data evenly, and the skew may
persist in the resulting partitions. Instead, you can use repartition to explicitly
shuffle and redistribute the data evenly across a smaller number of partitions.
• By using repartition, you ensure that the data is evenly distributed across the
specified number of partitions, which helps mitigate the effects of skew and
improves performance by balancing the workload across the cluster. Although
repartition involves a full shuffle, in cases of heavy skew, the benefits of evenly
distributed data may outweigh the overhead of shuffling.
Q 15: What is bucketing in spark?
• Bucketing is a technique used in Apache Spark to organize data into
more manageable and evenly sized partitions based on a hash of one
or more columns. It's particularly useful when you have large datasets
and want to optimize certain types of operations, such as joins or
aggregations, by ensuring that related data is colocated within the
same partitions.
• df.write.bucketBy(4, "id").saveAsTable("bucketed_table",
format="parquet")