15 Asked Questions in KPMG

The document discusses 15 questions asked in KPMG theoretical interviews for data engineers working with Spark. The questions cover topics like the differences between Spark transformations and actions, RDDs and DataFrames, narrow and wide transformations, data skewness handling techniques, broadcast variables and accumulators.

Uploaded by

Rajat Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views22 pages

15 Asked Questions in KPMG

Uploaded by

Rajat Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

15 Asked Questions in KPMG

Theoretical Questions for Data Engineers (Spark)

Part 1 and Part 2.
Q1 :Which one is better to use, Hadoop or
Spark?
Q 2 : What is the difference between Spark
Transformation and Spark Actions?
Q 3. What distinguishes PySpark, Databricks,
and Amazon Elastic MapReduce (EMR)?
Q 4. What are RDD and DataFrames?
• RDD (Resilient Distributed Dataset) is the core abstraction in Spark,
representing an immutable distributed collection of objects that can
be processed in parallel across a cluster. RDDs provide a low-level API
for manipulating distributed datasets.
• DataFrames, on the other hand, are a higher-level abstraction built on
top of RDDs, introduced in Spark 1.3. DataFrames organize data into
named columns, similar to a table in a relational database or a data
frame in R/Python pandas. They provide a more structured and
efficient way to work with structured and semi-structured data
compared to RDDs.
Q. 5 What is a Spark Partition?
• A Spark partition is a logical division of data in an RDD or DataFrame.
Partitions are the basic units of parallelism in Spark, with each
partition being processed by a single task (or thread) within a Spark
executor. The number of partitions in an RDD or DataFrame
determines the degree of parallelism during computation. Spark
automatically determines the number of partitions based on the
input data and the cluster configuration, but you can also explicitly
control the partitioning of data using partitioning functions.
Q 6 . What is shuffling in PySpark?
• Shuffling in PySpark refers to the process of redistributing data across
partitions during certain operations, such as joins or aggregations,
where data from multiple partitions needs to be combined or
reshuffled. Shuffling involves transferring data between nodes in the
cluster and can be a performance-intensive operation, especially for
large datasets. Minimizing shuffling is crucial for optimizing the
performance of Spark jobs.
Q 7. What is the difference between narrow
and wide transformation in Spark?
• Narrow transformations are transformations where each input
partition contributes to only one output partition. Examples of
narrow transformations include map, filter, flatMap, etc. These
transformations can be executed in parallel without shuffling data
across partitions.
• Wide transformations, on the other hand, are transformations where
each input partition may contribute to multiple output partitions.
Examples of wide transformations include groupBy, join, sortByKey,
etc. These transformations typically involve shuffling data across
partitions and may require data movement between nodes in the
cluster.
Q 8 : What is a dag in spark
• (Directed Acyclic Graph) DAG in Apache Spark is a set of Vertices and Edges,
where vertices represent the RDDs and the edges represent the Operation
to be applied on RDD
• In Spark DAG, every edge directs from earlier to later in the sequence. On
the calling of Action, the created DAG submits to DAG Scheduler which
further splits the graph into the stages of the task. Apache Spark DAG
allows the user to dive into the stage and expand on detail on any stage. In
the stage view, the details of all RDDs belonging to that stage are
expanded. The Scheduler splits the Spark RDD into stages based on various
transformation applied. Each stage is comprised of tasks, based on the
partitions of the RDD, which will perform same computation in parallel.
The graph here refers to navigation, and directed and acyclic refers to how
it is done.
Q 9: How do you handle data skewness?
Data skewness in Spark can lead to performance issues and uneven resource
utilization. To handle data skewness, you can use the following techniques:
• Partitioning: Choose an appropriate partitioning strategy to evenly distribute data
across partitions. Avoid relying solely on default partitioning, especially for skewed
keys.
• Salting: Add random or sequential numbers (salting) to skewed keys to distribute
them more evenly across partitions.
• Aggregation Strategies: Use alternative aggregation strategies (e.g., sampling,
approximation algorithms) to reduce the impact of skewness during aggregation
operations.
• Broadcasting Small Tables: Broadcast small lookup tables to all worker nodes to avoid
shuffling large skewed datasets during join operations.
• Data Replication: Replicate small skewed partitions across multiple nodes to improve
parallelism and reduce processing time for skewed partitions.
• Customer_order_status → completed_1 – 1 lakh recods , inprogress -
500, failed -- 200, tobe continued--400, No status-100
Q 10: Explain challenging situation you faced
in your project?
Theoretical Questions for Data Engineers (Spark)
Part 2.
Q 11 : What is the difference between map
and flatMap?
• The Map operation is a transformation operation that applies a given
function to each element of an RDD or DataFrame, creating a new
RDD or DataFrame with the transformed elements. The function
takes a single input element and returns a single output element,
maintaining a one-to-one relationship between input and output
elements.
• FlatMap, on the other hand, is a transformation operation that
applies a given function to each element of an RDD or DataFrame
and "flattens" the result into a new RDD or DataFrame. Unlike Map,
the function applied in FlatMap can return multiple output elements
(in the form of an iterable) for each input element, resulting in a one-
to-many relationship between input and output elements.
Cont.
• rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

• # Example using map

• mapped_rdd = rdd.map(lambda x: x * 2)
• print("Using map:", mapped_rdd.collect()) # Output: [2, 4, 6, 8, 10]

• # Example using flatMap

• flat_mapped_rdd = rdd.flatMap(lambda x: [x, x * 2])
• print("Using flatMap:", flat_mapped_rdd.collect()) # Output: [1, 2, 2, 4, 3,
6, 4, 8, 5, 10]
Q 12: What is catalyst Optimizer.
• The Catalyst Optimizer is a query optimization framework that Apache
Spark uses to improve the performance of DataFrame operations. It is
a sophisticated optimization engine that transforms DataFrame
operations into an optimized execution plan, taking advantage of
various optimizations such as filter pushdown, projection pruning,
and join reordering.
Cont.
• Logical Plan Optimization: The Catalyst Optimizer starts by creating a logical plan representing the
DataFrame operations specified by the user. It then applies various logical optimizations to this
plan, such as predicate pushdown and constant folding, to simplify and optimize the logical
representation of the query.
• Physical Plan Generation: Once the logical plan is optimized, the Catalyst Optimizer generates
multiple physical execution plans from the optimized logical plan. These physical plans represent
different ways to execute the DataFrame operations using Spark's execution engine.
• Cost-Based Optimization: Catalyst employs cost-based optimization techniques to estimate the
cost of executing each physical plan and selects the plan with the lowest estimated cost. This
approach allows Catalyst to make decisions based on the characteristics of the data and the
available resources, leading to more efficient query execution.
• Rule-Based Optimization: Catalyst uses a set of optimization rules to transform the logical and
physical plans. These rules cover a wide range of optimizations, including predicate pushdown,
constant folding, join reordering, and more. By applying these rules, Catalyst can generate more
efficient execution plans tailored to specific query patterns.
Q 13: What are broadcast variables and
accumulators in PySpark?
• Broadcast Variables:
• Broadcast variables allow you to efficiently distribute large read-only
variables to all worker nodes in a PySpark cluster. This is particularly
useful when you have large datasets or lookup tables that are needed
for tasks across all nodes. Instead of sending the variable separately
with each task, broadcast variables are sent once to each node and
reused across multiple tasks.

• broadcast_var = spark.sparkContext.broadcast([1, 2, 3, 4, 5])

Cont.
• Accumulators are variables that are only "added" to through an
associative and commutative operation and are used to implement
counters or sums across the cluster. They are mainly used as a way for
tasks running on worker nodes to update a shared variable that lives
on the driver node. Accumulators are primarily used for debugging
purposes or for monitoring the progress of a computation.

• even_counter = spark.sparkContext.accumulator(0)
Q 14: Explain a scenario where we want to reduce
the number of partition but we prefer reparation.
• Suppose you have a PySpark DataFrame that is heavily skewed, meaning that a
few partitions contain significantly more data than others. This skew can lead to
straggler tasks and inefficient resource utilization during computations, as the
workload is not evenly distributed across the cluster.
• In this scenario, coalesce might not be sufficient because it only merges existing
partitions without performing a full shuffle. Since the data is heavily skewed,
simply merging partitions may not redistribute the data evenly, and the skew may
persist in the resulting partitions. Instead, you can use repartition to explicitly
shuffle and redistribute the data evenly across a smaller number of partitions.
• By using repartition, you ensure that the data is evenly distributed across the
specified number of partitions, which helps mitigate the effects of skew and
improves performance by balancing the workload across the cluster. Although
repartition involves a full shuffle, in cases of heavy skew, the benefits of evenly
distributed data may outweigh the overhead of shuffling.
Q 15: What is bucketing in spark?
• Bucketing is a technique used in Apache Spark to organize data into
more manageable and evenly sized partitions based on a hash of one
or more columns. It's particularly useful when you have large datasets
and want to optimize certain types of operations, such as joins or
aggregations, by ensuring that related data is colocated within the
same partitions.

• df.write.bucketBy(4, "id").saveAsTable("bucketed_table",
format="parquet")

Pyspark
100% (1)
Pyspark
48 pages
master_pyspark_zero_to_hero_1738689679
No ratings yet
master_pyspark_zero_to_hero_1738689679
102 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Block Diagram: X555LD Repair Guide
No ratings yet
Block Diagram: X555LD Repair Guide
7 pages
Must Know Before Your Next Databricks Interview
No ratings yet
Must Know Before Your Next Databricks Interview
7 pages
Pyspark Dumps
No ratings yet
Pyspark Dumps
10 pages
2 Digital Literacy L5 and L6 2025
No ratings yet
2 Digital Literacy L5 and L6 2025
1 page
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
Mod Menu Log - Com - Tfgco.games - Strategy.free - Castlecrush
No ratings yet
Mod Menu Log - Com - Tfgco.games - Strategy.free - Castlecrush
121 pages
Audit THP Parameter
100% (1)
Audit THP Parameter
2 pages
ILI9225B
No ratings yet
ILI9225B
118 pages
Linux Kernel Series
No ratings yet
Linux Kernel Series
140 pages
final report review 2
No ratings yet
final report review 2
20 pages
Spark Interview Questions and Answers
100% (3)
Spark Interview Questions and Answers
31 pages
bca-4-sem-advanced-microprocessors-n-microcontroller-s-2019
No ratings yet
bca-4-sem-advanced-microprocessors-n-microcontroller-s-2019
2 pages
Evaluation of A SoC For Real-Time 3D SLAM
No ratings yet
Evaluation of A SoC For Real-Time 3D SLAM
71 pages
pyspark interview questions
No ratings yet
pyspark interview questions
9 pages
Spark+Databricks
No ratings yet
Spark+Databricks
19 pages
Log
No ratings yet
Log
23 pages
Unit I PPT CSS 3
No ratings yet
Unit I PPT CSS 3
14 pages
spark QA
No ratings yet
spark QA
34 pages
Assembly and Disassembly
No ratings yet
Assembly and Disassembly
21 pages
Bigdata Interview Q&A
No ratings yet
Bigdata Interview Q&A
71 pages
PySpark Interview Questions 2025
No ratings yet
PySpark Interview Questions 2025
8 pages
Practical No 7
No ratings yet
Practical No 7
13 pages
RX Controller IP For MIPI CSI-2 v2.1
No ratings yet
RX Controller IP For MIPI CSI-2 v2.1
2 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Software Quality Attiributes
No ratings yet
Software Quality Attiributes
27 pages
PySpark Q&A
No ratings yet
PySpark Q&A
56 pages
IOT Assignment-2
No ratings yet
IOT Assignment-2
32 pages
ECE 1071 PHY 11 Mar 2023
No ratings yet
ECE 1071 PHY 11 Mar 2023
4 pages
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Introduction to Big Data With PySpark_ Spark RDDs With PySpark Cheatsheet _ Codecademy
No ratings yet
Introduction to Big Data With PySpark_ Spark RDDs With PySpark Cheatsheet _ Codecademy
6 pages
MVC Architecture: Model-View-Controller (MVC) Is A
No ratings yet
MVC Architecture: Model-View-Controller (MVC) Is A
7 pages
Spark Interview More Questions With Answers
No ratings yet
Spark Interview More Questions With Answers
3 pages
Acn Report
No ratings yet
Acn Report
16 pages
CompAir 4-Pages SmartAir Master
No ratings yet
CompAir 4-Pages SmartAir Master
4 pages
Megaplex-2100 Module: 4-Channel Low Speed Data Module With V.110 Rate Adaptation
No ratings yet
Megaplex-2100 Module: 4-Channel Low Speed Data Module With V.110 Rate Adaptation
32 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Linux Fundamentals Post Assessment
No ratings yet
Linux Fundamentals Post Assessment
2 pages
Top Questions for Data Engineering Interviews 1742072752
No ratings yet
Top Questions for Data Engineering Interviews 1742072752
72 pages
Interview - Questions
No ratings yet
Interview - Questions
8 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
TFWoljND9k
No ratings yet
TFWoljND9k
25 pages
In-Lab Work: Department of Electrical and Computer Engineering ECE 2262 - Electric Circuits
No ratings yet
In-Lab Work: Department of Electrical and Computer Engineering ECE 2262 - Electric Circuits
4 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Symbian Os
No ratings yet
Symbian Os
31 pages
Datasheet EAP615-Wall
No ratings yet
Datasheet EAP615-Wall
6 pages
Shorten 2
No ratings yet
Shorten 2
2 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
ALGIZ 8x-Manual PDF
No ratings yet
ALGIZ 8x-Manual PDF
29 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
Manual Call Points: General
No ratings yet
Manual Call Points: General
2 pages
Pyspark Theory Questions
No ratings yet
Pyspark Theory Questions
5 pages
Databricks RealQuestions
No ratings yet
Databricks RealQuestions
9 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
IDOCs
No ratings yet
IDOCs
8 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Data Engineer Interview
No ratings yet
Data Engineer Interview
23 pages
bLScCdW1geivYxBAmcEE3u (1)(1)
No ratings yet
bLScCdW1geivYxBAmcEE3u (1)(1)
166 pages
Pysparkdump
No ratings yet
Pysparkdump
4 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Module 4
No ratings yet
Module 4
29 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Apache Spark - Practices 2nd
No ratings yet
Apache Spark - Practices 2nd
26 pages
Spark Optimisation Techniques
No ratings yet
Spark Optimisation Techniques
3 pages
Pyspark IQ
No ratings yet
Pyspark IQ
13 pages
Pyspark-1
No ratings yet
Pyspark-1
7 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
Data Bricks
No ratings yet
Data Bricks
10 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
PySpark Interview QA (1)
No ratings yet
PySpark Interview QA (1)
2 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
notes (2) - Copy
No ratings yet
notes (2) - Copy
4 pages
pyspark
No ratings yet
pyspark
6 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Data Engineer Question
No ratings yet
Data Engineer Question
33 pages
Spark Material
No ratings yet
Spark Material
6 pages
SPARK Question answers
No ratings yet
SPARK Question answers
19 pages
PySpark_Interview_Questions
No ratings yet
PySpark_Interview_Questions
2 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
50_PySpark_interview_questions__1732556477
No ratings yet
50_PySpark_interview_questions__1732556477
7 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
1746178312202
No ratings yet
1746178312202
4 pages
PySpark_Interview_Questions_Shubham
No ratings yet
PySpark_Interview_Questions_Shubham
3 pages
Fast Beam
No ratings yet
Fast Beam
2 pages
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet