azure DE interview que

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25

1.

What is the logic used while moving from the Silver layer to the
Bronze layer?
 The Bronze Layer typically contains raw, unprocessed data ingested from
the source.
 The Silver Layer contains cleaned, transformed, and enriched data.
 Moving from Silver to Bronze is uncommon unless you're archiving
transformed data back to the raw layer for tracking purposes. If this
occurs:
o Data transformations (e.g., data standardization) are reverted.

o Metadata mapping is retained for traceability.

2. Difference between Repartitioning and Coalesce?


 Repartitioning: Increases or decreases the number of partitions by
shuffling data across the cluster. It is resource-intensive.
python
CopyEdit
df.repartition(10)
 Coalesce: Reduces the number of partitions without a full shuffle by
merging adjacent partitions.
python
CopyEdit
df.coalesce(4)

3. PySpark Optimization Techniques


 Use cache() or persist() for reused datasets.
 Optimize partitioning using repartition() and coalesce().
 Avoid wide transformations where possible.
 Use broadcast joins for small datasets.
 Enable predicate pushdown by using proper filters.
 Write data using the Parquet or Delta format for efficient reads/writes.

4. What is Serialization?
 Serialization is the process of converting an object into a format that can
be stored or transmitted and later reconstructed. PySpark uses Pickle and
Kryo for this.
5. What are Different Types of Triggers in PySpark?
 Default Trigger: Processes data as soon as it's available.
 Fixed-interval Trigger: Processes data at a set time interval.
python
CopyEdit
.trigger(processingTime='1 minute')
 One-time Trigger: Processes data once and stops.
python
CopyEdit
.trigger(once=True)
 Continuous Trigger: Processes data continuously (low latency).
python
CopyEdit
.trigger(continuous="1 second")

6. Difference Between Tumbling Window Trigger and Schedule Trigger


 Tumbling Window Trigger:
o Processes non-overlapping windows of data.

o Used in stream processing.

 Schedule Trigger:
o Initiates pipelines or workflows at specific times.

7. Copy SQL On-Prem Data to ADLS


 Use Azure Data Factory (ADF) or Databricks.
 Steps:
1. Set up a linked service for SQL Server and ADLS in ADF.
2. Use Copy Data Tool to transfer data incrementally or fully.

8. Different SQL Joins


 Inner Join: Matches rows in both tables.
 Left Join: Includes all rows from the left table and matching rows from the
right.
 Right Join: Includes all rows from the right table and matching rows from
the left.
 Full Join: Includes rows from both tables, even without matches.
 Cross Join: Cartesian product of both tables.
 Self Join: Table joins with itself.
 Exclusive Joins: Returns non-matching rows from left/right.

9. PySpark Query and SQL for Second-Highest Salary Department-Wise


 PySpark:
python
CopyEdit
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank

window_spec = Window.partitionBy("department").orderBy("salary")
df = df.withColumn("rank", dense_rank().over(window_spec))
df.filter(df.rank == 2).show()
 SQL:
sql
CopyEdit
SELECT department, employee_name, salary
FROM (
SELECT department, employee_name, salary,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS
rnk
FROM employees
) AS ranked
WHERE rnk = 2;

10. Why Use Dense Rank Over Rank?


 Dense Rank does not skip ranks for ties.
 Example: 1, 2, 2, 3 (dense) vs 1, 2, 2, 4 (rank).

11. Sort a PySpark List


python
CopyEdit
data = [5, 1, 4, 3, 2]
sorted_data = sorted(data)
print(sorted_data)

12. Data Warehouses Used


 Snowflake, Azure Synapse, Google BigQuery, Amazon Redshift.

13. Purpose of Linked Services in ADF


 Provides connection details for external resources (e.g., SQL Server,
ADLS).

14. How Do You Monitor Pipelines?


 Using ADF monitoring tools.
 Configure alerts and logs in Azure Monitor.

15. SQL Optimization Techniques


 Use indexes.
 Avoid SELECT*.
 Optimize joins and subqueries.
 Use partitioning and clustering.

16. Challenges Faced


 Handling large datasets efficiently.
 Managing schema evolution in Delta tables.
 Optimizing pipeline performance.

17. Tools Used


 Azure Data Factory, Databricks, Power BI, SQL, Snowflake.

18. Source and Sink


 Source: Origin of data (e.g., SQL, API).
 Sink: Destination for processed data (e.g., ADLS, Data Warehouse).

19. Incremental vs Full Data


 Incremental: Only new/changed data is processed.
 Full: Entire dataset is processed.

20) hat is metadata?


ans- In Databricks, metadata is information that describes data assets, such as
schemas, columns, dashboards, and queries. It helps users understand and use
data for various purposes.
What does metadata do?
Helps with data governance: Metadata helps enforce data governance policies.
Improves data quality: Metadata helps maintain data quality and compliance.
Simplifies data integration: Metadata helps integrate data by providing
information about its source, format, and transformations.
Improves data lineage and discovery: Metadata helps users find and analyze
data assets.
Helps with auditing: Metadata can be used for auditing.
How is metadata managed in Databricks?
Metadata is collected, organized, maintained, and managed in a central
repository.
Metadata can be added to databases, tables, columns, and individual commits.
Metadata can be made searchable and easy to understand.
This metadata (now called metadata and governance layer) is the Databricks
Unity Catalog, which indexes all data within the lakehouse architecture, such as
the schemas, tables, views, and more. Without this metadata layer, each
Databricks workspace will be a silo.

What is Metastore in Databricks?


A metastore is the top-level container of objects in Unity Catalog. It stores data
assets (tables and views) and the permissions that govern access to them.
Databricks account admins can create metastores and assign them to Databricks
workspaces to control which workloads use each metastore.

21) PySpark's architecture is based on Apache Spark's distributed computing


framework, and it uses a Python interface to make data analysis and
manipulation easier. PySpark's architecture allows for distributed data processing
by using Spark's parallel processing capabilities.
Key components of PySpark's architecture
Spark driver
Manages the application's lifecycle, schedules tasks, and monitors job status
Spark executors
Worker processes that run on worker nodes and execute tasks assigned by the
driver
Resilient Distributed Dataset (RDD)
The foundation of Spark applications, where data is divided into chunks and is
immutable
Spark DataFrames API
Supports modern big data and data science applications
What PySpark can be used for? Creating scalable analyses and pipelines and
Data analysis and manipulation.
The basic architecture of PySpark involves several key components: Driver
Program: The main entry point for a PySpark application. It contains the user's
application code and coordinates the execution of tasks on the cluster.
SparkContext: SparkContext is the entry point to any Spark functionality.
As mentioned before, Apache Spark allows you to treat many machines as one
machine and this is done via a master-worker type architecture where there is a
driver or master node in the cluster, accompanied by worker nodes.

22 - You have your data on your SQL onprime you want to load it ADLS but the
data keeps changing and it's incremental how will you design your pipeline?
ans- To design a pipeline to load incrementally changing data from your SQL on-
premise to Azure Data Lake Storage (ADLS), you would need to leverage a
"change data capture" mechanism in your SQL database, typically by utilizing a
timestamp or a dedicated change tracking table, and then use an ETL tool like
Azure Data Factory to extract only the new or updated data based on a
"watermark" value, effectively capturing only the incremental changes between
runs.
Key components of the pipeline:
Source Data Store (SQL on-premise):
Change Tracking: Ensure your SQL tables have a suitable column (like a
timestamp or a change flag) to identify newly added or updated records.
Stored Procedures: Create stored procedures to efficiently retrieve incremental
data based on the watermark value.
Azure Data Factory (ETL Tool):
Linked Services: Establish connections to your SQL on-premise database and
ADLS.
Datasets: Define source datasets (pointing to the SQL table with change tracking
information) and sink datasets (in ADLS).
Lookup Activity:
Retrieve the current "watermark" value from a dedicated table in the SQL
database (or potentially store it in Azure Key Vault for better management).
Copy Activity:
Use a SQL query in the Copy Activity to filter data based on the stored watermark
value, only extracting records that were modified after the last run.
Configure partitioning in ADLS to organize data by date or other relevant criteria
for efficient querying.
Sink Transformation:
Transform the data as needed before writing it to ADLS, including adding
appropriate metadata like timestamps or change flags.
Pipeline execution flow:
1. Initialize Watermark:
On the first run, retrieve the initial watermark value (potentially the oldest
timestamp in the data).
2. Extract Incremental Data:
Use a Lookup activity to fetch the current watermark value.
Execute a SQL query in the Copy activity that filters data based on the
watermark value, selecting only records modified after the last run.
3. Load to ADLS:
Copy the extracted data to the designated ADLS location, potentially using
partitioning for better organization.
4. Update Watermark:
After successful data load, update the watermark value in the SQL database to
the latest timestamp processed to ensure the next run picks up new changes.
Considerations:
Change Data Capture (CDC):
If your SQL database supports CDC features, utilize them for more efficient
incremental data capture.
Error Handling:
Implement error handling mechanisms to retry failed operations and log issues
for debugging.
Performance Optimization:
Optimize SQL queries to efficiently filter data based on the watermark.
Consider using bulk loading options for large data transfers.

23) what are the types of cluster?

ans- Databricks offers several types of clusters, including:


Single-node clusters: These clusters are not designed for sharing and are best for
single jobs. They have restricted resources and can be configured to stop after a
period of inactivity.
High concurrency clusters: These clusters are good for sharing and enable low
query latencies and maximum resource utilization.
Auto-scaling clusters: These clusters can automatically add or remove nodes
based on workload metrics.
GPU clusters: These clusters contain pre-installed packages and libraries that can
be used for building data pipelines, Notebook analysis, and more.
Standard clusters: These clusters are good for a single user and can run
workloads created in languages such as SQL, Python, Scala, and R.
Job clusters: These clusters are one of the two types of clusters in Databricks.
All-purpose clusters: These clusters are one of the two types of clusters in
Databricks.
Each cluster type has unique benefits, including cost-effectiveness, scalability,
and processing power.
Databricks offers five types of clusters: Single-Node Clusters, Multi-Node
Clusters, Auto-Scaling Clusters, High Concurrency Clusters, and GPU-Enabled
Clusters.

24) what is Liquid Clustering?


Liquid clustering is a data management technique in Databricks that organizes
data based on clustering keys. It's designed to improve query performance and
streamline data layout in Delta Lake tables.
Benefits of liquid clustering
Dynamic data layout: Liquid clustering adapts to new workloads and query
patterns over time.
Self-tuning: Liquid clustering automatically determines the best data layout.
Minimal data layout decisions: Users can choose clustering keys based on query
access patterns.
Incremental: Data is only rewritten when necessary.
Cost-effective: Liquid clustering simplifies data layout decisions, which can be
more cost-effective.
How to use liquid clustering
Databricks recommends using Databricks Runtime 15.2 or higher for tables with
liquid clustering.
Schedule regular OPTIMIZE jobs to cluster data.
Use the targetFileSize parameter in the OPTIMIZE command to control file sizes.
Monitor cluster performance metrics, such as CPU and memory usage.
Monitor the performance of Spark jobs to understand how Liquid Clustering
impacts job execution times.

25) What are Integration runtime?


ans-
Integration runtime (IR) is a compute infrastructure that allows data integration
across different network environments. There are three types of IRs in Azure
Data Factory: Azure, Self-hosted, and Azure-SSIS.
Types of IRs
Azure IR
A fully managed IR that can scan Azure and non-Azure data sources. It supports
connections to data stores and compute services with publicly accessible
endpoints.
Self-hosted IR
An IR that can scan data sources in an on-premises network or a virtual network.
You need to install it on an on-premises machine or a virtual machine inside your
private network.
Azure-SSIS IR
A VM that runs the SSIS engine and allows you to execute SSIS packages. It can
access resources in both public and private networks.
What IRs do
Provide data integration capabilities like data movement and data flows
Support data movement, transformation, and activity dispatch across different
data stores and compute services
Allow you to execute pipelines to get data from a source and write it to a
destination

26) What is the difference between Self Hosted and Autoresolve runtime?
Self-hosted runtime is used to connect to private networks or on-premises data
sources, while Autoresolve runtime is used for data integration scenarios. Self-
hosted runtime is more flexible, but requires the user to manage the underlying
infrastructure.
Self-hosted runtime
Use case: Connect to private networks or on-premises data sources
Flexibility: More flexible and can be configured to meet specific needs
Infrastructure: User is responsible for providing and maintaining the
infrastructure
Security: Requires additional security measures to protect data
Autoresolve runtime
Use case: Data integration scenarios where the user doesn't want to manage the
infrastructure
Flexibility: Automatically scales based on data needs, but the user can't define
the cluster size or compute type
Infrastructure: Automatically scales based on data needs
Security: Can be used with Private Endpoint and Private Link to ensure data
security

27)What is the difference between ADLS and Blob storage?


Azure Data Lake Storage (ADLS) is designed for big data analytics, while Azure
Blob Storage is a general-purpose object store. ADLS is a distributed file system
that can store and process large amounts of data, while Blob Storage is a cloud
storage system for unstructured data.
Storage type:
ADLS
A distributed file system that can store structured, semi-structured, and
unstructured data
Blob Storage
A cloud storage system for unstructured data, such as email, video, and binary
data
Storage tiers:
ADLS: Has hot and cool storage tiers, with hot being for frequently accessed data
and cool for infrequently accessed data
Blob Storage: Has different pricing tiers based on how often the data is accessed
Access:
ADLS: Integrates with other Azure services, such as Azure Databricks, Azure
HDInsight, and Azure Synapse Analytics
Blob Storage: Accessible via a REST API that can be accessed from anywhere
Cost:
ADLS: Has a pay-per-use model, where users pay for the data at rest, the number
of gigabytes stored, and the number of transactions
Blob Storage: Has different pricing tiers based on how often the data is accessed

28) what is the difference between Parquet format and Delta format?
Parquet is a columnar storage format for data, while Delta is a file format that's
built on top of Parquet. Delta adds features like data versioning and ACID
compliance, making it a good choice for transactional workloads.
Storage:
Parquet
Stores data in columns, which is efficient for read-intensive tasks like analytical
queries.
Delta
Stores metadata about Parquet files in a transaction log, which makes it easier to
read and skip files.
Features:
Parquet: Known for its high compression and query performance.
Delta: Adds transactional capabilities, data versioning, and ACID compliance.
Use cases:
Parquet: Well-suited for big data processing and data lake applications.
Delta: Preferable for transactional workloads, real-time analytics, and managing
data lakes.
Benefits:
Parquet: Efficient storage and query performance.
Delta: Improved storage layer with caching and indexing for better read and
write performance.

29)What is autoloader in DataBricks?


Databricks Auto Loader is a tool that ingests data from cloud storage services
into Databricks. It can process files in near real-time or in batches.
How it works
Auto Loader detects new or updated files in a specified cloud storage location
It triggers a Spark job to load and process the data
Auto Loader can process files in batches or near real-time
What it can do Process billions of files, Control late arriving data, Optimize
compute resource use, Identify schema drifts, and Recover data that might
otherwise be ignored or lost.
What file types it supports
JSON, CSV, PARQUET, AVRO, TEXT, BINARYFILE, and ORC.
Where it can be used Azure Storage Account, AWS S3, and GCP storage.

30)
File format of Delta tables?
Delta tables use the Apache Parquet file format. Delta Lake is an open-source
storage layer that extends the Parquet file format.
Explanation
Delta tables are made up of Parquet files that store data and a transaction log.
The transaction log tracks metadata about the transactions, such as the files
added to the table, the schema, and column level metadata.
Delta Lake provides enhancements over traditional Parquet files, including
transactional capabilities, schema evolution, and time travel.
Delta tables allow you to create, read, update, and delete data in a
transactionally consistent manner.
You can update table data in Delta Lake using: SQL, Python, Scala, and Java
elta Lake is built on top of Parquet, and as such, Databricks also has optimized
readers and writers for interacting with Parquet files
31) Sub queries in SQL?
ans- An SQL subquery is nothing but a query inside another query. We use a
subquery to fetch data from two tables. A subquery is often also referred to as an
inner query, while the statement containing a subquery is also called an outer
query or outer select. We can implement subqueries with the SELECT, INSERT,
UPDATE, and DELETE statements along with the operators like =, <, >, >=, <=,
IN, BETWEEN, etc.

32)a) Aggregate function in Pyspark?


ans - Aggregate functions in PySpark are essential for summarizing data across
distributed datasets. They allow computations like sum, average, count,
maximum, and minimum to be performed efficiently in parallel across multiple
nodes in a cluster.

whereas

32) b) aggregate function in sql


An aggregate function is a function that performs a calculation on a set of values,
and returns a single value.

Aggregate functions are often used with the GROUP BY clause of the SELECT
statement. The GROUP BY clause splits the result-set into groups of values and
the aggregate function can be used to return a single value for each group.

The most commonly used SQL aggregate functions are:

MIN() - returns the smallest value within the selected column


MAX() - returns the largest value within the selected column
COUNT() - returns the number of rows in a set
SUM() - returns the total sum of a numerical column
AVG() - returns the average value of a numerical column
Aggregate functions ignore null values (except for COUNT()).

33) What is the Delta table in ADF?


Delta is an open-source storage layer that provides ACID (Atomicity, Consistency,
Isolation, Durability) transactions and time travel capabilities on top of data
lakes. Delta files are stored as a collection of Parquet files with a transaction log
that maintains a record of all operations performed on the data

What is delta file format in Azure?


Delta table is the default data table format in Azure Databricks and is a feature
of the Delta Lake open source data framework. Delta tables are typically used for
data lakes, where data is ingested via streaming or in large batches. See: Delta
Lake quickstart: Create a table. Updating and modifying Delta Lake tables.

34) manage endpoint in azure?


Managed private endpoints are private endpoints created in a Managed Virtual
Network associated with your Azure Synapse workspace. Managed private
endpoints establish a private link to Azure resources. Azure Synapse manages
these private endpoints on your behalf. You can create Managed private
endpoints from your Azure Synapse workspace to access Azure services (such as
Azure Storage or Azure Cosmos DB) and Azure hosted customer/partner services.

When you use Managed private endpoints, traffic between your Azure Synapse
workspace and other Azure resources traverse entirely over the Microsoft
backbone network. Managed private endpoints protect against data exfiltration.
A Managed private endpoint uses private IP address from your Managed Virtual
Network to effectively bring the Azure service that your Azure Synapse
workspace is communicating into your Virtual Network. Managed private
endpoints are mapped to a specific resource in Azure and not the entire service.
Customers can limit connectivity to a specific resource approved by their
organization.

35) Dedicated SQL pool vs. serverless SQL ?

Dedicated SQL pool


The SQL pool was known as Azure SQL Data Warehouse, before becoming a part
of Synapse. When using Synapse SQL, a dedicated pool presents several
provisioned analytical resources. A dedicated SQL is a big data solution that
provides data storage in a relational table with columnar storage, and it improves
query performance and limits the storage cost. A dedicated SQL server is
measured in data warehousing units (DWU). After the data is stored in the
dedicated SQL pool, users can utilize it for analytics on a large scale.

Serverless SQL pool


A serverless SQL pool stores and computes large-scale data in a distributing data
processing system. The Azure serverless SQL pool doesn’t mandate users to set
up infrastructure and maintain clusters. Instead, it uses a pay-per-use model,
which means there is no cost incurred for resources not used, and the charge
made is based on the data processed by each query run.

Dedicated SQL PoolServerless SQL Pool


Allows you to run a query and ingest data from the business’s data lake files
Allows users to query their data lake files
Demands an infrastructure set up Not required to set up an infrastructure or
maintain clusters
Need to acquire dedicated servers before conducting any operation Seamless
data transformation and exploration without any infrastructure in place
Data is stored in relational tables Data is stored in Data Lake
Cost is managed by pausing SQL pool or scaling down warehouse Cost is
automatically handled and invoiced on a pay-per-query basis
Cost is incurred for the resources reserved Cost is incurred for the data
processed per query
Pay per DWU provisioned Pay per TB Processed
Cost
Unlike dedicated SQL pools where the user pays for reserved resources at a pre-
decided scale, the cost of serverless SQL is calculated per TB or data processed
based on the queries run. The pricing of serverless SQL pool is around 5 USD per
TB of data processed, but costs can quickly rise if the queries run are above large
multi TB datasets.

Some users might intentionally run large queries using a pay-per-query price
model, but you can always manage your expense. The Cost Control feature in
Azure Synapse analytics allows users to determine their budgets based on daily,
weekly, and monthly spend. Users can configure the feature through Synapse
Studio or the stored process in T-SQL.

Use cases for Synapse serverless SQL pool


Ad hoc analysis and data discovery
It is used to explore analytical data as an alternate to Apache Spark, particularly
for personnel already hands-on with T-SQL, without any need for cluster
configuration.
Ad hoc queries are run for data analysis and profiling purposes to work out
transformation demands for new data sources added.
Providing data to businesses for initial analysis before the ETL pipelines
development.
Conducting ad hoc analysis for historical analysis ingested in the data lake and
comparing with the existing data in the data warehouse.

36) What are Managed Identity?


Managed identities for Azure resources provide Azure services with an
automatically managed identity in Microsoft Entra ID. You can use this identity to
authenticate to any service that supports Microsoft Entra authentication, without
having credentials in your code

37 - What are the different types of .mode function in Pyspark?

In PySpark, the .mode() function is used to specify the behavior when writing
data to a data source. It determines how the data is saved if the target location
(e.g., a file or table) already exists. Below are the different modes available:

1. append
Description: Adds the data to the existing data at the target location.
Use Case: When you want to add new records to an existing dataset without
deleting or replacing the existing data.
Example:
python
Copy
Edit
df.write.mode("append").csv("path/to/output")
2. overwrite
Description: Overwrites the existing data at the target location with the new
data.
Use Case: When you want to replace the entire dataset with a new dataset.
Example:
python
Copy
Edit
df.write.mode("overwrite").parquet("path/to/output")
3. error (or errorifexists)
Description: Throws an error if the target location already exists.
Use Case: When you want to avoid accidental overwriting of data and ensure the
location is empty before writing.
Example:
python
Copy
Edit
df.write.mode("error").json("path/to/output")
Note: error and errorifexists are synonymous.
4. ignore
Description: Does nothing if the target location already exists. The write
operation is skipped.
Use Case: When you want to ensure no changes are made if the dataset already
exists at the location.
Example:
python
Copy
Edit
df.write.mode("ignore").orc("path/to/output")
Choosing the Right Mode
append: Use when you need to preserve existing data and add more.
overwrite: Use when you need to replace the existing data entirely.
error: Use when overwriting existing data is undesirable and should raise an
exception.
ignore: Use when you want a "do-nothing" approach if the data already exists.

38 - What is the Runtime version of the Pyspark cluster you have used?
Databricks Runtime version

Python version
12.2 LTS ML, 12.2 LTS

3.9

11.3 LTS ML, 11.3 LTS

3.9

10.4 LTS ML, 10.4 LTS

3.8

9.1 LTS ML, 9.1 LTS

3.8

7.3 LTS

3.7

In Databricks, the runtime version of the PySpark cluster I would typically use is
dependent on the task and environment, but the runtime versions commonly
used include:

Databricks Runtime for Machine Learning (e.g., ML 12.0 or ML 13.x)

Includes pre-installed libraries for data science and machine learning.


Example: Spark 3.4.x with Python 3.10.
Databricks Runtime for General Purposes (e.g., 10.4 LTS, 12.2, or newer versions
like 13.x).

These include a specific Spark and Scala version with Python compatibility.
Example: Spark 3.4.x and Python 3.10.x.
How to Check Runtime Version in Databricks:
Go to Clusters in the Databricks workspace.
Select your cluster.
Look at the Cluster Configuration, where the Databricks Runtime Version (e.g.,
12.2 LTS) is specified.

39 - How are delta tables stored?


A Delta table stores data as a directory of files in cloud object storage and
registers that table's metadata to the metastore within a catalog and schema. All
Unity Catalog managed tables and streaming tables are Delta tables.

39. In Python, which libraries have you used?


For Data Analysis and Manipulation: pandas, numpy
For Data Visualization: matplotlib, seaborn, plotly
For Machine Learning: scikit-learn, tensorflow, keras, xgboost
For Big Data: pyspark, dask
For API Calls: requests, flask
For Database Operations: sqlalchemy, pyodbc, sqlite3

40. What is PySpark?


PySpark is the Python API for Apache Spark. It allows Python developers to
leverage Spark's distributed computing framework for processing large-scale
data and performing machine learning, SQL-like operations, and stream
processing.

41. Why should we use PySpark over Python?


Scalability: Processes massive datasets distributed across a cluster.
Speed: Optimized for parallel processing using Spark's in-memory computation.
Ease of Use: Provides a high-level API for working with structured and
unstructured data.
Integration: Works with tools like Hadoop, Hive, and HDFS.
42. How do we create the Spark session in PySpark?
python
Copy
Edit
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("MyApp") \
.getOrCreate()

43. How do we read files in PySpark?


python
Copy
Edit
# Reading CSV file
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

# Reading Parquet file


df = spark.read.parquet("path/to/file.parquet")

44. Difference between CSV and Parquet files?


Feature CSV Parquet
Format Text-based Binary, columnar storage
CompressionNot compressed Highly compressed
Schema No schema enforcement Schema enforced
Read/Write Speed Slower Faster

45. How can you cache your data in PySpark for performance?
python
Copy
Edit
df.cache() # Keeps the DataFrame in memory
df.persist() # Allows disk-based persistence

46. What is partitioning, and what’s its role in PySpark?


Partitioning: Divides data into smaller, manageable chunks (partitions) across
nodes in a cluster.
Role: Improves parallelism and query performance.

47. When using partitioning, does it create 100 partitions or how is it?
The number of partitions depends on the input data and Spark's configuration
(default: 200 partitions). You can specify it with:
python
Copy
Edit
df.repartition(100) # Manually set number of partitions

48. What is DAG (Directed Acyclic Graph)?


A logical representation of Spark's computation steps. It ensures fault tolerance
and optimizes task execution.

49. How do we implement custom transformations?


Using user-defined functions (UDFs):

python
Copy
Edit
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def custom_function(x):
return x * 2

udf_func = udf(custom_function, IntegerType())


df = df.withColumn("new_col", udf_func(df["existing_col"]))

50. What is a window function?


A function that performs calculations across a sliding range of rows:

python
Copy
Edit
from pyspark.sql.window import Window
from pyspark.sql.functions import rank

window_spec = Window.partitionBy("category").orderBy("value")
df = df.withColumn("rank", rank().over(window_spec))

51. Difference between WHERE and HAVING clause?


WHERE: Filters rows before grouping.
HAVING: Filters aggregated results after GROUP BY.

52. What is Star Schema?


A data modeling design with a central fact table connected to dimension tables
in a star-like structure.
53. What are stored procedures?
Precompiled SQL code blocks stored in a database, used for repetitive
operations.

54. What are views?


Virtual tables created using SQL queries. They do not store data but provide a
layer for querying.

55. What are temporary tables?


Tables that exist only during the session or query execution.

56. Types of temporary tables?


Local Temporary Table: Specific to a session.
Global Temporary Table: Accessible to all sessions.

57. Different types of indexing?


Clustered Index: Sorts and stores data rows in order.
Non-Clustered Index: Stores pointers to data.
Unique Index: Ensures unique values in a column.

58. Difference between ETL and ELT?


ETL (Extract, Transform, Load) ELT (Extract, Load, Transform)
Transformation before loading Transformation after loading
Used in traditional data warehouses Used in modern data lakes

59. Difference between OLTP and OLAP database?


OLTP OLAP
Transactional processing Analytical processing
Normalized schema Denormalized schema
60. Between DELETE, TRUNCATE, and DROP, which can be rolled back?
DELETE: Can be rolled back.
TRUNCATE & DROP: Cannot be rolled back (DDL operations).

61. Different ways to import data into Power BI?


Directly from databases (SQL, Oracle, etc.)
Files (CSV, Excel, JSON)
APIs
Azure and other cloud sources.

62. SQL Query to Get Customer with the Highest Order Amount
sql
Copy
Edit
SELECT c.customer_name, c.customer_id, SUM(o.amount) AS total_amount
FROM customer c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name, c.customer_id
ORDER BY total_amount DESC
LIMIT 1;

You might also like