azure DE interview que
azure DE interview que
azure DE interview que
What is the logic used while moving from the Silver layer to the
Bronze layer?
The Bronze Layer typically contains raw, unprocessed data ingested from
the source.
The Silver Layer contains cleaned, transformed, and enriched data.
Moving from Silver to Bronze is uncommon unless you're archiving
transformed data back to the raw layer for tracking purposes. If this
occurs:
o Data transformations (e.g., data standardization) are reverted.
4. What is Serialization?
Serialization is the process of converting an object into a format that can
be stored or transmitted and later reconstructed. PySpark uses Pickle and
Kryo for this.
5. What are Different Types of Triggers in PySpark?
Default Trigger: Processes data as soon as it's available.
Fixed-interval Trigger: Processes data at a set time interval.
python
CopyEdit
.trigger(processingTime='1 minute')
One-time Trigger: Processes data once and stops.
python
CopyEdit
.trigger(once=True)
Continuous Trigger: Processes data continuously (low latency).
python
CopyEdit
.trigger(continuous="1 second")
Schedule Trigger:
o Initiates pipelines or workflows at specific times.
window_spec = Window.partitionBy("department").orderBy("salary")
df = df.withColumn("rank", dense_rank().over(window_spec))
df.filter(df.rank == 2).show()
SQL:
sql
CopyEdit
SELECT department, employee_name, salary
FROM (
SELECT department, employee_name, salary,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS
rnk
FROM employees
) AS ranked
WHERE rnk = 2;
22 - You have your data on your SQL onprime you want to load it ADLS but the
data keeps changing and it's incremental how will you design your pipeline?
ans- To design a pipeline to load incrementally changing data from your SQL on-
premise to Azure Data Lake Storage (ADLS), you would need to leverage a
"change data capture" mechanism in your SQL database, typically by utilizing a
timestamp or a dedicated change tracking table, and then use an ETL tool like
Azure Data Factory to extract only the new or updated data based on a
"watermark" value, effectively capturing only the incremental changes between
runs.
Key components of the pipeline:
Source Data Store (SQL on-premise):
Change Tracking: Ensure your SQL tables have a suitable column (like a
timestamp or a change flag) to identify newly added or updated records.
Stored Procedures: Create stored procedures to efficiently retrieve incremental
data based on the watermark value.
Azure Data Factory (ETL Tool):
Linked Services: Establish connections to your SQL on-premise database and
ADLS.
Datasets: Define source datasets (pointing to the SQL table with change tracking
information) and sink datasets (in ADLS).
Lookup Activity:
Retrieve the current "watermark" value from a dedicated table in the SQL
database (or potentially store it in Azure Key Vault for better management).
Copy Activity:
Use a SQL query in the Copy Activity to filter data based on the stored watermark
value, only extracting records that were modified after the last run.
Configure partitioning in ADLS to organize data by date or other relevant criteria
for efficient querying.
Sink Transformation:
Transform the data as needed before writing it to ADLS, including adding
appropriate metadata like timestamps or change flags.
Pipeline execution flow:
1. Initialize Watermark:
On the first run, retrieve the initial watermark value (potentially the oldest
timestamp in the data).
2. Extract Incremental Data:
Use a Lookup activity to fetch the current watermark value.
Execute a SQL query in the Copy activity that filters data based on the
watermark value, selecting only records modified after the last run.
3. Load to ADLS:
Copy the extracted data to the designated ADLS location, potentially using
partitioning for better organization.
4. Update Watermark:
After successful data load, update the watermark value in the SQL database to
the latest timestamp processed to ensure the next run picks up new changes.
Considerations:
Change Data Capture (CDC):
If your SQL database supports CDC features, utilize them for more efficient
incremental data capture.
Error Handling:
Implement error handling mechanisms to retry failed operations and log issues
for debugging.
Performance Optimization:
Optimize SQL queries to efficiently filter data based on the watermark.
Consider using bulk loading options for large data transfers.
26) What is the difference between Self Hosted and Autoresolve runtime?
Self-hosted runtime is used to connect to private networks or on-premises data
sources, while Autoresolve runtime is used for data integration scenarios. Self-
hosted runtime is more flexible, but requires the user to manage the underlying
infrastructure.
Self-hosted runtime
Use case: Connect to private networks or on-premises data sources
Flexibility: More flexible and can be configured to meet specific needs
Infrastructure: User is responsible for providing and maintaining the
infrastructure
Security: Requires additional security measures to protect data
Autoresolve runtime
Use case: Data integration scenarios where the user doesn't want to manage the
infrastructure
Flexibility: Automatically scales based on data needs, but the user can't define
the cluster size or compute type
Infrastructure: Automatically scales based on data needs
Security: Can be used with Private Endpoint and Private Link to ensure data
security
28) what is the difference between Parquet format and Delta format?
Parquet is a columnar storage format for data, while Delta is a file format that's
built on top of Parquet. Delta adds features like data versioning and ACID
compliance, making it a good choice for transactional workloads.
Storage:
Parquet
Stores data in columns, which is efficient for read-intensive tasks like analytical
queries.
Delta
Stores metadata about Parquet files in a transaction log, which makes it easier to
read and skip files.
Features:
Parquet: Known for its high compression and query performance.
Delta: Adds transactional capabilities, data versioning, and ACID compliance.
Use cases:
Parquet: Well-suited for big data processing and data lake applications.
Delta: Preferable for transactional workloads, real-time analytics, and managing
data lakes.
Benefits:
Parquet: Efficient storage and query performance.
Delta: Improved storage layer with caching and indexing for better read and
write performance.
30)
File format of Delta tables?
Delta tables use the Apache Parquet file format. Delta Lake is an open-source
storage layer that extends the Parquet file format.
Explanation
Delta tables are made up of Parquet files that store data and a transaction log.
The transaction log tracks metadata about the transactions, such as the files
added to the table, the schema, and column level metadata.
Delta Lake provides enhancements over traditional Parquet files, including
transactional capabilities, schema evolution, and time travel.
Delta tables allow you to create, read, update, and delete data in a
transactionally consistent manner.
You can update table data in Delta Lake using: SQL, Python, Scala, and Java
elta Lake is built on top of Parquet, and as such, Databricks also has optimized
readers and writers for interacting with Parquet files
31) Sub queries in SQL?
ans- An SQL subquery is nothing but a query inside another query. We use a
subquery to fetch data from two tables. A subquery is often also referred to as an
inner query, while the statement containing a subquery is also called an outer
query or outer select. We can implement subqueries with the SELECT, INSERT,
UPDATE, and DELETE statements along with the operators like =, <, >, >=, <=,
IN, BETWEEN, etc.
whereas
Aggregate functions are often used with the GROUP BY clause of the SELECT
statement. The GROUP BY clause splits the result-set into groups of values and
the aggregate function can be used to return a single value for each group.
When you use Managed private endpoints, traffic between your Azure Synapse
workspace and other Azure resources traverse entirely over the Microsoft
backbone network. Managed private endpoints protect against data exfiltration.
A Managed private endpoint uses private IP address from your Managed Virtual
Network to effectively bring the Azure service that your Azure Synapse
workspace is communicating into your Virtual Network. Managed private
endpoints are mapped to a specific resource in Azure and not the entire service.
Customers can limit connectivity to a specific resource approved by their
organization.
Some users might intentionally run large queries using a pay-per-query price
model, but you can always manage your expense. The Cost Control feature in
Azure Synapse analytics allows users to determine their budgets based on daily,
weekly, and monthly spend. Users can configure the feature through Synapse
Studio or the stored process in T-SQL.
In PySpark, the .mode() function is used to specify the behavior when writing
data to a data source. It determines how the data is saved if the target location
(e.g., a file or table) already exists. Below are the different modes available:
1. append
Description: Adds the data to the existing data at the target location.
Use Case: When you want to add new records to an existing dataset without
deleting or replacing the existing data.
Example:
python
Copy
Edit
df.write.mode("append").csv("path/to/output")
2. overwrite
Description: Overwrites the existing data at the target location with the new
data.
Use Case: When you want to replace the entire dataset with a new dataset.
Example:
python
Copy
Edit
df.write.mode("overwrite").parquet("path/to/output")
3. error (or errorifexists)
Description: Throws an error if the target location already exists.
Use Case: When you want to avoid accidental overwriting of data and ensure the
location is empty before writing.
Example:
python
Copy
Edit
df.write.mode("error").json("path/to/output")
Note: error and errorifexists are synonymous.
4. ignore
Description: Does nothing if the target location already exists. The write
operation is skipped.
Use Case: When you want to ensure no changes are made if the dataset already
exists at the location.
Example:
python
Copy
Edit
df.write.mode("ignore").orc("path/to/output")
Choosing the Right Mode
append: Use when you need to preserve existing data and add more.
overwrite: Use when you need to replace the existing data entirely.
error: Use when overwriting existing data is undesirable and should raise an
exception.
ignore: Use when you want a "do-nothing" approach if the data already exists.
38 - What is the Runtime version of the Pyspark cluster you have used?
Databricks Runtime version
Python version
12.2 LTS ML, 12.2 LTS
3.9
3.9
3.8
3.8
7.3 LTS
3.7
In Databricks, the runtime version of the PySpark cluster I would typically use is
dependent on the task and environment, but the runtime versions commonly
used include:
These include a specific Spark and Scala version with Python compatibility.
Example: Spark 3.4.x and Python 3.10.x.
How to Check Runtime Version in Databricks:
Go to Clusters in the Databricks workspace.
Select your cluster.
Look at the Cluster Configuration, where the Databricks Runtime Version (e.g.,
12.2 LTS) is specified.
spark = SparkSession.builder \
.appName("MyApp") \
.getOrCreate()
45. How can you cache your data in PySpark for performance?
python
Copy
Edit
df.cache() # Keeps the DataFrame in memory
df.persist() # Allows disk-based persistence
47. When using partitioning, does it create 100 partitions or how is it?
The number of partitions depends on the input data and Spark's configuration
(default: 200 partitions). You can specify it with:
python
Copy
Edit
df.repartition(100) # Manually set number of partitions
python
Copy
Edit
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def custom_function(x):
return x * 2
python
Copy
Edit
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
window_spec = Window.partitionBy("category").orderBy("value")
df = df.withColumn("rank", rank().over(window_spec))
62. SQL Query to Get Customer with the Highest Order Amount
sql
Copy
Edit
SELECT c.customer_name, c.customer_id, SUM(o.amount) AS total_amount
FROM customer c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name, c.customer_id
ORDER BY total_amount DESC
LIMIT 1;