Data Engineering Questionnaire
Data Engineering Questionnaire
SQL
1. Types of joins
● Inner Join:
○ An Inner Join retrieves records that have matching values in both the
left (or first) and right (or second) tables. It effectively combines rows
from both tables where the specified condition is met.
○ If there are no matching values for a particular row in one of the tables,
that row is excluded from the result.
● Left Join (Left Outer Join):
○ A Left Join retrieves all records from the left table and the matching
records from the right table. If there are no matches in the right table,
NULL values are included for the columns from the right table.
○ This type of join ensures that all records from the left table are retained,
with matching records from the right table.
● Right Join (Right Outer Join):
○ A Right Join is similar to a Left Join, but it retrieves all records from the
right table and the matching records from the left table. If there are no
matches in the left table, NULL values are included for the columns
from the left table.
○ This join type ensures that all records from the right table are retained,
with matching records from the left table.
● Full Outer Join:
○ A Full Outer Join combines all records from both the left and right
tables. It includes matching records as well as non-matching records
from both tables.
○ If there is no match for a row in one table, NULL values are included for
columns from that table. This type of join is used when you want to
retrieve all the data from both tables.
● Cross Join (Cartesian Join):
○ A Cross Join combines each row from the left table with every row
from the right table. This results in a Cartesian product, producing a
large output set.
○ This type of join is not based on a specified condition or key, and it
effectively combines every row from one table with every row from the
other, resulting in a comprehensive combination of all rows.
WHERE Clause:
● The WHERE clause is used to filter rows from the result set based on a
specified condition or criteria.
● It is typically applied before any grouping or aggregation in a query, and it
operates on individual rows of the data.
● The WHERE clause is used in the SELECT, UPDATE, and DELETE statements to
restrict the rows that are affected by these operations.
● It allows you to filter rows based on various conditions, such as comparisons
(e.g., equal, not equal, less than, greater than), logical operators (e.g., AND,
OR), and wildcard characters (e.g., LIKE, %).
● It is commonly used to filter data to include only the rows that meet specific
criteria. For example, you can use it to retrieve all employees with a salary
above a certain threshold, or products with a price less than a given value.
HAVING Clause:
● The HAVING clause is used in conjunction with the GROUP BY clause and is used to
filter groups of rows rather than individual rows.
● It is applied after the GROUP BY operation and allows you to filter groups based on
conditions that involve aggregate functions (e.g., SUM, COUNT, AVG).
● The HAVING clause is primarily used in the SELECT statement when you want to filter
grouped data.
● It allows you to specify conditions for aggregated values. For example, you can use it
to filter groups to retrieve only those groups where the sum of values in a particular
column is greater than a specified amount.
● The HAVING clause is especially useful when performing calculations on grouped
data and you want to include only certain groups based on aggregate results. It
ensures that groups meeting the specified criteria are included in the result set.
To find the second-highest salary in the Employee table, you can use SQL to perform
the following steps:
Suppose you have an Employee table with the following sample data:
EmployeeID Name Salary DepartmentID
1 John 60000 1
2 Sarah 55000 2
3 Michael 75000 1
4 Lisa 58000 2
5 David 70000 3
In this example, the result will be 70000, which is the second-highest salary in the
Employee table.
To retrieve employees by department using two tables, Employee and Department, you
can use SQL like this:
Suppose you have an Employee table with the following sample data:
1 John 60000 1
2 Sarah 55000 2
3 Michael 75000 1
4 Lisa 58000 2
5 David 70000 3
Suppose you also have a Department table with the following sample data:
DepartmentID DepartmentName
1 Sales
2 Marketing
3 Engineering
Now, to retrieve employees by department using both tables, you can use this SQL
query:
Name DepartmentName
John Sales
Sarah Marketing
Michael Sales
Lisa Marketing
David Engineering
This result shows each employee's name alongside their respective department name,
which is obtained by joining the Employee and Department tables based on the
DepartmentID.
5. What is the Primary Key?
The primary difference between a Primary Key and a Unique Key lies in their handling
of NULL values and their specific roles within a database:
Primary Key:
Unique Key:
DELETE:
● Operation Type:
○ DELETE is a Data Manipulation Language (DML) operation.
● Granularity:
○ It operates at the row level. You can specify conditions in a WHERE
clause to delete specific rows.
● Rollback:
○ DELETE can be rolled back, meaning you can undo the changes made
by a DELETE operation if it is part of a transaction.
● Logging:
○ DELETE operations are typically logged in the transaction log, which
allows for recovery in case of errors or accidental data deletion.
● Performance:
○ DELETE can be slower than TRUNCATE, especially when deleting a large
number of rows, as it generates more transaction log records and
triggers associated constraints and triggers.
● Constraints and Triggers:
○ DELETE triggers any associated triggers on the table, and it respects
foreign key constraints and other referential integrity constraints. This
means that if there are dependent records in other tables, DELETE
operations may fail or require additional actions.
TRUNCATE:
● Operation Type:
○ TRUNCATE is a Data Definition Language (DDL) operation.
● Granularity:
○ It operates at the table level. It removes all rows from the table.
● Rollback:
○ TRUNCATE cannot be rolled back. Once executed, the data is
permanently removed from the table. It does not generate individual log
records for each deleted row, making it irreversible.
● Logging:
○ TRUNCATE minimizes the amount of logging compared to DELETE. It
generates fewer log records and is more efficient for large-scale data
removal.
● Performance:
○ TRUNCATE is generally faster and more efficient than DELETE, especially
when you need to remove all rows from a table, as it bypasses
transaction log records and constraints.
● Constraints and Triggers:
○ TRUNCATE does not trigger associated triggers on the table, and it does
not enforce foreign key constraints. It's important to ensure data
integrity when using TRUNCATE by handling related tables or constraints
manually.
In summary, the primary differences between DELETE and TRUNCATE are their
operation type, granularity, rollback capability, logging, performance, and their
interaction with constraints and triggers. DELETE is a DML operation that works at
the row level, can be rolled back, logs individual row deletions, and triggers
constraints and triggers. TRUNCATE is a DDL operation that works at the table level,
cannot be rolled back, logs minimal information, and does not trigger constraints or
triggers, making it more efficient for removing all rows from a table but requiring
extra caution to maintain data integrity.
A view in SQL is a virtual table created by a SQL query. Unlike physical tables,
views do not store data on their own; instead, they are based on the result of a
SELECT statement. Views serve several important purposes in a relational
database system.
The "BETWEEN" clause in SQL is used to filter results within a specified range of
values. It is typically used with numerical or date values, ensuring that the result
set includes values within the specified range.
This query will return all products with prices equal to or greater than $10 and
equal to or less than $20.
Both "RANK" and "DENSE_RANK" are window functions in SQL used for ranking
rows in a result set based on the values in a specified column. However, the key
difference between them is how they handle tied values (i.e., rows with the same
value that receive the same rank):
RANK:
● "RANK" assigns the same rank to tied values but leaves gaps in the
ranking sequence for the next rank. For example, if two rows have the
same value and rank 1, the next row will receive rank 3, and there will be a
gap in the ranking sequence. If the next row has the same value, it will also
receive rank 3.
● "RANK" provides the same rank to all tied values and then increments the
rank by the number of tied values plus one for the next row.
● The result of using "RANK" might look like this:
Rank | Value
---- | -----
1 | 10
1 | 10
3 | 15
4 | 20
DENSE_RANK:
● "DENSE_RANK" assigns the same rank to tied values and does not leave
gaps in the ranking sequence. For tied values, it will have the same rank
for each of them without incrementing the rank for the next row.
● "DENSE_RANK" provides a more compact ranking, ensuring that tied
values receive the same rank, and the next row receives the next rank
without any gaps.
● The result of using "DENSE_RANK" might look like this:
Rank | Value
---- | -----
1 | 10
1 | 10
2 | 15
3 | 20
data retrieval operations by creating data structures (indexes) that provide fast
access to data rows. Indexes are like a table of contents in a book, allowing the
database management system to quickly locate the rows associated with specific
Purpose:
How It Works:
The "GROUP BY" clause in SQL is used to group rows from a result set into summary
rows based on the values in one or more columns. It is primarily used with
aggregate functions like SUM, COUNT, AVG, and MAX to perform calculations on groups
of data.
● Purpose:
● The "GROUP BY" clause is used to summarize and aggregate data by
grouping rows with common values in one or more columns. It allows
you to perform calculations on each group of data.
● Syntax:
● The "GROUP BY" clause is usually followed by one or more columns
that specify how to group the data. These columns can be the same as
those used in the "SELECT" statement or can be expressions.
● The result of a "GROUP BY" query includes one row for each group,
along with the results of aggregate functions applied to the columns
that are not part of the grouping.
● Aggregate Functions:
● In conjunction with "GROUP BY," you typically use aggregate functions
to calculate values for each group. Common aggregate functions
include SUM, COUNT, AVG, MAX, and MIN.
● For example, you can use "GROUP BY" to find the total sales per
product category by grouping sales records by category and using the
SUM function to calculate the total for each group.
● Havings Clause:
● You can use the "HAVING" clause in combination with "GROUP BY" to
filter groups based on specific conditions that involve aggregate
functions. It's used for filtering grouped data, similar to the "WHERE"
clause for individual rows.
● Use Cases:
● "GROUP BY" is commonly used in reporting, data analysis, and business
intelligence. It helps you summarize data and gain insights from large
datasets.
● Examples of use cases include generating sales reports, counting the
number of orders per customer, and calculating average scores for
students in each subject.
In summary, the "GROUP BY" clause is a powerful SQL feature that allows you to
group and summarize data based on specific criteria, enabling you to perform
calculations on groups of data and generate meaningful reports and insights from
your database.
Window functions in SQL are a powerful set of functions that allow you to perform
calculations across a set of table rows that are related to the current row. These
functions provide a way to analyze and report on data in a more advanced manner
than traditional aggregate functions. Window functions are often used in analytical
and reporting scenarios where you need to calculate values over a specific window
or partition of data.
Key Concepts:
● ROW_NUMBER(): Assigns a unique integer value to each row in the result set
based on a specified order. This is useful for ranking rows.
● RANK(): Assigns a unique rank to each row based on a specified order, and it
allows for tied rankings.
● DENSE_RANK(): Similar to RANK but without gaps in ranking when tied values
exist.
● NTILE(n): Divides the result set into "n" roughly equal parts and assigns each
row to one of those parts.
● LEAD(): Provides the value of a specified column from the next row within the
window.
● LAG(): Provides the value of a specified column from the previous row within
the window.
● SUM(), AVG(), MIN(), MAX(): These are standard aggregate functions that
can be used as window functions with an OVER() clause to calculate values
over a window of rows.
● FIRST_VALUE(): Provides the value of a specified column from the first row
within the window.
● LAST_VALUE(): Provides the value of a specified column from the last row
within the window.
Use Cases:
● Ranking and Top-N Queries: You can use window functions like
ROW_NUMBER(), RANK(), and DENSE_RANK() to rank rows or retrieve the top
N rows based on specific criteria, such as sales performance or test scores.
● Partitioned Data Analysis: You can partition data into groups and apply
window functions separately within each group. For example, you can
calculate the rank of students within their respective classes.
● Lead and Lag Analysis: LEAD() and LAG() are used to analyze changes over
time. For example, you can calculate the difference between the current and
previous month's sales.
● Data Smoothing and Trend Analysis: Window functions can help smooth out
noisy data and identify trends or patterns in time series data.
PYTHON
Exception handling in Python is a crucial feature that allows you to gracefully manage
and recover from unexpected errors or exceptions that might occur during the execution
of a program. Python provides a structured way to handle exceptions using the try,
except, else, and finally blocks.
try:
# Code that may raise an exception
result = 10 / 0 # This will raise a ZeroDivisionError
except ZeroDivisionError:
# Handle a specific type of exception
print("Division by zero is not allowed.")
except (ValueError, TypeError):
# Handle multiple types of exceptions
print("Value or type error occurred.")
else:
# This block is executed if no exceptions occur
print("No exceptions were raised.")
finally:
# This block is always executed
print("Cleanup code or resource release.")
Lists and tuples are both data structures in Python, but they have some key differences:
Mutability:
● Lists: Lists are mutable, which means you can add, remove, or change
elements after the list is created. You can use methods like append(),
extend(), and pop() to modify a list.
● Tuples: Tuples are immutable, so once you create a tuple, you cannot
change its elements. This immutability can provide data integrity and
security in certain situations.
Syntax:
● Lists: Lists are defined using square brackets, e.g., my_list = [1, 2, 3].
Performance:
● Lists: Lists are slightly slower than tuples because of their mutability.
When elements are added or removed from a list, it may need to allocate
new memory or shift elements, which can affect performance.
● Tuples: Tuples are faster and have slightly better performance, especially
in situations where data does not change. Their immutability allows for
optimizations.
Use Cases:
● Lists: Lists are typically used for collections of items that may need to be
modified. They are suitable for sequences of data where elements can
change, such as a to-do list or a list of names.
● Tuples: Tuples are often used for data that should not be changed, such
as coordinates, database records, or function return values. They provide
a sense of data integrity.
In Python, there is no native data type called "array" as you might find in some other
programming languages. Instead, you can use lists to achieve similar functionality.
● Lists: Lists are mutable, meaning you can add, remove, or modify
elements in a list after it's created.
Element Types:
Performance:
● Lists: Lists are versatile but may not be as performant as arrays for
certain operations, especially when dealing with large amounts of
numerical data or scientific computing tasks.
● Lists: Lists are native to Python, making them easy to work with for
general-purpose data structures and collections.
def is_palindrome(s):
s = s.lower() # Convert the string to lowercase for case-insensitive
comparison
s = ''.join(c for c in s if c.isalnum()) # Remove non-alphanumeric characters
return s == s[::-1]
# Example usage:
input_string = "A man, a plan, a canal, Panama"
result = is_palindrome(input_string)
if result:
print("It's a palindrome!")
else:
print("It's not a palindrome.")
This code first converts the input string to lowercase, removes non-alphanumeric
characters, and then checks if the modified string is the same when reversed.
21. Fibonacci Sequence Code:
The Fibonacci sequence is a series of numbers where each number is the sum
of the two preceding ones, usually starting with 0 and 1. Here's Python code to
generate the Fibonacci sequence up to a specified number of terms:
def generate_fibonacci(n):
fibonacci_sequence = []
a, b = 0, 1
for _ in range(n):
fibonacci_sequence.append(a)
a, b = b, a + b
return fibonacci_sequence
# Example usage:
num_terms = 10
fib_sequence = generate_fibonacci(num_terms)
print(f"Fibonacci sequence with {num_terms} terms: {fib_sequence}")
In this code, we initialize the first two Fibonacci numbers (0 and 1), and then use
a loop to generate the subsequent terms by summing the previous two. The
result is stored in the fibonacci_sequence list. You can change the num_terms
variable to generate a different number of Fibonacci terms.
23. How to Extract Data from an API and Libraries for API Data
Extraction?
To extract data from an API, you typically need to make HTTP requests to the
API's endpoints and parse the responses. Python provides several libraries that
facilitate API data extraction. Here's a step-by-step explanation of the process
and the libraries you can use:
import requests
import json
data_dict = json.loads(data)
For example, if the API response is a list of items, you can loop through them to
extract specific values:
This process demonstrates the basics of extracting data from an API using
Python. The specific library and method you use may vary depending on the
API's authentication method, the type of data (e.g., XML or CSV), and the
structure of the API. Be sure to refer to the API's documentation for details on
how to make requests and interpret responses.
● Determine which API you want to extract data from. This could be a
public API, a third-party API, or an internal API.
● Determine the API endpoint you want to access and create an HTTP
request. This typically involves specifying the HTTP method (e.g., GET,
POST, PUT), the API endpoint URL, and any query parameters or request
body data.
Step 6: Make the API Request:
● Use the chosen HTTP client to send the request to the API endpoint.
Ensure that the request is properly formatted according to the API
documentation.
● Once the API receives your request, it will send back a response. This
response usually includes an HTTP status code and the actual data you
requested in a specific format (e.g., JSON, XML).
● Examine the HTTP status code in the API response. A status code of
200 typically indicates a successful request, while other codes may
indicate errors or specific conditions.
● If the response contains data (e.g., in JSON or XML format), parse the
response to extract the information you need. Use a relevant library or
tool to perform the parsing.
● Once the data is parsed, extract the specific pieces of information you're
interested in using Python or any other programming language.
Step 11: Post-process Data (Optional):
● Some APIs implement pagination for large data sets. Follow the API's
pagination guidelines to retrieve all relevant data in multiple requests.
● Finally, store the extracted data in a data structure (e.g., a list, dictionary,
database) or use it as needed for your application, analysis, or reporting.
25. What are the different status codes 1 series, 2 series, 3 series,
4 series, 5 series.
HTTP status codes are three-digit numbers that the server sends as part of
the response header to indicate the outcome of the client's request. These
status codes are grouped into five categories based on their first digit, which
provides information about the response class. Here are the different status
code categories:
1xx (Informational):
● Status codes in the 1xx range provide informational responses and don't
contain a response body. They typically inform the client that the
request has been received and the server is continuing to process it.
2xx (Successful):
● Status codes in the 2xx range indicate that the request was received,
understood, and successfully processed. They often represent success
or positive outcomes.
● Common 2xx status codes:
○ 200 OK: The request was successful, and the server is returning
the requested data.
○ 201 Created: A new resource has been successfully created as a
result of the request.
○ 204 No Content: The request was successful, but there is no
response body to return.
3xx (Redirection):
● Status codes in the 3xx range indicate that further action is required by
the client to complete the request. They often involve redirecting the
client to a different URL.
● Common 3xx status codes:
○ 301 Moved Permanently: The requested resource has been
permanently moved to a new URL, and the client should update its
references.
○ 302 Found (or 302 Found): The requested resource has been
temporarily moved to a different URL. The client should continue
to use the original URL.
○ 303 See Other: The server is sending a redirect, and the client
should issue a GET request to the new URL.
○ 304 Not Modified: The client's cached version of the resource is
still valid, and the server returns this status to indicate that the
client should use its cached data.
4xx (Client Error):
● Status codes in the 4xx range indicate that the client has made an error
or an invalid request. These status codes often highlight issues with the
client's request.
● Common 4xx status codes:
○ 400 Bad Request: The server cannot understand the client's
request due to a malformed or invalid syntax.
○ 401 Unauthorized: Authentication is required, and the client's
credentials are either missing or invalid.
○ 403 Forbidden: The client does not have permission to access the
requested resource.
○ 404 Not Found: The requested resource could not be found on the
server.
● Status codes in the 5xx range indicate that the server has encountered
an error or is incapable of performing the request due to server-related
issues.
● Common 5xx status codes:
○ 500 Internal Server Error: A generic error message indicating that
an unexpected condition prevented the server from fulfilling the
request.
○ 502 Bad Gateway: The server, while acting as a gateway or proxy,
received an invalid response from an upstream server.
○ 503 Service Unavailable: The server is currently unable to handle
the request due to temporary overloading or maintenance.
○ 504 Gateway Timeout: The server, while acting as a gateway or
proxy, did not receive a timely response from an upstream server
or some other auxiliary server it needed to access to complete the
request.
These HTTP status codes provide important information about the outcome
of a client's request and help both clients and servers communicate
effectively during the HTTP transaction.
26. What is the library used to scrap data from pdf and
images?
To extract data from PDFs and images, you can use several Python libraries.
Here are some popular options for each:
For PDFs:
● PyPDF2: PyPDF2 is a Python library for working with PDF files. It allows
you to extract text, split, merge, and perform various operations on PDF
documents.
For Images:
● OpenCV: OpenCV (Open Source Computer Vision Library) is a versatile
library that can be used for image processing, including tasks like OCR
(Optical Character Recognition) to extract text from images.
These libraries provide tools and functions to extract data from PDFs and
images, depending on your specific needs and the format of the data in the
documents.
Pandas is a popular Python library that is primarily used for data manipulation
and analysis. It provides data structures and functions to work with structured
data, making it an essential tool for data scientists and analysts.
● Data Import and Export: Pandas can read data from various file formats,
such as CSV, Excel, SQL databases, and more. It can also export data to
these formats.
● Data Exploration: You can use Pandas to explore data, perform summary
statistics, and gain insights into the dataset's structure and characteristics.
● Data Filtering and Selection: You can filter and select data based on
conditions and criteria, allowing you to extract specific subsets of the data.
● Data Visualization: While not a visualization library itself, Pandas can
work seamlessly with libraries like Matplotlib and Seaborn to create data
visualizations and plots.
● Time Series Analysis: Pandas includes functionality for time series data,
making it suitable for tasks like stock market analysis and financial data
manipulation.
● Grouping and Aggregation: You can group data by specific criteria and
perform aggregate calculations on grouped data, such as calculating the
mean, sum, or count.
● Data Integration: Pandas can integrate data from different sources and
formats into a unified, structured dataset.
● Machine Learning: Data scientists often use Pandas to prepare data for
machine learning tasks. It helps in feature engineering and data
preprocessing.
Step 1: Import Pandas: You need to import the Pandas library to use its
functions.
import pandas as pd
Step 2: Prepare Your Data: Data for a DataFrame can be a list of dictionaries, a
list of lists, or other structured data. You'll want to have your data ready in a
suitable format.
Step 3: Create the DataFrame: Use the pd.DataFrame() constructor to create a
DataFrame. Pass your data as the argument.
df = pd.DataFrame(data)
Step 4: Customize the DataFrame (Optional): You can further customize the
DataFrame by specifying row and column labels, data types, and other options.
The resulting df is a Pandas DataFrame containing your data. You can now
perform various operations on it, such as data analysis, filtering, grouping, and
visualization.
Pandas offers many more advanced techniques for data manipulation, including
reading data from external sources, merging DataFrames, and handling missing
values. Creating a DataFrame is just the beginning of what you can do with
Pandas for data analysis and manipulation.
Data Type Conversion: 34. df.astype(): Convert the data type of a column.
Indexing: 35. df.set_index(): Set a specific column as the index.
● df.reset_index(): Reset the index.
These are just some of the many functions available in Pandas for data
manipulation and analysis. The choice of function depends on the specific task
you want to perform with your data.
To convert a JSON object or file into a Pandas DataFrame, you can use the
pd.read_json() function. Here's how to do it:
From a JSON Object: You can directly convert a JSON object (dictionary) into a
DataFrame.
import pandas as pd
From a JSON File: You can read data from a JSON file and create a
DataFrame.
import pandas as pd
Pandas is a widely used Python library in data science, data analysis, and data
manipulation for several key reasons:
● Data Handling: Pandas provides data structures (DataFrames and Series)
that allow users to efficiently handle and work with structured data, such as
tables, spreadsheets, and time series data.
● Data Import/Export: It can read data from various file formats (e.g., CSV,
Excel, SQL databases) and export data in the same or other formats. This
makes it a versatile tool for data extraction and preparation.
● Data Aggregation: Users can aggregate data easily using functions like
groupby, allowing for the calculation of statistics (e.g., sum, mean, count)
across groups of data.
● Time Series Analysis: Pandas has specialized tools for time series data,
making it a valuable resource for financial analysis, sensor data, and more.
Pandas is a crucial tool for data professionals, from data analysts and scientists
to engineers and business analysts. It simplifies the data manipulation and
analysis process, making it more accessible and efficient.
32. What is an API Gateway?
An API Gateway is a server or service that acts as an entry point for clients (e.g.,
web applications, mobile apps, other services) to access multiple APIs
(Application Programming Interfaces) and microservices within a system. It
serves as a central point for managing, routing, and securing API requests.
Here's an explanation of key functions and components of an API Gateway:
Key Functions:
● Request Routing: API Gateways route client requests to the appropriate
backend services or APIs based on the request's path, headers, or other
criteria. This enables clients to interact with various services through a
single entry point.
● Security: API Gateways provide security features such as authentication,
authorization, and encryption to protect the APIs and the data they expose.
This includes validating API keys, implementing OAuth, or other
authentication mechanisms.
● Rate Limiting: API Gateways often enforce rate limits to prevent abuse or
excessive usage of APIs. They can restrict the number of requests a client
can make within a certain time frame.
● Logging and Monitoring: API Gateways log API requests and responses,
allowing for auditing and troubleshooting. They may also integrate with
monitoring tools to provide insights into API performance.
● Caching: Some API Gateways offer caching capabilities to store
responses from backend services. Cached responses can be quickly
served to clients, reducing the load on backend services and improving
response times.
● Transformation: API Gateways can transform requests and responses by
modifying data formats, headers, or payloads to ensure compatibility
between clients and backend services.
● Load Balancing: API Gateways may distribute incoming requests to
multiple instances of a service to balance the load and ensure high
availability and scalability.
● Error Handling: They handle errors gracefully, providing meaningful error
messages and responses to clients when issues occur.
Components:
● API Endpoints: These are the URLs through which clients access the API
Gateway. Each endpoint corresponds to a specific API or service.
● Proxy Server: The API Gateway acts as a proxy server, forwarding
requests from clients to the appropriate backend services.
● Security Features: This includes authentication and authorization
mechanisms, as well as encryption to secure the data in transit.
● Traffic Management: API Gateways manage the flow of traffic and
enforce rate limiting and throttling.
● Routing Rules: Configuration to determine how incoming requests are
mapped to backend services.
● Cache: Caching mechanisms for storing responses and reducing the load
on backend services.
● Monitoring and Analytics: Tools for logging, monitoring, and analytics to
track API usage and performance.
List comprehensions are efficient and make your code more readable when you
need to perform simple transformations and filtering operations on iterable data.
They can also be used with other iterable types, such as strings, to process and
generate new sequences.
Lambda functions are typically used when you need a small function for a short
period, especially as an argument to higher-order functions like map(), filter(), or
sorted(). Here's an example of a lambda function that squares a number:
Lambda functions are often used for simple operations where it's more
convenient and concise to define the function directly at the point of use, rather
than writing a separate named function.
35. Different File Types: Avro, Parquet, JSON, XML, CSV, and Building
DataFrames
Different file formats are used for various purposes, and each has its own
characteristics. Here are some common file types and how to build a Pandas
Python also provides a more concise way to count occurrences using the
str.count() method, which counts non-overlapping occurrences of a substring in
the given string:
occurrences = text.count(target_substring)
print(f"The substring '{target_substring}' appears {occurrences} times in the text.")
This method is efficient and easy to use when counting the occurrences of a
specific substring within a string.
AIRFLOW
● Web Interface: Airflow comes with a web-based user interface that allows
users to monitor and manage their workflows, view logs, and trigger tasks
manually.
The architecture of Apache Airflow consists of several key components that work
together to execute and manage workflows.
● Worker: Workers are nodes in the Airflow cluster responsible for running
task instances. Workers communicate with the metadata database to fetch
task details and execute them according to the instructions provided by the
scheduler.
Apache Airflow consists of several core components that work together to create,
schedule, and execute workflows.
● Hooks: Hooks are a type of interface for interacting with external systems
or databases. They provide a consistent way to connect to various external
resources from within tasks.
● Acyclic: DAGs are acyclic, which means they do not contain cycles or
loops. Tasks are connected in a way that prevents circular dependencies.
● Error Handling: DAGs allow you to specify how to handle errors and
exceptions during task execution, including options like retries, failure
tolerance, and alerting.
● Configurable Schedule: You can configure when and how often a DAG
should be scheduled for execution. Airflow's scheduler uses the DAG
definition and schedule to decide when to run tasks.
Apache Airflow's use of DAGs makes it a powerful tool for orchestrating and
automating a wide range of data-related workflows, from ETL processes and
data analysis to report generation and beyond.
● Define Tasks: Define individual tasks within the DAG. You can use a
variety of operators to represent different types of tasks, such as Python
functions, SQL queries, or file operations.
● Configure the Schedule: Configure when and how often the DAG should
be scheduled for execution by setting the schedule_interval parameter
when creating the DAG object. This determines the frequency and timing
of task runs.
● Define Error Handling: Configure error handling for tasks, including
options like retries, timeouts, and the behavior on failure.
● Save the DAG to a Python File: Save the complete DAG script to a
Python file in the Airflow DAG folder, which is monitored by the Airflow
scheduler.
def task_2():
print("Task 2 executed")
task1 = PythonOperator(
task_id='task_1',
python_callable=task_1,
dag=dag,
)
task2 = PythonOperator(
task_id='task_2',
python_callable=task_2,
dag=dag,
)
In Apache Airflow, operators are components used to define how a task should
execute within a DAG. Airflow provides a wide range of built-in operators for
common use cases. Some of the common types of operators in Airflow include:
● PythonOperator: Executes a Python callable as a task.
These operators provide the flexibility to perform various types of tasks within a
workflow. Users can choose the appropriate operator based on the task's
requirements and functionality. Custom operators can also be created to meet
specific use cases that are not covered by the built-in operators.
43. What are the types of functions or use cases on each of
the operators? E.g. python operator, sql operator, bash
operator.
Each type of operator in Apache Airflow is designed for specific use cases or
functions, depending on the task or operation it needs to perform within a
workflow. Here's an explanation of the common operator types and their typical
use cases:
1. PythonOperator:
● Use Case: Execute Python Code
● Function: Runs a Python function as a task. This is useful for custom data
processing, calculations, or other Python-based operations within a
workflow.
2. BashOperator:
● Use Case: Run Shell Commands or Scripts
● Function: Executes shell commands or scripts. It's suitable for tasks that
involve running command-line utilities, scripts, or shell commands.
3. DockerOperator:
● Use Case: Run Docker Containers
● Function: Executes a task within a Docker container. Useful for running
tasks that require a specific environment, isolated from the host system.
4. SqlSensor:
● Use Case: Wait for Database Conditions
● Function: Monitors a database and waits until a specified condition is met.
It's used when you need to ensure that specific data is available in a
database before proceeding with a task.
5. HttpSensor:
● Use Case: Wait for Web Responses
● Function: Monitors a web endpoint and waits for a specific HTTP response.
Useful when tasks depend on the availability or status of a web service.
6. DummyOperator:
● Use Case: No Operation or Placeholder
● Function: Acts as a placeholder task with no actual operation. It can be
used for structuring DAGs or creating conditional branches.
7. BranchPythonOperator:
● Use Case: Conditionally Choose Next Task
● Function: Executes a Python callable to determine which task to run next,
enabling conditional execution paths in a workflow.
8. EmailOperator:
● Use Case: Send Email Notifications
● Function: Sends email notifications as part of the workflow. Useful for
alerting stakeholders or sending reports.
9. FileSensor:
● Use Case: Wait for File or Directory Changes
● Function: Monitors a file or directory and waits for its presence or specific
changes. Ideal for tasks dependent on file availability or changes.
10. JdbcOperator:
● Use Case: Execute SQL Against JDBC Databases
● Function: Executes SQL commands against JDBC databases. Useful for
ETL processes involving database interactions.
11. SparkSubmitOperator:
● Use Case: Submit Spark Jobs
● Function: Submits Apache Spark jobs to a Spark cluster. It's used in
workflows involving big data processing.
12. SubDagOperator:
● Use Case: Embed Sub-DAGs
● Function: Allows embedding sub-DAGs within a main DAG, facilitating
modular and reusable workflow design.
13. HttpToHttpOperator:
● Use Case: Transfer Data Between HTTP Endpoints
● Function: Transfers data between HTTP endpoints, making HTTP requests
and processing responses within a workflow.
These are common operator types in Apache Airflow, each tailored for specific
tasks and use cases. You can choose the appropriate operator based on the type
of operation you need to perform within your workflow. Custom operators can
also be created to address specific use cases not covered by the built-in
operators.
AWS
Amazon Web Services (AWS) offers a wide range of services and components
that can be leveraged in data engineering tasks. Here are some of the key
● Amazon EC2 (Elastic Compute Cloud): EC2 instances can be used for
data processing tasks, running ETL jobs, and hosting databases. You can
choose the instance type that best suits your computational needs.
● AWS Glue DataBrew: Glue DataBrew is a visual data preparation tool that
helps clean, transform, and combine data for analytics and machine
learning.
● Amazon SageMaker: SageMaker is a machine learning platform that can
be used for building, training, and deploying machine learning models on
AWS.
● Elastic IP: Elastic IP addresses are static public IP addresses that can be
associated with EC2 instances, providing a consistent public-facing
address for your instances.
● Storage Options: EC2 instances can be attached to various types of
storage, including EBS (Elastic Block Store) for persistent block storage
and instance store volumes for temporary storage.
● Auto Scaling: EC2 instances can be part of an Auto Scaling group, which
automatically adjusts the number of instances based on traffic and
application demand.
Amazon EC2 is a fundamental service for running applications in the AWS cloud,
offering flexibility, scalability, and a wide range of options to meet different
computational needs. It is widely used in various use cases, from hosting web
applications to running data engineering and data science workloads.
Amazon S3, or Simple Storage Service, is a scalable and highly available object
storage service provided by Amazon Web Services (AWS). It is designed to store
and retrieve any amount of data from anywhere on the web. S3 is known for its
durability, scalability, and ease of use, making it a fundamental component in
many cloud-based applications and data storage solutions.
● Scalable: S3 is designed to scale with your storage needs. You can store
vast amounts of data in S3, and it can automatically scale to accommodate
your requirements.
● Secure: S3 offers a range of security features, including access control
lists (ACLs), bucket policies, and fine-grained access control. Data can be
encrypted at rest and during transit.
Amazon Redshift is widely used for data warehousing and business intelligence
applications. Its architecture, along with features like data compression and
parallel processing, allows for fast query performance on large datasets, making
it suitable for analytical workloads.
● Standard SQL: Athena uses standard SQL for querying data, making it
accessible to users familiar with SQL.
● Data in S3: It can query data directly from Amazon S3, without the need to
load the data into a separate data warehouse.
AWS Athena is often used for on-demand or exploratory data analysis, log
analysis, and querying data lakes in scenarios where data is stored in Amazon
S3 and doesn't require the up-front structure of a data warehouse.
Amazon Athena and Amazon Redshift are both data analysis services in the
AWS ecosystem, but they have distinct differences in terms of their use cases,
architecture, and features:
● Use Cases:
○ Athena: Best for ad-hoc, exploratory querying of data stored in
Amazon S3. Ideal for data lakes, log analysis, and scenarios where
data doesn't require ETL processes.
○ Redshift: Designed for data warehousing and business intelligence
workloads. Suitable for structured, high-performance analytics on
large datasets.
● Storage:
○ Athena: Queries data directly from Amazon S3. Data remains in S3
without the need to be loaded into a separate data store.
○ Redshift: Requires data to be loaded into the Redshift data
warehouse, which has its own storage format.
● Query Language:
○ Athena: Uses standard SQL for querying data stored in S3.
○ Redshift: Also uses SQL but requires data to be loaded into
Redshift tables.
● Schema:
○ Athena: Supports schema-on-read, meaning you define the schema
as you query the data.
○ Redshift: Requires schema-on-write, where data is loaded into
predefined tables with a fixed schema.
● Performance:
○ Athena: Provides good performance for ad-hoc queries but may not
be as performant as Redshift for complex, high-concurrency
workloads.
○ Redshift: Optimized for complex analytics and high-concurrency
queries, offering excellent performance for structured data.
● Scalability:
○ Athena: Automatically scales to handle query concurrency and data
volume, but may have query latency.
○ Redshift: Scales vertically by resizing the cluster to handle
increased workloads.
● Cost Model:
○ Athena: Pay-as-you-go model based on the amount of data
scanned by queries.
○ Redshift: Based on the cluster size and usage, with reserved
instance options for cost savings.
In summary, the choice between Athena and Redshift depends on your specific
data analysis requirements. Athena is more flexible and cost-effective for
exploring data stored in S3, while Redshift is designed for structured data
warehousing and high-performance analytics. Some organizations use both
services to optimize their data analysis workflows.
Security in AWS is a shared responsibility, with AWS responsible for the security
of the cloud infrastructure, and customers responsible for the security of their
data, applications, and configurations. By implementing security measures and
following best practices, organizations can build secure and compliant
environments in AWS.
52. What Are Lambda Triggers?
Lambda triggers are events or conditions that invoke AWS Lambda functions to
execute specific actions in response to the trigger. Lambda is a serverless
compute service that allows you to run code in response to various events and
automatically scales based on the workload. Lambda triggers play a crucial role
in event-driven serverless computing.
Here's an in-depth explanation of Lambda triggers:
● Event Sources:
○ Lambda triggers are often associated with event sources, which are
AWS services or external systems that generate events. These
events can include changes in resources, file uploads, database
updates, or custom events.
● Event-Driven Architecture:
○ Lambda functions are designed to work in an event-driven
architecture. They are idle until triggered by an event, which allows
you to respond to events in near real-time without the need to
manage server infrastructure.
● Built-in Triggers:
○ AWS Lambda supports built-in triggers for various AWS services.
For example, S3 bucket events (object created, deleted),
DynamoDB stream events, API Gateway HTTP requests, and SNS
notifications can trigger Lambda functions.
● Custom Triggers:
○ You can also create custom triggers by configuring other AWS
services to call a Lambda function in response to specific conditions.
● Data Processing:
○ Lambda triggers are commonly used for data processing tasks, such
as transforming data, generating thumbnails from uploaded images,
or analyzing logs.
● Workflow Automation:
○ Lambda functions can be used to automate workflows. For example,
they can be triggered by changes in an S3 bucket and process files
automatically.
● IoT and Real-time Data:
○ In IoT applications, Lambda functions can be triggered by sensor
data, allowing for real-time processing and actions like device
control.
● API Endpoints:
○ Lambda can be triggered by API Gateway requests, making it
suitable for building RESTful APIs and serverless applications.
● Asynchronous and Synchronous Triggers:
○ Lambda supports both synchronous and asynchronous triggers.
Synchronous triggers respond immediately to an event, while
asynchronous triggers queue events for processing later.
● Scaling and Parallelism:
○ Lambda automatically scales and parallelizes the execution of
functions in response to high numbers of incoming events, ensuring
low latency and high throughput.
● Error Handling:
○ Lambda provides error handling capabilities, including retries and
dead-letter queues for managing failed function invocations.
Lambda triggers enable you to build highly responsive and efficient serverless
applications by executing code in response to various events. They are essential
for creating event-driven architectures and automating tasks in the cloud.
● Data Security: EMR provides data encryption at rest and in transit. It also
supports fine-grained access control using AWS Identity and Access
Management (IAM) and Amazon EMRFS authorization.
● Spot Instances: You can reduce costs by using Amazon EC2 Spot
Instances in EMR clusters, which are cost-effective but may be preempted
with short notice.
● Custom Tasks: You can include custom tasks or code in your state
machine using AWS Lambda functions or AWS Fargate tasks.
● Error Handling: The service handles errors and retries based on your
defined error conditions, improving reliability and resilience of workflows.
● Wait States: Wait states allow you to pause a workflow for a specified
duration, or until a specific time or event occurs.
56. What is AWS Glue, and How Can We Use It?
AWS Glue is a fully managed, serverless ETL (Extract, Transform, Load) service
provided by Amazon Web Services. It is designed to help users prepare and
transform data for analytics and data processing tasks. AWS Glue simplifies the
ETL process by providing tools for data cataloging, schema inference, data
transformation, and data loading. Here's how AWS Glue can be used and its
benefits:
● ETL Jobs: AWS Glue enables you to create ETL jobs using a visual
interface or by writing code in Python or Scala. ETL jobs can transform and
clean data, making it suitable for analysis or reporting.
● Data Integration: AWS Glue can integrate with a wide range of data
sources and destinations, including Amazon Redshift, Amazon RDS, S3,
and various data warehouses and data lakes.
● Job Scheduling: You can schedule ETL jobs to run at specific times or in
response to events, ensuring data is always up to date.
AWS Glue is a valuable tool for data engineers, data analysts, and data scientists
who need to prepare and transform data for analysis. It streamlines the ETL
process, reduces operational overhead, and enhances the quality and
accessibility of data assets.
57. What is ETL, and How Does AWS Glue Help in ETL?
ETL stands for Extract, Transform, Load, which is a common data integration
process used to move data from source systems to a destination system while
transforming and reshaping the data along the way.
Here's an explanation of each phase of ETL and how AWS Glue helps in
this process:
● Extract: In the extraction phase, data is collected from various source
systems, which can include databases, applications, log files, and more.
This data is typically stored in a raw or unstructured form.
● Load: Once the data is extracted and transformed, it is loaded into a target
system, such as a data warehouse, data lake, or database, where it can be
used for reporting, analysis, or other purposes.
AWS Glue is a service that simplifies and automates the ETL process in the
following ways:
● Data Catalog: Glue provides a data catalog that automatically discovers
and catalogs metadata about your source data. This catalog makes it easy
to understand and access your data assets.
● Schema Inference: Glue can automatically infer the schema of your data,
reducing the need for manual schema definition. This is especially useful
for semi-structured or unstructured data.
● ETL Job Creation: Glue allows you to create ETL jobs using a visual
interface or by writing code in Python or Scala. It provides a flexible
environment for designing and running ETL workflows.
AWS Glue is particularly well-suited for managing data in data lakes and data
warehousing scenarios, where data often arrives in a raw or semi-structured form
and needs to be prepared for analysis. It simplifies the ETL process, automates
much of the data preparation, and provides a flexible environment for data
transformation and integration.
● Data Integration: EMR can integrate with a wide range of data sources,
including Amazon S3, Amazon RDS, Amazon Redshift, and on-premises
data sources. This flexibility allows you to access and analyze data from
various locations.
● Integration with Other AWS Services: EMR integrates with other AWS
services like AWS Glue, AWS Lambda, and Amazon QuickSight, allowing
you to build end-to-end data pipelines and analytics solutions.
Amazon EMR is used in a wide range of industries and use cases, from data
warehousing and log analysis to machine learning and real-time data processing.
Its flexibility and scalability make it a valuable tool for organizations working with
big data.
59. Explain how AWS Lambda functions can be integrated into
data engineering workflows. Provide some use cases.
Here's how AWS Lambda functions can be used in data engineering, along with
some use cases:
1. Data Transformation:
● Use Lambda functions to transform data on-the-fly. For example, you can
convert data formats, filter and aggregate data, or perform enrichment
tasks.
3. Data Ingestion:
6. Data Enrichment:
● Lambda functions can extract metadata from files and catalog it in a central
repository, such as an AWS Glue Data Catalog.
10. Data Routing: - Lambda functions can route data to different destinations
based on conditions or business rules, enabling dynamic data flow control.
Use Cases:
● Log Analysis: Lambda functions can parse and analyze log files
generated by applications, servers, and IoT devices.
● Real-time Recommendation Engines: Lambda functions can calculate
and serve real-time recommendations to users based on their actions.
● Data Lake Orchestration: Lambda functions can automate data ingestion
and cataloging tasks in data lakes.
● Data Enrichment for Customer 360: Lambda functions can enhance
customer profiles with external data, creating a comprehensive view of
customers.
● Event-Driven ETL: Lambda functions can process data in real-time as it
arrives, allowing for event-driven ETL pipelines.
AWS Step Functions offer several advantages when used in data processing
pipelines:
● Orchestration: Step Functions provide a way to orchestrate complex
workflows and dependencies between multiple AWS services, making it
easy to build, manage, and monitor data processing pipelines.
AWS Step Functions are especially beneficial for orchestrating complex data
processing pipelines in a serverless, scalable, and reliable manner. They simplify
workflow management, improve the overall efficiency of data processing tasks,
and provide clear visibility into the progress of your pipelines.
61. What is Amazon Athena, and How Does It Work with Data
Stored in Amazon S3?
● Data Catalog: Athena integrates with the AWS Glue Data Catalog,
allowing it to discover and catalog metadata about data stored in S3. This
makes it easier to explore and query datasets.
How It Works:
● Data Ingestion: Data is ingested into Amazon S3, where it is stored in one
or more buckets.
● Cataloging: Data stored in S3 can be cataloged using the AWS Glue Data
Catalog, which automatically detects and records metadata about the data,
such as the schema.
● Querying: Users can connect to Athena using a web console, SQL clients,
or business intelligence tools. They write SQL queries to analyze the data.
Benefits:
● No Data Loading: Athena eliminates the need to load data into a separate
database or data warehouse, as it directly queries data in S3. This
simplifies data processing workflows.
● Cost-Efficiency: You pay only for the queries you run, which makes it a
cost-effective solution for ad-hoc querying and analysis.
● Flexible Data Model: It supports a flexible data model with support for
documents, key-value pairs, and more. This flexibility accommodates
various types of data in real-time applications.
63. How can you ensure data security and compliance when
working with sensitive data in AWS data engineering projects?
Here are key practices and services to help you achieve this:
● Identity and Access Management (IAM):
● Encryption:
● Data Classification:
● Data Auditing:
● Data Governance:
● Access Controls:
● Network Security:
● Compliance Standards:
○ Use AWS CloudWatch and CloudTrail to monitor and set alarms for
security-related events. Implement centralized logging and analysis
of security logs.
● Secure Configuration:
○ Ensure that all AWS resources and services are configured securely,
following best practices and recommendations provided by AWS.
● Vendor Security:
○ If you use third-party tools or services, ensure they comply with your
security and compliance requirements.
● Users: IAM allows you to create individual user accounts for people who
need access to AWS resources. Each user has their own credentials
(username and password or access keys) and permissions.
● Policies: IAM policies are JSON documents that define the permissions
and access controls for users and groups. Policies specify which actions
are allowed or denied on which resources.
● IAM Roles for AWS Services: AWS services like EC2 instances, Lambda
functions, and Redshift clusters assume IAM roles to access AWS
resources securely.
PySpark is the Python library for Apache Spark, an open-source, distributed, and
high-performance data processing framework. Spark is designed for big data
processing and analytics, and PySpark allows Python developers to interact with
Spark, leveraging its capabilities.
● Data Source Agnostic: RDDs can be created from various data sources,
including HDFS, local file systems, and external data storage systems.
RDDs serve as the core data structure in Spark, and many higher-level
abstractions in Spark, such as DataFrames and Datasets, are built on top of
RDDs. RDDs are essential for distributed data processing and enable the
fault-tolerant and parallel processing capabilities that make Spark a powerful tool
for big data analytics and processing.
67. What are the different types of Slowly Changing Dimensions
(SCDs) and what are the features of each?
● Type 1 - Overwrite:
○ Use Case: Suitable when historical values are not important, and
you only need the latest information.
○ Use Case: Used when you need to track historical changes while
still having access to the latest dimension values.
● Type 3 - Add New Attribute:
○ Use Case: Used when you want to maintain historical data efficiently
and minimize the impact on the current schema.
○ Use Case: Used when different attributes within the same dimension
require different SCD handling.
68. What are data frames in Pyspark and how do we create them?
For example:
spark = SparkSession.builder.appName("example").getOrCreate()
rdd = spark.sparkContext.parallelize([Row(name="Alice", age=30),
Row(name="Bob", age=25)])
schema = ["name", "age"]
df = spark.createDataFrame(rdd, schema)
For example:
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.read.json("data.json")
spark = SparkSession.builder.appName("example").getOrCreate()
data = [Row(name="Alice", age=30), Row(name="Bob", age=25)]
schema = ["name", "age"]
df = spark.createDataFrame(data, schema)
Creating from an RDD of Tuples:
● You can create a DataFrame from an RDD of tuples, specifying the
column names.
For example:
spark = SparkSession.builder.appName("example").getOrCreate()
rdd = spark.sparkContext.parallelize([("Alice", 30), ("Bob", 25)])
schema = ["name", "age"]
df = spark.createDataFrame(rdd, schema)
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.read.json("data.json")
selected_df = df.select("name", "age")
DataFrames provide a structured and efficient way to work with data in PySpark,
and they offer numerous transformations and actions for data processing and
analysis. PySpark DataFrames are particularly well-suited for big data processing
tasks, as they can leverage distributed computing resources for high
performance.
You can adjust the read API and specify various options for reading CSV files,
such as custom delimiter, header inclusion, and more. Refer to the PySpark
documentation for additional details on reading and configuring CSV data
sources.
In PySpark, you can perform joins and filter data using DataFrame operations.
Here's an explanation of how to do both:
Joins in PySpark:
PySpark supports various types of joins, including inner joins, outer joins (left,
right, and full outer joins), and cross joins. To perform a join, you typically use the
join method on two DataFrames.
In this example, result_df contains the result of the inner join between df1 and df2
based on the "id" column.
Here's an example:
# Create a DataFrame
df = spark.createDataFrame([(1, "Alice", 30), (2, "Bob", 25), (3, "Charlie", 35)],
["id", "name", "age"])
In this example, filtered_df contains rows where the "age" column is greater than
30. You can use various conditions in the filter or where methods to filter data
based on your criteria.
These are basic examples of performing joins and filtering data in PySpark.
PySpark provides a wide range of DataFrame operations for more complex join
and filter scenarios, as well as other data manipulation and analysis tasks.
71. Explain the concept of lazy evaluation in PySpark. Why is
it important in Spark computations?
Benefits:
● Optimization: Lazy evaluation allows PySpark to optimize the execution
plan. It can reorder and combine operations to minimize the amount of
data shuffling and reduce the computational load.
● Reduced Data Movement: Spark can optimize data locality, reducing the
need to move data between nodes in the cluster, which is particularly
important for performance.
● Query Optimization: For Spark SQL queries, lazy evaluation allows for
query optimization and predicate pushdown, further improving query
performance.
Example:
spark = SparkSession.builder.appName("example").getOrCreate()
In the code above, transformations like filter and select are lazily evaluated until
the show action is called. Lazy evaluation allows Spark to optimize the execution
plan and minimize data processing.
72. What is a Spark job, and what happens when you submit a Spark
job in PySpark?
In PySpark, a Spark job refers to a set of Spark tasks and operations that are
executed together as a unit of work. When you submit a Spark job, you are
instructing the Spark cluster to perform a specific computation or analysis on
your data.
● Task Scheduling: Spark divides the logical plan into smaller units of work
called tasks. These tasks are scheduled to run on the worker nodes of the
Spark cluster. The level of parallelism depends on the available resources
and the cluster configuration.
● Data Processing: Each task processes its portion of the data, applying
transformations and operations as specified in the logical plan.
● Output and Result: The results of each task are collected and
aggregated. The final output or result of the job is typically displayed,
saved to storage, or used for further analysis.
● Job Completion: Once all tasks have been completed successfully, the
job is considered complete, and the resources are released. If any tasks
fail, Spark can automatically recompute lost data partitions to ensure fault
tolerance.
● Lookup Tables: When you have a small lookup table, such as a dictionary
or mapping, you can broadcast it to worker nodes to avoid repeated data
transfer.
spark = SparkSession.builder.appName("example").getOrCreate()
In this example, the lookup_df DataFrame is broadcast, and its values are used
in a join operation with another_df. This reduces data transfer and improves the
efficiency of the join.
PySpark offers several advantages over traditional Hadoop MapReduce for data
processing tasks:
● Ease of Use: PySpark is a more developer-friendly framework compared
to MapReduce. It allows you to write data processing tasks in Python,
which is known for its simplicity and readability.
● Unified API: Spark offers a unified API for batch processing, interactive
queries, machine learning, and streaming data, making it easier to work
with various data processing use cases within a single framework.
● Integration with Big Data Tools: Spark can seamlessly integrate with
other big data tools and storage systems, such as HDFS, HBase, and data
lakes, making it a versatile choice for big data processing.
For example
For example:
For example, to fill null values in a numerical column with the mean:
For example:
df = df.drop("column_name")
Here are the key differences between them, along with examples:
Narrow Transformation:
● A narrow transformation does not require data shuffling or data exchange
between partitions. It operates on a single partition independently, and
each output partition depends on a single input partition.
● Narrow transformations are executed in a pipelined fashion, where each
partition processes its data and produces the final result.
● Examples of narrow transformations include map, filter, union, and
groupByKey (when the grouping is within a single partition).
Spark Context:
● The SparkContext is the entry point for any Spark functionality in a Spark
application. It was the first and primary entry point in earlier versions of
Spark.
● It is responsible for coordinating and managing the resources of a Spark
application. It sets up various configurations, connects to the cluster
manager (e.g., YARN or standalone cluster manager), and controls the
execution of tasks.
● The SparkContext is generally used in Spark applications with the RDD
API, which is the original core data structure in Spark.
● In modern Spark applications, especially those using DataFrames or
Datasets, the SparkContext is often encapsulated within the SparkSession.
Spark Session:
● The SparkSession is introduced in Spark 2.0 and serves as the entry point
for Spark functionality in newer versions of Spark.
● It encapsulates the functionalities of both the original SparkContext and
SQLContext. This means it provides access to the Spark Core, SQL, and
Hive functionality in a unified interface.
● The SparkSession is designed to work seamlessly with DataFrames and
Datasets, which are higher-level abstractions for structured and
semi-structured data.
● It simplifies application development and provides a more user-friendly and
unified interface for working with structured data.
In summary, the primary difference is that the SparkContext is used for older
RDD-based applications, while the SparkSession is the recommended entry
point for modern Spark applications, especially those using DataFrames and
Datasets. The SparkSession simplifies development, provides better integration
with SQL and structured data, and offers a more user-friendly interface for
working with Spark.
spark = SparkSession.builder.appName("example").getOrCreate()
exploded_df.show()
In this example, the explode operation is applied to the "fruits" column, creating a
new row for each element in the array.
80. What is Persist and Cache in Spark:
In PySpark, persist and cache are operations used to optimize the performance
of DataFrames or RDDs by storing them in memory or on disk. They allow you to
control how data is stored and accessed during Spark computations.
● cache: The cache operation is a shorthand for calling persist with the
default storage level, which is to cache the data in memory as deserialized
objects.
Both persist and cache are used to avoid recomputing the same data multiple
times in Spark jobs, which can significantly improve performance.
spark = SparkSession.builder.appName("example").getOrCreate()
Stages:
● Job: A Spark application is divided into jobs. A job represents a high-level
unit of work and consists of one or more stages. Each job typically
corresponds to a Spark action, like count, saveAsTextFile, or collect.
● Stage: A stage is a logical division within a job. A job can be composed of
one or more stages. Stages are created based on the execution plan of the
Spark application and are separated by narrow transformations, which do
not require data shuffling, and wide transformations, which require data
shuffling.
○ Shuffle Map Stage: In a Spark application, the stages that require
data shuffling are called shuffle map stages. These stages produce
data that will be shuffled and exchanged between worker nodes.
○ Result Stage: A result stage is the final stage of a job and typically
represents an action. It collects and aggregates the results produced
by previous stages and performs the final computations to produce
the output of the job.
Tasks:
● Task: A task is the smallest unit of work in Spark. Each stage is divided
into tasks, and tasks are distributed across the worker nodes in the Spark
cluster. A task is a single computation operation that processes a portion of
the data. The number of tasks within a stage depends on the number of
partitions of the input data.
○ Shuffle Map Task: These tasks are responsible for producing data
that needs to be shuffled, such as the output of a reduceByKey
operation. Shuffle map tasks read data, perform operations, and
write data that will be shuffled.
○ Result Task: Result tasks process the shuffled data produced by
shuffle map tasks. They collect and aggregate the shuffled data to
compute the final result.
Partition: A partition is a logical division of data within a stage. It is a subset of
the data that a task processes. The number of partitions is determined by the
level of parallelism in the Spark application and can be configured when reading
data or using repartition and coalesce operations.
Stages and tasks are crucial for Spark's parallel and distributed data processing,
allowing it to efficiently handle large datasets and take advantage of the
resources of a cluster of worker nodes. Spark's execution plan is optimized to
minimize data shuffling and improve the overall performance of data processing
tasks.
KAFKA
82. What is Apache Kafka and where do we use and what are its
benefits?
● IoT (Internet of Things): Kafka is used to collect and process sensor data
from IoT devices, allowing for real-time insights and actions.
● Metrics and Monitoring: Kafka can be used for collecting and distributing
metrics and monitoring data, enabling real-time monitoring and alerting.
Benefits of Kafka:
● Scalability: Kafka is designed to scale horizontally, making it suitable for
handling massive data streams. You can add more brokers to
accommodate increased data volumes.
● Batch and Real-Time Processing: Kafka can be used for both batch
processing and real-time processing, making it versatile for different use
cases.
● Broker: Kafka brokers are the Kafka server instances that store the data
and serve client requests. They are responsible for data storage and
distribution.
● Topic: Topics are logical channels or categories where data is published
by producers and consumed by consumers. Topics are used to organize
data streams.
● Partition: Each topic can be divided into partitions, which allows for
parallel processing and distribution of data. Partitions are the basic unit of
parallelism in Kafka.
85. What is Kafka topic and how is different from Kafka partition?
Kafka Architecture:
Kafka has a distributed and fault-tolerant architecture, making it highly scalable
and suitable for real-time data streaming. The key components in Kafka's
architecture include producers, brokers, topics, partitions, consumers, and
ZooKeeper (for coordination, although Kafka is moving away from direct
ZooKeeper dependence). Here's how Kafka's publish and subscribe mechanism
works within its architecture:
● Brokers: Brokers are Kafka server instances responsible for storing data
and serving client requests. They manage data distribution and replication.
Brokers work together in a Kafka cluster to provide scalability and fault
tolerance.
● Partitions: Each topic can be divided into multiple partitions. Partitions are
ordered, immutable logs of data. Producers write messages to partitions,
and consumers read from them. The data within a partition is distributed
across brokers for fault tolerance and can be replicated for durability.
● Consumers: Consumers subscribe to one or more topics and read data
from specific partitions. Kafka allows multiple consumers to subscribe to
the same topic and read data independently. Consumers process data
based on their own application logic.
● Fault Tolerance: Kafka clusters are designed for fault tolerance. Data is
replicated across multiple brokers, ensuring that even if some brokers fail,
data remains available and durable.
● Partition Selection: The partitioner ensures that messages with the same
key always go to the same partition, allowing for ordered storage and
processing.
Kafka producers play a crucial role in real-time data pipelines and systems,
allowing applications to publish data to Kafka topics, which can then be
consumed by various downstream consumers for real-time processing and
analytics. The message key, as mentioned earlier, is an optional but important
component of the message that helps determine partitioning and ordering.
Kafka provides different message delivery semantics that allow you to control
how messages are produced and consumed. These semantics are
"at-most-once," "at-least-once," and "exactly-once." Each semantic offers a
trade-off between message delivery guarantee and processing overhead:
Characteristics:
○ Low processing overhead on the producer side.
○ Messages may be lost if a failure occurs before successful delivery.
○ Suitable for scenarios where occasional message loss is acceptable,
and low latency is a priority.
● At-Least-Once (One or More Delivery):
○ Producer Side: The producer sends a message and waits for
acknowledgment (ack) from Kafka. It retries sending the message
until it receives an ack.
○ Consumer Side: The consumer reads messages from Kafka. It may
process a message multiple times, but it ensures that no message is
lost. A deduplication mechanism is often needed to prevent
processing the same message multiple times.
Characteristics:
○ Guaranteed message delivery, but potential for duplicate processing.
○ Higher processing overhead on the producer side due to
acknowledgment and retries.
○ Suitable for scenarios where message loss is unacceptable, and
some level of duplicate processing can be handled.
Characteristics:
○ Guarantees both no message loss and no duplicate processing.
○ Highest processing overhead on both the producer and consumer
sides.
○ Suitable for scenarios where message loss is unacceptable, and
strict deduplication is required.
It's important to note that achieving "exactly-once" semantics can be more
challenging and may involve additional complexities in the consumer application.
The producer and consumer applications need to coordinate and handle
deduplication effectively.
Git:
● Git is a distributed version control system (DVCS) designed for tracking
changes in source code during software development.
● It allows multiple developers to collaborate on a project by maintaining a
history of changes, merging contributions, and providing tools for
managing code and project versions.
● Git operates locally on a developer's machine, and it doesn't require a
constant network connection to function.
● Git focuses on version control, branching, merging, and local development
workflow.
GitHub:
● GitHub, on the other hand, is a web-based platform for hosting Git
repositories and provides additional collaboration and project management
features.
● It offers a cloud-based environment where developers can store, share,
and collaborate on Git repositories.
● GitHub provides tools for issue tracking, pull requests, code reviews, team
collaboration, and project management.
● GitHub is a popular hosting platform for open-source and private software
projects, providing visibility and accessibility to a wider community of
developers.
Key Differences:
● Git is the version control system itself, whereas GitHub is a web-based
hosting platform for Git repositories.
● Git is a command-line tool and operates locally, while GitHub is a web
interface and a cloud-based service.
● Git is used for version control and local development, while GitHub adds
collaboration, social coding, and project management features on top of
Git.
● GitHub allows you to host your Git repositories publicly or privately, making
it a central platform for open-source and private projects.
Fork:
● Forking is an action typically associated with online Git hosting platforms
like GitHub.
● It involves creating a copy of someone else's Git repository in your own
GitHub account.
● Forking is often used when you want to contribute to an open-source
project, as it allows you to create a personal copy of the project that you
can modify and then create pull requests to merge your changes back into
the original repository.
Clone:
● Cloning is the process of creating a local copy of a Git repository, typically
from a remote repository.
● Cloning allows you to work on a project locally and interact with the remote
repository for pulling and pushing changes.
● You can clone a repository from GitHub or any other Git hosting service to
your local machine using the git clone command.
Branch:
● A branch in Git is a parallel line of development, allowing you to work on
features or fixes without affecting the main codebase.
● Creating a branch is a local action, and it's used for isolating changes and
experimental work.
● Branches can be used to implement new features, fix bugs, or work on
different aspects of a project.
● Branches can be merged back into the main branch or other branches
when the work is complete.
A Pull Request (PR) is a feature offered by Git hosting platforms like GitHub,
GitLab, and Bitbucket that enables developers to propose changes to a
codebase and request that these changes be reviewed, discussed, and
eventually merged into the main branch of the repository. Pull Requests are a
fundamental part of the collaborative development workflow and are commonly
used for the following purposes:
● Code Review: A PR provides a platform for other team members to review
the changes introduced in a branch. Reviewers can leave comments,
suggest improvements, and discuss the code.
● Quality Assurance: PRs are an effective way to ensure the quality and
correctness of code changes. Reviewers can catch and address issues,
potential bugs, or security vulnerabilities.
A merge conflict occurs when Git cannot automatically merge changes from one
branch into another, typically because the same part of the code has been
modified in both branches. Handling merge conflicts involves manual intervention
to resolve the conflicting changes.
● Identify the Conflict: When a conflict occurs, Git will mark the conflicting
sections in the affected files. You'll see markers like <<<<<<<, =======,
and >>>>>>> to indicate the conflicting sections.
● Manually Resolve the Conflict: Open the conflicted file(s) in a text editor
or code editor. Review the conflicting sections and decide which changes
to keep. Remove the conflict markers and ensure the file contains the
desired code.
● Add and Commit Changes: After resolving the conflict, stage the
modified files using git add. Then, commit the changes with a commit
message explaining that you resolved the conflict.
● Complete the Merge: After resolving all conflicts and committing the
changes, use the git merge --continue or git pull --continue command to
complete the merge operation.
● Push the Changes: If you are working in a shared repository, push the
merged changes to the remote repository. Use the git push command.
● Inform the Team: Communicate with your team to let them know that the
conflict has been resolved and the merge is complete.
git fetch and git pull are both Git commands used to update your local repository
with changes from a remote repository. However, they work differently:
● git fetch:
● It only fetches changes from the remote repository, updating your
local copy of remote branches and their history. It doesn't
automatically merge or apply these changes to your current working
branch.
● It is a read-only operation that retrieves the latest commits and
updates the remote tracking branches (e.g., origin/master) in your
local repository.
● It allows you to inspect and review changes before merging them
into your working branch. This is useful for avoiding unintended
merges or conflicts.
● git pull:
● It combines git fetch with an automatic merge or rebase. After
fetching the changes from the remote, it attempts to integrate them
into your current working branch.
● It is a more aggressive operation because it automatically updates
your working branch with changes from the remote branch,
potentially leading to conflicts.
● It is convenient when you want to quickly update your branch with
the latest changes and are confident that it won't result in conflicts or
unintended merges.
In summary, git fetch fetches changes from the remote repository but doesn't
automatically apply them to your working branch, while git pull fetches changes
and attempts to merge or rebase them into your working branch.
97. How to Revert a Commit That Has Already Been Pushed and
Made Public?
If you need to revert a commit that has already been pushed to a remote
repository and made public, you should follow these steps:
Create a Revert Commit:
● Checkout the branch where the commit to be reverted was made.
● Use the git revert command followed by the commit hash of the
commit to be reverted. This creates a new commit that undoes the
changes introduced by the reverted commit.
● Add Commits: Make changes to your code in the branch and commit your
changes regularly. Each commit should represent a single logical change
or unit of work. Commits provide a clear history and help in code review.
● Open a Pull Request (PR): When you're ready to share your work or
request feedback, open a Pull Request. This action initiates a discussion
and code review process. The PR includes a clear description of the
changes, making it easier for reviewers to understand your work.
● Discuss and Review: Team members can review the code, leave
comments, and suggest improvements in the PR. This collaborative
process is a critical step in maintaining code quality and catching potential
issues early.
● Run Tests: Automated tests and checks can be integrated into the
workflow using Continuous Integration (CI) services. GitHub Actions, for
example, can run tests to ensure that the changes do not introduce
regressions.
● Merge the PR: Once the code is reviewed, approved, and passes all tests,
the PR can be merged into the main branch (often main or master). This
integrates the changes into the main codebase.
● Delete the Branch: After merging, the feature branch can be deleted. This
keeps the repository clean and helps avoid clutter.
● Repeat: The process is cyclical. The next piece of work starts with a new
branch, and the cycle continues.
● Education and Training: Educate your team about best practices for
handling sensitive information in a public repository.
Remember that even with these precautions, it's often best to avoid storing highly
sensitive information in a public repository altogether, if possible. Consider using
private repositories for sensitive projects or use alternatives like environment
variables, encrypted configuration files, and external secrets management
services.
101. Tracking and Managing Project Enhancements and Bugs in
GitHub:
GitHub provides several features and best practices for tracking and managing
project enhancements and bugs:
● Issues: Use GitHub Issues to track and manage both bugs and feature
requests. Issues can be labeled, assigned to team members, and
categorized.
● Project Boards: GitHub Project Boards help organize and prioritize issues.
You can create custom boards, such as "To-Do," "In Progress," and
"Done," to visualize the progress of issues.
● Pull Requests: Link issues to pull requests. This shows the connection
between code changes and the issues they address.
● Closing and Referencing: Closing an issue is a way to indicate that it has
been resolved. You can reference issues from commit messages and pull
request descriptions.
By using these GitHub features and best practices, you can effectively track,
prioritize, and manage enhancements and bugs in your projects, promoting
transparency and collaboration within your development team and with the
broader community.
GitHub Actions are workflows defined in YAML files that allow you to automate
various aspects of your software development and CI/CD pipelines. There are
different types of GitHub Actions:
● Matrix Builds: You can define matrix builds in your workflows, allowing
you to test your code against multiple versions of dependencies or on
different platforms.