0% found this document useful (0 votes)
9 views13 pages

ADTHEORY4

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 13

What is ETL?

Extract, transform, and load (ETL) is the process of combining data from
multiple sources into a large, central repository called a data warehouse. ETL
uses a set of business rules to clean and organize raw data and prepare it for
storage, data analytics, and machine learning (ML). You can address specific
business intelligence needs through data analytics (such as predicting the
outcome of business decisions, generating reports and dashboards, reducing
operational inefficiency, and more).

Why is ETL important?


Organizations today have both structured and unstructured data from various
sources including:

 Customer data from online payment and customer relationship


management (CRM) systems

 Inventory and operations data from vendor systems

 Sensor data from Internet of Things (IoT) devices

 Marketing data from social media and customer feedback

 Employee data from internal human resources systems

By applying the process of extract, transform, and load (ETL), individual raw
datasets can be prepared in a format and structure that is more consumable
for analytics purposes, resulting in more meaningful insights.

INTRODUCTION:

1. ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a
format suitable for loading into a data warehouse, and then load it into
the warehouse. The process of ETL can be broken down into the
following three stages:
2. Extract: The first stage in the ETL process is to extract data from
various sources such as transactional systems, spreadsheets, and flat
files. This step involves reading data from the source systems and
storing it in a staging area.
3. Transform: In this stage, the extracted data is transformed into a
format that is suitable for loading into the data warehouse. This may
involve cleaning and validating the data, converting data types,
combining data from multiple sources, and creating new data fields.
4. Load: After the data is transformed, it is loaded into the data
warehouse. This step involves creating the physical data structures and
loading the data into the warehouse.
5. The ETL process is an iterative process that is repeated as new data is
added to the warehouse. The process is important because it ensures
that the data in the data warehouse is accurate, complete, and up-to-
date. It also helps to ensure that the data is in the format required for
data mining and reporting.

ETL process can also use the pipelining concept i.e. as soon as some data is
extracted, it can transformed and during that period some new data can be
extracted. And while the transformed data is being loaded into the data
warehouse, the already extracted data can be transformed. The block diagram of
the pipelining of ETL process is shown below:
How does Data Extraction work?
Identifying Data Sources

The data extraction process starts with identifying data sources. You need to be
clear on what data you need and where your data is located. It can be in
documents, databases, or social media apps. Once you have identified your data
sources, you need select the appropriate method for each source. For images,
you might need OCR; for websites, you might need web scraping software, and
so on and so forth.

Source Connection

Then, you need to establish a connection to selected data sources. The


connection method may vary depending on the source type. You may use a
database connection string, username, and password for databases. You may
need to use APIs for web-based sources. Some data extraction software offers a
complete solution with various inbuilt connectors so you can connect to all
sources simultaneously.
Query or Retrieval

You can use SQL queries to retrieve specific data from tables for databases.
Documents may require text extraction using OCR (or specific document
parsers. Most data extraction tools are now code-free, which means all you need
to do is just drag and drop a connector and connect to any data source without
learning extensive SQL queries or programming languages.

Data Transformation and Loading

Once the data is extracted, it often doesn’t comply with the format required by
the end destination or even for analysis. For example, you could have data in
XML or JSON, and you might need to convert it into Excel for analysis. There
could be multiple scenarios, which is why data transformation is essential. Some
common transformation tasks include:

 Cleaning data to remove duplicates, handle missing values, and correct


errors.
 Normalizing data by converting date formats or standardizing units of
measurement.
 Enriching data by adding external information or calculated fields.

The transformed data is then fed into a destination, which varies according to
the objective of the data. You could store data in flat files such as CSV, JSON, or
Parquet files or put it in a relational database (e.g., MySQL, PostgreSQL) or
NoSQL database (e.g., MongoDB).

Types of data extraction

Deferred Update: It is a technique for the maintenance of the transaction log


files of the DBMS. It is also called NO-UNDO/REDO technique. It is used for
the recovery of transaction failures that occur due to power, memory, or OS
failures. Whenever any transaction is executed, the updates are not made
immediately to the database. They are first recorded on the log file and then those
changes are applied once the commit is done. This is called the “Re-doing”
process. Once the rollback is done none of the changes are applied to the database
and the changes in the log file are also discarded. If the commit is done before
crashing the system, then after restarting the system the changes that have been
recorded in the log file are thus applied to the database.
2. Immediate Update: It is a technique for the maintenance of the transaction log
files of the DBMS. It is also called UNDO/REDO technique. It is used for the
recovery of transaction failures that occur due to power, memory, or OS failures.
Whenever any transaction is executed, the updates are made directly to the
database and the log file is also maintained which contains both old and new
values. Once the commit is done, all the changes get stored permanently in the
database, and records in the log file are thus discarded. Once rollback is done the
old values get restored in the database and all the changes made to the database
are also discarded. This is called the “Un-doing” process. If the commit is done
before crashing the system, then after restarting the system the changes are stored
permanently in the database.
Difference between Deferred update and Immediate update:
Deferred Update Immediate Update

In a deferred update, the changes are not In an immediate update, the changes
applied immediately to the database. are applied directly to the database.

The log file contains all the changes that The log file contains both old as well as
are to be applied to the database. new values.

In this method once rollback is done all the In this method once rollback is done the
records of log file are discarded and no old values are restored to the database
changes are applied to the database. using the records of the log file.

Concepts of buffering and caching are used Concept of shadow paging is used in
in deferred update method. immediate update method.

The major disadvantage of this method is The major disadvantage of this method
that it requires a lot of time for recovery in is that there are frequent I/O operations
case of system failure. while the transaction is active.

In this method of recovery, firstly the


In this method of recovery, the database
changes carried out by a transaction on the
gets directly updated after the changes
data are done in the log file and then
made by the transaction and the log file
applied to the database on commit. Here,
keeps the old and new values. In the
the maintained record gets discarded on
case of rollback, these records are used
rollback and thus, not applied to the
to restore old values.
database.

The Data Transformation Process


In a cloud data warehouse, the data transformation process most typically takes the form
of ELT (Extract Load Transform) or ETL (Extract Transform Load).
With cloud storage costs becoming cheaper by the year, many teams opt for ELT— the
difference being that all data is loaded in cloud storage, then transformed and added to a
warehouse.

The transformation process generally follows 6 stages:

1. Data Discovery: During the first stage, data teams work to


understand and identify applicable raw data. By profiling data,
analysts/engineers can better understand the transformations that
need to occur.
2. Data Mapping: During this phase, analysts determine how
individual fields are modified, matched, filtered, joined, and
aggregated.
3. Data Extraction: During this phase, data is moved from a
source system to a target system. Extraction may include
structured data (databases) or unstructured data (event
streaming, log files) sources.
4. Code Generation and Execution: Once extracted and loaded,
transformation needs to occur on the raw data to store it in a
format appropriate for BI and analytic use. This is frequently
accomplished by analytics engineers, who write SQL/Python to
programmatically transform data. This code is executed
daily/hourly to provide timely and appropriate analytic data.
5. Review: Once implemented, code needs to be reviewed and
checked to ensure a correct and appropriate implementation.
6. Sending: The final step involves sending data to its target
destination. The target might be a data warehouse or other
database in a structured format.

Data loading

What is loading?
Loading is the ultimate step in the ETL process. In this step, the extracted data
and the transformed data are loaded into the target database. To make the data
load efficient, it is necessary to index the database and disable the constraints
before loading the data. All three steps in the ETL process can be run parallel.
Data extraction takes time and therefore the second phase of the transformation
process is executed simultaneously. This prepared the data for the third stage of
loading. As soon as some data is ready, it is loaded without waiting for the
previous steps to be completed.
The loading process is the physical movement of the data from the computer
systems storing the source database(s) to that which will store the data warehouse
database. The entire process of transferring data to a data warehouse repository is
referred to in the following ways:
1. Initial Load: For the very first time loading all the data warehouse
tables.
2. Incremental Load: Periodically applying ongoing changes as per the
requirement. After the data is loaded into the data warehouse database,
verify the referential integrity between the dimensions and the fact
tables to ensure that all records belong to the appropriate records in the
other tables. The DBA must verify that each record in the fact table is
related to one record in each dimension table that will be used in
combination with that fact table.
3. Full Refresh: Deleting the contents of a table and reloading it with
fresh data.
Refresh versus Update
After the initial load, the data warehouse needs to be maintained and updated and
this can be done by the following two methods:
 Update– application of incremental changes in the data sources.
 Refresh– complete reloads at specified intervals.
Data Loading-
Data is physically moved to the data warehouse. The loading takes place within a
“load window. The tendency is close to real-time updates for data warehouses as
warehouses are growing used for operational applications.
Loading the Dimension Tables
Procedure for maintaining the dimension tables includes two functions, initial
loading of the tables and thereafter applying the changes on an ongoing basis
System geared keys are used in a data warehouse. The reeds in the source system
have their own keys. Therefore, before an initial load or an ongoing load, the
production keys must be co to system-generated keys in the data warehouse,
Another issue is related to the application of Type 1, Type 2, and Type 3 changes
to the data warehouse. Fig. shows how to handle it.
Loading changes to a dimension table

Loading the Fact tables: History and Incremental Loads

 The key in the fact table is the concatenation of keys from the
dimension tables.
 So for this reason amplitude records are loaded first.
 A concatenated key is created from the keys of the corresponding
dimension tables

Methods for data loading

 Cloud-based: ETL solutions in the cloud are frequently able to process


data in real-time and are designed for speed and scalability. They also
contain the vendor’s experience and ready-made infrastructure, which
may offer advice on best practices for each organization’s particular
configuration and requirements.
 Batch processing: Data is moved every day or every week via ETL
systems that use batch processing. Large data sets and organizations
that don’t necessarily require real-time access to their data are the
greatest candidates for it.
 Open-source: Since their codebases are shared, editable, and publicly
available, many open-source ETL systems are extremely affordable.
Despite being a decent substitute for commercial solutions, many tools
may still need some hand-coding or customization.

What is data quality?


Data quality measures how well a dataset meets criteria for
accuracy, completeness, validity, consistency, uniqueness,
timeliness and fitness for purpose, and it is critical to all data
governance initiatives within an organization.
Dimensions of data quality (*)

Data quality is evaluated based on a number of dimensions, which can differ based on the source of
information. These dimensions are used to categorize data quality metrics:

 Completeness: This represents the amount of data that is usable or complete. If there is a
high percentage of missing values, it may lead to a biased or misleading analysis if the data is
not representative of a typical data sample.

 Uniqueness: This accounts for the amount of duplicate data in a dataset. For example, when
reviewing customer data, you should expect that each customer has a unique customer ID.

 Validity: This dimension measures how much data matches the required format for any
business rules. Formatting usually includes metadata, such as valid data types, ranges,
patterns, and more.

 Timeliness: This dimension refers to the readiness of the data within an expected time
frame. For example, customers expect to receive an order number immediately after they
have made a purchase, and that data needs to be generated in real-time.

 Accuracy: This dimension refers to the correctness of the data values based on the agreed
upon “source of truth.” Since there can be multiple sources which report on the same metric,
it’s important to designate a primary data source; other data sources can be used to confirm
the accuracy of the primary one. For example, tools can check to see that each data source is
trending in the same direction to bolster confidence in data accuracy.

 Consistency: This dimension evaluates data records from two different datasets. As
mentioned earlier, multiple sources can be identified to report on a single metric. Using
different sources to check for consistent data trends and behavior allows organizations to
trust the any actionable insights from their analyses. This logic can also be applied around
relationships between data. For example, the number of employees in a department should
not exceed the total number of employees in a company.

 Fitness for purpose: Finally, fitness of purpose helps to ensure that the data asset meets a
business need. This dimension can be difficult to evaluate, particularly with new, emerging
datasets.
These metrics help teams conduct data quality assessments across their organizations to evaluate
how informative and useful data is for a given purpose.

Why is data quality important?(*)

Over the last decade, developments within hybrid cloud, artificial intelligence, the Internet of Things
(IoT), and edge computing have led to the exponential growth of big data. As a result, the practice of
master data management (MDM) has become more complex, requiring more data stewards and
rigorous safeguards to ensure good data quality.

Businesses rely on data quality management to support their data analytics initiatives, such as
business intelligence dashboards. Without this, there can be devastating consequences, even ethical
ones, depending on the industry (e.g. healthcare). Data quality solutions exist to help companies
maximize the use of their data, and they have driven key benefits, such as:

Better business decisions: High quality data allows organizations to identify key performance
indicators (KPIs) to measure the performance of various programs, which allows teams to improve or
grow them more effectively. Organizations prioritize data quality will undoubtedly have an advantage
over their competitors.

Improved business processes: Good data also means that teams can identify where there are
breakdowns in operational workflows. This is particularly true for the supply chain industry, which
relies on real-time data to determine appropriate inventory and location of it after shipment.

Increased customer satisfaction: High data quality provides organizations, particularly marketing and
sales teams, with incredible insight into their target buyers. They are able to integrate different data
across the sales and marketing funnel, which enable them to sell their products more effectively. For
example, the combination of demographic data and web behavior can inform how organizations
create their messaging, invest their marketing budget, or staff their sales teams to service existing or
potential clients.

Challenges and Problems in Data Cleaning


Below are some major challenges and problems that are faced while cleaning
the data −

Incomplete and Missing Data

One of the main challenges in data cleansing is the management of


inadequate or absent data. In practical situations, it is frequently
encountered that datasets contain gaps or absent data, which can greatly
influence the precision of analysis and modeling. Data analysts are
confronted with the challenge of identifying the optimal strategy to address
missing data, whether by utilizing imputation methods or eliminating
incomplete records. Nevertheless, finding the appropriate equilibrium is
essential to prevent the introduction of bias or the distortion of the inherent
patterns within the data. Additionally, it is worth mentioning that the choice
of imputation technique should take into account the specific characteristics
of the dataset and the nature of the missing data to ensure reliable and
accurate results.

Inconsistencies and Outliers

Inconsistencies and outliers may emerge within datasets for a multitude of


reasons, encompassing human oversight, inaccuracies in data input, or
technical malfunctions. These discrepancies may manifest in diverse
manners, like typographical errors, inconsistencies in formatting, or data
points that deviate from actual ranges. It is of utmost importance to detect
and rectify such irregularities to uphold the integrity of the data.

Applying approaches for data cleansing, such as algorithms designed to


identify outliers and implementing rules for data validation, proves invaluable
in efficiently addressing these concerns. Additionally, employing automated
procedures and rigorous quality control measures during the data collection
phase can aid in minimizing errors and ensuring the overall reliability of the
dataset.

Duplicate Data

In the field of data cleaning, organizations often face a common obstacle


known as duplicated or replicated data. This issue arises when data
originates from various sources, as well as due to human slip-ups or glitches
in systems. Duplicate entries not only occupy unnecessary storage space but
also contribute to erroneous analysis and distorted outcomes. To address this
challenge, organizations can employ reliable data deduplication approaches,
such as employing record linkage algorithms and employing fuzzy matching
techniques.

By utilizing these methods, they can effectively detect and remove redundant
data, thus improving the overall quality of the dataset. Additionally,
incorporating these techniques also ensures that the dataset contains
accurate and reliable information for further analysis and decision-making
purposes.

Data Standardization

Data gathered from various sources may display disparities in formats,


measurement units, or terminology. These disparities can hamper the
integration and analysis of data. Data standardization encompasses the
process of converting data into a standardized format, guaranteeing
uniformity across diverse datasets. Approaches such as data normalization,
establishing consistent naming conventions, and creating data dictionaries
play a crucial role in accomplishing data standardization. By implementing
data standardization, organizations can improve data coherence and
streamline precise comparisons and analysis. Moreover, standardized data
promotes compatibility among different systems and facilitates seamless
collaboration.

Scalability and Performance

Data cleaning becomes increasingly challenging as the volume and


complexity of data grow. Large-scale datasets require efficient algorithms
and techniques to handle cleaning operations effectively. Moreover, as
organizations strive for real-time data analysis, the speed and performance
of data-cleaning processes become crucial. Leveraging parallel computing,
distributed systems, and optimized algorithms can help overcome scalability
and performance challenges, ensuring timely data cleaning without
compromising quality.

Data Privacy and Security

Data cleaning involves working with sensitive and confidential information


and raising concerns regarding data privacy and security. Organizations must
ensure compliance with relevant data protection regulations, such as the
General Data Protection Regulation (GDPR) or industry-specific standards.
Implementing robust data anonymization techniques, secure data storage
practices, and access controls helps safeguard sensitive data during the
cleaning process.
Domain Knowledge and Expertise

Data cleaning is not solely a technical endeavor but also requires domain
knowledge and expertise. Understanding the context, semantics, and
intricacies of the data is crucial for making informed decisions during the
cleaning process. Collaborating with subject matter experts and domain
specialists can significantly improve the quality and accuracy of data-cleaning
outcomes. Domain knowledge helps in identifying patterns, resolving
ambiguities, and making data-driven decisions specific to the industry or
organization.

In this extensive article, we have examined several common obstacles and hurdles encountered
during data cleansing, such as missing data, disparities, replicas, data normalization, scalability, data
confidentiality, and the necessity of specialized knowledge.

You might also like