ADTHEORY4
ADTHEORY4
ADTHEORY4
Extract, transform, and load (ETL) is the process of combining data from
multiple sources into a large, central repository called a data warehouse. ETL
uses a set of business rules to clean and organize raw data and prepare it for
storage, data analytics, and machine learning (ML). You can address specific
business intelligence needs through data analytics (such as predicting the
outcome of business decisions, generating reports and dashboards, reducing
operational inefficiency, and more).
By applying the process of extract, transform, and load (ETL), individual raw
datasets can be prepared in a format and structure that is more consumable
for analytics purposes, resulting in more meaningful insights.
INTRODUCTION:
1. ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a
format suitable for loading into a data warehouse, and then load it into
the warehouse. The process of ETL can be broken down into the
following three stages:
2. Extract: The first stage in the ETL process is to extract data from
various sources such as transactional systems, spreadsheets, and flat
files. This step involves reading data from the source systems and
storing it in a staging area.
3. Transform: In this stage, the extracted data is transformed into a
format that is suitable for loading into the data warehouse. This may
involve cleaning and validating the data, converting data types,
combining data from multiple sources, and creating new data fields.
4. Load: After the data is transformed, it is loaded into the data
warehouse. This step involves creating the physical data structures and
loading the data into the warehouse.
5. The ETL process is an iterative process that is repeated as new data is
added to the warehouse. The process is important because it ensures
that the data in the data warehouse is accurate, complete, and up-to-
date. It also helps to ensure that the data is in the format required for
data mining and reporting.
ETL process can also use the pipelining concept i.e. as soon as some data is
extracted, it can transformed and during that period some new data can be
extracted. And while the transformed data is being loaded into the data
warehouse, the already extracted data can be transformed. The block diagram of
the pipelining of ETL process is shown below:
How does Data Extraction work?
Identifying Data Sources
The data extraction process starts with identifying data sources. You need to be
clear on what data you need and where your data is located. It can be in
documents, databases, or social media apps. Once you have identified your data
sources, you need select the appropriate method for each source. For images,
you might need OCR; for websites, you might need web scraping software, and
so on and so forth.
Source Connection
You can use SQL queries to retrieve specific data from tables for databases.
Documents may require text extraction using OCR (or specific document
parsers. Most data extraction tools are now code-free, which means all you need
to do is just drag and drop a connector and connect to any data source without
learning extensive SQL queries or programming languages.
Once the data is extracted, it often doesn’t comply with the format required by
the end destination or even for analysis. For example, you could have data in
XML or JSON, and you might need to convert it into Excel for analysis. There
could be multiple scenarios, which is why data transformation is essential. Some
common transformation tasks include:
The transformed data is then fed into a destination, which varies according to
the objective of the data. You could store data in flat files such as CSV, JSON, or
Parquet files or put it in a relational database (e.g., MySQL, PostgreSQL) or
NoSQL database (e.g., MongoDB).
In a deferred update, the changes are not In an immediate update, the changes
applied immediately to the database. are applied directly to the database.
The log file contains all the changes that The log file contains both old as well as
are to be applied to the database. new values.
In this method once rollback is done all the In this method once rollback is done the
records of log file are discarded and no old values are restored to the database
changes are applied to the database. using the records of the log file.
Concepts of buffering and caching are used Concept of shadow paging is used in
in deferred update method. immediate update method.
The major disadvantage of this method is The major disadvantage of this method
that it requires a lot of time for recovery in is that there are frequent I/O operations
case of system failure. while the transaction is active.
Data loading
What is loading?
Loading is the ultimate step in the ETL process. In this step, the extracted data
and the transformed data are loaded into the target database. To make the data
load efficient, it is necessary to index the database and disable the constraints
before loading the data. All three steps in the ETL process can be run parallel.
Data extraction takes time and therefore the second phase of the transformation
process is executed simultaneously. This prepared the data for the third stage of
loading. As soon as some data is ready, it is loaded without waiting for the
previous steps to be completed.
The loading process is the physical movement of the data from the computer
systems storing the source database(s) to that which will store the data warehouse
database. The entire process of transferring data to a data warehouse repository is
referred to in the following ways:
1. Initial Load: For the very first time loading all the data warehouse
tables.
2. Incremental Load: Periodically applying ongoing changes as per the
requirement. After the data is loaded into the data warehouse database,
verify the referential integrity between the dimensions and the fact
tables to ensure that all records belong to the appropriate records in the
other tables. The DBA must verify that each record in the fact table is
related to one record in each dimension table that will be used in
combination with that fact table.
3. Full Refresh: Deleting the contents of a table and reloading it with
fresh data.
Refresh versus Update
After the initial load, the data warehouse needs to be maintained and updated and
this can be done by the following two methods:
Update– application of incremental changes in the data sources.
Refresh– complete reloads at specified intervals.
Data Loading-
Data is physically moved to the data warehouse. The loading takes place within a
“load window. The tendency is close to real-time updates for data warehouses as
warehouses are growing used for operational applications.
Loading the Dimension Tables
Procedure for maintaining the dimension tables includes two functions, initial
loading of the tables and thereafter applying the changes on an ongoing basis
System geared keys are used in a data warehouse. The reeds in the source system
have their own keys. Therefore, before an initial load or an ongoing load, the
production keys must be co to system-generated keys in the data warehouse,
Another issue is related to the application of Type 1, Type 2, and Type 3 changes
to the data warehouse. Fig. shows how to handle it.
Loading changes to a dimension table
The key in the fact table is the concatenation of keys from the
dimension tables.
So for this reason amplitude records are loaded first.
A concatenated key is created from the keys of the corresponding
dimension tables
Data quality is evaluated based on a number of dimensions, which can differ based on the source of
information. These dimensions are used to categorize data quality metrics:
Completeness: This represents the amount of data that is usable or complete. If there is a
high percentage of missing values, it may lead to a biased or misleading analysis if the data is
not representative of a typical data sample.
Uniqueness: This accounts for the amount of duplicate data in a dataset. For example, when
reviewing customer data, you should expect that each customer has a unique customer ID.
Validity: This dimension measures how much data matches the required format for any
business rules. Formatting usually includes metadata, such as valid data types, ranges,
patterns, and more.
Timeliness: This dimension refers to the readiness of the data within an expected time
frame. For example, customers expect to receive an order number immediately after they
have made a purchase, and that data needs to be generated in real-time.
Accuracy: This dimension refers to the correctness of the data values based on the agreed
upon “source of truth.” Since there can be multiple sources which report on the same metric,
it’s important to designate a primary data source; other data sources can be used to confirm
the accuracy of the primary one. For example, tools can check to see that each data source is
trending in the same direction to bolster confidence in data accuracy.
Consistency: This dimension evaluates data records from two different datasets. As
mentioned earlier, multiple sources can be identified to report on a single metric. Using
different sources to check for consistent data trends and behavior allows organizations to
trust the any actionable insights from their analyses. This logic can also be applied around
relationships between data. For example, the number of employees in a department should
not exceed the total number of employees in a company.
Fitness for purpose: Finally, fitness of purpose helps to ensure that the data asset meets a
business need. This dimension can be difficult to evaluate, particularly with new, emerging
datasets.
These metrics help teams conduct data quality assessments across their organizations to evaluate
how informative and useful data is for a given purpose.
Over the last decade, developments within hybrid cloud, artificial intelligence, the Internet of Things
(IoT), and edge computing have led to the exponential growth of big data. As a result, the practice of
master data management (MDM) has become more complex, requiring more data stewards and
rigorous safeguards to ensure good data quality.
Businesses rely on data quality management to support their data analytics initiatives, such as
business intelligence dashboards. Without this, there can be devastating consequences, even ethical
ones, depending on the industry (e.g. healthcare). Data quality solutions exist to help companies
maximize the use of their data, and they have driven key benefits, such as:
Better business decisions: High quality data allows organizations to identify key performance
indicators (KPIs) to measure the performance of various programs, which allows teams to improve or
grow them more effectively. Organizations prioritize data quality will undoubtedly have an advantage
over their competitors.
Improved business processes: Good data also means that teams can identify where there are
breakdowns in operational workflows. This is particularly true for the supply chain industry, which
relies on real-time data to determine appropriate inventory and location of it after shipment.
Increased customer satisfaction: High data quality provides organizations, particularly marketing and
sales teams, with incredible insight into their target buyers. They are able to integrate different data
across the sales and marketing funnel, which enable them to sell their products more effectively. For
example, the combination of demographic data and web behavior can inform how organizations
create their messaging, invest their marketing budget, or staff their sales teams to service existing or
potential clients.
Duplicate Data
By utilizing these methods, they can effectively detect and remove redundant
data, thus improving the overall quality of the dataset. Additionally,
incorporating these techniques also ensures that the dataset contains
accurate and reliable information for further analysis and decision-making
purposes.
Data Standardization
Data cleaning is not solely a technical endeavor but also requires domain
knowledge and expertise. Understanding the context, semantics, and
intricacies of the data is crucial for making informed decisions during the
cleaning process. Collaborating with subject matter experts and domain
specialists can significantly improve the quality and accuracy of data-cleaning
outcomes. Domain knowledge helps in identifying patterns, resolving
ambiguities, and making data-driven decisions specific to the industry or
organization.
In this extensive article, we have examined several common obstacles and hurdles encountered
during data cleansing, such as missing data, disparities, replicas, data normalization, scalability, data
confidentiality, and the necessity of specialized knowledge.