Slide for Chapter 2
Slide for Chapter 2
CHAPTER 02
Mastering Data
Objectives
1
11/03/2024
Contents
2
11/03/2024
3
11/03/2024
4
11/03/2024
10
5
11/03/2024
11
12
6
11/03/2024
The ETL process begins with identifying which data you need and is
complete when the clean data are loaded in the appropriate format
into the tool to be used for analysis. The Requesting data is an iterative
practice involving 5 steps:
• Extract
Step 1: Determine the purpose and scope of the data request.
Step 2: Obtain the data.
• Transform
Step 3: Validate the data for completeness and integrity.
Step 4: Clean the data.
• Load
Step 5: Load the data for data analysis.
14
7
11/03/2024
Extract
Extract
16
8
11/03/2024
Request Date
Number Received Date Date
Completed Provided
Assigned
Received by
to
Initial review comments (discussion with client— Revisions
revisions required? agreement to proceed? etc.) Required
9
11/03/2024
19
Transform
20
10
11/03/2024
Transform
21
Knowledge check
22
11
11/03/2024
23
24
12
11/03/2024
25
26
13
11/03/2024
27
28
14
11/03/2024
Load
29
15
11/03/2024
Chapter 2 Summary
• The first step in the IMPACT cycle is to identify the • Once you have the data, they will need to be validated
questions that you intend to answer through your data for completeness and integrity—that is, you will need to
analysis project. Once a data analysis problem or question ensure that all of the data you need were extracted, and
has been identified, the next step in the IMPACT cycle is that all data are correct. Sometimes when data are
mastering the data, which can be broken down to mean extracted, some formatting or sometimes even entire
obtaining the data needed and preparing it for analysis. records will get lost, resulting in inaccuracies. Correcting
the errors and cleaning the data is an integral step in
• In order to obtain the right data, it is important to have a mastering the data.
firm grasp of what data are available to you and how that
information is stored. • Finally, after the data have been cleaned, there may be
• Data are often stored in a relational database, which one last step of mastering the data, which is to load
helps to ensure that an organization’s data are them into the tool that will be used for analysis. Often,
complete and to avoid redundancy. Relational the cleaning and correcting of data occur in Excel and
databases are made up of tables with uniquely the analysis will also be done in Excel. In this case, there
identified records (this is done through primary keys) is no need to load the data elsewhere. However, if you
and are related through the usage of foreign keys. intend to do more rigorous statistical analysis than Excel
provides, or if you intend to do more robust data
• To obtain the data, you will either have access to extract the visualization than can be done in Excel, it may be
data yourself or you will need to request the data from a necessary to load the data into another tool following
database administrator or the information systems team. If the transformation process.
the latter is the case, you will complete a data request form,
indicating exactly which data you need and why.
31
Problems
• P1
• P2
• P3
• P4
• P5
• P6
32
16