0% found this document useful (0 votes)
36 views105 pages

Lesson 3 Data Cleaning and Preparation

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 105

Business Analytics with Excel

Data Cleaning and Preparation


Learning Objectives

By the end of this lesson, you will be able to:

Implement sort and filter functionalities to order or filter data

Organize the data using group by and ungroup functions

Execute Remove duplicates function to rid the data of


duplicates

Implement data validation function to a given data


A Day in the Life of Business Analyst

As a business analyst of an organization:

You are required to sort and filter data.Also, improper data needs to be eliminated and data
must be cleaned and be meaningful which serves the business purpose of the organization

To achieve these tasks, you will be learning a few concepts, such as sort, filter, group by,
subtotal and removing duplicates.
Sort and Filter
Sort and Filter

The sort and filter functionalities are available in Excel to order or filter the data for further analysis.
Sort and Filter

Example: The results of nine students for the Maths subject.


Steps to Sort Value

To sort the data based on the CGPA in descending order, choose the CGPA column name and then click on
Sort under the Data tab
Steps to Sort Value

Under sort columns, choose CGPA and then select ‘Largest to Smallest’
Steps to Sort Value

These are the sorted values:


Steps to Sort Value

Sorting can be done based on character values from either A-Z or Z-A.

Any type of data


can be sorted
based on multiple
columns
Steps to Sort Value

Under the Sort by tab, select Name and choose order as A to Z and click OK
Steps to Sort Value

The results will be in the following order.


Filter

Filter option allows us to choose any column we would like to filter the data on.
Steps to Filter Data

Choose a column to filter and in the Data tab, click on Filter


Steps to Filter Data

Example: To view CGPA’s that are greater than or equal to four.


Steps to Filter Data

Choose the greater than or equal to option from the dropdown, and then mention the number

In this case, it is 4.0.


Steps to Filter Data

The result will be in the following order.


Group by and Subtotal
Group by and Ungroup

Group by and ungroup allow data to group data by collapse and expanding rows with similar content to
create more compact and understandable views.

Group by and ungroup by are available under the Data tab within the outline section.
Group By

The group by functionality in Excel allows us to show necessary data for easy viewing and analysis.

It is possible to create subtotals and outline for a given set of data.


Group By

Group by can be done for rows or columns.

Grouping for Rows Grouping for Columns


Steps for Grouping

Let us discuss the steps for grouping data.

Step 1: To group data, select the rows and columns you want to group
Steps for Grouping

Step 2: Click on Group under Data tab


Grouping for Columns

Step 3: Choose Columns and click on OK


Grouping for Columns

Step 4: This groups the three columns chosen, and applies a control to show or hide the grouped content
Grouping for Columns

Clicking on – hides the content, while clicking on + shows the grouped content.
Grouping for Rows

Similarly, for row-wise grouping, select the rows you want to group.
Grouping for Rows

Step 1: Click on Group under the Data tab, and then select rows option from the dialog box

Step 2: Click on OK
Grouping for Rows

Clicking on – hides the content and clicking on + shows the grouped content.
Grouping for Rows

We can create a group within a group by choosing rows or columns within the grouped data.

Create a row or column group again.


Grouping for Rows

This will be the result.


Ungrouping

The ungroup option allows us to remove the groups created by group.

Step 1:
Choose the data already chosen for grouping
(row/column)
Ungrouping

Step 2: Click on Ungroup under the Data tab


Ungrouping

Step 3: Choose Rows to remove row-level grouping


Ungrouping

Step 4: The group chosen will be removed


Subtotal

Subtotal allows us to create groups and have a subtotal for each group.
Subtotal: Example

Let us understand this by taking an example.

For the following data set,


find the total per student by
grouping students and adding their
marks.
Subtotal: Example

Step 1: Select the data we need to group by and subtotal


Subtotal: Example

Step 2: Click on Subtotal under Data tab


Subtotal: Example

Step 3: Click on the column to which the sum function has to be applied
Subtotal: Example

The subtotaling provides control to the group and shows subtotals per student.
Text to Column
Text to Column

It converts raw text into columns in excel, which can save a user the time of manually separating the text
in a cell into several columns.

Name, age, address, phone number, university


Tom Smith, 22,4th street, 8998798901, St Gallen
University

Raw text Text put in excel columns


Text to Column: Example

Step 1: Open Excel and paste the content into a sheet


Text to Column: Example

Step 2: Choose Column A


Text to Column: Example

Step 2:Step
Choose column
3: Go to the AData
and tab
clickand
on click
text to
oncolumns under Data tab.
Text to Columns
Text to Column: Example

Step 4: Select the Delimited


option in the dialog box and click
Next
Text to Column: Example

Step 5: Choose Comma and click Next,


since the delimiter is a comma here.
Text to Column: Example

The text to column function puts each element separated by comma in an individual box
Removing Duplicates
Duplicate

Duplicate refers to a copy of the original.


Removing Duplicates in Excel

In any data analytics work, there will always be cases where we get duplicates in
different columns.

Excel is very handy in removing duplicates in the data.


Causes of Duplicates

Duplicates can occur in data and cause errors in analytics.

Duplicates occur when there is an incorrect submission of user data.


Causes of Duplicates

When there is a missing validation in the data set.


Causes of Duplicates

Duplicates occur when we merge multiple data sources using Joins.


Causes of Duplicates

When data is copy pasted multiple times.

When duplicates are removed using Excel, we can choose a single column or
multiple columns to check the data.
Removing Duplicates Using Single Column: Example

Step 1: Choose the column with a set of rows to remove duplicates

There are many duplicates in this column.


Removing Duplicates Using Single Column: Example

Step 2:
• Select the entire column
• Click on Data
• Click on Remove Duplicates
Removing Duplicates Using Single Column: Example

Step 3: After clicking the option a pop up will appear

Click OK
Removing Duplicates Using Single Column: Example

Step 4: It is clearly visible that all the duplicates are removed


Removing Duplicates Using Multiple Columns: Example

Let us consider the following data set as an example for removing duplicates with
multiple columns:

• Here, there are two entries for Maths


subject under the same name, Albert Dane.
• When removing duplicates for this, only the
first row is retained.
Removing Duplicates Using Multiple Columns: Example

Step 1: Let us choose the data to


remove duplicates
Removing Duplicates Using Multiple Columns: Example

Step 2: Now click on Remove Duplicates from Data tab


Removing Duplicates Using Multiple Columns: Example

A pop up will occur to remove duplicates.

Step 3: Choose the columns where duplicates need to be checked


Removing Duplicates Using Multiple Columns: Example

Step 4: Once it is checked, click OK


Removing Duplicates Using Multiple Columns: Example

Another pop up will appear which notifies that, 1 duplicate value was found and removed,
also 3 unique values remain.
Removing Duplicates Using Multiple Columns: Example

This is the final data set


Data Validation
Data Validation

Data in Excel can be validated using some rules set in data validation dialog.

This helps in reducing the amount of unstandardized data, errors, or irrelevant information in the
worksheet.
Data Validation: Example

Let us understand data validation through an example.

• Choose a cell or a group of cells to validate

• Click on Data Validation under Data tab


Data Validation: Example

It is important to remember that:

• Validation applies to new data entered in the cells where rules are placed.

• Existing data is not validated.


Data Validation: Example

After clicking on the data validation, a pop-up appears regarding the validation criteria
and the following validations are possible.

‘Any value’ allows any alphanumeric value


in the cells.
Data Validation: Example

‘Whole number’ allows whole numbers


and a set of rules including a range of
minimum and maximum to be set.
Data Validation: Example

‘List’ allows only a list of values specified in a


range of cells or written manually in the ‘source’
input box.
Data Validation: Example

‘Date’ allows only dates and a set of rules


including a range of minimum and
maximum to be set.
Data Validation: Example

‘Time’ allows only time values and a set of


rules including a range of minimum and
maximum to be set.
Data Validation: Example

‘Text length’ allows only text within the


specified length and a set of rules on
the length to be set.
Data Validation: Example

‘Custom’ allows custom rules on data to


be set.
Key Takeaways

The sort and filter functionalities are available to order or filter the
data for further analysis.

Group by functionality in Excel allows us to show necessary data for


easy viewing and analysis.

The ungroup option allows us to remove the groups created by


group.

While removing duplicates, we can choose a single column or


multiple columns to check the data.

Data validation applies only to new data entered in the cells


where rules are placed.
Knowledge Check
Knowledge
Check In which of the following sections can we find Group By and Subtotal under the data
tab?
1

a. Sort & Filter

b. Data Tools

c. Outline

d. Analyze
Knowledge
Check In which of the following sections can we find Group By and Subtotal under the data
tab?
1

a. Sort & Filter

b. Data Tools

c. Outline

d. Analyze

The correct answer is c

Outline section under Data tab allows group by and subtotal.


Knowledge
Check
Group By within a Group By is possible. True or False.
2

a. True

b. False

c.

d.
Knowledge
Check
Group By within a Group By is possible. True or False.
2

a. True

b. False

c.

d.

The correct answer is a

True. Group By within a Group By is possible.


Knowledge
Check
Which of the following options can be used for sorting on multiple columns?
3

a. Options

b. Add Level

c. Sort On

d. Order
Knowledge
Check
Which of the following options can be used for sorting on multiple columns?
3

a. Options

b. Add Level

c. Sort On

d. Order

The correct answer is b

Add Level helps to add multiple columns for sorting.


Knowledge
Check
Pattern matching is possible in filters. True or False.
4

a. True

b. False

c.

d.
Knowledge
Check
Pattern matching is possible in filters. True or False.
4

a. True

b. False

c.

d.

The correct answer is a

True. Pattern matching is done using regular expressions such as ? and *.


Knowledge
Check Which of the following options is used to convert text to columns when there is no
delimiters?
5

a. Delimiter

b. Fixed Width

c. Comma

d. Space
Knowledge
Check Which of the following options is used to convert text to columns when there is no
delimiters?
5

a. Delimiter

b. Fixed Width

c. Comma

d. Space

The correct answer is b

Fixed width allows us to convert data into columns based on the length of each column.
Knowledge
Check
How to convert a CSV format data into excel?
6

a. Use text to columns

b. Use remove duplicates

c. Use copy paste to take out each CSV value

d.
Knowledge
Check
How to convert a CSV format data into excel?
6

a. Use text to columns

b. Use remove duplicates

c. Use copy paste to take out each CSV value

d.

The correct answer is a

Text to columns is the easiest way to convert data to columns


Knowledge
Check Is it possible to separate data with multiple delimiters into columns?(Example
7 1,2,3;4,5|6)? True or False.

a. True

b. False

c.

d.
Knowledge
Check Is it possible to separate data with multiple delimiters into columns?(Example
7 1,2,3;4,5|6)? True or False.

a. True

b. False

c.

d.

The correct answer is a

True. Multiple delimiters can be specified in Text to Columns


Knowledge
Check
Why do duplicates occur in a dataset?
8

a. Missing validation

b. Duplicates cannot occur in a dataset

c. Excel has a feature to create duplicates

d.
Knowledge
Check
Why do duplicates occur in a dataset?
8

a. Missing validation

b. Duplicates cannot occur in a dataset

c. Excel has a feature to create duplicates

d.

The correct answer is a

Duplicates occur if the input feed has not validated the data and allowed duplicates.
Knowledge
Check
How do you specify that data has header while removing duplicates?
9

a. Click on "My data has headers" Checkbox

b. Remove headers manually

c. Cannot be specified

d.
Knowledge
Check
How do you specify that data has header while removing duplicates?
9

a. Click on "My data has headers" Checkbox

b. Remove headers manually

c. Cannot be specified

d.

The correct answer is a

The "My data has headers" checkbox specifies that the data has headers
Knowledge
Check Is it possible to remove rows in a dataset where only one row has duplicates? True or
10 False.

a. True

b. False

c.

d.
Knowledge
Check Is it possible to remove rows in a dataset where only one row has duplicates? True or
10 False.

a. True

b. False

c.

d.

The correct answer is a

True. It is possible to remove all rows in a dataset where one column only has duplicates.
Knowledge
Check
Which of the following options in data validation allows us to validate a list of values?
11

a. Any Value

b. Data

c. List

d. Custom
Knowledge
Check
Which of the following options in data validation allows us to validate a list of values?
11

a. Any Value

b. Data

c. List

d. Custom

The correct answer is b

Outline section under Data tab allows group by and subtotal.


Knowledge
Check
Which of the following range of values can be provided in data validation?
12

a. not between

b. equal to

c. greater than

d. between
Knowledge
Check
Which of the following range of values can be provided in data validation?
12

a. not between

b. equal to

c. greater than

d. between

The correct answer is d

Between allows us to set range of values.

You might also like