0% found this document useful (0 votes)

24 views35 pages

DM 02 01 Data Undrestanding

This document provides an overview of data understanding and preparation techniques. It discusses measuring central tendency through mean, median, mode, and other measures. It also covers measuring data dispersion using range, variance, standard deviation, and other statistical techniques. Finally, it explores various graphic displays that can be used like histograms, quantile plots, scatter plots, and Loess curves to visualize and better understand datasets.

Uploaded by

Pallavi Bharti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views35 pages

DM 02 01 Data Undrestanding

Uploaded by

Pallavi Bharti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Data Mining

Part 2. Data Understanding and

Preparation

2.1 Data Understanding

Spring 2010

Instructor: Dr. Masoud Yaghini

Data Understanding
Introduction

Data Understanding
Outline

Introduction
Measuring the Central Tendency
Measuring the Dispersion of Data
Graphic Displays
References

Data Understanding
Introduction

Data Understanding
– To highlight which data values should be treated as noise or
outliers.
Measures
– Central tendency
Mean, median, mode, and midrange
– Data dispersion
Variance, Rang, quartiles, and interquartile range (IQR)

Data Understanding
Introduction

Such measures have been studied extensively in the

statistical literature.
From the data mining point of view, we need to
examine how they can be computed efficiently in large
databases.

Data Understanding
Measuring the Central Tendency

Measures of Central tendency:

– Mean
– Weighted mean
– Trimmed mean
– Median
– Mode
– Midrange

Data Understanding
Mean

Mean: The most common and most effective numerical measure

of the “center” of a set of data is the (arithmetic) mean. (sample
vs. population)

Weighted (arithmetic) mean : Sometimes, each value in a set

may be associated with a weight, the weights reflect the
significance, importance, or occurrence frequency attached to
their respective values.

Data Understanding
Trimmed mean

Disadvantage of mean
– A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values.
– Even a small number of extreme values can corrupt the mean.
Trimmed mean
– the trimmed mean is the mean obtained after cutting off values at
the high and low extremes.
– For example, we can sort the values and remove the top and
bottom 2% before computing the mean.
– We should avoid trimming too large a portion (such as 20%) at
both ends as this can result in the loss of valuable information.

Data Understanding
Median

Suppose that a given data set of N distinct values is

sorted in numerical order.
The median is the middle value if odd number of
values, or average of the middle two values otherwise
For skewed (asymmetric) data, a better measure of the
center of data is the median.

Data Understanding
Mode & Midrange

Mode is the another measure of central tendency

– The mode for a set of data is the value that occurs most
frequently in the set.
– If each data value occurs only once, then there is no mode.
The midrange can also be used to assess the central
tendency of a data set
– It is the average of the largest and smallest values in the set.

Data Understanding
Mean, Median, and Mode

Mean, median, and mode of symmetric versus positively and

negatively skewed data.

Positively skewed, where the mode is smaller than the median

(b), and negatively skewed, where the mode is greater than the
median (c).

Data Understanding
Measuring the Dispersion of Data

The degree to which numerical data tend to spread is

called the dispersion, or variance of the data.
The measures of data dispersion:
– Range
– Five-number summary (based on quartiles)
– Interquartile range (IQR)
– Standard deviation
Range
– difference between highest and lowest observed values

Data Understanding
Inter-Quartile Range

For the remainder of this section, let’s assume that the

data are sorted in increasing numerical order.
The kth percentile of a set of data in numerical order
is the value xi having the property that k percent of
the data entries lie at or below xi.
– The median (discussed in the previous subsection) is the
50th percentile.
Quartiles:
– First quartile (Q1): The first quartile is the value, where
25% of the values are smaller than Q1 and 75% are larger.
– Third quartile (Q3): The third quartile is the value, where
75% of the values are smaller than Q3 and 25% are larger.
Data Understanding
Inter-Quartile Range

Inter-quartile range (IQR)

– IQR = Q3 – Q1
– IQR is a simple measure of spread that gives the range
covered by the middle half of the data
Outlier
– usually, values falling at least 1.5 * IQR, above the third
quartile or below the first quartile.
Five number summary
– min, Q1, Median, Q3, max
– Contain information about the endpoints (e.g., tails) of the
data

Data Understanding
Five Number Summary

Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles, i.e.,
the height of the box is IRQ
– The median is marked by a line within the box
– Whiskers: two lines outside the box extend to Minimum
and Maximum
– To show outliers, the whiskers are extended to the extreme
low and high observations only if these values are less than
1.5 * IQR beyond the quartiles.

Data Understanding
Five Number Summary

Boxplot for the unit price data for items sold at four branches of
AllElectronics during a given time period.

Data Understanding
Variance and Standard Deviation

Variance (σ2)

Standard deviation (σ)

– is the square root of variance σ2
– σ measures spread about the mean and should be used only
when the mean is chosen as the measure of center.
– σ =0 only when there is no spread, that is, when all
observations have the same value.

Data Understanding
Graphic Displays

There are many types of graphs for the display of data

summaries and distributions, such as:
– Bar charts
– Pie charts
– Line graphs
– Boxplot
– Histograms
– Quantile plots
– Scatter plots
– Loess curves

Data Understanding
Histogram Analysis

Histograms or frequency histograms

– A univariate graphical method
– Consists of a set of rectangles that reflect the counts or
frequencies of the classes present in the given data
– If the attribute is categorical, then one rectangle is drawn
for each known value of A, and the resulting graph is more
commonly referred to as a bar chart.
– If the attribute is numeric, the term histogram is preferred.

Data Understanding
Histogram Analysis

Example: A set of unit price data for items sold at a

branch of AllElectronics

Data Understanding
Histogram Analysis

Example: A histogram

Data Understanding
Quantile Plot

A quantile plot is a simple and effective way to have a

first look at a univariate data distribution.
Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the
value xi
Note that
– the 0.25 quantile corresponds to quartile Q1,
– the 0.50 quantile is the median, and
– the 0.75 quantile is Q3.

Data Understanding
Quantile Plot

A quantile plot for the unit price data of AllElectronics.

Data Understanding
Scatter plot

Scatter plot
– is one of the most effective graphical methods for
determining if there appears to be a relationship, clusters
of points, or outliers between two numerical attributes.
Each pair of values is treated as a pair of coordinates
and plotted as points in the plane

Data Understanding
Scatter plot

A scatter plot for the data set of AllElectronics.

Data Understanding
Scatter plot

Scatter plots can be used to find (a) positive or (b)

negative correlations between attributes.

Data Understanding
Scatter plot

Three cases where there is no observed correlation between the

two plotted attributes in each of the data sets.

Data Understanding
Loess Curve

Adds a smooth curve to a scatter plot in order to

provide better perception of the pattern of dependence
The word loess is short for local regression.
Loess curve is fitted by setting two parameters:
– a smoothing parameter, and
– the degree of the polynomials that are fitted by the
regression

Data Understanding
Loess Curve

A loess curve for the data set of AllElectronics

Data Understanding
References

J. Han, M. Kamber, Data Mining: Concepts and

Techniques, Elsevier Inc. (2006). (Chapter 2)

Data Understanding
The end

Data Understanding

Lectur 4 Basic Statistical Descriptions of Data
No ratings yet
Lectur 4 Basic Statistical Descriptions of Data
44 pages
Introduction To Software Engineering: UML: Ananda Amatya Department of Computer Science University of Warwick
No ratings yet
Introduction To Software Engineering: UML: Ananda Amatya Department of Computer Science University of Warwick
369 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
DM - 02 - 02 - Descriptive Data Summarization
No ratings yet
DM - 02 - 02 - Descriptive Data Summarization
32 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
CH - 4
No ratings yet
CH - 4
71 pages
01 Data
No ratings yet
01 Data
100 pages
100 Days Higher
No ratings yet
100 Days Higher
112 pages
Data-Science-Report - Priyesh
No ratings yet
Data-Science-Report - Priyesh
32 pages
2024 L1 QuantMethods
100% (1)
2024 L1 QuantMethods
61 pages
CH 2
No ratings yet
CH 2
68 pages
CHP 2
No ratings yet
CHP 2
52 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
PANDAS - Series Dataframes
No ratings yet
PANDAS - Series Dataframes
118 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
Lecture 3
No ratings yet
Lecture 3
39 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
R22-UNIT2-CH2
No ratings yet
R22-UNIT2-CH2
28 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
02 Data
No ratings yet
02 Data
62 pages
02Data
No ratings yet
02Data
65 pages
Module 1
No ratings yet
Module 1
64 pages
data mining 2
No ratings yet
data mining 2
64 pages
XML
No ratings yet
XML
79 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Asha Karegowda DataAnalytics Unit1 Part 1 Notes
No ratings yet
Asha Karegowda DataAnalytics Unit1 Part 1 Notes
44 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Unit II Data Science Notes
No ratings yet
Unit II Data Science Notes
38 pages
Data Management
No ratings yet
Data Management
36 pages
Data Analysis
No ratings yet
Data Analysis
44 pages
Lecture 03 04 EDA 2
No ratings yet
Lecture 03 04 EDA 2
42 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
02Data
No ratings yet
02Data
66 pages
02data Part2
No ratings yet
02data Part2
34 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
Statistics Introduction
No ratings yet
Statistics Introduction
37 pages
Stats
No ratings yet
Stats
109 pages
Xamar Cadey
No ratings yet
Xamar Cadey
48 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
Measures of Position For Ungrouped Data
100% (2)
Measures of Position For Ungrouped Data
49 pages
1_L2_Intro_DAM
No ratings yet
1_L2_Intro_DAM
27 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
13 Numericals Problems to Practice
No ratings yet
13 Numericals Problems to Practice
25 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
SCSA1606 - Predictive and Advanced Analytics - Unit II
No ratings yet
SCSA1606 - Predictive and Advanced Analytics - Unit II
50 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Data Mining-5 - Getting Know Data 1
No ratings yet
Data Mining-5 - Getting Know Data 1
27 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
Measures of Position Quartile Decile and Percentile
No ratings yet
Measures of Position Quartile Decile and Percentile
39 pages
Skills For IB Geography Sampling
No ratings yet
Skills For IB Geography Sampling
20 pages
02 Data
No ratings yet
02 Data
64 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
ds1 Iat Ans
No ratings yet
ds1 Iat Ans
18 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Ada Module Chapter 1
No ratings yet
Ada Module Chapter 1
20 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
15 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
Descriptive Statistics (1)
No ratings yet
Descriptive Statistics (1)
63 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
2-1-Data
No ratings yet
2-1-Data
22 pages
Maths Literacy Grade 12 Trial 2021 P1 and Memo
No ratings yet
Maths Literacy Grade 12 Trial 2021 P1 and Memo
27 pages
Hughes Et Al 2019 PDF
No ratings yet
Hughes Et Al 2019 PDF
14 pages
DataUnderstandingAndPreparation DOM304
No ratings yet
DataUnderstandingAndPreparation DOM304
19 pages
Unit I-Introduction of Object Oriented Modeling
No ratings yet
Unit I-Introduction of Object Oriented Modeling
92 pages
Operating System Tutorial
No ratings yet
Operating System Tutorial
41 pages
Lecture 7 9
No ratings yet
Lecture 7 9
16 pages
Data-handling-part-2-NOTES
No ratings yet
Data-handling-part-2-NOTES
3 pages
Process Specification: 1. Motivation and Learning Goals
No ratings yet
Process Specification: 1. Motivation and Learning Goals
30 pages
Sqqs1013 Elementary Statistics (Group A) SECOND SEMESTER SESSION 2019/2020 (A192)
0% (1)
Sqqs1013 Elementary Statistics (Group A) SECOND SEMESTER SESSION 2019/2020 (A192)
13 pages
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
detailed-lesson-plan-on-measures-of-position-quartile
No ratings yet
detailed-lesson-plan-on-measures-of-position-quartile
13 pages
Pseudocode Test 2 Cram Up
No ratings yet
Pseudocode Test 2 Cram Up
7 pages
MODULE2 Material
No ratings yet
MODULE2 Material
14 pages
Mysql vs. Mongodb: Looking at Relational and Non-Relational Databases
No ratings yet
Mysql vs. Mongodb: Looking at Relational and Non-Relational Databases
23 pages
Research For Math Game
No ratings yet
Research For Math Game
13 pages
Foundation Mathematics 9 January 2018 Examination Paper
No ratings yet
Foundation Mathematics 9 January 2018 Examination Paper
8 pages
Oct 21 S1 MS
No ratings yet
Oct 21 S1 MS
13 pages
Name - Bharti Pallavi Arjun Roll No. - 18141112 Q.1) Prepare List of Automated Software Testing Tool. Ans
No ratings yet
Name - Bharti Pallavi Arjun Roll No. - 18141112 Q.1) Prepare List of Automated Software Testing Tool. Ans
6 pages
Tutorial No.7
No ratings yet
Tutorial No.7
7 pages
May 22 DLP
No ratings yet
May 22 DLP
7 pages
Daily Lesson Plan (Sample)
No ratings yet
Daily Lesson Plan (Sample)
5 pages
DATABASE CONNECTIVITY Adms
No ratings yet
DATABASE CONNECTIVITY Adms
7 pages
Calculation of Median, Quartiles and Percentiles
No ratings yet
Calculation of Median, Quartiles and Percentiles
4 pages
Name - Bharti Pallavi Arjun Roll No. - 18141112 Q.1) List Testing Objectives and Explain Testing Principles. Ans
No ratings yet
Name - Bharti Pallavi Arjun Roll No. - 18141112 Q.1) List Testing Objectives and Explain Testing Principles. Ans
7 pages
Q4 Test
No ratings yet
Q4 Test
5 pages
Adms PPT PP
No ratings yet
Adms PPT PP
6 pages
Database - Mongodb: Presented by
No ratings yet
Database - Mongodb: Presented by
5 pages
Answers Revised
No ratings yet
Answers Revised
2 pages
Adms PPT PP
No ratings yet
Adms PPT PP
4 pages
Experiment No 2: Name: Pallavi Bharti Roll No: 18141112
No ratings yet
Experiment No 2: Name: Pallavi Bharti Roll No: 18141112
4 pages
Receipt : An Autonomous Institute of Government of Maharashtra Vidyanagar, Karad, Maharashtra 415124, India
No ratings yet
Receipt : An Autonomous Institute of Government of Maharashtra Vidyanagar, Karad, Maharashtra 415124, India
1 page
Minor Project - 2 Groups
No ratings yet
Minor Project - 2 Groups
1 page
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet