DM 02 01 Data Undrestanding
DM 02 01 Data Undrestanding
Spring 2010
Data Understanding
Introduction
Data Understanding
Outline
Introduction
Measuring the Central Tendency
Measuring the Dispersion of Data
Graphic Displays
References
Data Understanding
Introduction
Data Understanding
– To highlight which data values should be treated as noise or
outliers.
Measures
– Central tendency
Mean, median, mode, and midrange
– Data dispersion
Variance, Rang, quartiles, and interquartile range (IQR)
Data Understanding
Introduction
Data Understanding
Measuring the Central Tendency
Data Understanding
Measuring the Central Tendency
Data Understanding
Mean
Data Understanding
Trimmed mean
Disadvantage of mean
– A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values.
– Even a small number of extreme values can corrupt the mean.
Trimmed mean
– the trimmed mean is the mean obtained after cutting off values at
the high and low extremes.
– For example, we can sort the values and remove the top and
bottom 2% before computing the mean.
– We should avoid trimming too large a portion (such as 20%) at
both ends as this can result in the loss of valuable information.
Data Understanding
Median
Data Understanding
Mode & Midrange
Data Understanding
Mean, Median, and Mode
Data Understanding
Measuring the Dispersion of Data
Data Understanding
Measuring the Dispersion of Data
Data Understanding
Inter-Quartile Range
Data Understanding
Five Number Summary
Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles, i.e.,
the height of the box is IRQ
– The median is marked by a line within the box
– Whiskers: two lines outside the box extend to Minimum
and Maximum
– To show outliers, the whiskers are extended to the extreme
low and high observations only if these values are less than
1.5 * IQR beyond the quartiles.
Data Understanding
Five Number Summary
Boxplot for the unit price data for items sold at four branches of
AllElectronics during a given time period.
Data Understanding
Variance and Standard Deviation
Variance (σ2)
Data Understanding
Graphic Displays
Data Understanding
Graphic Displays
Data Understanding
Histogram Analysis
Data Understanding
Histogram Analysis
Data Understanding
Histogram Analysis
Example: A histogram
Data Understanding
Quantile Plot
Data Understanding
Quantile Plot
Data Understanding
Scatter plot
Scatter plot
– is one of the most effective graphical methods for
determining if there appears to be a relationship, clusters
of points, or outliers between two numerical attributes.
Each pair of values is treated as a pair of coordinates
and plotted as points in the plane
Data Understanding
Scatter plot
Data Understanding
Scatter plot
Data Understanding
Scatter plot
Data Understanding
Loess Curve
Data Understanding
Loess Curve
Data Understanding
References
Data Understanding
References
Data Understanding
The end
Data Understanding