Unit 3
Unit 3
Unit 3
Data discrimination makes discrimination rules which are a comparison of the general features of objects between
two classes defined as the target class and the contrasting class.
It is a comparison of the general characteristics of targeting class data objects with the general characteristics of
objects from one or a set of contrasting classes. The user can define the target and contrasting classes. The methods
used for data discrimination are very similar to the approaches used for data characterization with the exception
that data discrimination results include comparative measures.
Reasons for attribute relevance analysis
• It can decide which dimensions must be included.
• It can reduce the number of attributes that support us to read patterns easily.
The basic concept behind attribute relevance analysis is to evaluate some measure that can compute the
relevance of an attribute regarding a given class or concept. Such measures involve information gain,
ambiguity, and correlation coefficient.
Attribute relevance analysis for concept description is implemented as follows −
Data collection − It can collect data for both the target class and the contrasting class by query processing.
Preliminary relevance analysis using conservative AOI − This step recognizes a set of dimensions and attributes on
which the selected relevance measure is to be used. AOI can be used to implement preliminary analysis on the data
by eliminating attributes having a high number of distinct values. It can be conservative, the AOI implemented should
employ attribute generalization thresholds that are set reasonably large to enable more attributes to be treated in
further relevance analysis by the selected measure.
Remove − This process removes irrelevant and weakly relevant attributes using the selected relevance analysis
measure.
Generate the concept description using AOI − It can implement AOI using a less conservative set of attribute
generalization thresholds. If the descriptive mining function is class characterization, only the original target class
working relation is included now.
Distributions and central tendency
A dataset is a distribution of n number of scores or values.
Normal distribution
In a normal distribution, data is symmetrically distributed with no skew. Most values cluster around a central region, with
values tapering off as they go further away from the center. The mean, mode and median are exactly the same in a normal
distribution.
The representative value of a data set, generally the central value or the most occurring value that gives a general idea
of the whole data set is called the Measure of Central Tendency.
Measures of central tendency helps to find the middle, or the average, of a dataset. The 3 most common measures of
central tendency are the mode, median, and mean.
• Mean
• Median
• Mode
Mode
The mode is the most frequently occurring value in the dataset. It’s possible to have no mode, one
mode, or more than one mode. To find the mode, sort your dataset numerically or categorically and
select the response that occurs most frequently.
Mode=Liberal
When to use the mode
The mode is most applicable to data from a nominal level of measurement. Nominal data is classified into
mutually exclusive categories, so the mode tells you the most popular category.
For continuous variables or ratio levels of measurement, the mode may not be a helpful measure of central
tendency. That’s because there are many more possible values than there are in a nominal or ordinal level of
measurement. It’s unlikely for a value to repeat in a ratio level of measurement.
Median
The median of a dataset is the value that’s exactly in the middle when it is ordered from low to high.
Participant 1 2 3 4 5 6 7
Speed Medium Slow Fast Fast Medium Fast Slow
To find the median, you first order all values from low to high. Then, you find the value in the middle of the ordered
dataset—in this case, the value in the 4th position.
Ordered Slow Slow Medium Medium Fast Fast Fast
dataset
Median: Medium
Median of an odd-numbered dataset
For an odd-numbered dataset, find the value that lies at the (n+1)/2 position, where n is the number of values in the
dataset.
Example
You measure the reaction times in milliseconds of 5 participants and order the dataset.
=(5+1)/2 =3
That means the median is the 3rd value in your ordered dataset.
Median of an even-numbered dataset
For an even-numbered dataset, find the two values in the middle of the dataset: the values at the n/2 and
(n/2)+1 positions. Then, find their mean.
Example
You measure the reaction times of 6 participants and order the dataset.
The middle positions are calculated using n/2 and (n/2)+1, where n = 6.
( 6/2)=3
(6/2)+1=4
That means the middle values are the 3rd value, which is 345, and the 4th value, which is 357.
To get the median, take the mean of the 2 middle values by adding them together and dividing by 2.
(345+357)/2=351
Mean
Mean in general terms is used for the arithmetic mean of the data, but other than the arithmetic mean there
are geometric mean and harmonic mean as well that are calculated using different formulas.
Mean for Ungrouped Data
Arithmetic mean {x} is defined as the sum of the individual observations (xi) divided by the total number of
observations N. In other words, the mean is given by the sum of all observations divided by the total number
of observations.
Example: If there are 5 observations, which are 27, 11, 17, 19, and 21 then the mean is given by
⇒ = 95 ÷ 5
⇒ mean = 19
Mean for Grouped Data
Mean is defined for the grouped data as the sum of the product of observations (xi) and their corresponding
frequencies (fi) divided by the sum of all the frequencies (fi).
Example: If the values (xi) of the observations and their frequencies (fi) are given as follows:
xi 4 6 15 10 9
fi 5 10 8 7 10
(4×5 + 6×10 + 15×8 + 10×7 + 9×10) ÷ (5 + 10 + 8 + 7 + 10)
=
22
Measures of Dispersion
• Which of the distributions of
scores has the larger dispersion? 125
100
75
50
25
The upper distribution has more 0
dispersion because the scores are 1 2 3 4 5 6 7 8 9 10
125
more spread out 100
That is, they are less similar to each 75
50
other 25
0
1 2 3 4 5 6 7 8 9 10
23
Measure of Dispersion is the numbers that are used to represent the scattering of the data. These are the numbers
that show the various aspects of the data spread across various parameters. There are various measures of dispersion
that are used to represent the data that includes,
• Standard Deviation
• Mean Deviation
• Quartile Deviation
• Variance
• Range, etc
Dispersion in the general sense is the state of scattering. Suppose we have to study the data for thousands of
variables there we have to find various parameters that represent the crux of the given data set. These
parameters are called the measure of dispersion.
Example of Measures of Dispersion
We can understand the measure of dispersion by studying the following example, suppose we have 10 students in a class and
the marks scored by them in a Mathematics test are 12, 14, 18, 9, 11, 7, 9, 16, 19, and 20 out of 20. Then the average value
scored by the student in the class is,
= 135/10 = 13.5
Mean Deviation = {|12-13.5| + |14-13.5| + |18-13.5| + |9-13.5| + |11-13.5| + |7-13.5| + |9-13.5| + |16-13.5| + |
19-13.5| + |20-13.5|}/10 = 34.5/10 = 3.45
Types of Measures of Dispersion
Measures of dispersion can be classified into two categories shown below:
•Absolute Measures of Dispersion
•Relative Measures of Dispersion
Absolute Measures of Dispersion
These measures of dispersion are measured and expressed in the units of data themselves. For example – Meters, Dollars,
Kg, etc. Some absolute measures of dispersion are:
• Range
• Mean Deviation
• Standard Deviation
• Variance
• Quartile Deviation
• Interquartile Range
The Range
• The range is defined as the difference between the largest score in
the set of data and the smallest score in the set of data, XL - XS
• What is the range of the following data:
4 8 1 6 6 2 9 3 6 9
• The largest score (XL) is 9; the smallest score (XS) is 1; the range is XL -
XS = 9 - 1 = 8
28
Range for Grouped Data
The range of the data set for the grouped data set is found by studying the following example,
Example: Find out the range for the following frequency distribution table for the marks scored by class
10 students.
0-10 5
10-20 8
20-30 15
30-40 9
30
Sample standard deviation
When you collect data from a sample, the sample standard deviation is used to make estimates
or inferences about the population standard deviation.
The sample standard deviation formula looks like this:
Solution:
S.D = 15.5
Mean Deviation
Range as a measure of dispersion only depends on the highest and the lowest values in the data. Mean deviation on
the other hand measures the deviation of the observations from the mean of the distribution. Since the average is the
central value of the data, some deviations might be positive and some might be negative. If they are added like that,
their sum will not reveal much as they tend to cancel each other’s effect. For example,
X 2
2
35
What Does the Variance Formula Mean?
• First, it says to subtract the mean from each of the scores
• This difference is called a deviate or a deviation score
• The deviate tells us how far a given score is from the typical, or average, score
• Thus, the deviate is a measure of dispersion for a given score
36
What Does the Variance Formula Mean?
• Why can’t we simply take the average of the
deviates? That is, why isn’t variance defined as:
X
2
N
This is not the formula
for variance!
37
What Does the Variance Formula Mean?
• One of the definitions of the mean was that it always made the sum
of the scores minus the mean equal to 0
• Thus, the average of the deviates must be 0 since the sum of the
deviates must equal 0
• To avoid this problem, statisticians square the deviate score prior to
averaging them
• Squaring the deviate score makes all the squared scores positive
38
What Does the Variance Formula Mean?
• Variance is the mean of the squared deviation scores
• The larger the variance is, the more the scores deviate, on average,
away from the mean
• The smaller the variance is, the less the scores deviate, on average,
from the mean
39
Standard Deviation
• Standard deviation = variance
• Variance = standard deviation2
40
Quartile Deviation & interquartile
• Quartile deviation is a statistic that measures the deviation. It measures the deviation of the data
from the average value. Here quartile deviation gives the spread of the data, which helps to
understand the distribution of the data.
• we have three quartiles Q1, Q2, Q3 which divide the data into three quarters. The median of the data
has been referred as the second quartile Q2. Also, the first quartile Q1 is the median of the first half
of the data, and the third quartile Q3 is the median of the second half of the data.
• Quartile deviation is the dispersion in the middle of the data. The difference between the first
quartile Q1 and the third quartile Q3 is called the interquartile range, and half of this interquartile
range is called the quartile deviation. This quartile deviation is also referred to as a semi-
interquartile range.
Example 1: Find the quartile deviation for the following given data.
23, 8, 5, 16, 33, 7, 24, 5, 30, 33, 37, 30, 9, 11, 26, 32
From the above data we have Q1 = ( 8 + 9)/2 = 17/2 = 8.5, and Q3 = (30 + 32)/2 = 62/2 = 31
Quartile Deviation = (Q3−Q1)/2=(31−8.52)/2
=22.52/2
=11.25
Relative Measures of Dispersion
Suppose we have to measure the two quantities that have different units than we used relative measures of
dispersion to get a better idea about the scatter of the data. Various relative measures of the dispersion are:
• Coefficient of Range: The coefficient of range is defined as the ratio of the difference between the highest
and lowest value in a data set to the sum of the highest and lowest value.
• Coefficient of Variation: The coefficient of Variation is defined as the ratio of the standard deviation to the
mean of the data set. We use percentages to express the coefficient of variation.
• Coefficient of Mean Deviation: The coefficient of the Mean Deviation is defined as the ratio of the mean
deviation to the value of the central point of the data set.
• Coefficient of Quartile Deviation: The coefficient of the Quartile Deviation is defined as the ratio of the
difference between the third quartile and the first quartile to the sum of the third and first quartiles.