Unit 3

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 43

UNIT_III

Analytical Characterization in Data Mining –


Attribute Relevance Analysis
Analytical characterization
Analytical characterization is used to help and identifying the weakly relevant, or irrelevant attributes. We can
exclude these unwanted irrelevant attributes when we preparing our data for the mining. It is a statistical
approach for preprocessing data to filter out irrelevant attributes or rank the relevant attribute. Measures of
attribute relevance analysis can be used to recognize irrelevant attributes that can be unauthorized from the
concept description process. The incorporation of this preprocessing step into class characterization or
comparison is defined as an analytical characterization
1. Class/Concept Descriptions
A class or concept implies there is a data set or set of features that define the class or a concept. A class can be a
category of items on a shop floor, and a concept could be the abstract idea on which data may be categorized
like products to be put on clearance sale and non-sale products. There are two concepts here, one that helps
with grouping and the other that helps in differentiating.
Data Characterization: This refers to the summary of general characteristics or features of the class, resulting in
specific rules that define a target class. A data analysis technique called Attribute-oriented Induction is employed on
the data set for achieving characterization.
Data Discrimination: Discrimination is used to separate distinct data sets based on the disparity in attribute values.
It compares features of a class with features of one or more contrasting classes e.g., bar charts, curves and pie
charts

Data discrimination makes discrimination rules which are a comparison of the general features of objects between
two classes defined as the target class and the contrasting class.

It is a comparison of the general characteristics of targeting class data objects with the general characteristics of
objects from one or a set of contrasting classes. The user can define the target and contrasting classes. The methods
used for data discrimination are very similar to the approaches used for data characterization with the exception
that data discrimination results include comparative measures.
Reasons for attribute relevance analysis
• It can decide which dimensions must be included.

• It can produce a high level of generalization.

• It can reduce the number of attributes that support us to read patterns easily.

The basic concept behind attribute relevance analysis is to evaluate some measure that can compute the
relevance of an attribute regarding a given class or concept. Such measures involve information gain,
ambiguity, and correlation coefficient.
Attribute relevance analysis for concept description is implemented as follows −

Data collection − It can collect data for both the target class and the contrasting class by query processing.
Preliminary relevance analysis using conservative AOI − This step recognizes a set of dimensions and attributes on
which the selected relevance measure is to be used. AOI can be used to implement preliminary analysis on the data
by eliminating attributes having a high number of distinct values. It can be conservative, the AOI implemented should
employ attribute generalization thresholds that are set reasonably large to enable more attributes to be treated in
further relevance analysis by the selected measure.
Remove − This process removes irrelevant and weakly relevant attributes using the selected relevance analysis
measure.
Generate the concept description using AOI − It can implement AOI using a less conservative set of attribute
generalization thresholds. If the descriptive mining function is class characterization, only the original target class
working relation is included now.
Distributions and central tendency
A dataset is a distribution of n number of scores or values.
Normal distribution

In a normal distribution, data is symmetrically distributed with no skew. Most values cluster around a central region, with
values tapering off as they go further away from the center. The mean, mode and median are exactly the same in a normal
distribution.

Example: Normal distribution


You survey a sample in your local community on the number of
books they read in the last year.

A histogram of your data shows the frequency of responses for


each possible number of books. From looking at the chart, you
see that there is a normal distribution.
The mean, median and mode are all equal; the central
tendency of this dataset is 8.
Skewed distributions
In skewed distributions, more values fall on one side of the center than the other, and the mean, median and mode
all differ from each other. One side has a more spread out and longer tail with fewer scores at one end than the
other. The direction of this tail tells you the side of the skew. In a positively skewed distribution, there’s a cluster of
lower scores and a spread out tail on the right. In a negatively skewed distribution, there’s a cluster of higher scores
and a spread out tail on the left.

In this histogram, your distribution is


skewed to the right, and the central
tendency of your dataset is on the lower
end of possible scores.

In a positively skewed distribution,


mode < median < mean.
In this histogram, your distribution is skewed to the left, and the central tendency of your dataset is
towards the higher end of possible scores.
In a negatively skewed distribution, mean < median < mode
Measures of Central Tendency Meaning

The representative value of a data set, generally the central value or the most occurring value that gives a general idea
of the whole data set is called the Measure of Central Tendency.
Measures of central tendency helps to find the middle, or the average, of a dataset. The 3 most common measures of
central tendency are the mode, median, and mean.
• Mean
• Median
• Mode
Mode
The mode is the most frequently occurring value in the dataset. It’s possible to have no mode, one
mode, or more than one mode. To find the mode, sort your dataset numerically or categorically and
select the response that occurs most frequently.

Example: Finding the mode


In a survey, you ask 9 participants whether they identify as conservative, moderate, or liberal.
To find the mode, sort your data by category and find which response was chosen most frequently. To make it easier,
you can create a frequency table to count up the values for each category.

Political ideology Frequency


Conservative 2
Moderate 3
Liberal 4

Mode=Liberal
When to use the mode
The mode is most applicable to data from a nominal level of measurement. Nominal data is classified into
mutually exclusive categories, so the mode tells you the most popular category.

For continuous variables or ratio levels of measurement, the mode may not be a helpful measure of central
tendency. That’s because there are many more possible values than there are in a nominal or ordinal level of
measurement. It’s unlikely for a value to repeat in a ratio level of measurement.
Median

The median of a dataset is the value that’s exactly in the middle when it is ordered from low to high.

Example: Finding the median


You measure the reaction times of 7 participants on a computer task and categorize them into 3 groups: slow, medium
or fast.

Participant 1 2 3 4 5 6 7
Speed Medium Slow Fast Fast Medium Fast Slow

To find the median, you first order all values from low to high. Then, you find the value in the middle of the ordered
dataset—in this case, the value in the 4th position.
Ordered Slow Slow Medium Medium Fast Fast Fast
dataset

Median: Medium
Median of an odd-numbered dataset

For an odd-numbered dataset, find the value that lies at the (n+1)/2 position, where n is the number of values in the
dataset.
Example
You measure the reaction times in milliseconds of 5 participants and order the dataset.

Reaction time (milliseconds) 287 298 345 365 380

The middle position is calculated using (n+1)/2, where n = 5.

=(5+1)/2 =3

That means the median is the 3rd value in your ordered dataset.
Median of an even-numbered dataset

For an even-numbered dataset, find the two values in the middle of the dataset: the values at the n/2 and
(n/2)+1 positions. Then, find their mean.

Example
You measure the reaction times of 6 participants and order the dataset.

Reaction time (milliseconds) 287 298 345 357 365 380

The middle positions are calculated using n/2 and (n/2)+1, where n = 6.

( 6/2)=3

(6/2)+1=4

That means the middle values are the 3rd value, which is 345, and the 4th value, which is 357.

To get the median, take the mean of the 2 middle values by adding them together and dividing by 2.

(345+357)/2=351
Mean

Mean in general terms is used for the arithmetic mean of the data, but other than the arithmetic mean there
are geometric mean and harmonic mean as well that are calculated using different formulas.
Mean for Ungrouped Data
Arithmetic mean {x} is defined as the sum of the individual observations (xi) divided by the total number of
observations N. In other words, the mean is given by the sum of all observations divided by the total number
of observations.
Example: If there are 5 observations, which are 27, 11, 17, 19, and 21 then the mean is given by

mean = (27 + 11 + 17 + 19 + 21) ÷ 5

⇒ = 95 ÷ 5

⇒ mean = 19
Mean for Grouped Data

Mean is defined for the grouped data as the sum of the product of observations (xi) and their corresponding
frequencies (fi) divided by the sum of all the frequencies (fi).

Example: If the values (xi) of the observations and their frequencies (fi) are given as follows:

xi 4 6 15 10 9

fi 5 10 8 7 10
(4×5 + 6×10 + 15×8 + 10×7 + 9×10) ÷ (5 + 10 + 8 + 7 + 10)
=

⇒ = (20 + 60 + 120 + 70 + 90) ÷ 40


⇒ = 360 ÷ 40
⇒ =9
Measure of Dispersion
Definition
• Measures of dispersion are descriptive statistics that describe how
similar a set of scores are to each other
• The more similar the scores are to each other, the lower the measure of
dispersion will be
• The less similar the scores are to each other, the higher the measure of
dispersion will be
• In general, the more spread out a distribution is, the larger the measure of
dispersion will be

22
Measures of Dispersion
• Which of the distributions of
scores has the larger dispersion? 125
100
75
50
25
The upper distribution has more 0
dispersion because the scores are 1 2 3 4 5 6 7 8 9 10
125
more spread out 100
That is, they are less similar to each 75
50
other 25
0
1 2 3 4 5 6 7 8 9 10
23
Measure of Dispersion is the numbers that are used to represent the scattering of the data. These are the numbers
that show the various aspects of the data spread across various parameters. There are various measures of dispersion
that are used to represent the data that includes,
• Standard Deviation
• Mean Deviation
• Quartile Deviation
• Variance
• Range, etc

Dispersion in the general sense is the state of scattering. Suppose we have to study the data for thousands of
variables there we have to find various parameters that represent the crux of the given data set. These
parameters are called the measure of dispersion.
Example of Measures of Dispersion

We can understand the measure of dispersion by studying the following example, suppose we have 10 students in a class and
the marks scored by them in a Mathematics test are 12, 14, 18, 9, 11, 7, 9, 16, 19, and 20 out of 20. Then the average value
scored by the student in the class is,

Mean (Average) = (12 + 14 + 18 + 9 + 11 + 7 + 9 + 16 + 19 + 20)/10

= 135/10 = 13.5

Then, the average value of the marks is 13.5

Mean Deviation = {|12-13.5| + |14-13.5| + |18-13.5| + |9-13.5| + |11-13.5| + |7-13.5| + |9-13.5| + |16-13.5| + |
19-13.5| + |20-13.5|}/10 = 34.5/10 = 3.45
Types of Measures of Dispersion
Measures of dispersion can be classified into two categories shown below:
•Absolute Measures of Dispersion
•Relative Measures of Dispersion
Absolute Measures of Dispersion
These measures of dispersion are measured and expressed in the units of data themselves. For example – Meters, Dollars,
Kg, etc. Some absolute measures of dispersion are:

• Range
• Mean Deviation
• Standard Deviation
• Variance
• Quartile Deviation
• Interquartile Range
The Range
• The range is defined as the difference between the largest score in
the set of data and the smallest score in the set of data, XL - XS
• What is the range of the following data:
4 8 1 6 6 2 9 3 6 9
• The largest score (XL) is 9; the smallest score (XS) is 1; the range is XL -
XS = 9 - 1 = 8

28
Range for Grouped Data
The range of the data set for the grouped data set is found by studying the following example,
Example: Find out the range for the following frequency distribution table for the marks scored by class
10 students.

Marks Intervals Number of Students

0-10 5

10-20 8

20-30 15

30-40 9

•For Largest Value: Taking the higher limit of Highest Class = 40


•For Smallest Value: Taking the lower limit of Lowest Class = 0
Range = 40 – 0
Thus, the range of the given data set is,
Range = 40
When To Use the Range
• The range is used when
• you have ordinal data or
• you are presenting your results to people with little or no knowledge of
statistics
• The range is rarely used in scientific work as it is fairly insensitive
• It depends on only two scores in the set of data, XL and XS
• Two very different sets of data can have the same range:
1 1 1 1 9 vs 1 3 5 7 9

30
Sample standard deviation
When you collect data from a sample, the sample standard deviation is used to make estimates
or inferences about the population standard deviation.
The sample standard deviation formula looks like this:

• = sample standard deviation


• = sum of…
• = each value
• = sample mean
• = number of values in the sample
Examples: A garden contains 39 plants. The following plants were chosen at random, and their heights were
recorded in cm: 38, 51, 46, 79, and 57. Calculate their heights’ standard deviation.

Solution:

Given that, Number of observations = 5

Hence, the mean of 5 observations is:

Mean = (38 + 51 + 46 + 79 + 57)/5 = 54.2

Now, the standard deviation is calculated as follows:

Standard Deviation, SD = √[(Σ(xi – x̄)2) / (N-1)]

Now, substitute the values in the formula, we get

On solving the above expression, we get

S.D = 15.5
Mean Deviation

Range as a measure of dispersion only depends on the highest and the lowest values in the data. Mean deviation on
the other hand measures the deviation of the observations from the mean of the distribution. Since the average is the
central value of the data, some deviations might be positive and some might be negative. If they are added like that,
their sum will not reveal much as they tend to cancel each other’s effect. For example,

Consider the data given below, -5, 10, 25

Mean = (-5 + 10 + 25)/3 = 10

Now a deviation from the mean for different values is,

(-5 -10) = -15


(10 – 10) = 0
(25 – 10) = 15
Standard Deviation
The standard deviation is the average amount of variability in your dataset. It tells you, on average, how far each value
lies from the mean. In normal distributions, data is symmetrically distributed with no skew. Most values cluster
around a central region, with values tapering off as they go further away from the center. The standard deviation tells
you how spread out from the center of the distribution your data is on average.

Population standard deviation


When you have collected data from every member of the population that you’re interested in, you can get an exact
value for population standard deviation.
The population standard deviation formula looks like this:

• = population standard deviation


• = sum of…
• = each value
• = population mean
• = number of values in the population
Variance
• Variance is defined as the average of the square
deviations:

 X   2

 
2

35
What Does the Variance Formula Mean?
• First, it says to subtract the mean from each of the scores
• This difference is called a deviate or a deviation score
• The deviate tells us how far a given score is from the typical, or average, score
• Thus, the deviate is a measure of dispersion for a given score

36
What Does the Variance Formula Mean?
• Why can’t we simply take the average of the
deviates? That is, why isn’t variance defined as:

 X   
 2

N
This is not the formula
for variance!

37
What Does the Variance Formula Mean?
• One of the definitions of the mean was that it always made the sum
of the scores minus the mean equal to 0
• Thus, the average of the deviates must be 0 since the sum of the
deviates must equal 0
• To avoid this problem, statisticians square the deviate score prior to
averaging them
• Squaring the deviate score makes all the squared scores positive

38
What Does the Variance Formula Mean?
• Variance is the mean of the squared deviation scores
• The larger the variance is, the more the scores deviate, on average,
away from the mean
• The smaller the variance is, the less the scores deviate, on average,
from the mean

39
Standard Deviation
• Standard deviation = variance
• Variance = standard deviation2

40
Quartile Deviation & interquartile
• Quartile deviation is a statistic that measures the deviation. It measures the deviation of the data
from the average value. Here quartile deviation gives the spread of the data, which helps to
understand the distribution of the data.
• we have three quartiles Q1, Q2, Q3 which divide the data into three quarters. The median of the data

has been referred as the second quartile Q2. Also, the first quartile Q1 is the median of the first half

of the data, and the third quartile Q3 is the median of the second half of the data.

• Quartile deviation is the dispersion in the middle of the data. The difference between the first
quartile Q1 and the third quartile Q3 is called the interquartile range, and half of this interquartile
range is called the quartile deviation. This quartile deviation is also referred to as a semi-
interquartile range.
Example 1: Find the quartile deviation for the following given data.

23, 8, 5, 16, 33, 7, 24, 5, 30, 33, 37, 30, 9, 11, 26, 32

Let us arrange this data in the following ascending order.


5, 5, 7, 8, 9, 11, 16, 23, 24, 26, 30, 30, 32, 33, 33, 37

From the above data we have Q1 = ( 8 + 9)/2 = 17/2 = 8.5, and Q3 = (30 + 32)/2 = 62/2 = 31
Quartile Deviation = (Q3−Q1)/2=(31−8.52)/2
=22.52/2
=11.25
Relative Measures of Dispersion
Suppose we have to measure the two quantities that have different units than we used relative measures of
dispersion to get a better idea about the scatter of the data. Various relative measures of the dispersion are:

• Coefficient of Range: The coefficient of range is defined as the ratio of the difference between the highest
and lowest value in a data set to the sum of the highest and lowest value.
• Coefficient of Variation: The coefficient of Variation is defined as the ratio of the standard deviation to the
mean of the data set. We use percentages to express the coefficient of variation.
• Coefficient of Mean Deviation: The coefficient of the Mean Deviation is defined as the ratio of the mean
deviation to the value of the central point of the data set.
• Coefficient of Quartile Deviation: The coefficient of the Quartile Deviation is defined as the ratio of the
difference between the third quartile and the first quartile to the sum of the third and first quartiles.

You might also like