0% found this document useful (0 votes)
50 views26 pages

Exploratory Data Analysis

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 26

EXPLORATORY DATA

ANALYSIS
Exploratory Data Analysis (EDA)
■ In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to
summarize their main characteristics, often with visual methods.
■ A statistical model can be used or not, but primarily EDA is for seeing what the data
can tell us beyond the formal modeling or hypothesis testing task.
■ Exploratory Data Analysis refers to the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of summary statistics and
graphical representations.
■ Exploratory Data Analysis is a crucial step before you jump to machine learning or
modeling of your data. It provides the context needed to develop an appropriate
model – and interpret the results correctly
■ Exploratory Data Analysis (EDA) is, the most important part of Machine Learning
Modeling in new datasets. If EDA is not executed correctly, it can cause us to start
modeling with “unclean” data, and this is just as a snowball downhill, it gets bigger
and worse.
■ EDA is presented in two ways: digital exploration and visual exploration.
Data Sources

S.T. Exploratory Data Analysis (EDA)

Data Cleaning
Data
Some preprocessing
S.T.
Data summarization Feature transformation
Feature Engineering

Date aggregation Data reduction

Some
S.T.  A.I. or D.M. or S.T. Model Creation for various Data Analysis
based on different requirements
 Model evaluation and selection

* Nowdays data preprocessing contains data cleaning for getting bigger and complicated data.
Used Data Set
■ The MASS package and Insurance data set are used for this Chapter.
■ The relevant instructions for loading and viewing the data set are below.

■ The data set is structured for car policyholder below.


Digital Exploration
Digital Exploration
■ The ideas are to retrieve relevant indexes for data exploration such as data
structure(dataset shape)、variable types and situation 、 data distribution 、 missing and
null values 、data correlation 、duplicated values, and descriptive statistics.
■ EDA focuses on exploring data to understand the data’s underlying structure and variables,
to develop intuition about the data set, to consider how that data set came into existence,
and to decide how it can be investigated with more formal statistical methods.
Understand Variables Structure –summary() 1/n

■ The following are used to understand


variables

■ The execution is as illustrated:


■ Based on you have learned,
what are the data types?
■ They are both measures of central
tendency (along with the mode and
the midrange). The mean is the sum
of the data point's values divided by
the number of data points. The
median is the geographic middle of
the data when the list of data is put
in ascending order.
Understand Variables Structure - summary() 2/n

■ You can see the variables’ structures based on the following:

■ Then, use str() to see the internal structure. As you can see, the number of
observation is 64 in which the sample have 5 variables, and 4 levels in the first 3
variables: District, Group , and Age.
■ And, use summary() to obtain the descriptive statistics. The first 3 variables have
different data types from the last 2 variables.
■ And, we can check the mean and median to see how skew the data is. If the value
difference of the two statistics numbers is large, then left or right skewed
distribution is obvious.
■ For example on the variable holders, the data distribution presents right skewed
distribution when its mean value double in median value. Further, the data of
variable holders exists abnormal values.
Variables Description - describe() 1/n

■ The Harrell Miscellaneous package contains many functions useful for data analysis, high-
level graphics, utility operations, functions for computing sample size and power, importing
and annotating datasets, imputing missing values, advanced table making, variable
clustering, character string manipulation, conversion of R objects to LaTeX and html code,
and recoding variables. Quantile the values to see the Gmd: Gini's Mean Difference.
data distribution Info: (關於變量的連續性) is a data distribution measure
using the relative efficiency of a proportional
odds/Wilcoxon test on the variable relative to the same
■ use describe() to view data distribution in more detailed. test on a variable that has no ties. Info is related to how
continuous the variable values are, and ties are less
n: number of samples. harmful while the more untied values there are.
missing: number of missing samples
distinct: (unique) number of distinct values

■ Use describe() based on different types


Variables Description - describe() 2/n

■ use describe() to view data distribution in more detailed.


■ However, describe() justifies data type based on:
– if the number of distinct values for a variable with numerical type is not greater than 10,
the variable is treated as categorical type
- If the number of distinct values for a variable not greater than 20 and greater than 10,
describe() shows frequency table.
- If a variable has a number of samples greater than 20, the 5 lowest and 5 highest
frequencies are listed.
- Namely, describe() uses biased estimator for small samples while using unbiased
estimator for non-small samples. (p.s. In statistics, bias (or bias function) of an estimator
is the difference between this estimator's expected value and the true value of the
parameter being estimated. An estimator or decision rule with zero bias is called
unbiased. Otherwise the estimator is said to be biased).
- The variable claims has more tied values as compared to Holders
according to the distinct and Info in which Info is calculated through
a proportional odds/Wilcoxon test.
Variables Description - describe() 3/n

■ Of all measures of variability, the variance is the most popular by far .


■ However, Gini's Mean Difference (Gmd) is an alternative index of variability, shares many
properties with the variance, but can be more informative about the properties of distributions
that depart from normality.
■ Gini's Mean Difference can be used to check non-normal distribution based on left or right
skewed distribution.
■ For example, the data distributions of the variables Holders and Claims presents right skewed
distribution according to the mean and median, and Gmd in which the values are 4971. and
60.66 respectively. However, Gmd efficiency is sensitive to number of division, and degree of
dispersion.
■ The more division the quantile divides, the more completed data distribution you will get.
However, sometimes, the situation may cause inaccurate Gmd value.
fBasics package – basicStats() 1/n
■ The Rmetrics "fBasics" package is a collection of functions to explore and to investigate basic
properties of financial returns and related quantities.
■ The covered fields include techniques of explorative data analysis and the investigation of
distributional properties, including parameter estimation and hypothesis testing.
■ Even more there are several utility functions for data handling and management.
■ BasicStatistics () computes basic financial time series statistics. For now, we check the data in the
variable Holders as illustrated.

■ The S.T. values in the red frame are different from summary()
■ nobs: #observation
Nas: #missing values
Sum: summarize the values
SE Mean: standard error mean
LCL/ UCL mean: Lower/ Upper confidence interval for mean value
variance: it measures how far a set of (random) numbers are
spread out from their average value
stdev: standard deviation
skewness (偏度)
kurtosis (峰度)
fBasics" package - basicStats() 2/n

■ According the sum value, there are totally 23,359 insurance holders in the data set.
■ As listed, the average of insurance holders is 365 (e.g. 23,359/64) under considering District,
Group, and Age.
■ The mean value 365 is reliable while value is in the confidence interval [209, 521] (i.e. [LCL
mean, UCL mean]). Namely, if a mean value is in a confidence interval, we say that the mean
value has 95% confidence. It is credible.
■ Skewness and kurtosis are discussed in data distribution.
skewness is a measure of the asymmetry of the probability distribution of
a real-valued random variable about its mean. The skewness value can
be positive, zero, negative, or undefined.
■ Kurtosis is a measure of the "tailedness" of the probability distribution of
a real-valued random variable.
■ For analyzing data, please you explore data according to the required
information. Rather than exploring data as more as you can.
Distribution Index
■ In Statistics, Binomial distribution(二項分佈), Poisson distribution(卜瓦松分佈) , and Geometric
distribution(幾何分佈) are used for discrete variables.
■ In Statistics, Uniform distribution (均勻分布), exponential distribution (指數分佈), Normal
distribution (常態分佈) are used for continuous variables.
■ However, a data distribution is a function or a listing which shows all the possible values (or
intervals) of the data. It also (and this is important) tells you how often each value occurs.
■ In this section, skewness and kurtosis are discussed and explored by using basicStats() in
fBasics package ,and skewness()+kurtosis() in timeDate package for continuous variables.
■ Kurtosis is the degree of peakedness of a distribution
Distribution Index - skewness 1/n
■ In probability theory and statistics, the skew normal distribution is a continuous probability
distribution that generalizes the normal distribution to allow for non-zero skewness.
■ If a normal distribution is limited to the interval [-1, 1] in axis x, that means the symmetry of the
data distribution is stronger. And, its skewness is 0 as illustrated. While normal distribution is
one of the most common forms of distribution, not all data sets follow this basic curve.
■ The data distribution is right skewed if its skewness > 1, and is left skewed if its skewness < -1.
■ Please pay more attention on that if your data distribution does not belong to a normal
distribution, then does not mean you need to change your data distribution to a normal
distribution.
Distribution Index - skewness 2/n
■ Data can be "skewed", meaning it tends to have a long tail on one side or the other. For example,

■ Why is it called negative/ positive skew? Because the long "tail" is on the negative/ positive side
of the peak.
■ The Normal Distribution has No Skew
■ For example,
Distribution Index - skewness 3/n

■ As executed on skewness(), the data distribution of variable Holders and Claims have have
a long tail on right side since their sknewness are all greater than 1.
■ Namely, the density of its data distribution have a long tail on right side.

■ Skewness can be quantified to define the extent to which a distribution differs from a normal
distribution. In such a right-skewed distribution, usually (but not always) the mean is greater
than the median, or equivalently, the mean is greater than the mode; in which case
the skewness is greater than zero, and vice versa to left-skewed one.
■ To summarize, generally if the distribution of data is skewed to the left, the mean is less than
the median, which is often less than the mode. If the distribution of data is skewed to the right,
the mode is often less than the median, which is less than the mean.
Distribution Index - skewness 4/n

■ Skewness and symmetry become important when we discuss probability distributions.


■ An important characteristic of any set of data is the variation in the data. In some data sets, the
data values are concentrated closely near the mean; in other data sets, the data values are
more widely spread out from the mean. The most common measure of variation, or spread, is
the standard deviation. The standard deviation is a number that measures how far data values
are from their mean.
■ You will find that in symmetrical distributions, the standard deviation can be very helpful but in
skewed distributions, the standard deviation may not be much help. The reason is that the two
sides of a skewed distribution have different spreads. In a skewed distribution, it is better to look
at the first quartile, the median, the third quartile, the smallest value, and the largest value.
Because numbers can be confusing, always graph your data. Display your data in a histogram or
a box plot.
Distribution Index - Kurtosis 1/n
■ Another one for data distribution index is kurtosis. It often works with skewness along.
■ Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the
tails of a normal distribution. In other words, kurtosis identifies whether the tails of a given
distribution contain extreme values.
■ The normal curve is called Mesokurtic curve (M. curve for short). If the curve of a distribution is
peaked than a normal or Mesokurtic curve then it is referred to as a Leptokurtic curve (L. curve
for short). If a curve is less peaked than a normal curve, it is called as a Platykurtic curve (P.
curve for short). That's why kurtosis of normal distribution equal to 3.
■ A mesokurtic distribution is one in which the returns do not exhibit any behaviour that is
different from one without kurtosis. This type of distribution has a coeffecient of kurtosis of 3
which is the same as that of a normal distribution. This distribution is zero kurtosis excess.

Distribution Index - Kurtosis 2/n
■ Kurtosis is a measure of the combined sizes of the two tails. It measures the amount of
probability in the tails. The value is often compared to the kurtosis of the normal distribution,
which is equal to 3.
■ Usually, It is to let the kurtosis of normal distribution to be 0, so -3.
■ The values for asymmetry and kurtosis between -2 and +2 are considered acceptable in order to
prove normal univariate distribution (George & Mallery, 2010). However, it depends on mainly
the sample size. Most software packages that compute the skewness and kurtosis, also
compute their standard error.
■ In the dataset we used, the kurtosis of variable Holders and Claims are greater 0, and even
greater than +2. That mean there are some abnormal data in the data set. For the abnormal
data, you either find out the problem further, or find another algorithm instead not depends on
data distribution.
Distribution Index - Kurtosis 3/n
■ A further characterization of the data includes skewness and kurtosis. Skewness is a measure of
symmetry, or more precisely, the lack of symmetry. Kurtosis is a measure of whether the data
are heavy-tailed or light-tailed relative to a normal distribution.
■ It measures the amount of probability in the tails. The value is often compared to the kurtosis of
the normal distribution, which is equal to 3. If the kurtosis is greater than 3, then the dataset
has heavier tails than a normal distribution (more in the tails).
■ Like skewness, kurtosis is a statistical measure that is used to describe the distribution.
Distributions with large kurtosis exhibit tail data exceeding the tails of the normal distribution
(e.g., five or more standard deviations from the mean)
Data Sparseness 1/n
■ In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in
which most of the elements are zero or null. By contrast, if most of the elements are nonzero
and not null, then the matrix is considered dense.
■ R uses the package Matrix to explore data sparseness, and provides functions to process dense
matrix or sparse matrix.
■ sample() takes a sample of the specified size from the elements of x using either with or without
replacement. As illustrated, sample() function samples 10 times repeatedly and randomly (i.e.
After sampling, it is replaced each time) in which the value is in 1 to 10 each time. For example,
sample(1:6, 10, replace=TRUE). That means we roll a dice 10 time in which is 1 to 6 each time.
And, the events we roll the dice are independent so the replace value is true. Namely, the
number of population is the same after rolling the dice.
Data Sparseness
■ sparseMatrix() builds a 10 x 10 matrix, and sets up 1 to the elements in which there are 10
elements filled in at most.
■ which() filters out that the values in the matrix are 1. Then plot() function plots the sparse matrix.
■ The argument pch is a symbol used in the plot() function as illustrated below. However, the
symbols are dependent on the version of plot() you use. They are a little bit different.
Missing Values
■ md.pattern() in the mice package is used to retrieve the missing values in a data set.

■ However, we will build a random-small data set (i.e. 64 x 5 elements) before using md.pattern().
We use the following loop to set up 10 NA (i.e. missing value) and store them on Insurance[ ].
Then, present and retrieve the missing values of Insurance[ ] by using md.pattern() .
■ The “1” on the pattern table presents no missing value.
“0” indicates a missing value.
Visual Exploration

You might also like