Exploratory Data Analysis
Exploratory Data Analysis
Exploratory Data Analysis
ANALYSIS
Exploratory Data Analysis (EDA)
■ In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to
summarize their main characteristics, often with visual methods.
■ A statistical model can be used or not, but primarily EDA is for seeing what the data
can tell us beyond the formal modeling or hypothesis testing task.
■ Exploratory Data Analysis refers to the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of summary statistics and
graphical representations.
■ Exploratory Data Analysis is a crucial step before you jump to machine learning or
modeling of your data. It provides the context needed to develop an appropriate
model – and interpret the results correctly
■ Exploratory Data Analysis (EDA) is, the most important part of Machine Learning
Modeling in new datasets. If EDA is not executed correctly, it can cause us to start
modeling with “unclean” data, and this is just as a snowball downhill, it gets bigger
and worse.
■ EDA is presented in two ways: digital exploration and visual exploration.
Data Sources
Data Cleaning
Data
Some preprocessing
S.T.
Data summarization Feature transformation
Feature Engineering
Some
S.T. A.I. or D.M. or S.T. Model Creation for various Data Analysis
based on different requirements
Model evaluation and selection
* Nowdays data preprocessing contains data cleaning for getting bigger and complicated data.
Used Data Set
■ The MASS package and Insurance data set are used for this Chapter.
■ The relevant instructions for loading and viewing the data set are below.
■ Then, use str() to see the internal structure. As you can see, the number of
observation is 64 in which the sample have 5 variables, and 4 levels in the first 3
variables: District, Group , and Age.
■ And, use summary() to obtain the descriptive statistics. The first 3 variables have
different data types from the last 2 variables.
■ And, we can check the mean and median to see how skew the data is. If the value
difference of the two statistics numbers is large, then left or right skewed
distribution is obvious.
■ For example on the variable holders, the data distribution presents right skewed
distribution when its mean value double in median value. Further, the data of
variable holders exists abnormal values.
Variables Description - describe() 1/n
■ The Harrell Miscellaneous package contains many functions useful for data analysis, high-
level graphics, utility operations, functions for computing sample size and power, importing
and annotating datasets, imputing missing values, advanced table making, variable
clustering, character string manipulation, conversion of R objects to LaTeX and html code,
and recoding variables. Quantile the values to see the Gmd: Gini's Mean Difference.
data distribution Info: (關於變量的連續性) is a data distribution measure
using the relative efficiency of a proportional
odds/Wilcoxon test on the variable relative to the same
■ use describe() to view data distribution in more detailed. test on a variable that has no ties. Info is related to how
continuous the variable values are, and ties are less
n: number of samples. harmful while the more untied values there are.
missing: number of missing samples
distinct: (unique) number of distinct values
■ The S.T. values in the red frame are different from summary()
■ nobs: #observation
Nas: #missing values
Sum: summarize the values
SE Mean: standard error mean
LCL/ UCL mean: Lower/ Upper confidence interval for mean value
variance: it measures how far a set of (random) numbers are
spread out from their average value
stdev: standard deviation
skewness (偏度)
kurtosis (峰度)
fBasics" package - basicStats() 2/n
■ According the sum value, there are totally 23,359 insurance holders in the data set.
■ As listed, the average of insurance holders is 365 (e.g. 23,359/64) under considering District,
Group, and Age.
■ The mean value 365 is reliable while value is in the confidence interval [209, 521] (i.e. [LCL
mean, UCL mean]). Namely, if a mean value is in a confidence interval, we say that the mean
value has 95% confidence. It is credible.
■ Skewness and kurtosis are discussed in data distribution.
skewness is a measure of the asymmetry of the probability distribution of
a real-valued random variable about its mean. The skewness value can
be positive, zero, negative, or undefined.
■ Kurtosis is a measure of the "tailedness" of the probability distribution of
a real-valued random variable.
■ For analyzing data, please you explore data according to the required
information. Rather than exploring data as more as you can.
Distribution Index
■ In Statistics, Binomial distribution(二項分佈), Poisson distribution(卜瓦松分佈) , and Geometric
distribution(幾何分佈) are used for discrete variables.
■ In Statistics, Uniform distribution (均勻分布), exponential distribution (指數分佈), Normal
distribution (常態分佈) are used for continuous variables.
■ However, a data distribution is a function or a listing which shows all the possible values (or
intervals) of the data. It also (and this is important) tells you how often each value occurs.
■ In this section, skewness and kurtosis are discussed and explored by using basicStats() in
fBasics package ,and skewness()+kurtosis() in timeDate package for continuous variables.
■ Kurtosis is the degree of peakedness of a distribution
Distribution Index - skewness 1/n
■ In probability theory and statistics, the skew normal distribution is a continuous probability
distribution that generalizes the normal distribution to allow for non-zero skewness.
■ If a normal distribution is limited to the interval [-1, 1] in axis x, that means the symmetry of the
data distribution is stronger. And, its skewness is 0 as illustrated. While normal distribution is
one of the most common forms of distribution, not all data sets follow this basic curve.
■ The data distribution is right skewed if its skewness > 1, and is left skewed if its skewness < -1.
■ Please pay more attention on that if your data distribution does not belong to a normal
distribution, then does not mean you need to change your data distribution to a normal
distribution.
Distribution Index - skewness 2/n
■ Data can be "skewed", meaning it tends to have a long tail on one side or the other. For example,
■ Why is it called negative/ positive skew? Because the long "tail" is on the negative/ positive side
of the peak.
■ The Normal Distribution has No Skew
■ For example,
Distribution Index - skewness 3/n
■ As executed on skewness(), the data distribution of variable Holders and Claims have have
a long tail on right side since their sknewness are all greater than 1.
■ Namely, the density of its data distribution have a long tail on right side.
■ Skewness can be quantified to define the extent to which a distribution differs from a normal
distribution. In such a right-skewed distribution, usually (but not always) the mean is greater
than the median, or equivalently, the mean is greater than the mode; in which case
the skewness is greater than zero, and vice versa to left-skewed one.
■ To summarize, generally if the distribution of data is skewed to the left, the mean is less than
the median, which is often less than the mode. If the distribution of data is skewed to the right,
the mode is often less than the median, which is less than the mean.
Distribution Index - skewness 4/n
■ However, we will build a random-small data set (i.e. 64 x 5 elements) before using md.pattern().
We use the following loop to set up 10 NA (i.e. missing value) and store them on Insurance[ ].
Then, present and retrieve the missing values of Insurance[ ] by using md.pattern() .
■ The “1” on the pattern table presents no missing value.
“0” indicates a missing value.
Visual Exploration