Statistic & Machine Learning: Team 2
Statistic & Machine Learning: Team 2
Statistic & Machine Learning: Team 2
MACHINE
LEARNING TEAM
2
MEMBER OF TEAM 2
Eva Priscilla
Liana Indah Nisrina Syifa Putri Nur Rizky Eka
Simorangkir Sari Syauqiyah Alifah Purnama
1 Probability
OUTLINE
2 Parametric test
3 Non-Parametric Test
4 Preprocessing
PROBABILITY TERMINOLOGY
Experiment is an experiment that results in a
probability of an event. An
coin toss is an experiment.
understanding of probability is the basis
Outcome is the result of an experiment. In the above
for Inferential Statistics material. case, if the result is head, then the outcome is "head".
Sample Space is the set of all possible outcomes in a
1
dependent, then it can be simplified
to
Spearman
is one of the applications of correlation
2
coefficients in non-parametric statistical data
analysis methods.
Creamers'V
1 is one of the correlations to determine the
strength of the relationship in nominal data
Theil's
2 almost the same as cramer's the
difference theil's cover cramer's
weakness
1. random sampling
2. sample data is more than 30
3. population variance could be determined
Parametric Tests
Willcoxon
type of statistic test to determine hypothesis of a population bigger/smaller with
respest to certain mean of datas of other population
Willcoxon signed
rank test
Statistic test to
determine the
median difference
between 2
dependent datas
Manwhitneyu test
Statistic test to
determine the median
difference between 2
independent datas
Kurskal test
Statistic test to
determine the median
difference between 3 or
independent datas
Hypothesis testing summary
Goodness of Fit test
Statistic test to determine if
the sample distribution similar to
certain type of distributions/normal distribution.
D'Agostinos's k2 test --> normality test which use skewness and kurtosis to
determine the gaussian distribution
Shapiro Wilk test --> normality test to determine the scatterness of data and
effective for low number of sample
Quantile-quantile plot --> normaloty test that use quantile to determine its
distribution
Goodness of Fit test
QQ Plot test
MACHINE
LEARNING
Machine learning (ML) is a field of inquiry
devoted to understanding and building
methods that 'learn', that is, methods
that leverage data to improve
performance on some set of tasks. It is
seen as a part of artificial intelligence.
Machine learning algorithms build a
model based on sample data, known as
training data, in order to make
predictions or decisions without being
explicitly programmed to do so.
MACHINE LEARNING VS ARTIFICIAL INTELLIGENCE
VS DEEP LEARNING
PREPROCESSING
Information Dimensionality
Change DType
DType check Normalization reduction
Handling Null
Null & Outlier Generalization Numerosity
Handling Outlier
Discrepancy etc reduction
etc
etc etc
DATA QUALITY
If Null includes MNAR, then we must fill null (with any fill method) with the value 0, 'unknown',
etc. which represents the empty data, without changing the meaning of the blank.
If Null is MCAR, then we can just drop the Null, use ffill(), or bfill() because the missing values are
still safe. Usually the missing data is small <3%.
If null includes MAR, then we must fill in the missing as precession as possible, so that the intent
of the data will still be conveyed properly. for example, by filtering the data according to the
information implied in other columns of the same dataset, then filling the Null values with the
mean, median or mode with criteria following the type and distribution of the data. This method
is often called Proxying.
PROXYING
For example, from the data above, we decide by data understanding, we will proxy the sex and class columns
to find the value of the age column.
And for example, the value we will choose, because the data is highly skewed, then the imputation is the
median.
From the table above, we can immediately find out how many values are "appropriate" for us to fill in the Age column.
When:
Sex = female, and Pclass = 1, then the Age value = 35 years old
Sex = female, and Pclass = 2, then the Age value = 28 years old
etc
There are a number of ways to implement this, but generally one can use loops or manual filters.
OUTLIERS
group of statistics.
DETECTING
OUTLIERS
1 Boxplot
2 Scatterplot
3 Z-Score
1 Interquartile Range
(IQR)
DETECTING OUTLIERS
Boxplot Scatterplot
DETECTING OUTLIERS
Jointplot Z-Score
HANDLING
OUTLIERS
into.
THANK
YOU!
Have a
great day
ahead.