Statistic & Machine Learning: Team 2

STATISTIC &
MACHINE
LEARNING TEAM
2
MEMBER OF TEAM 2
Eva Priscilla
Liana Indah Nisrina Syifa Putri Nur Rizky Eka
Simorangkir Sari Syauqiyah Alifah Purnama
1 Probability
OUTLINE
2 Parametric test
3 Non-Parametric Test
4 Preprocessing
PROBABILITY TERMINOLOGY
Experiment is an experiment that results in a
calculation, measurement, or response (counts,
Probability is a measurement of the

measurements or responses). In the above case, the
probability of an event. An
coin toss is an experiment.
understanding of probability is the basis
Outcome is the result of an experiment. In the above
for Inferential Statistics material. case, if the result is head, then the outcome is "head".
Sample Space is the set of all possible outcomes in a
probability experiment. In this case, it means the
result set of 10 tosses.

Event is part of the sample space, it can be one or
more outcomes. In the above case, "tail" is a possible
event in this experiment.

CASE :
what is the random probability if we
choose Neighborhood in Old Town

CONDITIONAL PROBABILITY
conditional probability is the probability that an event will occur,
knowing that another event has already occurred or has occurred
Independent Event, if the occurrence of one event does not affect

the probability of the occurrence of the second event.
Dependen Event, if the occurrence of one event affects the

probability of the second event returning
MULTIPLICATION probability for two events (A
and B) to occur consecutively:
RULE
if the two events (A and B) are
1
dependent, then it can be simplified
to
if the two events (A and B) are

2
2 independent, it can be simplified
to
MUTUALLY AND
NON MUTUALLY
EXCLUSIVE
Mutually Exclusive is condition when

1
event A and event B do not occur at
the same time.
Meanwhile, non-mutually exclusive

2
is2the opposite condition, namely
events A and B can occur at the
same time
PROBABILITY DISTRIBUTION
a statistical function that aims to describe all possible values and also the
probability that can be taken from various random variables at a certain point
In statistics, probability has several properties,

namely:
1. If added together, all the possibilities that occur
in the probability distribution will be equal to 1 , so
the probability range can only be between 0 to 1
2. In addition, depending on the type of variable
we have, the type of probability distribution itself
will be different, namely:
- Continuous Probability distribution for
continuous or quantitative variables
- discrete probability distribution for discrete or
categorical variables
CONTINUOUS
FEATURE
Person (Linear Corelation)

1 used to find linearity, it is usually better if the
data is normally distributed and linear
Spearman
is one of the applications of correlation
2
coefficients in non-parametric statistical data
analysis methods.
Person (Linear Corelation)

3 is a non-parametric statistic with a data
measurement scale of at least ordinal data
CATEGORICAL
FUTURE
Creamers'V
1 is one of the correlations to determine the
strength of the relationship in nominal data
Theil's
2 almost the same as cramer's the
difference theil's cover cramer's
weakness
Point Biserial Corelation

3 is an appropriate method to analyze the
strength between two variables
Parametric Tests
Type of statistics which assume data follow normal distributions. e.g. z-test, t-test,
ANOVA test.
Z-test
type of statistic test to determine hypothesis
of a population bigger/smaller with respest to
certain mean of datas of other population
requirements:
1. random sampling
2. sample data is more than 30
3. population variance could be determined
Parametric Tests
t-test ANOVA test

Same with z test but applied to lower sample A statistical test to determine
count(sample<30). hypothesis between 3 or more
population using their variance
Sampling
After random sampling, we must

compare the sample distribution
to population distribution to
measure "is the sample size is
enough?". Which indicated by
similar distribution.
solution: use higher size of

sample data to get more
similarity to population
Z- test
Using statsmodels we could do z test by this following example
t-test
Using statsmodels we could do t test by this following example
one sample t-test

Two independent sample t- Paired sample t-test
test
ANOVA test
Non- Parametric Tests
Type of statistics which assume data didn't follow normal distributions. e.g.
mannwhitney u, kurskal, willcoxon
Willcoxon
type of statistic test to determine hypothesis of a population bigger/smaller with
respest to certain mean of datas of other population
Willcoxon signed
rank test

Statistic test to
determine the
median difference
between 2
dependent datas
Manwhitneyu test

Statistic test to
determine the median
difference between 2
independent datas
Kurskal test

Statistic test to
determine the median
difference between 3 or
independent datas
Hypothesis testing summary
Goodness of Fit test
Statistic test to determine if
the sample distribution similar to
certain type of distributions/normal distribution.
D'Agostinos's k2 test --> normality test which use skewness and kurtosis to
determine the gaussian distribution
Shapiro Wilk test --> normality test to determine the scatterness of data and
effective for low number of sample
Quantile-quantile plot --> normaloty test that use quantile to determine its
distribution

D'Agoustinous test Saphiro Wilk test


QQ Plot test
MACHINE
LEARNING
Machine learning (ML) is a field of inquiry
devoted to understanding and building
methods that 'learn', that is, methods
that leverage data to improve
performance on some set of tasks. It is
seen as a part of artificial intelligence.
Machine learning algorithms build a
model based on sample data, known as
training data, in order to make
predictions or decisions without being
explicitly programmed to do so.
MACHINE LEARNING VS ARTIFICIAL INTELLIGENCE
VS DEEP LEARNING
PREPROCESSING
Data Preprocessing is a technique

used to convert raw data into
digestible data. The purpose of
validation is to assess the level of
completeness, quality and
accuracy of the data.
Preprocessing itself broadly
involves data validation and data
imputation.
PREPROCESSING STEP
1 2 3 4
DATA QUALITY DATA DATA DATA

ASSESSMENT CLEANING TRANSFORMATION REDUCTION
Information Dimensionality
Change DType
DType check Normalization reduction
Handling Null
Null & Outlier Generalization Numerosity
Handling Outlier
Discrepancy etc reduction
etc
etc etc
DATA QUALITY
Data quality assessment refers to the

process of evaluating the initial data. the
goal is to ensure that the data fits as
intended, so that the quality of the data is
good and appropriate. Good quality data
is of course free from data quality
problems, such as duplicate data, null
data, outliers, inconsistencies, noise, etc.
Usually the process only uses simple
methods in the pandas library such as
head(), tail(), info(), isna(), etc.
DATA
CLEANING
Data cleaning is our process of cleaning
data. In this process, we clean the raw data
that we assessed in the previous process.
There are several things we must do when
cleaning data, namely:
1. Datatype transformation
2. Null values handling
3. Outliers handling
4. Reduce Discrepancy & Noise
NULL VALUES
HANDLING
Drop Missing Values
This method has two approaches, namely using filter + notna() or using dropna(), both of which have the same
function, but different characteristics.
MISSING VALUES IMPUTATION
We need to do a data understanding first, before we finally fill in the missing values. we have to
identify again these missing values including MCAR, MAR, or MNAR.
If Null includes MNAR, then we must fill null (with any fill method) with the value 0, 'unknown',
etc. which represents the empty data, without changing the meaning of the blank.
If Null is MCAR, then we can just drop the Null, use ffill(), or bfill() because the missing values are
still safe. Usually the missing data is small <3%.
If null includes MAR, then we must fill in the missing as precession as possible, so that the intent
of the data will still be conveyed properly. for example, by filtering the data according to the
information implied in other columns of the same dataset, then filling the Null values with the
mean, median or mode with criteria following the type and distribution of the data. This method
is often called Proxying.
PROXYING
For example, from the data above, we decide by data understanding, we will proxy the sex and class columns
to find the value of the age column.
And for example, the value we will choose, because the data is highly skewed, then the imputation is the
median.
From the table above, we can immediately find out how many values are "appropriate" for us to fill in the Age column.
When:
Sex = female, and Pclass = 1, then the Age value = 35 years old
Sex = female, and Pclass = 2, then the Age value = 28 years old
etc
There are a number of ways to implement this, but generally one can use loops or manual filters.
OUTLIERS
Outliers is a single data
point that goes far outside
the average value of a
group of statistics.
DETECTING
OUTLIERS
1 Boxplot
2 Scatterplot
3 Z-Score
1 Interquartile Range
(IQR)
DETECTING OUTLIERS
Boxplot Scatterplot
DETECTING OUTLIERS
Jointplot Z-Score
HANDLING
OUTLIERS
outliers should be discarded when
the situation allows for re-collection
of data and when outliers do not
represent the data set at all
however, when 10% of your data is
outliers, it's better not to remove
outliers, so you need to look further
into.
THANK
YOU!
Have a
great day
ahead.

Statistic & Machine Learning: Team 2

Uploaded by

Copyright:

Available Formats

Statistic & Machine Learning: Team 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistic & Machine Learning: Team 2

Uploaded by

Copyright:

Available Formats

STATISTIC &

calculation, measurement, or response (counts,

Probability is a measurement of the

probability experiment. In this case, it means the

result set of 10 tosses.

more outcomes. In the above case, "tail" is a possible

event in this experiment.

choose Neighborhood in Old Town

Independent Event, if the occurrence of one event does not affect

Dependen Event, if the occurrence of one event affects the

if the two events (A and B) are

if the two events (A and B) are

Mutually Exclusive is condition when

Meanwhile, non-mutually exclusive

In statistics, probability has several properties,

Person (Linear Corelation)

Person (Linear Corelation)

Point Biserial Corelation

t-test ANOVA test

After random sampling, we must

solution: use higher size of

one sample t-test

D'Agoustinous test Saphiro Wilk test

Data Preprocessing is a technique

DATA QUALITY DATA DATA DATA

Data quality assessment refers to the

Outliers is a single data

point that goes far outside

the average value of a

outliers should be discarded when

the situation allows for re-collection

of data and when outliers do not

represent the data set at all

however, when 10% of your data is

outliers, it's better not to remove

outliers, so you need to look further

You might also like