Statistic & Machine Learning: Team 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

STATISTIC &

MACHINE

LEARNING TEAM

2
MEMBER OF TEAM 2

Eva Priscilla
Liana Indah Nisrina Syifa Putri Nur Rizky Eka
Simorangkir Sari Syauqiyah Alifah Purnama
1 Probability

OUTLINE
2 Parametric test

3 Non-Parametric Test

4 Preprocessing
PROBABILITY TERMINOLOGY
Experiment is an experiment that results in a

calculation, measurement, or response (counts,

Probability is a measurement of the


measurements or responses). In the above case, the

probability of an event. An
coin toss is an experiment.
understanding of probability is the basis
Outcome is the result of an experiment. In the above

for Inferential Statistics material. case, if the result is head, then the outcome is "head".
Sample Space is the set of all possible outcomes in a

probability experiment. In this case, it means the

result set of 10 tosses.


Event is part of the sample space, it can be one or

more outcomes. In the above case, "tail" is a possible

event in this experiment.


CASE :
what is the random probability if we

choose Neighborhood in Old Town


CONDITIONAL PROBABILITY
conditional probability is the probability that an event will occur,
knowing that another event has already occurred or has occurred

Independent Event, if the occurrence of one event does not affect


the probability of the occurrence of the second event.

Dependen Event, if the occurrence of one event affects the


probability of the second event returning
MULTIPLICATION probability for two events (A
and B) to occur consecutively:
RULE

if the two events (A and B) are

1
dependent, then it can be simplified

to

if the two events (A and B) are


2
2 independent, it can be simplified
to
MUTUALLY AND
NON MUTUALLY
EXCLUSIVE

Mutually Exclusive is condition when


1
event A and event B do not occur at
the same time.

Meanwhile, non-mutually exclusive


2
is2the opposite condition, namely
events A and B can occur at the
same time
PROBABILITY DISTRIBUTION
a statistical function that aims to describe all possible values and also the
probability that can be taken from various random variables at a certain point

In statistics, probability has several properties,


namely:
1. If added together, all the possibilities that occur
in the probability distribution will be equal to 1 , so
the probability range can only be between 0 to 1
2. In addition, depending on the type of variable
we have, the type of probability distribution itself
will be different, namely:
- Continuous Probability distribution for
continuous or quantitative variables
- discrete probability distribution for discrete or
categorical variables
CONTINUOUS
FEATURE

Person (Linear Corelation)


1 used to find linearity, it is usually better if the
data is normally distributed and linear

Spearman
is one of the applications of correlation
2
coefficients in non-parametric statistical data
analysis methods.

Person (Linear Corelation)


3 is a non-parametric statistic with a data
measurement scale of at least ordinal data
CATEGORICAL
FUTURE

Creamers'V
1 is one of the correlations to determine the
strength of the relationship in nominal data

Theil's
2 almost the same as cramer's the
difference theil's cover cramer's
weakness

Point Biserial Corelation


3 is an appropriate method to analyze the
strength between two variables
Parametric Tests
Type of statistics which assume data follow normal distributions. e.g. z-test, t-test,
ANOVA test.
Z-test
type of statistic test to determine hypothesis
of a population bigger/smaller with respest to
certain mean of datas of other population
requirements:

1. random sampling
2. sample data is more than 30
3. population variance could be determined
Parametric Tests

t-test ANOVA test


Same with z test but applied to lower sample A statistical test to determine
count(sample<30). hypothesis between 3 or more
population using their variance
Sampling

After random sampling, we must


compare the sample distribution
to population distribution to
measure "is the sample size is
enough?". Which indicated by
similar distribution.

solution: use higher size of


sample data to get more
similarity to population
Z- test
Using statsmodels we could do z test by this following example
t-test
Using statsmodels we could do t test by this following example

one sample t-test


Two independent sample t- Paired sample t-test
test
ANOVA test
Non- Parametric Tests
Type of statistics which assume data didn't follow normal distributions. e.g.
mannwhitney u, kurskal, willcoxon

Willcoxon
type of statistic test to determine hypothesis of a population bigger/smaller with
respest to certain mean of datas of other population
Willcoxon signed
rank test

Statistic test to
determine the
median difference
between 2
dependent datas
Manwhitneyu test

Statistic test to
determine the median
difference between 2
independent datas
Kurskal test

Statistic test to
determine the median
difference between 3 or
independent datas
Hypothesis testing summary
Goodness of Fit test
Statistic test to determine if
the sample distribution similar to
certain type of distributions/normal distribution.

D'Agostinos's k2 test --> normality test which use skewness and kurtosis to
determine the gaussian distribution
Shapiro Wilk test --> normality test to determine the scatterness of data and
effective for low number of sample
Quantile-quantile plot --> normaloty test that use quantile to determine its
distribution
Goodness of Fit test

D'Agoustinous test Saphiro Wilk test


Goodness of Fit test

QQ Plot test
MACHINE
LEARNING
Machine learning (ML) is a field of inquiry
devoted to understanding and building
methods that 'learn', that is, methods
that leverage data to improve
performance on some set of tasks. It is
seen as a part of artificial intelligence.
Machine learning algorithms build a
model based on sample data, known as
training data, in order to make
predictions or decisions without being
explicitly programmed to do so.
MACHINE LEARNING VS ARTIFICIAL INTELLIGENCE
VS DEEP LEARNING
PREPROCESSING

Data Preprocessing is a technique


used to convert raw data into
digestible data. The purpose of
validation is to assess the level of
completeness, quality and
accuracy of the data.
Preprocessing itself broadly
involves data validation and data
imputation.
PREPROCESSING STEP
1 2 3 4

DATA QUALITY DATA DATA DATA


ASSESSMENT CLEANING TRANSFORMATION REDUCTION

Information Dimensionality
Change DType
DType check Normalization reduction
Handling Null
Null & Outlier Generalization Numerosity
Handling Outlier
Discrepancy etc reduction
etc
etc etc
DATA QUALITY

Data quality assessment refers to the


process of evaluating the initial data. the
goal is to ensure that the data fits as
intended, so that the quality of the data is
good and appropriate. Good quality data
is of course free from data quality
problems, such as duplicate data, null
data, outliers, inconsistencies, noise, etc.
Usually the process only uses simple
methods in the pandas library such as
head(), tail(), info(), isna(), etc.
DATA
CLEANING
Data cleaning is our process of cleaning
data. In this process, we clean the raw data
that we assessed in the previous process.
There are several things we must do when
cleaning data, namely:
1. Datatype transformation
2. Null values handling
3. Outliers handling
4. Reduce Discrepancy & Noise
NULL VALUES
HANDLING
Drop Missing Values
This method has two approaches, namely using filter + notna() or using dropna(), both of which have the same
function, but different characteristics.
MISSING VALUES IMPUTATION
We need to do a data understanding first, before we finally fill in the missing values. we have to
identify again these missing values ​including MCAR, MAR, or MNAR.

If Null includes MNAR, then we must fill null (with any fill method) with the value 0, 'unknown',
etc. which represents the empty data, without changing the meaning of the blank.
If Null is MCAR, then we can just drop the Null, use ffill(), or bfill() because the missing values ​are
still safe. Usually the missing data is small <3%.
If null includes MAR, then we must fill in the missing as precession as possible, so that the intent
of the data will still be conveyed properly. for example, by filtering the data according to the
information implied in other columns of the same dataset, then filling the Null values ​with the
mean, median or mode with criteria following the type and distribution of the data. This method
is often called Proxying.
PROXYING

For example, from the data above, we decide by data understanding, we will proxy the sex and class columns
to find the value of the age column.

And for example, the value we will choose, because the data is highly skewed, then the imputation is the
median.
From the table above, we can immediately find out how many values are "appropriate" for us to fill in the Age column.

When:
Sex = female, and Pclass = 1, then the Age value = 35 years old
Sex = female, and Pclass = 2, then the Age value = 28 years old
etc

There are a number of ways to implement this, but generally one can use loops or manual filters.
OUTLIERS

Outliers is a single data

point that goes far outside

the average value of a

group of statistics.
DETECTING

OUTLIERS
1 Boxplot

2 Scatterplot

3 Z-Score

1 Interquartile Range

(IQR)
DETECTING OUTLIERS
Boxplot Scatterplot
DETECTING OUTLIERS
Jointplot Z-Score
HANDLING

OUTLIERS

outliers should be discarded when

the situation allows for re-collection

of data and when outliers do not

represent the data set at all

however, when 10% of your data is

outliers, it's better not to remove

outliers, so you need to look further

into.
THANK
YOU!
Have a
great day
ahead.

You might also like