Lecture-4: Introduction To Data Science

Lecture- 4
Introduction to Data Science:
Basic of Statistics and Probability

Statistics and Probability
• Foundation of all machine learning

algorithm and data science.
2
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
3
4
5
6
Red Wine Data Set
7
8
Statistics is essential for successfully working
through a predictive modeling problem
• Problem Framing
• Data Understanding
• Data Cleaning
• Data Selection
• Data Preparation
• Model Evaluation
• Model Configuration
• Model Selection
• Model Presentation
• Model Predictions
9
https://machinelearningmastery.com/statistical-methods-in-an-applied-machine-learning-project /
1. Problem Framing
Statistical methods that can aid in the exploration of the data

during the framing of a problem include:
• Exploratory Data Analysis. Summarization and visualization
in order to explore ad hoc views of the data.
• Data Mining. Automatic discovery of structured relationships
and patterns in the data.
10
2. Data Understanding
Two large branches of statistical methods are used to aid in

understanding data; they are:
• Summary Statistics. Methods used to summarize the
distribution and relationships between variables using
statistical quantities.
• Data Visualization. Methods used to summarize the
distribution and relationships between variables using
visualizations such as charts, plots, and graphs.
11
3. Data Cleaning
Observations from a domain are often not pristine.
Some examples include:
• Data corruption.
• Data errors.
• Data loss.
The process of identifying and repairing issues with the data is
called data cleaning.
Statistical methods are used for data cleaning; for example:
• Outlier detection. Methods for identifying observations that
are far from the expected value in a distribution.
• Imputation. Methods for repairing or filling in corrupt or
missing values in observations.
12
4. Data Selection
Not all observations or all variables may be relevant when
modeling.
#
The process of reducing the scope of data to those elements that
are most useful for making predictions is called data selection.
#
Two types of statistical methods that are used for data selection
include:
• Data Sample. Methods to systematically create smaller
representative samples from larger datasets.
• Feature Selection. Methods to automatically identify those
variables that are most relevant to the outcome variable.
13
5. Data Preparation
@ Data can often not be used directly for modeling.
@ Some transformation is often required in order to change
the shape or structure of the data to make it more suitable
for the chosen framing of the problem or learning
algorithms.
@ Data preparation is performed using statistical methods.
Some common examples include:
• Scaling. Methods such as standardization and
normalization.
• Encoding. Methods such as integer encoding and one
hot encoding.
• Transforms. Methods such as power transforms like the
Box-Cox method.
14
6. Model Evaluation
A crucial part of a predictive modeling problem is evaluating a

learning method.
#
Generally, the planning of this process of training and evaluating
a predictive model is called experimental design. This is a whole
subfield of statistical methods.
• Experimental Design. Methods to design systematic
experiments to compare the effect of independent variables on
an outcome, such as the choice of a machine learning
algorithm on prediction accuracy.
Resampling Methods. Methods for systematically splitting a
dataset into subsets for the purposes of training and evaluating a
predictive model.
15
7. Model Configuration
• A given machine learning algorithm often has a suite of

hyperparameters that allow the learning method to be tailored
to a specific problem.
• The configuration of the hyperparameters is often
empirical in nature, rather than analytical, requiring large
suites of experiments in order to evaluate the effect of
different hyperparameter values on the skill of the model.
16
7. Model Configuration (Contd..)
The interpretation and comparison of the results between

different hyperparameter configurations is made using one of two
subfields of statistics, namely:
• Statistical Hypothesis Tests. Methods that quantify the
likelihood of observing the result given an assumption or
expectation about the result (presented using critical values
and p-values).
• Estimation Statistics. Methods that quantify the uncertainty
of a result using confidence intervals.
17
8. Model Selection
• One among many machine learning algorithms may be

appropriate for a given predictive modeling problem.
• The process of selecting one method as the solution is called
model selection.
• This may involve a suite of criteria both from stakeholders
in the project and the careful interpretation of the
estimated skill of the methods evaluated for the problem.
18
9. Model Presentation
Once a final model has been trained, it can be presented to

stakeholders prior to being used or deployed to make actual
predictions on real data.
Methods from the field of estimation statistics can be used to
quantify the uncertainty in the estimated skill of the machine
learning model through the use of tolerance intervals and
confidence intervals.
• Estimation Statistics. Methods that quantify the uncertainty
in the skill of a model via confidence intervals.
19
10. Model Predictions
• Finally, it will come time to start using a final model to make

predictions for new data where we do not know the real
outcome.
20
Summary
• Exploratory data analysis, data summarization, and data
visualizations can be used to help frame your predictive
modeling problem and better understand the data.
• That statistical methods can be used to clean and prepare data
ready for modeling.
• That statistical hypothesis tests and estimation statistics can
aid in model selection and in presenting the skill and
predictions from final models.
21
Data Selection
• Feature Reduction
• Data sampling: Data points (objects)
reduction
22
Basic Terminologies of Statistics
23
Sample & Population
• Sampling is a short cut to study the entire population

• Why do we need Sampling?
– Sample selection is a cost-efficient method
– Analysis of the sample is less cumbersome and more
practical than an analysis of the entire population
• Example: You want to find happiness index for the particular
country. What will be necessary steps?
24
https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/
Steps involved in Sampling
25
Step 1
• The first stage in the sampling process is to clearly define the
target population.
• So, to carry out opinion polls, polling agencies consider only
the people who are above 18 years of age and are eligible to
vote in the population.
Step 2
• Sampling Frame – It is a list of items or people forming a
population from which the sample is taken.
• So, the sampling frame would be the list of all the people
whose names appear on the voter list of a constituency.
26
Step 3
• Generally, probability sampling methods are used because
every vote has equal value and any person can be included in
the sample irrespective of his caste, community, or religion.
Different samples are taken from different regions all over the
country.
27
Step 4
Sample Size – It is the number of individuals or items to be
taken in a sample that would be enough to make inferences about
the population with the desired level of accuracy and precision.
• Larger the sample size, more accurate our inference about the
population would be.
• For the polls, agencies try to get as many people as possible of
diverse backgrounds to be included in the sample as it would
help in predicting the number of seats a political party can
win.
28
Step 5
• Once the target population, sampling frame, sampling
technique, and sample size have been established, the next
step is to collect data from the sample.
• In opinion polls, agencies generally put questions to the
people, like which political party are they going to vote for or
has the previous party done any work, etc.
• Based on the answers, agencies try to interpret who the people
of a constituency are going to vote for and approximately how
many seats is a political party going to win.
29
30
31
Random Sampling
32
Systematic Sampling
33
Systematic Sampling
• Suppose, we began with person number 3, and we want
a sample size of 5. So, the next individual that we will
select would be at an interval of (20/5) = 4 from the
3rd person, i.e. 7 (3+4), and so on.
3, 3+4=7, 7+4=11, 11+4=15, 15+4=19 = 3, 7, 11, 15, 19
34
Stratified Sampling
• In this type of sampling, we divide the population into
subgroups (called strata) based on different traits like
gender, category, etc. And then we select the sample(s)
from these subgroups:
35
Cluster Sampling
• In a clustered sample, we use the subgroups of the population as the
sampling unit rather than individuals. The population is divided into
subgroups, known as clusters, and a whole cluster is randomly selected
to be included in the study:
36
Types of Non-Probability Sampling:
Convenience Sampling
• This is perhaps the easiest method of sampling because individuals are
selected based on their availability and willingness to take part.
• Here, let’s say individuals numbered 4, 7, 12, 15 and 20 want to be part of
our sample, and hence, we will include them in the sample.
37
Types of Non-Probability Sampling: Quota
Sampling
•In this type of sampling, we choose items based on predetermined
characteristics of the population. Consider that we have to select
individuals having a number in multiples of four for our sample:
38
Judgment Sampling
• It is also known as selective sampling. It depends on the
judgment of the experts when choosing whom to ask to
participate.
Snowball Sampling
• I quite like this sampling technique. Existing people are
asked to nominate further people known to them so that
the sample increases in size like a rolling snowball. This
method of sampling is effective when a sampling frame is
difficult to identify.
39
40
41

Lecture-4: Introduction To Data Science

Uploaded by

Copyright:

Available Formats

Lecture-4: Introduction To Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture-4: Introduction To Data Science

Uploaded by

Copyright:

Available Formats

Lecture- 4

Introduction to Data Science:

Basic of Statistics and Probability

• Foundation of all machine learning

Statistical methods that can aid in the exploration of the data

Two large branches of statistical methods are used to aid in

A crucial part of a predictive modeling problem is evaluating a

• A given machine learning algorithm often has a suite of

The interpretation and comparison of the results between

• One among many machine learning algorithms may be

Once a final model has been trained, it can be presented to

• Finally, it will come time to start using a final model to make

• Sampling is a short cut to study the entire population

You might also like