Lecture-4: Introduction To Data Science
Lecture-4: Introduction To Data Science
Lecture-4: Introduction To Data Science
2
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
3
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
4
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
5
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
6
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
Red Wine Data Set
7
8
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
Statistics is essential for successfully working
through a predictive modeling problem
• Problem Framing
• Data Understanding
• Data Cleaning
• Data Selection
• Data Preparation
• Model Evaluation
• Model Configuration
• Model Selection
• Model Presentation
• Model Predictions
9
https://machinelearningmastery.com/statistical-methods-in-an-applied-machine-learning-project /
1. Problem Framing
10
https://machinelearningmastery.com/statistical-methods-in-an-applied-machine-learning-project /
2. Data Understanding
11
https://machinelearningmastery.com/statistical-methods-in-an-applied-machine-learning-project /
3. Data Cleaning
Observations from a domain are often not pristine.
Some examples include:
• Data corruption.
• Data errors.
• Data loss.
The process of identifying and repairing issues with the data is
called data cleaning.
Statistical methods are used for data cleaning; for example:
• Outlier detection. Methods for identifying observations that
are far from the expected value in a distribution.
• Imputation. Methods for repairing or filling in corrupt or
missing values in observations.
12
https://machinelearningmastery.com/statistical-methods-in-an-applied-machine-learning-project /
4. Data Selection
Not all observations or all variables may be relevant when
modeling.
#
The process of reducing the scope of data to those elements that
are most useful for making predictions is called data selection.
#
Two types of statistical methods that are used for data selection
include:
• Data Sample. Methods to systematically create smaller
representative samples from larger datasets.
• Feature Selection. Methods to automatically identify those
variables that are most relevant to the outcome variable.
13
https://machinelearningmastery.com/statistical-methods-in-an-applied-machine-learning-project /
5. Data Preparation
@ Data can often not be used directly for modeling.
@ Some transformation is often required in order to change
the shape or structure of the data to make it more suitable
for the chosen framing of the problem or learning
algorithms.
@ Data preparation is performed using statistical methods.
Some common examples include:
• Scaling. Methods such as standardization and
normalization.
• Encoding. Methods such as integer encoding and one
hot encoding.
• Transforms. Methods such as power transforms like the
Box-Cox method.
14
https://machinelearningmastery.com/statistical-methods-in-an-applied-machine-learning-project /
6. Model Evaluation
16
7. Model Configuration (Contd..)
17
8. Model Selection
18
9. Model Presentation
19
10. Model Predictions
20
Summary
• Exploratory data analysis, data summarization, and data
visualizations can be used to help frame your predictive
modeling problem and better understand the data.
• That statistical methods can be used to clean and prepare data
ready for modeling.
• That statistical hypothesis tests and estimation statistics can
aid in model selection and in presenting the skill and
predictions from final models.
21
Data Selection
• Feature Reduction
• Data sampling: Data points (objects)
reduction
22
Basic Terminologies of Statistics
23
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
Sample & Population
24
https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/
Steps involved in Sampling
25
https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/
Steps involved in Sampling
Step 1
• The first stage in the sampling process is to clearly define the
target population.
• So, to carry out opinion polls, polling agencies consider only
the people who are above 18 years of age and are eligible to
vote in the population.
Step 2
• Sampling Frame – It is a list of items or people forming a
population from which the sample is taken.
• So, the sampling frame would be the list of all the people
whose names appear on the voter list of a constituency.
26
Steps involved in Sampling
Step 3
• Generally, probability sampling methods are used because
every vote has equal value and any person can be included in
the sample irrespective of his caste, community, or religion.
Different samples are taken from different regions all over the
country.
27
https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/
Steps involved in Sampling
Step 4
Sample Size – It is the number of individuals or items to be
taken in a sample that would be enough to make inferences about
the population with the desired level of accuracy and precision.
• Larger the sample size, more accurate our inference about the
population would be.
• For the polls, agencies try to get as many people as possible of
diverse backgrounds to be included in the sample as it would
help in predicting the number of seats a political party can
win.
28
https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/
Steps involved in Sampling
Step 5
• Once the target population, sampling frame, sampling
technique, and sample size have been established, the next
step is to collect data from the sample.
• In opinion polls, agencies generally put questions to the
people, like which political party are they going to vote for or
has the previous party done any work, etc.
• Based on the answers, agencies try to interpret who the people
of a constituency are going to vote for and approximately how
many seats is a political party going to win.
29
https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/
30
https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/
31
Source: Statistics And Probability Tutorial | Statistics And Probability for Data Science | Edureka
Random Sampling
32
Systematic Sampling
33
Systematic Sampling
• Suppose, we began with person number 3, and we want
a sample size of 5. So, the next individual that we will
select would be at an interval of (20/5) = 4 from the
3rd person, i.e. 7 (3+4), and so on.
3, 3+4=7, 7+4=11, 11+4=15, 15+4=19 = 3, 7, 11, 15, 19
34
https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/
Stratified Sampling
• In this type of sampling, we divide the population into
subgroups (called strata) based on different traits like
gender, category, etc. And then we select the sample(s)
from these subgroups:
35
Cluster Sampling
• In a clustered sample, we use the subgroups of the population as the
sampling unit rather than individuals. The population is divided into
subgroups, known as clusters, and a whole cluster is randomly selected
to be included in the study:
36
https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/
Types of Non-Probability Sampling:
Convenience Sampling
• This is perhaps the easiest method of sampling because individuals are
selected based on their availability and willingness to take part.
• Here, let’s say individuals numbered 4, 7, 12, 15 and 20 want to be part of
our sample, and hence, we will include them in the sample.
37
Types of Non-Probability Sampling: Quota
Sampling
•In this type of sampling, we choose items based on predetermined
characteristics of the population. Consider that we have to select
individuals having a number in multiples of four for our sample:
38
Judgment Sampling
• It is also known as selective sampling. It depends on the
judgment of the experts when choosing whom to ask to
participate.
Snowball Sampling
• I quite like this sampling technique. Existing people are
asked to nominate further people known to them so that
the sample increases in size like a rolling snowball. This
method of sampling is effective when a sampling frame is
difficult to identify.
39
40
41