Introduction to Data Science_students
Introduction to Data Science_students
Introduction to Data Science_students
• Hypothesis Testing
• Regression Analysis
• Confidence Intervals
Hypothesis Testing
• Hypothesis testing involves making decisions
about a population parameter based on
sample data. It typically involves formulating a
null hypothesis (H0) and an alternative
hypothesis (Ha), collecting sample data, and
using statistical tests to determine whether
there is enough evidence to reject the null
hypothesis in favor of the alternative
hypothesis.
Regression Analysis
• Regression analysis is used to examine the
relationship between one or more
independent variables and a dependent
variable. It helps in predicting the value of the
dependent variable based on the values of the
independent variables.
Confidence Intervals
• Confidence intervals provide a range of values
within which the true population parameter is
likely to fall with a certain level of confidence.
For example, a 95% confidence interval for the
population mean indicates that we are 95%
confident that the true population mean falls
within the interval.
Hypothesis Testing
• Hypothesis testing is an important part of
statistics. It's like a detective game where we
have two guesses about something in a group:
one saying there's no difference, and the
other saying there is. We collect data from a
smaller group and use statistics to see if we
can prove one guess is more likely. It helps us
decide if our ideas about the whole group are
true or not.
Hypothesis Testing
• The null hypothesis and the alternative
hypothesis are two competing statements in
statistical hypothesis testing. They are used to
make inferences about a population based on
sample data.
Null Hypothesis (H0)
• Definition: The null hypothesis states that there is no
effect, no difference, or no relationship between
variables. It represents the status quo or the
assumption that nothing has changed.
• Purpose: It is the hypothesis that researchers aim to
test or reject.
• Example: Suppose we are testing whether a new drug
is effective in reducing blood pressure. The null
hypothesis would be: (The mean blood pressure after
using the drug is equal to the mean blood pressure
before using the drug, indicating no effect.)
Alternative Hypothesis (Ha)
• Definition: The alternative hypothesis states that
there is an effect, a difference, or a relationship
between variables. It is the claim that researchers
want to support.
• Purpose: It is accepted if there is sufficient
evidence to reject the null hypothesis.
• Example:Continuing the same example, the
alternative hypothesis could be:(The mean blood
pressure after using the drug is different from the
mean blood pressure before, indicating the drug
has an effect.)
Key Differences
What is a P-value?
• The p-value is a statistical measure that quantifies the
probability of observing a result at least as extreme as the
one obtained, assuming the null hypothesis (H0) is true.
• Low p-value: Suggests that the observed data is unlikely
under H0 and therefore provides evidence to reject H0.
• High p-value: Indicates the observed data is consistent with
H0 , and there is insufficient evidence to reject it.
• Key Points:
• Range: 0 ≤ p ≤ 1
• Interpretation:
– p ≤ α: Reject the null hypothesis (statistically significant).
– P > α: Fail to reject the null hypothesis (not statistically
significant).
What is Alpha (α)?
import numpy as np
from scipy.stat import binom_test
n_flips = 100
observed_heads = 60
p_fair = 0.5
p_value = binom_test(observed_heads, n=n_flips, p=p_fair,
alternative='two-sided')
alpha = 0.05
print(f"P-value: {p_value}")
if p_value < alpha: print("Reject the null hypothesis: The coin is
not fair.")
else: print("Fail to reject the null hypothesis: The coin is fair.")
What is Distribution?
• Distribution = Probability Distribution
• A distribution is a function that shows the
possible values for a variable and how they
often occur.
• Fair Coin Problem –
Types of Distribution
Discrete Uniform Distribution: All
Outcomes are Equally Likely
• In statistics, uniform distribution refers to a statistical
distribution in which all outcomes are equally likely.
Consider rolling a six-sided die. You have an equal
probability of obtaining all six numbers on your next
roll, i.e., obtaining precisely one of 1, 2, 3, 4, 5, or 6,
equaling a probability of 1/6, hence an example of a
discrete uniform distribution.
• As a result, the uniform distribution graph contains
bars of equal height representing each outcome. In our
example, the height is a probability of 1/6 (0.166667).
Drawback
• Uniform distribution is represented by the function
U(a, b), where a and b represent the starting and
ending values, respectively. Similar to a discrete
uniform distribution, there is a continuous uniform
distribution for continuous variables.
• The drawbacks of this distribution are that it often
provides us with no relevant information. Using our
example of a rolling die, we get the expected value of
3.5, which gives us no accurate intuition since there is
no such thing as half a number on a dice. Since all
values are equally likely, it gives us no real predictive
power.
Bernoulli Distribution: Single-trial
with Two Possible Outcomes
• The Bernoulli distribution is one of the easiest distributions to
understand. It can be used as a starting point to derive more
complex distributions. Any event with a single trial and only two
outcomes follows a Bernoulli distribution. Flipping a coin or
choosing between True and False in a quiz are examples of a
Bernoulli distribution.
• They have a single trial and only two outcomes. Let’s assume you
flip a coin once; this is a single trail. The only two outcomes are
either heads or tails. This is an example of a Bernoulli distribution.
• Usually, when following a Bernoulli distribution, we have the
probability of one of the outcomes (p). From (p), we can deduce the
probability of the other outcome by subtracting it from the total
probability (1), represented as (1-p).
Poisson Distribution: The Probability
that an Event May or May not Occur
• Poisson distribution deals with the frequency with which an
event occurs within a specific interval. Instead of the
probability of an event, Poisson distribution requires
knowing how often it happens in a particular period or
distance. For example, a cricket chirps two times in 7
seconds on average. We can use the Poisson distribution to
determine the likelihood of it chirping five times in 15
seconds.
• A Poisson process is represented with the notation Po(λ),
where λ represents the expected number of events that
can take place in a period. The expected value and variance
of a Poisson process is λ. X represents the discrete random
variable. A Poisson Distribution can be modeled using the
following formula.
• The main characteristics which describe the
Poisson Processes are:
• The events are independent of each other.
• An event can occur any number of times
(within the defined period).
• Two events can’t take place simultaneously.
•
Numpy Choice Function
• With the help of choice() method, we can get
the random samples of one dimensional array
and return the random samples of numpy
array.
• syntax : numpy.random.choice(a, size=None,
replace=True, p=None)
• Parameters:
• 1) a – 1-D array of numpy having random samples.
• 2) size – Output shape of random samples of numpy
array.
• 3) replace – Whether the sample is with or without
replacement.
• 4) p – The probability attach with every samples in a.
• Output : Return the numpy array of random samples.
• # import numpy library
• import numpy as np
• # create a list
• num_list = [10, 20, 30, 40, 50]
• print(number)
# import numpy library
import numpy as np
# create a list
num_list = [10, 20, 30, 40, 50]
print(number_list)
# import numpy library
import numpy as np
# create a list
num_list = [10, 20, 30, 40, 50]
print(number_list)
import numpy as np
import matplotlib.pyplot as plt
• print(gfg)
Sampling error in inferential statistics
• # dictionary of lists
• dict = {'name': nme, 'degree': deg, 'score': scr}
•
• df = pd.DataFrame(dict)
• Isnull()
• Isna()
• Fillna()
• Dropna()
Remove Rows & Return a new Data
Frame
• import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
• import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())
• import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())
Replace Empty Values
• import pandas as pd
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv')
• df.drop_duplicates(inplace = True)
Applications of Data Science in
Various Industries
Tools:
•Excel
•Python (Pandas, NumPy)
•R
•SQL
Applications:
•Examining sales data to identify top-performing products.
Data Analytics
Definition:
• Data Analytics is a broader field that includes
Data Analysis and focuses on applying
techniques and algorithms to predict future
trends, make recommendations, and
automate decision-making.
Objectives:
• Gain actionable insights for strategic
decisions.
• Predict future events or behaviors.
• Optimize processes and systems.
Characteristics:
Data Analytics
Key Techniques:
1.Predictive Analytics: Uses machine learning and statistical
models to forecast future outcomes.
2. Prescriptive Analytics: Recommends actions to optimize
results.
3. Diagnostic Analytics: Identifies the causes of past outcomes.
4. Real-Time Analytics: Analyzes data as it is generated.
Tools:
• Tableau
• Power BI
• Python (Scikit-learn, TensorFlow)
• Big Data Platforms (Hadoop, Spark)
Applications:
• Building recommendation systems (e.g., Netflix, Amazon).
• Fraud detection in financial transactions.
• Demand forecasting in supply chain management.
Differences
Aspect Data Analysis Data Analytics