Unit 1 Machine Learning
Unit 1 Machine Learning
import numpy as np
A NumPy array object has a number of attributes, which help in giving information about
the array. Here are its important attributes.
1 ndim: This gives the number of dimensions of the array. The following shows that
the array that we defined had two dimensions.
a.ndim
2 shape: This gives the size of each dimension of the array.
a.shape
3 size: This gives the number of elements.
a.size
4 dtype: This gives the data type of the elements in the array:
a.dtype.name
Example: Create a NumPy array of dimension 3x3 and display all of its attributes.
import numpy as np
a=np.array([[1,3,6],[2,4,7],[2,5,9]])
print(a)
print("#Dimensions:",a.ndim)
print("Shape:",a.shape)
print("#Elements:",a.size)
print("Data Type:",a.dtype.name)
Mathematical Operations
1. Array Subtraction
The following commands subtract the a array from the b array to get the resultant
c array. The subtraction happens element by element:
b = np.array( [ 1, 2, 3, 4])
c=a–b
2. Squaring an Array
The following command raises each element to the power of 2 to obtain this result:
r=b**2
r=np.cos(b)
4. Conditional Operations
The following command will apply a conditional operation to each of the elements
of the b array, in order to generate the respective Boolean values.
r=b<2
5. Matrix Multiplication
Two matrices can be multiplied element by element or in a dot product. The
following commands will perform the element-by-element multiplication:
c=a * b
Example: Create two NumPy arrays of dimension 3x4, Say a and b, and then perform
following operations.
import numpy as np
a=np.array([[1,-3,4,7],[2,4,-5,9],[0,-1,8,3]])
b=np.array([[2,3,6,7],[2,5,5,8],[3,1,7,3]])
r=a+b
print("a+b=",r)
r=a-b
print("a-b=",r)
r=a**2
print("a^2=",r)
r=b**3
print("b^3=",r)
r=np.sin(a)+np.cos(b)
print("sin(a)+cos(b)=",r)
r=(a>=0)
print("(a>=0)=",r)
r=a*b
print("Element wise Multiplication of a and b=",r)
a=np.transpose(a)
print("Transpose of a=",a)
r=np.dot(a,b)
print("Matrix Multiplication of a and b=",r)
The preceding command will select the first row and then select the second value
in the row. It can also be seen as an intersection of the first row and the second
column of the matrix. If a range of values has to be selected on a row, then we can
use the following command.
a[0 , 0:3 ]
The 0:3 value selects the first three values of the first row. The whole row of values
can be selected with the following command.
a[ 0 , : ]
a[ : , 1 ]
7. Shape Manipulation
Once the array has been created, we can change the shape of it too. The following
command flattens the array.
a.ravel()
The following command reshapes the array in to a six rows and two columns
format. Also, note that when reshaping, the new shape should have the same
number of elements as the previous one.
a.shape = (6,2)
a.transpose()
Example: Create a NumPy array of dimension 4x3 and then perform following
operations.
import numpy as np
a=np.array((np.random.rand(4,3)))
print(a)
print("Third Element of Second Row=",a[1,2])
print("First Two Elements of Second Row=",a[1,0:2])
print("Second to Third Row:",a[1:3,:])
print("2x2 Slice of Top-Left Part=",a[0:2,0:2])
print("Fisr Two Columns=",a[:,0:2])
a=a.ravel()
print("1D Array:",a)
a.shape=(3,4)
print("a=",a)
Review of Pandas Data Structures
The pandas library is an open source Python library, specially designed for data analysis.
It has been built on NumPy and makes it easy to handle data. The pandas library brings
the richness of R in the world of Python to handle data. It's has efficient data structures
to process data, perform fast joins, and read data from various sources, to name a few.
The pandas library essentially has three data structures: Series, DataFrame, and Panel.
Series
Series is a one-dimensional array, which can hold any type of data, such as integers,
floats, strings, and Python objects too. A series can be created by calling the following.
import pandas as pd
import numpy as np
s=pd.Series(np.random.randn(5))
print(s)
The random.randn parameter is part of the NumPy package and it generates random
numbers. The series function creates a pandas series that consists of an index, which is
the first column, and the second column consists of random values. At the bottom of the
output is the datatype of the series. The index of the series can be customized by calling
the following:
import pandas as pd
import numpy as np
s=pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
import pandas as pd
import numpy as np
d = {'A': 10, 'B': 20, 'C': 30}
s=pd.Series(d)
print(s)
DataFrame
DataFrame is a 2D data structure with columns that can be of different data types. It can
be seen as a table. A DataFrame can be formed from the following data structures: A
NumPy array, Lists, Dicts, Series, etc.
import pandas as pd
d = {'c1': pd.Series(['A', 'B', 'C']),'c2': pd.Series([1, 2, 3, 4])}
df = pd.DataFrame(d)
print(df)
import pandas as pd
d = {'c1': ['A', 'B', 'C', 'D'],'c2': [1, 2, 3, 4]}
df = pd.DataFrame(d)
print(df)
import pandas as pd
import numpy as np
a=np.array(np.random.randn(3,4))
df=pd.DataFrame(a)
print(df)
To read data from a .csv file, the read_csv function can be used. To write a data to the .csv
file, the to_csv function can be used.
Example: Write a program that reads emp.csv file, displays its content, stores Eid, Ename,
and Salary to another dataframe writes it to emptiest.csv file
import pandas as pd
emp = pd.read_csv('/content/drive/My Drive/Data/emp.csv')
print(emp)
d=emp[["Eid","Ename","Age"]]
print(d)
d.to_csv("/content/drive/My Drive/Data/emptest.csv")
XLS
To read the data from an Excel file read_excel() command can be used and to_excel()
command can be used to write excel file.
Example: Write a program that reads Book1.xlsx file, displays its content, stores Sid, and
Sname to another dataframe writes it to mybook.xlsx file
import pandas as pd
book = pd.read_excel('/content/drive/My Drive/Data/Book1.xlsx')
print(book)
b=book[["Sid","Grade"]]
print(b)
b.to_excel("/content/drive/My Drive/Data/Book.xlsx")
JSON Data
JSON is a syntax for storing and exchanging data. JSON is text, written with JavaScript
object notation. Python has a built-in package called json, which can be used to work with
JSON data. If we have a JSON string, we can parse it by using the json.loads() method.
The result will be a Python dictionary. If we have a Python object, we can convert it into
a JSON string by using the json.dumps() method.
Example: Represent Id, Name, and Email of 3 person in JSON format, load it into
dictionary and display it. Again represent Id, Name, and Email of 3 person in dictionary
convert it into JSON format and display it.
import json
#JSON Data
x = """[
{"ID":101,"name":"Ram", "email":"Ram@gmail.com"},
{"ID":102,"name":"Bob", "email":"bob32@gmail.com"},
{"ID":103,"name":"Hari", "email":"hari@gmail.com"}
]"""
# loads method converts x into dictionary
y = json.loads(x)
print(y)
# Displaying Email Id of all persons from dictionary
for r in y:
print(r["email"])
Database
SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in
wide use, and many alternative NoSQL databases have become quite popular. Loading
data from SQL into a DataFrame is fairly straightforward, and pandas has some functions
to simplify the process. As an example, an in-memory SQLite database using Python’s
built-in sqlite3 driver is presented below.
Example
import sqlite3
query = "CREATE TABLE Student (Sid Varchar(10), Sname VARCHAR(20), GPA
Real, Age Integer);"
con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()
data = [('S1', 'Ram', 3.25, 23),('S2', 'Hari', 3.4, 24),('S3', 'Sita',
3.7, 22)]
stmt = "Insert Into Student Values(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()
cursor = con.execute('select * from Student where GPA>3.3')
rows = cursor.fetchall()
print(rows)
Data Cleansing
Data cleansing, sometimes known as data cleaning or data scrubbing, denotes the
procedure of rectifying inaccurate, unfinished, duplicated, or other flawed data within a
dataset. This task entails detecting data discrepancies and subsequently modifying,
enhancing, or eliminating the data to rectify them. Through data cleansing, data quality
is enhanced, thereby furnishing more precise, uniform, and dependable information
crucial for organizational decision-making.
Generally, most data will have some missing values. There could be various reasons for
this: the source system which collects the data might not have collected the values or the
values may never have existed. Once you have the data loaded, it is essential to check the
missing elements in the data. Depending on the requirements, the missing data needs to
be handled. It can be handled by removing a row or replacing a missing value with an
alternative value. Commands isnull(), notnull(), dropna() are widely used for checking
null values.
isnull(): It examines columns for NULL values and generates a Boolean series
where True represents NaN values and False signifies non-null values.
notnull: It examines columns for NULL values and generates a Boolean series
where True represents non-null values and False signifies NULL values.
dropna():It eliminates all rows containing NULL values from the DataFrame.
Example
import pandas as pd
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
for c in emp.columns:
print(emp[c].isnull().value_counts())
emp=emp.dropna()
print ("Cleaned Data")
for c in emp.columns:
print(emp[c].isnull().value_counts())
To address null values within datasets, we employ functions like fillna(), replace(), and
interpolate(). These functions substitute NaN values with specific values.
filna():The fillna() function substitutes NULL values with a designated value. By default,
it generates a new DataFrame object unless the inplace parameter is set to True, in which
case it performs the replacement within the original DataFrame.
Example
import pandas as pd
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
emp["First Name"].fillna(value="Unknown",inplace=True)
emp["Gender"].fillna(value="Unknown",inplace=True)
emp["Salary"].fillna(value=emp["Salary"].mean,inplace=True)
emp["Bonus %"].fillna(value=emp["Bonus %"].mean,inplace=True)
emp["Team"].fillna(value="Unknown",inplace=True)
for c in emp.columns:
print(emp[c].isnull().value_counts())
The Pandas interpolate() function is utilized to populate NaN values within the
DataFrame or Series by employing different interpolation techniques aimed at filling the
missing values in the data.
import pandas as pd
import numpy as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
print(*emp["Salary"])
emp["Salary"].interpolate(inplace=True,method="linear")
print(*emp["Salary"])
Merging Data
To combine datasets together, the concat function of pandas can be utilized. We can
concatenate two or more dataframes together.
Example
import pandas as pd
import numpy as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
e1=emp[0:5]
print("First 5 Rows of Dataframe:")
print(e1)
e2=emp[10:15]
print("Rows 100-15 of Dataframe:")
print(e2)
print("Concatenated Dataframe:")
e=pd.concat([e1,e2])
print(e)
Data operations
Once the missing data is handled, various operations such as aggregate operations, joins
etc. can be performed on the data.
Aggregation Operations
There are a number of aggregation operations, such as average, sum, and so on, which
we would like to perform on a numerical field. These aggregate methods are discussed
below.
groupby Function
import pandas as pd
import numpy as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
emp.drop_duplicates(inplace=True)
emp.dropna(inplace=True)
avgsal=emp[["Team","Salary"]].groupby(["Team"]).mean()
print("Average Salary For Each Team")
print(avgsal)
gencount=emp.groupby(["Gender"]).count()
print("#Employees Gender Wise")
print(gencount)
minbonus=emp[["Gender","Bonus %"]].groupby(["Gender"]).min()
print("Minimum Bonus% For Each Gender")
print(minbonus)
For all normal distributions, 68.2% of the observations will appear within plus or minus
one standard deviation of the mean; 95.4% of the observations will fall within +/- two
standard deviations; and 99.7% within +/- three standard deviations. This fact is
sometimes referred to as the "empirical rule," a heuristic that describes where most of the
data in a normal distribution will appear. This means that data falling outside of three
standard deviations ("3-sigma") would signify rare occurrences.
Example
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np
heights=heights.round(2)
heights=np.sort(heights)
prob = dist.pdf(x=5.2)
print("Probability of being height 5.2=",prob)
probs = dist.pdf(x=[4.5,5,5.5,6,6.5])
print("Probability of heights=",probs)
probs = dist.pdf(x=heights)
The method pdf() from the norm class can help us find the probability of some randomly
selected value. It returns the probabilities for specific values from a normal distribution.
PDF stands for probability density function.
The norm method cdf() helps us to calculate the proportion of a normally distributed
population that is less than or equal to a given value. CDF stands for Cumulative
Distribution Function.
Example
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np
z= (x − μ)/σ
Here, x is the value in the distribution, μ is the mean of the distribution, and σ is the
standard deviation of the distribution. Conversely, if x is a normal random variable with
mean μ and standard deviation σ, it is calculated as below.
x = σz + μ
Numerical Example
A survey of daily travel time had these results (in minutes): 26, 33, 65, 28, 34, 55, 25, 44,
50, 36, 26, 37, 43, 62, 35, 38, 45, 32, 28, 34. Convert the values to z-scores ("standard scores").
Solution
μ=38.5
σ=11.4
Standard scalar calculates z-score for every data while normalizing data and the
normalized can be inverse scaled accordingly as mentioned above.
Binomial Distribution
Binomial distribution is a probability distribution that summarizes the likelihood that a
variable will take one of two independent values under a given set of parameters. The
distribution is obtained by performing a number of Bernoulli trials. A Bernoulli trial is
assumed to meet each of these criteria.
For example, the probability of getting a head or a tail is 50%. If we take the same coin
and flip it n times, the probability of getting a head p times can be computed using
probability mass function (PMF) of binomial distribution. The binomial distribution
formula is for any random variable x, given by
Where, n is the number of times the coin is flipped, p is the probability of success, and
q=1– p is the probability of failure, and x is the number of successes desired.
Numerical Example: If a coin is tossed 5 times, find the probability of: (a) Exactly 2
heads and (b) at least 4 heads.
Solution
Number of trials: n=5
Again
For at least 4 heads: x<=4
𝑃(𝑥 ≤ 4) = 𝑃(𝑥 = 4) + 𝑃(𝑥 = 5)
5! 5!
𝑃(𝑥 ≤ 4) = 4!×1! × 0.54 × 0.51 + 5!×0! × 0.55 × 0.50
Example
from scipy import stats
import matplotlib.pyplot as plt
dist=stats.binom(n=5,p=0.5)
prob=dist.pmf(k=2)
print("Proability of Two Heads=",prob)
prob=dist.pmf(k=4)+dist.pmf(k=5)
print("Praobability of at least 4 heads=",prob)
Note!!!
A probability mass function is a function that gives the probability that a discrete random
variable is exactly equal to some value. A probability mass function differs from a
probability density function (PDF) in that the latter is associated with continuous rather
than discrete random variables. A PDF must be integrated over an interval to yield a
probability.
Poisson Distribution
Poisson distribution is a Discrete Distribution. It estimates how many times an event can
happen in a specified time provided the mean occurrence of the event in the interval. For
example, if someone eats twice a day what is the probability he will eat thrice? If lambda
is the mean occurrence of the events per interval, then the probability of having a k
occurrence within a given interval is given by the following formula.
Where, e is the Euler's number, k is the number of occurrences for which the probability
is going to be determined, and lambda is the mean number of occurrences.
Numerical Example
In the World Cup, an average of 2.5 goals are scored in each game. Modeling this situation
with a Poisson distribution, what is the probability that 3 goals are scored in a game?
What is the probability that 5 goals are scored in a game?
Solution
Given, λ=2.5
2.53 × 𝑒 −2.5
𝑃(𝑥 = 3) = = 0.214
3!
2.55 × 𝑒 −2.5
𝑃(𝑥 = 5) = = 0.0668
5!
Example
from scipy import stats
P-value
The P-value is known as the probability value. It is defined as the probability of getting a
result that is either the same or more extreme than the actual observations. A p-value is
used as a statistical test to determine whether null-hypothesis is rejected or not? The null
hypothesis is a statement that says that there is no difference between two measures. If
the hypothesis is that people who clock in 4 hours of study everyday score more that 90
marks out of 100. The null hypothesis here would be that there is no relation between the
number of hours clocked in and the marks scored. If the p-value is equal to or less than
the significance level, then the null hypothesis is inconsistent and it needs to be rejected.
The P-value table shows the hypothesis interpretations.
P-value Decision
P-value > 0.05 The result is not statistically significant and hence accept the null hypothesis.
The result is statistically significant. Generally, reject the null hypothesis in favor
P-value < 0.05
of the alternative hypothesis.
The result is highly statistically significant, and thus rejects the null hypothesis
P-value < 0.01
in favor of the alternative hypothesis.
Suppose null hypothesis is “It is common for students to score 68 marks in mathematics.”
Let's define the significance level at 5%. If the p-value is less than 5%, then the null
hypothesis is rejected and it is not common to score 68 marks in mathematics. First
calculate z-score of 68 marks (say z68) and then we calculate p-value for the given z-score
value as below.
pv=p(z≥z68)
This means pv*100% of the students are above the specified score 68.
import numpy as np
from scipy import stats
mean=scores.mean()
SD=scores.std()
z = (68-mean)/SD #z-value of score=68
print("Z-value of score=68:",z)
A one-tailed test may be either left-tailed or right-tailed. A left-tailed test is used when
the alternative hypothesis states that the true value of the parameter specified in the null
hypothesis is less than the null hypothesis claims. A right-tailed test is used when the
alternative hypothesis states that the true value of the parameter specified in the null
hypothesis is greater than the null hypothesis claims.
The main difference between one-tailed and two-tailed tests is that one-tailed tests will
only have one critical region whereas two-tailed tests will have two critical regions. If we
require a 100(1-α)% confidence interval we have to make some adjustments when using
a two-tailed test. The confidence interval must remain a constant size, so if we are
performing a two-tailed test, as there are twice as many critical regions then these critical
regions must be half the size. This means that when performing a two-tailed test, we need
to consider α/2 significance level rather than α.
Example: A light bulb manufacturer claims that its' energy saving light bulbs last an
average of 60 days. Set up a hypothesis test to check this claim and comment on what sort
of test we need to use.
The example in the previous section was an instance of a one-tailed test where the null
hypothesis is rejected or accepted based on one direction of the normal distribution. In a
two-tailed test, both the tails of the null hypothesis are used to test the hypothesis. In a
two-tailed test, when a significance level of 5% is used, then it is distributed equally in
the both directions, that is, 2.5% of it in one direction and 2.5% in the other direction.
Let's understand this with an example. The mean score of the mathematics exam at a
national level is 60 marks and the standard deviation is 3 marks. The mean marks of a
class are 53. The null hypothesis is that the mean marks of the class are similar to the
national average.
So, the p-value is 0.98%. The null hypothesis is to be rejected, and the p-value should be
less than 2.5% in either direction of the bell curve. Since the p-value is less than 2.5%, we
can reject the null hypothesis and clearly state that the average marks of the class are
significantly different from the national average.
A type 1 error appears when the null hypothesis of an experiment is true, but still, it is
rejected. A type 1 error is often called a false positive. Consider the following example.
There is a new drug that is being developed and it needs to be tested on whether it is
effective in combating diseases. The null hypothesis is that “it is not effective in
combating diseases.” The significance level is kept at 5% so that the null hypothesis can
be accepted confidently 95% of the time. However, 5% of the time, we'll accept the
rejection of the hypothesis although it had to be accepted, which means that even though
the drug is ineffective, it is assumed to be effective. The Type 1 error is controlled by
controlling the significance level, which is α. α is the highest probability to have a Type 1
error. The lower the α, the lower will be the Type 1 error.
The Type 2 error is the kind of error that occurs when we do not reject a null hypothesis
that is false. A type 2 error is also known as false negative. This kind of error occurs in a
drug scenario when the drug is accepted as ineffective but is actually it is effective. The
probability of a type 2 error is β. Beta depends on the power of the test. This means the
probability of not committing a type 2 error equal to 1-β. There are 3 parameters that can
affect the power of a test: sample size (n), significance level of test (α), “true” value of tested
parameter.
Sample size (n): Other things being equal, the greater the sample size, the greater
the power of the test.
Significance level (α): The lower the significance level, the lower the power of the
test. If we reduce the significance level (e.g., from 0.05 to 0.01), the region of
acceptance gets bigger. As a result, we are less likely to reject the null hypothesis.
This means we are less likely to reject the null hypothesis when it is false, so we
are more likely to make a Type II error. In short, the power of the test is reduced
when we reduce the significance level; and vice versa.
The "true" value of the parameter being tested: The greater the difference
between the "true" value of a parameter and the value specified in the null
hypothesis, the greater the power of the test.
These errors can be controlled one at a time. If one of the errors is lowered, then the other
one increases. It depends on the use case and the problem statement that the analysis is
trying to address, and depending on it, the appropriate error should reduce. In the case
of this drug scenario, typically, a Type 1 error should be lowered because it is better to
ship a drug that is confidently effective.
Confidence Interval
Confidence level = 1 − a
So if we use an alpha value of p < 0.05 for statistical significance, then our confidence
level would be 1 − 0.05 = 0.95, or 95%.
Confidence interval for sample data is calculated as below:
Example. Generate heights of 50 persons randomly such that the heights have normal distribution
with mean=165 and SD=20. Calculate Confidence interval for the dataset for the confidence level 95%.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
Correlation
Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate). It’s a common tool for
describing simple relationships without making a statement about cause and effect. The
sample correlation coefficient, r, quantifies the strength and direction of the relationship.
Correlation coefficient quite close to 0, but either positive or negative, implies little or no
relationship between the two variables. A correlation coefficient close to plus 1 means a
positive relationship between the two variables, with increases in one of the variables
being associated with increment in the other variable. A correlation coefficient close to -
1 indicates a negative relationship between two variables, with an increase in one of the
variables being associated with a decrease in the other variable. The most common
formula is the Pearson Correlation coefficient used for linear dependency between the
data sets and is given as below.
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑟=
√𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 √𝑛 ∑ 𝑦 2 − (∑ 𝑦)2
Numerical Example
Calculate the coefficient of correlation for the following two data sets: x = (41, 19, 23, 40,
55, 57, 33) and y = (94, 60, 74, 71, 82, 76, 61).
∑ 𝑥 = 41 + 19 + 23 + 40 + 55 + 57 + 33 = 268
∑ 𝑦 = 94 + 60 + 74 + 71 + 82 + 76 + 61 = 518
2 2 2
∑ 𝑥 2 = (41 ) + (19 ) + . . . (33 ) = 11,534
2 2 2
∑ 𝑦 2 = (94 ) + (60 ) + . . . (61 ) = 39,174
7×20,391−268×518
𝑟= = 0.54
√(7×11,534−2682)√(7×39174−5182 )
Example
T-test
A t-test is a statistical test that is used to compare the means of two groups. It is often used in
hypothesis testing to determine whether two groups are different from one another. A t-test can only
be used when comparing the means of two groups. If we want to compare more than two groups use
an ANOVA test. When choosing a t-test, we will need to consider two things: whether the groups
being compared come from a single population or two different populations, and whether we want
to test the difference in a specific direction.
If the groups come from a single population perform a paired sample t test.
If the groups come from two different populations perform a two-sample t-test.
If there is one group being compared against a standard value, perform a one-sample t test.
𝑑̅
𝑡=
𝑠/√𝑛
Where, d=difference between paired samples, 𝑑̅ is mean of d, s is standard deviation of
the differences, n is sample size.
Example
An instructor takes two exams of the students. Scores of Both exams are given in the table
below. He/She wants to know if the exams are equally difficult.
Student Exam1 Score(x) Exam2 Score(y)
S1 63 69
S2 65 65
S3 56 62
S4 100 91
S5 88 78
S6 83 87
S7 77 79
S8 92 88
S9 90 85
S10 84 92
S11 68 69
S12 74 81
S13 87 84
S14 64 75
S15 71 84
S16 88 82
Solution
𝑑̅ = 1.31
(𝑥 − 𝑦)2
𝑠= √ = 7.13
𝑛−1
Now,
𝑑̅ 1.31
𝑡= = = 0.74
𝑠/√𝑛 7.13/√16
Let’s assume, Significance level (α) = 0.05
Because 0.74 < 2.131, we accept null hypothesis. This means the mean score of two
exams is similar.
print('t-statistic:', tv)
print('p-value:', pv)
if(pv>0.05):
print("Null Hypothesis is Accepted. This means there is no difference
between means scores of two exams")
else:
print("Null Hypothesis is Rejected. This means there is difference
between means scores of two exams")
Output
t-statistic: -0.7497768853141169
p-value: 0.4649871003972206
Null Hypothesis is accepted. This means there is no difference between means scores of
two exams
One-Sample T-test
The One Sample t Test examines whether the mean of a sample is statistically different
from a known or hypothesized or population mean. It is calculated as below.
𝑥̅ − 𝜇
𝑡=
𝑠/√𝑛
Numerical Example
Imagine a company wants to test the claim that their batteries last more than 40 hours.
Using a simple random sample of 15 batteries yielded a mean of 44.9 hours, with a
standard deviation of 8.9 hours. Test this claim using a significance level of 0.05.
Solution
44.9−40
𝑡= = 2.13
8.9/√15
Because 2.13 > 1.761, we reject the null hypothesis and conclude that batteries last
more than 40 hours.
battry_hour=[40,50,55,38,48,62,44,52,46,44,37,42,46,38,45]
Two-Sample T-test
The two-sample t-test (also known as the independent samples t-test) is a method used
to test whether the unknown population means of two groups are equal or not. We can
use the test when our data values are independent, are randomly sampled from two
normal populations and the two independent groups. It carried out as below.
𝑥̅1 − 𝑥̅2
𝑡=
1 1
𝑠𝑝 √𝑛 + 𝑛
1 2
where 𝑥̅1 and 𝑥̅ 2 are the sample means, n1 and n2 are the sample sizes, and sp is calculated
as below.
Example
Our sample data is from a group of men and women who did workouts at a gym three
times a week for a year. Then, their trainer measured the body fat. The table below shows
the data.
Example
from scipy import stats
men=[13.3,6.0, 20.0, 8.0, 14.0, 19.0, 18.0, 25.0, 16.0,24.0,15.0, 1.0,
15.0]
women=[22.0, 16.0, 21.7, 21.0, 30.0, 26.0, 12.0, 23.2, 28.0, 23.0]
#Two Sample t-test
tv, pv = stats.ttest_ind(women,men)
print('t-statistic:', tv)
print('p-value:', pv)
if(pv>0.05):
print("Null Hypothesis is Accepted. This means Body Fat percentage of
Men and Women is similar")
else:
print("Null Hypothesis is Rejected. This means Body Fat percentage of
Men and Women is different")
T-test vs Z-Test
The difference between a t-test and a z-test hinges on the differences in their respective
distributions. As mentioned, a z-test uses the Standard Normal Distribution, which is
defined as having a population mean, 𝜇, of 0 and a population standard deviation, 𝜎 , of
1. It is calculated using following formula.
Notice that this distribution uses a known population standard deviation for a data set to
approximate the population mean. However, the population standard deviation is not
always known, and the sample standard deviation, s, is not always a good
approximation. In these instances, it is better to use the T-test.
The T-Distribution looks a lot like a Standard Normal Distribution. In fact, the larger a
sample is, the more it looks like the Standard Normal Distribution - and at sample sizes
larger than 30, they are very, very similar. Like the Standard Normal Distribution, the T-
Distribution is defined as having a mean 𝜇 = 0, but its standard deviation, and thus the
width of its graph, varies according to the sample size of the data set used for the
hypothesis test. It is calculated using following formula:
The standard normal or z-distribution assumes that you know the population standard
deviation. The t-distribution is based on the sample standard deviation. The t-
distribution is similar to a normal distribution. The useful properties of the t-distribution
are:
The table below presents the key differences between the two statistical methods, Z-test and
T-test.
Z-Test T-Test
Used for large sample sizes (n≥30). Used for small to moderate
sample sizes (n<30).
Performing this test requires It is performed when the
knowledge of population standard population standard deviation is
deviation (σ). unknown.
Does not involve the sample standard Involves the sample standard
deviation. deviation (s).
Assumes a standard normal Assumes a t-distribution, which
distribution. varies with degrees of freedom.
Chi-square Distribution
If we repeatedly take samples and define the chi-square statistics, then we can form a chi-
square distribution. A chi-square (Χ2) distribution is a continuous probability distribution
that is used in many hypothesis tests. The shape of a chi-square distribution is determined
by the parameter k, which represents the degrees of freedom. The graph below shows
examples of chi-square distributions with different values of k.
There are two main types of Chi-Square tests namely: Chi-Square for the Goodness-of-Fit
and Chi-Square for the test of Independence.
Chi-Square for the Goodness-of-Fit
A chi-square test is a statistical test that is used to compare observed and expected results.
The goal of this test is to identify whether a disparity between actual and predicted data
is due to chance or to a link between the variables under consideration. As a result, the
chi-square test is an ideal choice for aiding in our understanding and interpretation of the
connection between our two categorical variables. Pearson’s chi-square test was the first
chi-square test to be discovered and is the most widely used. Pearson’s chi-square test
statistic is given as below.
The dice is rolled 36 times and the probability that each face should turn upwards is
1/6. So, the expected distribution is as follows:
The null hypothesis in the chi-square test is that the observed value is similar to the
expected value. The chi-square can be performed using the chisquare function in the
SciPy package. The function gives chisquare value and p-value as output. By looking at
the p-value we can reject/accept null hypothesis
Example
from scipy import stats
import numpy as np
expected = np.array([6,6,6,6,6,6])
observed = np.array([7,5,3,9,6,6])
cp=stats.chisquare(observed,expected)
print(cp)
Output: P-value=0.65
Conclusion: Since p-value>0.05, Null hypothesis is accepted. Thus, we conclude that observed
value of dice is same as expected value.
Output:P-value=.029
Conclusion: since p-value is less than 0.05 our null hypothesis is rejected. Thus, we conclude
that the empathy is related with gender.
ANOVA
ANOVA stands for Analysis of Variance. It is a statistical method used to analyze the differences
between the means of two or more groups. It is often used to determine whether there are any
statistically significant differences between the means of different groups. ANOVA compares the
variation between group means to the variation within the groups. If the variation between group
means is significantly larger than the variation within groups, it suggests a significant difference
between the means of the groups.
ANOVA calculates an F-statistic by comparing between-group variability to within-group
variability. If the F-statistic exceeds a critical value, it indicates significant differences between
group means. Types of ANOVA include one-way (for comparing means of groups) and two-way
(for examining effects of two independent variables on a dependent variable). To perform the one-
way ANOVA, we can use the f_oneway() function of the SciPy package.
Example
Suppose we want to know whether or not three different exam prep programs lead to different
mean scores on a certain exam. To test this, we recruit 30 students to participate in a study and
split them into three groups. The students in each group are randomly assigned to use one of the
three exam prep programs for the next three weeks to prepare for an exam. At the end of the three
weeks, all of the students take the same exam. The exam scores for each group are shown below.