0% found this document useful (0 votes)
4 views

Unit 1 Machine Learning

The document provides an overview of data processing and inferential statistics using Python's NumPy and Pandas libraries. It covers the creation and manipulation of NumPy arrays, mathematical operations, and the structure of Pandas dataframes, including data import/export and data cleansing techniques. Key functions for handling missing data and merging datasets are also discussed.

Uploaded by

Nischal Ghimire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 1 Machine Learning

The document provides an overview of data processing and inferential statistics using Python's NumPy and Pandas libraries. It covers the creation and manipulation of NumPy arrays, mathematical operations, and the structure of Pandas dataframes, including data import/export and data cleansing techniques. Key functions for handling missing data and merging datasets are also discussed.

Uploaded by

Nischal Ghimire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Unit 1

Data Processing and Inferential Statistics


Review of NumPy Arrays
NumPy is a wonderful Python package, which has been created fundamentally for
scientific computing. It helps handle large multidimensional arrays and matrices, along
with a large library of high-level mathematical functions to operate on these arrays. A
NumPy array would require much less memory to store the same amount of data
compared to a Python list, which helps in reading and writing from the array in a faster
manner.
Creating an Array
A list of numbers can be passed to the following array function to create a NumPy
array object.

import numpy as np

a = np.array([[0, 1, 2, 3], [4, 5, 6, 7],[8, 9, 10, 11]])

A NumPy array object has a number of attributes, which help in giving information about
the array. Here are its important attributes.

1 ndim: This gives the number of dimensions of the array. The following shows that
the array that we defined had two dimensions.
a.ndim
2 shape: This gives the size of each dimension of the array.
a.shape
3 size: This gives the number of elements.
a.size
4 dtype: This gives the data type of the elements in the array:
a.dtype.name
Example: Create a NumPy array of dimension 3x3 and display all of its attributes.

import numpy as np
a=np.array([[1,3,6],[2,4,7],[2,5,9]])
print(a)
print("#Dimensions:",a.ndim)
print("Shape:",a.shape)
print("#Elements:",a.size)
print("Data Type:",a.dtype.name)
Mathematical Operations

When we have an array of data, we would like to perform certain mathematical


operations on it. We will now discuss a few of the important ones in this section.

1. Array Subtraction
The following commands subtract the a array from the b array to get the resultant
c array. The subtraction happens element by element:

a = np.array( [11, 12, 13, 14])

b = np.array( [ 1, 2, 3, 4])

c=a–b
2. Squaring an Array
The following command raises each element to the power of 2 to obtain this result:

r=b**2

3. Trigonometric Function Performed on the Array


The following command applies cosine to each of the values in the b array to obtain
the following result:

r=np.cos(b)

4. Conditional Operations
The following command will apply a conditional operation to each of the elements
of the b array, in order to generate the respective Boolean values.

r=b<2

5. Matrix Multiplication
Two matrices can be multiplied element by element or in a dot product. The
following commands will perform the element-by-element multiplication:

a = np.array([[1, 1], [0, 1]])

b = np.array([[2, 0],[3, 4]])

c=a * b

Example: Create two NumPy arrays of dimension 3x4, Say a and b, and then perform
following operations.

a. Find their sum and difference,


b. Find square of elements in first array and cube of elements in second array
c. Find r=sin(a)+cos(b)
d. Create a Boolean array for the first array, where all entries are true if the
element is +ve and false otherwise.
e. Find transpose of first array
f. Find element-wise multiplication of arrays and multiplication of both
matrices.

import numpy as np
a=np.array([[1,-3,4,7],[2,4,-5,9],[0,-1,8,3]])
b=np.array([[2,3,6,7],[2,5,5,8],[3,1,7,3]])
r=a+b
print("a+b=",r)
r=a-b
print("a-b=",r)
r=a**2
print("a^2=",r)
r=b**3
print("b^3=",r)
r=np.sin(a)+np.cos(b)
print("sin(a)+cos(b)=",r)
r=(a>=0)
print("(a>=0)=",r)
r=a*b
print("Element wise Multiplication of a and b=",r)
a=np.transpose(a)
print("Transpose of a=",a)
r=np.dot(a,b)
print("Matrix Multiplication of a and b=",r)

6. Indexing and Slicing


If we want to select a particular element of an array, it can be achieved using
indexes.
a[0,1]

The preceding command will select the first row and then select the second value
in the row. It can also be seen as an intersection of the first row and the second
column of the matrix. If a range of values has to be selected on a row, then we can
use the following command.

a[0 , 0:3 ]
The 0:3 value selects the first three values of the first row. The whole row of values
can be selected with the following command.

a[ 0 , : ]

Using the following command, an entire column of values need to be selected.

a[ : , 1 ]

7. Shape Manipulation
Once the array has been created, we can change the shape of it too. The following
command flattens the array.

a.ravel()

The following command reshapes the array in to a six rows and two columns
format. Also, note that when reshaping, the new shape should have the same
number of elements as the previous one.

a.shape = (6,2)

The array can be transposed too:

a.transpose()

Example: Create a NumPy array of dimension 4x3 and then perform following
operations.

a. Display 3rd element of second row


b. Display first two elements of second row
c. Display 2nd to 3rd rows of the array
d. Display 2x2 slice of top left part of the array
e. Display first two columns of the array
f. Convert array to 1D
g. Convert array to dimension 3x4

import numpy as np
a=np.array((np.random.rand(4,3)))
print(a)
print("Third Element of Second Row=",a[1,2])
print("First Two Elements of Second Row=",a[1,0:2])
print("Second to Third Row:",a[1:3,:])
print("2x2 Slice of Top-Left Part=",a[0:2,0:2])
print("Fisr Two Columns=",a[:,0:2])
a=a.ravel()
print("1D Array:",a)
a.shape=(3,4)
print("a=",a)
Review of Pandas Data Structures
The pandas library is an open source Python library, specially designed for data analysis.
It has been built on NumPy and makes it easy to handle data. The pandas library brings
the richness of R in the world of Python to handle data. It's has efficient data structures
to process data, perform fast joins, and read data from various sources, to name a few.
The pandas library essentially has three data structures: Series, DataFrame, and Panel.
Series
Series is a one-dimensional array, which can hold any type of data, such as integers,
floats, strings, and Python objects too. A series can be created by calling the following.

import pandas as pd
import numpy as np
s=pd.Series(np.random.randn(5))
print(s)

The random.randn parameter is part of the NumPy package and it generates random
numbers. The series function creates a pandas series that consists of an index, which is
the first column, and the second column consists of random values. At the bottom of the
output is the datatype of the series. The index of the series can be customized by calling
the following:

import pandas as pd
import numpy as np
s=pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)

A series can be derived from a Python dict too as below:

import pandas as pd
import numpy as np
d = {'A': 10, 'B': 20, 'C': 30}
s=pd.Series(d)
print(s)

DataFrame
DataFrame is a 2D data structure with columns that can be of different data types. It can
be seen as a table. A DataFrame can be formed from the following data structures: A
NumPy array, Lists, Dicts, Series, etc.

A DataFrame can be created from a dictionary of series as below:

import pandas as pd
d = {'c1': pd.Series(['A', 'B', 'C']),'c2': pd.Series([1, 2, 3, 4])}
df = pd.DataFrame(d)
print(df)

import pandas as pd
d = {'c1': ['A', 'B', 'C', 'D'],'c2': [1, 2, 3, 4]}
df = pd.DataFrame(d)
print(df)

We can also convert NumPy arrays into data Frame.

import pandas as pd
import numpy as np
a=np.array(np.random.randn(3,4))
df=pd.DataFrame(a)
print(df)

Inserting and Exporting Data


The data is stored in various forms, such as CSV, TSV (Tab Separated Values), databases,
and so on. The pandas library makes it convenient to read data from these formats or to
export to these formats.
CSV

To read data from a .csv file, the read_csv function can be used. To write a data to the .csv
file, the to_csv function can be used.

Example: Write a program that reads emp.csv file, displays its content, stores Eid, Ename,
and Salary to another dataframe writes it to emptiest.csv file
import pandas as pd
emp = pd.read_csv('/content/drive/My Drive/Data/emp.csv')
print(emp)
d=emp[["Eid","Ename","Age"]]
print(d)
d.to_csv("/content/drive/My Drive/Data/emptest.csv")

XLS

To read the data from an Excel file read_excel() command can be used and to_excel()
command can be used to write excel file.

Example: Write a program that reads Book1.xlsx file, displays its content, stores Sid, and
Sname to another dataframe writes it to mybook.xlsx file

import pandas as pd
book = pd.read_excel('/content/drive/My Drive/Data/Book1.xlsx')
print(book)
b=book[["Sid","Grade"]]
print(b)
b.to_excel("/content/drive/My Drive/Data/Book.xlsx")

JSON Data

JSON is a syntax for storing and exchanging data. JSON is text, written with JavaScript
object notation. Python has a built-in package called json, which can be used to work with
JSON data. If we have a JSON string, we can parse it by using the json.loads() method.
The result will be a Python dictionary. If we have a Python object, we can convert it into
a JSON string by using the json.dumps() method.

Example: Represent Id, Name, and Email of 3 person in JSON format, load it into
dictionary and display it. Again represent Id, Name, and Email of 3 person in dictionary
convert it into JSON format and display it.

import json
#JSON Data
x = """[
{"ID":101,"name":"Ram", "email":"Ram@gmail.com"},
{"ID":102,"name":"Bob", "email":"bob32@gmail.com"},
{"ID":103,"name":"Hari", "email":"hari@gmail.com"}
]"""
# loads method converts x into dictionary
y = json.loads(x)
print(y)
# Displaying Email Id of all persons from dictionary
for r in y:
print(r["email"])

# a Python object (dict):


x = {"Name": ["Ram","Hari","Sita"],"Age": [30,40,27],"City":
["KTM","PKR","DHN"]}

# dumps method converts d ictionary into JSON string


y = json.dumps(x)
print(y)
print(x["Name"]) #No Error beacuse x is dictionary
#print(y["Name"]) #Error because y is not dictionary

Database

SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in
wide use, and many alternative NoSQL databases have become quite popular. Loading
data from SQL into a DataFrame is fairly straightforward, and pandas has some functions
to simplify the process. As an example, an in-memory SQLite database using Python’s
built-in sqlite3 driver is presented below.
Example
import sqlite3
query = "CREATE TABLE Student (Sid Varchar(10), Sname VARCHAR(20), GPA
Real, Age Integer);"
con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()
data = [('S1', 'Ram', 3.25, 23),('S2', 'Hari', 3.4, 24),('S3', 'Sita',
3.7, 22)]
stmt = "Insert Into Student Values(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()
cursor = con.execute('select * from Student where GPA>3.3')
rows = cursor.fetchall()
print(rows)

Data Cleansing
Data cleansing, sometimes known as data cleaning or data scrubbing, denotes the
procedure of rectifying inaccurate, unfinished, duplicated, or other flawed data within a
dataset. This task entails detecting data discrepancies and subsequently modifying,
enhancing, or eliminating the data to rectify them. Through data cleansing, data quality
is enhanced, thereby furnishing more precise, uniform, and dependable information
crucial for organizational decision-making.

Checking the Missing Data

Generally, most data will have some missing values. There could be various reasons for
this: the source system which collects the data might not have collected the values or the
values may never have existed. Once you have the data loaded, it is essential to check the
missing elements in the data. Depending on the requirements, the missing data needs to
be handled. It can be handled by removing a row or replacing a missing value with an
alternative value. Commands isnull(), notnull(), dropna() are widely used for checking
null values.

 isnull(): It examines columns for NULL values and generates a Boolean series
where True represents NaN values and False signifies non-null values.
 notnull: It examines columns for NULL values and generates a Boolean series
where True represents non-null values and False signifies NULL values.
 dropna():It eliminates all rows containing NULL values from the DataFrame.
Example
import pandas as pd
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
for c in emp.columns:
print(emp[c].isnull().value_counts())

emp=emp.dropna()
print ("Cleaned Data")

for c in emp.columns:
print(emp[c].isnull().value_counts())

Filling Missing Values

To address null values within datasets, we employ functions like fillna(), replace(), and
interpolate(). These functions substitute NaN values with specific values.
filna():The fillna() function substitutes NULL values with a designated value. By default,
it generates a new DataFrame object unless the inplace parameter is set to True, in which
case it performs the replacement within the original DataFrame.
Example
import pandas as pd
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')

emp["First Name"].fillna(value="Unknown",inplace=True)
emp["Gender"].fillna(value="Unknown",inplace=True)
emp["Salary"].fillna(value=emp["Salary"].mean,inplace=True)
emp["Bonus %"].fillna(value=emp["Bonus %"].mean,inplace=True)
emp["Team"].fillna(value="Unknown",inplace=True)

for c in emp.columns:
print(emp[c].isnull().value_counts())

The Pandas interpolate() function is utilized to populate NaN values within the
DataFrame or Series by employing different interpolation techniques aimed at filling the
missing values in the data.

import pandas as pd
import numpy as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
print(*emp["Salary"])
emp["Salary"].interpolate(inplace=True,method="linear")
print(*emp["Salary"])

Merging Data
To combine datasets together, the concat function of pandas can be utilized. We can
concatenate two or more dataframes together.

Example
import pandas as pd
import numpy as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
e1=emp[0:5]
print("First 5 Rows of Dataframe:")
print(e1)
e2=emp[10:15]
print("Rows 100-15 of Dataframe:")
print(e2)
print("Concatenated Dataframe:")
e=pd.concat([e1,e2])
print(e)

Data operations
Once the missing data is handled, various operations such as aggregate operations, joins
etc. can be performed on the data.
Aggregation Operations
There are a number of aggregation operations, such as average, sum, and so on, which
we would like to perform on a numerical field. These aggregate methods are discussed
below.

 Average: mean() method of pandas dataframe is used for finding average of


specified numerical field of the dataframe.
 Sum: sum() method of pandas dataframe is used for total of specified numerical
field of the dataframe.
 Max: max() method of pandas dataframe is used for finding maximum value of
specified numerical field of the dataframe.
 Min: min() method of pandas dataframe is used for finding minimum value of
specified numerical field of the dataframe.
 Standard Deviation: std() method of pandas dataframe is used for finding
standard deviation of specified numerical field of the dataframe.
 Count: count() method of pandas dataframe is used for total number of values in
the specified field of the dataframe.
Example
import pandas as pd
import numpy as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
emp.drop_duplicates(inplace=True)
avgsal=emp["Salary"].mean()
print("Average Salary=",avgsal)
totsal=emp["Salary"].sum()
print("Total Salary=",totsal)
maxsal=emp["Salary"].max()
print("Maximum Salary=",maxsal)
minsal=emp["Salary"].min()
print("Minimum Salary=",minsal)
nemp=emp["First Name"].count()
print("#Employees=",nemp)
teams=emp["Team"].drop_duplicates().count()
print("#Teams=",teams)
std=emp["Bonus %"].std()
print("Stadrard Deviation of Bonus=",std)

groupby Function

A groupby operation involves some combination of splitting the object, applying a


function, and combining the results. This can be used to group large amounts of data and
compute operations on these groups.

import pandas as pd
import numpy as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
emp.drop_duplicates(inplace=True)
emp.dropna(inplace=True)
avgsal=emp[["Team","Salary"]].groupby(["Team"]).mean()
print("Average Salary For Each Team")
print(avgsal)
gencount=emp.groupby(["Gender"]).count()
print("#Employees Gender Wise")
print(gencount)
minbonus=emp[["Gender","Bonus %"]].groupby(["Gender"]).min()
print("Minimum Bonus% For Each Gender")
print(minbonus)

Various Forms of Distribution


There are various kinds of probability distributions, and each distribution shows the
probability of different outcomes for a random experiment.
Normal Distribution

Normal distribution, also known as the Gaussian distribution, is a probability


distribution that is symmetric about the mean, showing that data near the mean are more
frequent in occurrence than data far from the mean. In graphical form, the normal
distribution appears as a "bell curve" and is completely determined by two parameters:
its mean μ and its standard deviation σ. The mean indicates where the bell is centered,
and the standard deviation how “wide”.
. We say the data is "normally distributed" if the data exhibit following properties:
 mean = median = mode
 Data symmetric about the center that is 50% of values less than the mean and 50%
greater than the mean.

Probability distribution function of normal distribution is given as below:

For all normal distributions, 68.2% of the observations will appear within plus or minus
one standard deviation of the mean; 95.4% of the observations will fall within +/- two
standard deviations; and 99.7% within +/- three standard deviations. This fact is
sometimes referred to as the "empirical rule," a heuristic that describes where most of the
data in a normal distribution will appear. This means that data falling outside of three
standard deviations ("3-sigma") would signify rare occurrences.

Example
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np

dist = stats.norm(loc=5.6, scale=1)#Here 5.5 is mean and 1 is SD


# Generate a sample of 100 random penguin heights
heights = dist.rvs(size=100)

heights=heights.round(2)
heights=np.sort(heights)

prob = dist.pdf(x=5.2)
print("Probability of being height 5.2=",prob)
probs = dist.pdf(x=[4.5,5,5.5,6,6.5])
print("Probability of heights=",probs)
probs = dist.pdf(x=heights)

#plotting histograms with desnisty curve


plt.figure(figsize=(6, 4))
plt.hist(heights, bins=20, density=True)
plt.title("Height Histogram and Desity Curve")
plt.xlabel("Height")
plt.ylabel("Frequency")
plt.plot(heights, probs)
plt.show()

The method pdf() from the norm class can help us find the probability of some randomly
selected value. It returns the probabilities for specific values from a normal distribution.
PDF stands for probability density function.

The norm method cdf() helps us to calculate the proportion of a normally distributed
population that is less than or equal to a given value. CDF stands for Cumulative
Distribution Function.
Example
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np

dist = stats.norm(loc=5.5, scale=2)


# Generate a sample of 100 random penguin heights
heights = dist.rvs(size=1000)
heights=heights.round(2)
heights=np.sort(heights)
prob = dist.pdf(x=6)
print("Probability of Height=6:",prob)
prob = dist.cdf(x=6)
print("Proabability of Height<=6:",prob)
Z-score

Z-score is a statistical measurement that describes a value's relationship to the mean of a


group of values. Z-score is measured in terms of standard deviations from the mean. If a
Z-score is 0, it indicates that the data point's score is identical to the mean value. A Z-
score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores
may be positive or negative, with a positive value indicating the value is larger than the
mean and a negative z-score indicating it is smaller than the mean. It is calculated by
using the formula given below.

z= (x − μ)/σ

Here, x is the value in the distribution, μ is the mean of the distribution, and σ is the
standard deviation of the distribution. Conversely, if x is a normal random variable with
mean μ and standard deviation σ, it is calculated as below.

x = σz + μ

Numerical Example
A survey of daily travel time had these results (in minutes): 26, 33, 65, 28, 34, 55, 25, 44,
50, 36, 26, 37, 43, 62, 35, 38, 45, 32, 28, 34. Convert the values to z-scores ("standard scores").
Solution
μ=38.5

σ=11.4

Original Value Standard Score (z-score)


26 (26-38.8) / 11.4 = −1.12
33 (33-38.8) / 11.4 = −0.51
65 (65-38.8) / 11.4 = 2.30
Example
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

dist = stats.norm(loc=50, scale=10)


scores = dist.rvs(size=100)
scores=scores.round()
print(*scores)
plt.hist(scores, bins=30)
plt.title("Histrogram of Origical Scores")
plt.show()

#Converting scores to z-scores


z=stats.zscore(scores).round(3)
print(*z)
plt.hist(z, bins=30)
plt.title("Histrogram of Z-values of Scores")
plt.show()

#converting z-scores to value in distribution


s=(scores.std()*z+scores.mean()).round()
print(*s)
print(*scores)

Standard scalar calculates z-score for every data while normalizing data and the
normalized can be inverse scaled accordingly as mentioned above.
Binomial Distribution
Binomial distribution is a probability distribution that summarizes the likelihood that a
variable will take one of two independent values under a given set of parameters. The
distribution is obtained by performing a number of Bernoulli trials. A Bernoulli trial is
assumed to meet each of these criteria.

 There must be only 2 possible outcomes.


 Each outcome has a fixed probability of occurring. A success has the probability
of p, and a failure has the probability of 1 – p.
 Each trial is completely independent of all others.

For example, the probability of getting a head or a tail is 50%. If we take the same coin
and flip it n times, the probability of getting a head p times can be computed using
probability mass function (PMF) of binomial distribution. The binomial distribution
formula is for any random variable x, given by

Where, n is the number of times the coin is flipped, p is the probability of success, and
q=1– p is the probability of failure, and x is the number of successes desired.

Numerical Example: If a coin is tossed 5 times, find the probability of: (a) Exactly 2
heads and (b) at least 4 heads.
Solution
Number of trials: n=5

Probability of head: p= 1/2 and hence the probability of tail, q =1/2


For exactly two heads: x=2
5!
𝑃(𝑥 = 2) = 2!×3! × 0.52 × 0.53 = 0.315

Again
For at least 4 heads: x<=4
𝑃(𝑥 ≤ 4) = 𝑃(𝑥 = 4) + 𝑃(𝑥 = 5)
5! 5!
 𝑃(𝑥 ≤ 4) = 4!×1! × 0.54 × 0.51 + 5!×0! × 0.55 × 0.50

 𝑃(𝑥 ≤ 4) = 0.15625 + 0.03125 = 0.1875

Example
from scipy import stats
import matplotlib.pyplot as plt

dist=stats.binom(n=5,p=0.5)
prob=dist.pmf(k=2)
print("Proability of Two Heads=",prob)
prob=dist.pmf(k=4)+dist.pmf(k=5)
print("Praobability of at least 4 heads=",prob)

Note!!!

A probability mass function is a function that gives the probability that a discrete random
variable is exactly equal to some value. A probability mass function differs from a
probability density function (PDF) in that the latter is associated with continuous rather
than discrete random variables. A PDF must be integrated over an interval to yield a
probability.
Poisson Distribution

Poisson distribution is a Discrete Distribution. It estimates how many times an event can
happen in a specified time provided the mean occurrence of the event in the interval. For
example, if someone eats twice a day what is the probability he will eat thrice? If lambda
is the mean occurrence of the events per interval, then the probability of having a k
occurrence within a given interval is given by the following formula.
Where, e is the Euler's number, k is the number of occurrences for which the probability
is going to be determined, and lambda is the mean number of occurrences.
Numerical Example

In the World Cup, an average of 2.5 goals are scored in each game. Modeling this situation
with a Poisson distribution, what is the probability that 3 goals are scored in a game?
What is the probability that 5 goals are scored in a game?
Solution
Given, λ=2.5

2.53 × 𝑒 −2.5
𝑃(𝑥 = 3) = = 0.214
3!
2.55 × 𝑒 −2.5
𝑃(𝑥 = 5) = = 0.0668
5!

Example
from scipy import stats

dist=stats.poisson(2.5)#2.5 is average values


prob=dist.pmf(k=1)
print("Probability of having 1-goal=",prob)
prob=dist.pmf(k=3)
print("Probability of having 3-goals=",prob)
prob=dist.pmf(k=5)
print("Probability of having 5-goals=",prob)

P-value

The P-value is known as the probability value. It is defined as the probability of getting a
result that is either the same or more extreme than the actual observations. A p-value is
used as a statistical test to determine whether null-hypothesis is rejected or not? The null
hypothesis is a statement that says that there is no difference between two measures. If
the hypothesis is that people who clock in 4 hours of study everyday score more that 90
marks out of 100. The null hypothesis here would be that there is no relation between the
number of hours clocked in and the marks scored. If the p-value is equal to or less than
the significance level, then the null hypothesis is inconsistent and it needs to be rejected.
The P-value table shows the hypothesis interpretations.
P-value Decision
P-value > 0.05 The result is not statistically significant and hence accept the null hypothesis.
The result is statistically significant. Generally, reject the null hypothesis in favor
P-value < 0.05
of the alternative hypothesis.
The result is highly statistically significant, and thus rejects the null hypothesis
P-value < 0.01
in favor of the alternative hypothesis.

Suppose null hypothesis is “It is common for students to score 68 marks in mathematics.”
Let's define the significance level at 5%. If the p-value is less than 5%, then the null
hypothesis is rejected and it is not common to score 68 marks in mathematics. First
calculate z-score of 68 marks (say z68) and then we calculate p-value for the given z-score
value as below.

pv=p(z≥z68)

This means pv*100% of the students are above the specified score 68.
import numpy as np
from scipy import stats

#Generate 60 random scores with mean=50 and SD=10


dist=stats.norm(loc=50,scale=10)
scores=dist.rvs(size=100)

mean=scores.mean()
SD=scores.std()
z = (68-mean)/SD #z-value of score=68
print("Z-value of score=68:",z)

p=stats.norm.cdf(z) #probability of score<68


pv=1-p #probabily of score>=68
print(pv)
pvp=np.round(pv*100,2)
print(f"p-value={pvp}%")
if(pv>0.05):
print("Null hypothesis is accpted: It is common to score 68 marka in
mathematics:")
else:
print("Null hypothesis is rejeceted: It is not common to score 68 marka
in mathematics:")
One-tailed and Two-tailed Tests

A one-tailed test may be either left-tailed or right-tailed. A left-tailed test is used when
the alternative hypothesis states that the true value of the parameter specified in the null
hypothesis is less than the null hypothesis claims. A right-tailed test is used when the
alternative hypothesis states that the true value of the parameter specified in the null
hypothesis is greater than the null hypothesis claims.

The main difference between one-tailed and two-tailed tests is that one-tailed tests will
only have one critical region whereas two-tailed tests will have two critical regions. If we
require a 100(1-α)% confidence interval we have to make some adjustments when using
a two-tailed test. The confidence interval must remain a constant size, so if we are
performing a two-tailed test, as there are twice as many critical regions then these critical
regions must be half the size. This means that when performing a two-tailed test, we need
to consider α/2 significance level rather than α.

Example: A light bulb manufacturer claims that its' energy saving light bulbs last an
average of 60 days. Set up a hypothesis test to check this claim and comment on what sort
of test we need to use.

The example in the previous section was an instance of a one-tailed test where the null
hypothesis is rejected or accepted based on one direction of the normal distribution. In a
two-tailed test, both the tails of the null hypothesis are used to test the hypothesis. In a
two-tailed test, when a significance level of 5% is used, then it is distributed equally in
the both directions, that is, 2.5% of it in one direction and 2.5% in the other direction.
Let's understand this with an example. The mean score of the mathematics exam at a
national level is 60 marks and the standard deviation is 3 marks. The mean marks of a
class are 53. The null hypothesis is that the mean marks of the class are similar to the
national average.

from scipy import stats


zs = ( 53 - 60 ) / 3.0
print(f"z-score={zs}")
pv= stats.norm.cdf(zs)
print(f"p-value={pv}")
pv=(pv*100).round(2)
print(f"p-value={pv}%")

So, the p-value is 0.98%. The null hypothesis is to be rejected, and the p-value should be
less than 2.5% in either direction of the bell curve. Since the p-value is less than 2.5%, we
can reject the null hypothesis and clearly state that the average marks of the class are
significantly different from the national average.

Type 1 and Type 2 Errors

A type 1 error appears when the null hypothesis of an experiment is true, but still, it is
rejected. A type 1 error is often called a false positive. Consider the following example.
There is a new drug that is being developed and it needs to be tested on whether it is
effective in combating diseases. The null hypothesis is that “it is not effective in
combating diseases.” The significance level is kept at 5% so that the null hypothesis can
be accepted confidently 95% of the time. However, 5% of the time, we'll accept the
rejection of the hypothesis although it had to be accepted, which means that even though
the drug is ineffective, it is assumed to be effective. The Type 1 error is controlled by
controlling the significance level, which is α. α is the highest probability to have a Type 1
error. The lower the α, the lower will be the Type 1 error.
The Type 2 error is the kind of error that occurs when we do not reject a null hypothesis
that is false. A type 2 error is also known as false negative. This kind of error occurs in a
drug scenario when the drug is accepted as ineffective but is actually it is effective. The
probability of a type 2 error is β. Beta depends on the power of the test. This means the
probability of not committing a type 2 error equal to 1-β. There are 3 parameters that can
affect the power of a test: sample size (n), significance level of test (α), “true” value of tested
parameter.

 Sample size (n): Other things being equal, the greater the sample size, the greater
the power of the test.
 Significance level (α): The lower the significance level, the lower the power of the
test. If we reduce the significance level (e.g., from 0.05 to 0.01), the region of
acceptance gets bigger. As a result, we are less likely to reject the null hypothesis.
This means we are less likely to reject the null hypothesis when it is false, so we
are more likely to make a Type II error. In short, the power of the test is reduced
when we reduce the significance level; and vice versa.
 The "true" value of the parameter being tested: The greater the difference
between the "true" value of a parameter and the value specified in the null
hypothesis, the greater the power of the test.
These errors can be controlled one at a time. If one of the errors is lowered, then the other
one increases. It depends on the use case and the problem statement that the analysis is
trying to address, and depending on it, the appropriate error should reduce. In the case
of this drug scenario, typically, a Type 1 error should be lowered because it is better to
ship a drug that is confidently effective.

Confidence Interval

When we make an estimate in statistics, whether it is a summary statistic or a test statistic,


there is always uncertainty around that estimate because the number is based on a sample
of the population we are studying. A confidence interval is the mean of our estimate plus
and minus the variation in that estimate. This is the range of values we expect our
estimate to fall between if we experiment again or re-sample the population in the same
way.

The confidence level is the percentage of times we expect to reproduce an estimate


between the upper and lower bounds of the confidence interval. For example, if we
construct a confidence interval with a 95% confidence level, you are confident that 95 out
of 100 times the estimate will fall between the upper and lower values specified by the
confidence interval. Our desired confidence level is usually one minus the alpha (α) value
we used in our statistical test:

Confidence level = 1 − a

So if we use an alpha value of p < 0.05 for statistical significance, then our confidence
level would be 1 − 0.05 = 0.95, or 95%.
Confidence interval for sample data is calculated as below:

 Find the sample mean as: 𝑥̅ = (𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥1𝑛 )/𝑛


∑𝑛
𝑖=1(𝑥−𝑥̅ )
2
 Calculate the standard deviation as: 𝑆𝐷 = √ 𝑛−1
 Find the standard error: The standard error of the mean is the deviation of the
sample mean from the population mean. It is defined using the following formula:
𝑆𝐷
𝑆𝐸 =
√𝑛
 Finally, find confidence interval as: 𝑈𝑝𝑝𝑒𝑟/𝐿𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 = 𝑥̅ +/− 𝑧 × 𝑆𝐸, Where 𝑧 is z-
score of the given confidence level.
Note: Z-score of various confidence levels is given below.

Confidence Level Z-score


90% 1.645
95% 1.96
98% 2.33
99% 2.575
Numerical Example: Consider the following exam scores of 10 students { 80, 95, 90, 90, 95, 75, 75, 85,
90 , 80}. What will be the confidence interval for the confidence level 95%?

Example. Generate heights of 50 persons randomly such that the heights have normal distribution
with mean=165 and SD=20. Calculate Confidence interval for the dataset for the confidence level 95%.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

#Generate 50 random heights with mean=165 and SD=20


dist = stats.norm(loc=165, scale=20)
heights = dist.rvs(size=50)
mean=heights.mean()
print("Average Heights",mean)
SE=stats.sem(heights)#sem calculates standard error of mean
print("Standard Error=",SE)
ul=mean+1.96*SE
ll=mean-1.96*SE
print(f"Confidence Interval=({ll},{ul})")

Correlation
Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate). It’s a common tool for
describing simple relationships without making a statement about cause and effect. The
sample correlation coefficient, r, quantifies the strength and direction of the relationship.
Correlation coefficient quite close to 0, but either positive or negative, implies little or no
relationship between the two variables. A correlation coefficient close to plus 1 means a
positive relationship between the two variables, with increases in one of the variables
being associated with increment in the other variable. A correlation coefficient close to -
1 indicates a negative relationship between two variables, with an increase in one of the
variables being associated with a decrease in the other variable. The most common
formula is the Pearson Correlation coefficient used for linear dependency between the
data sets and is given as below.
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑟=
√𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 √𝑛 ∑ 𝑦 2 − (∑ 𝑦)2

Numerical Example
Calculate the coefficient of correlation for the following two data sets: x = (41, 19, 23, 40,
55, 57, 33) and y = (94, 60, 74, 71, 82, 76, 61).

∑ 𝑥 = 41 + 19 + 23 + 40 + 55 + 57 + 33 = 268

∑ 𝑦 = 94 + 60 + 74 + 71 + 82 + 76 + 61 = 518

∑ 𝑥𝑦 = (41 × 94) + (19 × 60)+ . . . + (33 × 61) = 20,391

2 2 2
∑ 𝑥 2 = (41 ) + (19 ) + . . . (33 ) = 11,534

2 2 2
∑ 𝑦 2 = (94 ) + (60 ) + . . . (61 ) = 39,174
7×20,391−268×518
 𝑟= = 0.54
√(7×11,534−2682)√(7×39174−5182 )

Example

from scipy import stats


import numpy as np
x = np.array([2, 4, 3, 9, 7, 6, 5])
y = np.array([5, 7, 7, 18, 15, 11, 10])
r=stats.pearsonr(x,y)#computes pearson correlation coefficient
print("Result:",r)
print("Correlation Coefficient:",r[0])

T-test
A t-test is a statistical test that is used to compare the means of two groups. It is often used in
hypothesis testing to determine whether two groups are different from one another. A t-test can only
be used when comparing the means of two groups. If we want to compare more than two groups use
an ANOVA test. When choosing a t-test, we will need to consider two things: whether the groups
being compared come from a single population or two different populations, and whether we want
to test the difference in a specific direction.

 If the groups come from a single population perform a paired sample t test.
 If the groups come from two different populations perform a two-sample t-test.
 If there is one group being compared against a standard value, perform a one-sample t test.

Paired Sample T-Test


This hypothesis testing is conducted when two groups belong to the same population.
The groups are studied either at two different times or under two varied conditions. They
could be pre and post test results from the same people. The formula used to obtain the
t-value is:

𝑑̅
𝑡=
𝑠/√𝑛
Where, d=difference between paired samples, 𝑑̅ is mean of d, s is standard deviation of
the differences, n is sample size.

Example

An instructor takes two exams of the students. Scores of Both exams are given in the table
below. He/She wants to know if the exams are equally difficult.
Student Exam1 Score(x) Exam2 Score(y)
S1 63 69
S2 65 65
S3 56 62
S4 100 91
S5 88 78
S6 83 87
S7 77 79
S8 92 88
S9 90 85
S10 84 92
S11 68 69
S12 74 81
S13 87 84
S14 64 75
S15 71 84
S16 88 82

Solution

Student Exam1 Score(x) Exam2 Score(y) d=x-y (x-y)2


S1 63 69 6 36
S2 65 65 0 0
S3 56 62 6 36
S4 100 91 -9 81
S5 88 78 -10 100
S6 83 87 4 16
S7 77 79 2 4
S8 92 88 -4 16
S9 90 85 -5 25
S10 84 92 8 64
S11 68 69 1 1
S12 74 81 7 49
S13 87 84 -3 9
S14 64 75 11 121
S15 71 84 13 169
S16 88 82 -6 36
Now,

𝑑̅ = 1.31
(𝑥 − 𝑦)2
𝑠= √ = 7.13
𝑛−1

Now,

𝑑̅ 1.31
𝑡= = = 0.74
𝑠/√𝑛 7.13/√16
Let’s assume, Significance level (α) = 0.05

Degree of freedom (df)= n-1=15

The tabulated t-value with α = 0.05 and 15 degrees of freedom is 2.131.

Because 0.74 < 2.131, we accept null hypothesis. This means the mean score of two
exams is similar.

 Exams are equally difficult.

In Python, we can perform a paired t-test using the scipy.stats.ttest_rel() function. It


performs the t-test on two related samples of scores.
Example
from scipy import stats

# Performance before and after tune-up


Exam1_Score = [63, 65,56,100, 88, 83, 77, 92, 90, 84, 68, 74, 87, 64, 71,
88]
Exam2_Score = [69, 65, 62, 91,78, 87, 79, 88, 85, 92, 69, 81, 84, 75, 84,
82]

# Perform paired t-test


tv, pv = stats.ttest_rel(Exam1_Score, Exam2_Score)

print('t-statistic:', tv)
print('p-value:', pv)
if(pv>0.05):
print("Null Hypothesis is Accepted. This means there is no difference
between means scores of two exams")
else:
print("Null Hypothesis is Rejected. This means there is difference
between means scores of two exams")

Output
t-statistic: -0.7497768853141169

p-value: 0.4649871003972206

Null Hypothesis is accepted. This means there is no difference between means scores of
two exams

One-Sample T-test

The One Sample t Test examines whether the mean of a sample is statistically different
from a known or hypothesized or population mean. It is calculated as below.
𝑥̅ − 𝜇
𝑡=
𝑠/√𝑛
Numerical Example

Imagine a company wants to test the claim that their batteries last more than 40 hours.
Using a simple random sample of 15 batteries yielded a mean of 44.9 hours, with a
standard deviation of 8.9 hours. Test this claim using a significance level of 0.05.
Solution
44.9−40
𝑡= = 2.13
8.9/√15

Given, significance level (α) = 0.05

Degree of freedom (df)= n-1=14

The tabulated t-value with α = 0.05 and 14 degrees of freedom is 1.761.

Because 2.13 > 1.761, we reject the null hypothesis and conclude that batteries last
more than 40 hours.

In Python, we can perform a one-sample t-test using the scipy.stats.ttest_1samp()


function.
Example
from scipy import stats

battry_hour=[40,50,55,38,48,62,44,52,46,44,37,42,46,38,45]

#One Sample t-test


tv, pv = stats.ttest_1samp(battry_hour, 40)
print('t-statistic:', tv)
print('p-value:', pv)
if(pv>0.05):
print("Null Hypothesis is Accepted. This means battries las for 40
hours")
else:
print("Null Hypothesis is Rejected. This means battries last for more
than or less than 40 hours")

Two-Sample T-test

The two-sample t-test (also known as the independent samples t-test) is a method used
to test whether the unknown population means of two groups are equal or not. We can
use the test when our data values are independent, are randomly sampled from two
normal populations and the two independent groups. It carried out as below.
𝑥̅1 − 𝑥̅2
𝑡=
1 1
𝑠𝑝 √𝑛 + 𝑛
1 2

where 𝑥̅1 and 𝑥̅ 2 are the sample means, n1 and n2 are the sample sizes, and sp is calculated
as below.

𝑠𝑝 = √((𝑛1 − 1)𝑠12 + (𝑛2 − 1)𝑠22 )/(𝑛1 + 𝑛2 − 2)

Example
Our sample data is from a group of men and women who did workouts at a gym three
times a week for a year. Then, their trainer measured the body fat. The table below shows
the data.

Group Body Fat Percentage


Men 13.3,6.0, 20.0, 8.0, 14.0, 19.0, 18.0, 25.0, 16.0,24.0,15.0, 1.0, 15.0
Women 22.0, 16.0, 21.7, 21.0, 30.0, 26.0, 12.0, 23.2, 28.0, 23.0
Determine whether the underlying populations of men and women at the gym have the
same mean body fat.
Solution
We have ttest_ind() function in Python to perform two sample t-test.

Example
from scipy import stats
men=[13.3,6.0, 20.0, 8.0, 14.0, 19.0, 18.0, 25.0, 16.0,24.0,15.0, 1.0,
15.0]
women=[22.0, 16.0, 21.7, 21.0, 30.0, 26.0, 12.0, 23.2, 28.0, 23.0]
#Two Sample t-test
tv, pv = stats.ttest_ind(women,men)
print('t-statistic:', tv)
print('p-value:', pv)
if(pv>0.05):
print("Null Hypothesis is Accepted. This means Body Fat percentage of
Men and Women is similar")
else:
print("Null Hypothesis is Rejected. This means Body Fat percentage of
Men and Women is different")

T-test vs Z-Test
The difference between a t-test and a z-test hinges on the differences in their respective
distributions. As mentioned, a z-test uses the Standard Normal Distribution, which is
defined as having a population mean, 𝜇, of 0 and a population standard deviation, 𝜎 , of
1. It is calculated using following formula.

Where, 𝑥̅ is a measured sample mean, 𝜇 is the hypothesized population mean, 𝑠 is the


sample standard deviation, and n is the sample size.

Notice that this distribution uses a known population standard deviation for a data set to
approximate the population mean. However, the population standard deviation is not
always known, and the sample standard deviation, s, is not always a good
approximation. In these instances, it is better to use the T-test.

The T-Distribution looks a lot like a Standard Normal Distribution. In fact, the larger a
sample is, the more it looks like the Standard Normal Distribution - and at sample sizes
larger than 30, they are very, very similar. Like the Standard Normal Distribution, the T-
Distribution is defined as having a mean 𝜇 = 0, but its standard deviation, and thus the
width of its graph, varies according to the sample size of the data set used for the
hypothesis test. It is calculated using following formula:

The standard normal or z-distribution assumes that you know the population standard
deviation. The t-distribution is based on the sample standard deviation. The t-
distribution is similar to a normal distribution. The useful properties of the t-distribution
are:

The table below presents the key differences between the two statistical methods, Z-test and
T-test.
Z-Test T-Test
Used for large sample sizes (n≥30). Used for small to moderate
sample sizes (n<30).
Performing this test requires It is performed when the
knowledge of population standard population standard deviation is
deviation (σ). unknown.
Does not involve the sample standard Involves the sample standard
deviation. deviation (s).
Assumes a standard normal Assumes a t-distribution, which
distribution. varies with degrees of freedom.

Chi-square Distribution

If we repeatedly take samples and define the chi-square statistics, then we can form a chi-
square distribution. A chi-square (Χ2) distribution is a continuous probability distribution
that is used in many hypothesis tests. The shape of a chi-square distribution is determined
by the parameter k, which represents the degrees of freedom. The graph below shows
examples of chi-square distributions with different values of k.
There are two main types of Chi-Square tests namely: Chi-Square for the Goodness-of-Fit
and Chi-Square for the test of Independence.
Chi-Square for the Goodness-of-Fit
A chi-square test is a statistical test that is used to compare observed and expected results.
The goal of this test is to identify whether a disparity between actual and predicted data
is due to chance or to a link between the variables under consideration. As a result, the
chi-square test is an ideal choice for aiding in our understanding and interpretation of the
connection between our two categorical variables. Pearson’s chi-square test was the first
chi-square test to be discovered and is the most widely used. Pearson’s chi-square test
statistic is given as below.

Where, O is the observed frequency and E is the expected frequency.

The dice is rolled 36 times and the probability that each face should turn upwards is
1/6. So, the expected distribution is as follows:

The observed distribution is as follows:

The null hypothesis in the chi-square test is that the observed value is similar to the
expected value. The chi-square can be performed using the chisquare function in the
SciPy package. The function gives chisquare value and p-value as output. By looking at
the p-value we can reject/accept null hypothesis
Example
from scipy import stats
import numpy as np
expected = np.array([6,6,6,6,6,6])
observed = np.array([7,5,3,9,6,6])
cp=stats.chisquare(observed,expected)
print(cp)

Output: P-value=0.65
Conclusion: Since p-value>0.05, Null hypothesis is accepted. Thus, we conclude that observed
value of dice is same as expected value.

Chi-square Test of Independence


The Chi-Square test of independence is used to determine if there is a significant relationship
between two nominal (categorical) variables. For example, say a researcher wants to examine
the relationship between gender (male vs. female) and empathy (high vs. low). The chi-square
test of independence can be used to examine this relationship. The null hypothesis for this test is
that there is no relationship between gender and empathy. The alternative hypothesis is that
there is a relationship between gender and empathy (e.g. there are more high-empathy females
than high-empathy males). The Chi-Square test of independence can be performed using the
chi2_contingency function in the SciPy package.
Example
Suppose the researcher collected data about empathy of males and female. He/She has collected
data about 300 males and 200 males as given in the table.
Gender Empathy
High Low
Male 180 120
Female 140 60

Null Hypothesis (H0) =There is no relationship between gender and empathy.


Python program to test above null hypothesis
from scipy import stats
import numpy as np
male_female = np.array([[180, 120],[140, 60]])
x=stats.chi2_contingency(male_female)
print(x)

Output:P-value=.029
Conclusion: since p-value is less than 0.05 our null hypothesis is rejected. Thus, we conclude
that the empathy is related with gender.
ANOVA
ANOVA stands for Analysis of Variance. It is a statistical method used to analyze the differences
between the means of two or more groups. It is often used to determine whether there are any
statistically significant differences between the means of different groups. ANOVA compares the
variation between group means to the variation within the groups. If the variation between group
means is significantly larger than the variation within groups, it suggests a significant difference
between the means of the groups.
ANOVA calculates an F-statistic by comparing between-group variability to within-group
variability. If the F-statistic exceeds a critical value, it indicates significant differences between
group means. Types of ANOVA include one-way (for comparing means of groups) and two-way
(for examining effects of two independent variables on a dependent variable). To perform the one-
way ANOVA, we can use the f_oneway() function of the SciPy package.
Example
Suppose we want to know whether or not three different exam prep programs lead to different
mean scores on a certain exam. To test this, we recruit 30 students to participate in a study and
split them into three groups. The students in each group are randomly assigned to use one of the
three exam prep programs for the next three weeks to prepare for an exam. At the end of the three
weeks, all of the students take the same exam. The exam scores for each group are shown below.

Python program to solve above problem


from scipy import stats
import numpy as np
sg1=[85,86,88,75,78,94,98,79,71,80]
sg2=[91,92,93,85,87,84,82,88,95,96]
sg3=[79,78,88,94,92,85,83,85,82,81]
r=stats.f_oneway(sg1,sg2,sg3)
print("P-value=",r.pvalue)
Output: P-value= 0.113
Since pvalue>0.05, Null hypothesis is accepted. Thus we conclude that three different exam prep
programs lead to similar mean scores on a certain exam.
Suppose a scientist is interested in how a person's marital status affects weight. They have only
one factor to examine so the scientist would use a one-way ANOVA. Now assume that another
scientist is interested in how a person's marital status and income affect their weight. In this case,
there are two factors to consider; therefore a two-way ANOVA will be performed.

You might also like