Predictive Analysis Workbook
Predictive Analysis Workbook
Predictive Analysis Workbook
Individual Coursework
DECLARATION
I declare that:
This is my own unaided work.
Yes ☒
No ☐
Instructions to candidates:
1. Please complete this cover sheet by entering your Candidate Number,
Module Name, Word Count, and Submission Date.
2. You must NOT use your NAME on this cover sheet or on any part of your
coursework
1
Abstract
This workbook captures the in-class task developed over session 1 - session 4 of the
predictive analytics course first semester period for the year 2022. The workbook captures
the exercises, process and the expereinces from each task undergone during the 6
workshops.
2
Table of Contents
Abstract..................................................................................................................................…1
Workshop1................................................................................................................................3
Tasks..........................................................................................................................................3
Process......................................................................................................................................3
Workshop2...............................................................................................................................6
Tasks.........................................................................................................................................6
Process.....................................................................................................................................6
Learning
Points..............................................................................................................................……..14
Workshop3.............................................................................................................................15
Tasks..................................................................................................................................…..15
Process..................................................................................................................................15
Learning
Points...............................................................................................................................…...28
Workshop4............................................................................................................................29
Process..................................................................................................................................29
Learning
Points.............................................................................................................….....................31
References ............................................................................................................................36
3
Individual Workbook - Predictive
Analysis
Workshop 1
Session started with covering the key concepts of predictive analysis , the method of
prediction that is built into specific Predictive Analytics Technologies(e.g similarity based,
information based, error based and probability based ) and evaluate the sources and types
of data used for Predictive Analytics.
Tasks
How to apply Predictive analysis in different context of an organization ?
What are the sources, structure and types of data for predictive analysis technology?
What are the methods of Prediction?
Process
There was a discussion in class and an in-depth individual research on the concepts of
predictive analysis by throwing more light on the following topics;
Introduction to Predictive analysis and where it can be applied.( Healthcare,HR,
Sports,Weather, Finance , Digital marketing, Security etc)
Data Minning -finding anomalies,correlations and patterns within big data to explain
outcomes
4
Big Data - Every medical procedure, credit application, fraudulent act, post on Facebook,
movie recommendation,and purchase of any type is encoded as data and stored. Big Data
grows by an estimated 2.5 quintillion bytes per day (Siegel, Eric (2015) Predictive
Analysis:the power to predict who will click)
Structured Data : Structured data is the easiest option to work with in predictive analysis. It
is very organized with a set parameter defining its dimensions.It’s mostly quantitative data
like contactaddress, age ,billing expenses.
Unstructured Data : It is the unorganized part of data. It's mostly qualitative data, and it's
difficult to manipulate, search, , and analyze using a traditional spreadsheet.
Data Mining: Data mining starts with data, which can range from a simple few numeric
observations to thousands of variables with millions of complex matrix observations .
The act of data mining makes use of some advanced computational methods to identify
meaningful and relevant structures in the data.
Types of Analytics
Descriptive, this answers the question of “What happened?”
Diagnostic, this will answer the question of “Why did this happen?”
Predictive, this one answers the question of “What might happen in the future?”
Prescriptive will answer the question of “What should we do next?”
Methods of Prediction:
Information Based Learning (Machine Learning Algorithms):This method AI conducts its task,
predicting output generally from given input data. The two main processes of machine
learning algorithms are regression and classification.
They are categorized into supervised or unsupervised. Where Supervised learning algorithms
uses labeling to provide for both the input data and the desired output, while the
unsupervised algorithms will work with data that are neither labeled or classified.
Similarity Based Learning ; This is a method that identify similarity of two objects based on
algorithmic distance functions. Metric and non-metric distance functions and high-speed
indexing are the two critical capabilities that are required for this algorithm to operate at the
scale and speed of machine learning.
Probability Based Learning: This method measures the extent of the certainty in an
uncertain event. Probability plays an important role to approximate the predictive analysis.
5
Types of Analytics Approach
Deductive analysis can assist in sorting data into organizational categories like
participant, time, data type.
Inductive analysis can assist in getting insights from data, develop findings, discover
data that are representing to support the findings.
Workshop 2
In this session we will cover analysis pattern and interpretations. Here approaches
will be taken to test interpretation in which test scores relationships are used to
inform diagnosis that are different.
Task
A multinational corporation elicits views on efficacy of communications in the
organization : they are (obviously/likely) not exactly two groups of employees one
in Boston and say the other in London. Is the difference something to be bothered
about i.e. is it significant enough ?
Process
Task
Started with an explanation on the following basic techniques:
Mean: It is gotten by adding the numbers in a data set given and divide it by the
number of items in that set.
Median: The number must be sorted first in value order in ascending order or
descending order.when the amount number on the list is even, then determine the
middle pair, add them together and divide it by two to find the median.
Mode: It can be applied to any type of data and its is not affected by extreme values
found in the dateset, so it can give positive insights into any data.
Range: When there are very high or low values the range can sometimes not be
accurate. It is the difference between the lowest and highest values.
Standard Deviation:It measures how far each value is from the mean.For the value
of the standard deviation be high it means that values are far from the mean, and
when the standard deviation is low shows that values are clustered close to the
mean.
T-Test; To determine if there is a significant difference between the means of two
groups for sample sizes that are less than 30.
F-Test:To determine whether a group of variables are jointly important will use the
F-test.
6
Variance: How far the mean is from each number in a set is determined by variance.
Correlation: Correlations are useful because if you can find out what relationship
variables have
Note: If the number of people in a group are the same, the mean of each group cant
be the same.
Exercise working
The mean for boths are already given for this session, the mean for the boston group
is 1.61 while the mean for London is 1.68
Then find the difference between the two mean 1.68-1.61=0.7
To see if there is a significant difference between the two mean of the group we
have to compare the spread of each sample with the mean.
We used standard deviation to find the spread for each sample using the formula
below
Here for the first sample Σx = 1.61 and n = 15 , for the second sample Σx = 1.68 and n
= 15
The spread for the 1st sample is 1.086 while the spread for the 2nd sample is 1.174
Now we will use the T-test formula to compare the mean and the standard deviation
to see if there is a significant difference between the average of the 2 samples.Using
the formula below.
7
Here x1 is the mean of sample 1 =1.61
s1 is the standard deviation of sample 1 .086
n1 is the number of the individuals in sample 1 =15
x2 is the mean of the sample 2 =1.68
s2 is the standard deviation of sample 2 = 1.174
n2 is the number of the individuals in sample 1 = 15
In the next step we calculated the degree of freedom (d.f) for the 2 identical samples;
Degrees of freedom refers to the maximum number of logically independent values
which are values that with the freedom to vary in the data sample.
To find the degree of freedom we will add the number of individuals in sample 1 and
sample 2
d.f. = n1 + n2 – 2 = 15 + 15 – 2 = 28
In the next step we will find the critical value of t for the relevant number of degree
of freedom.
A T critical value on a t distribution is a “cut off point”’.
We will use the confidence limit of 95% (0.05 error margin)
The error margin will be divided by 2 because we are running a two tailed test.
We will use the two tailed T Distribution table to find the critical value of t
Now, we will look up the df in the left hand side of the t-distribution table and the
error margin of 0.025 along the top row. Find the intersection of the row and column.
8
From the table, the critical value of t is 2.048
Since our critical value of t is below the degree of freedom of 28 so there are no
significant difference between the mean of the two samples.
1. Perfect correlation: Here two variables change in the same proportion, they
either increase or decrease.It either be a positive or negative .(Moore,McCabe(1999)
Introduction to the Practice of Statistics. )
2. Zero correlation: Here there are no relationship between two variable. It means
that when the value of one variable changes it will have no effect on other variable.
9
Workbook 3
This session covered the difference between Prediction and Forecasting , different
distribution of Sampling and Data Distribution , and also gives an adequate
understanding of Scoping Analysis and Inferential Analysis.
Task:
1. Predict the selling price of a house
2. Estimate chances of getting an investment given the information below
Task 1
Process
First ,we looked at the difference between predicting and forecasting.
The only difference between prediction and forecasting is that the later considers a
temporal dimension.
Prediction is focused on estimating the outcomes for data that has not been seen yet.
One of the sub-discipline of prediction is forecasting , and it is made on the basis of
time-series data.
We had an open discussion on factors to consider while predicting the price of a
house,style of the house, location ,age, etc.
To predict the price of the house in the data set we considered the average selling
prices of houses in that region and their various sizes.
We used a preliminary predictive model to determine the price of the house in the
data set.
Given the average selling prices of the houses in that region to be $20,000 and the
average value per square feet is $40
Our prediction for the selling price of a house of 2000 square feet is estimated at
20,000 + 40(2000) = $100,000
Due to other factors that like the style of the house, age, location ,etc that was not
considered in this model , its makes the deterministic model is not to be suitable, so
we will use a probabilistic model.
To consider the other factors in the prediction of the selling price of the house is
what we call the error margin, If the error margin is small we consider the prediction
a good one.(Moore,McCabe(1999) Introduction to the Practice of Statistics. )
To understand the conditional probability we treated another task using the Bayes
Theory.
10
Task 2
Process
We applied the information given to the Bayes Theorem Formula.
Let A be the events of the VC giving us the investment and B the event that the VC
will ask for more information.
Our goal is to find P(A\B) ie probability of A if B is true = If they ask for more
information will they invest
So there is a 65.2% chance that if the VC ask for more information that we will get
the investment
11
Key Learning Points
Task 1
A margin of error helpds us to know the percentage points our result will be
different from the real value of the population.
In a probabilistic model randomness plays a significant role in predicting future
events.
While a deterministic model produces a single outcome possible for an event, a
probabilistic model gives a probability distribution solution.
Eye Balling: This is a process of viewing a set of data and making values that are
statistical without any speciafic technology.
Sample: When population sizes are large for test to include all observations then
sample is used.
Standard Error: This is the deviation of a sample mean from the actual mean.iation is
the standard error of the mean.
Sample Error: When a sample is not selected that will represent the population of
data.
12
Workshop 4
We covered the practical side of this analysis with the use of excel on some data sets.
Task 1.
Using Regression Analysis on the Data set .
Doing regression on Excel and Interpreting Regression analysis Output.
Process;
Task 1
We covered the definition of regression analysis and its 2 most common models, linear and
multiple and its types.
Simple linear:The relationship between a dependent variable and an independent variable is
measured with simple linear regression.
Now we will apply some data set on microsoft excel doing regression and interpreting insights
Finding Mean (we clicked on the cell for mean and typed the formula)
13
Finding Median(we clicked on the cell for median and typed the formula)
Finding Standard Deviation.s(there is standard deviation for sample and and for population)
14
Finding Variance for SampleSize
15
Finding T-test(after dragging the 1st array,I added a comma then dragged the second array, then
choose the 2 tailed distribution and the paired t-test)
Finding Covariance(the value is a positive number meaning that the variables are positivlely
linked)
16
Finding Correlation(the correlation is 0 meaning that the data are not correlated with each other)
Learning Points
Task 1
R-Squared is a goodness of fit of data to the linear regression, it explains the
variation over total variation measuring the quality of fit. Formula to calculate R
squared is:
R squared = 1 — (SSR/TSS)
This means R squared = 1 — Variance (residual)/Variance (y)
Note: a high R squared means a good fit.
17
Reference
Siegel, Eric (2015) Predictive Analysis: the power to predict who will click .London:
Wiley.Blackwell.Available through: Ulaw Library website https://library.law.ac.uk
(Accessed: 23 April 2022).
James Cook University, 2019. How data science is used to solve real-world business problems.
Available at: https://online.jcu.edu.au/blog/data-science-solves-business-problems
[Accessed : 23 April 2022].
Johannes Ledolter (2013) Data Mining and Business Analytics with R: Packt
Publishing. New Jersey: Wiley. Available through: Ulaw Library website
https://library.law.ac.uk (Accessed: 23 April 2022).
18
19