Predictive Analysis Workbook

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

FRONT SHEET

Individual Coursework

CANDIDATE NUMBER C337946

MODULE NAME PREDICTIVE ANALYTICS

WORD COUNT 2710

SUBMISSION DATE 02/05/2022

DECLARATION
I declare that:
 This is my own unaided work.
Yes ☒
No ☐

 The word count stated by me is correct.


Yes ☒
No ☐

 I’m happy for my work to be retained on the Elite repository


and made available to staff and future students**
Yes ☒
No ☐

*Please note that all the assignments are submitted to Turnitin.


**Please note personal information (such as names) will be
deleted.

Instructions to candidates:
1. Please complete this cover sheet by entering your Candidate Number,
Module Name, Word Count, and Submission Date.
2. You must NOT use your NAME on this cover sheet or on any part of your
coursework

1
Abstract
This workbook captures the in-class task developed over session 1 - session 4 of the
predictive analytics course first semester period for the year 2022. The workbook captures
the exercises, process and the expereinces from each task undergone during the 6
workshops.

2
Table of Contents
Abstract..................................................................................................................................…1

Workshop1................................................................................................................................3

Tasks..........................................................................................................................................3

Process......................................................................................................................................3

Key Learning Points...................................................................................................................4

Workshop2...............................................................................................................................6

Tasks.........................................................................................................................................6

Process.....................................................................................................................................6

Learning
Points..............................................................................................................................……..14

Workshop3.............................................................................................................................15

Tasks..................................................................................................................................…..15

Process..................................................................................................................................15

Learning
Points...............................................................................................................................…...28

Workshop4............................................................................................................................29

Process..................................................................................................................................29

Learning
Points.............................................................................................................….....................31

References ............................................................................................................................36

3
Individual Workbook - Predictive
Analysis
Workshop 1
Session started with covering the key concepts of predictive analysis , the method of
prediction that is built into specific Predictive Analytics Technologies(e.g similarity based,
information based, error based and probability based ) and evaluate the sources and types
of data used for Predictive Analytics.
Tasks
How to apply Predictive analysis in different context of an organization ?
What are the sources, structure and types of data for predictive analysis technology?
What are the methods of Prediction?
Process
There was a discussion in class and an in-depth individual research on the concepts of
predictive analysis by throwing more light on the following topics;
Introduction to Predictive analysis and where it can be applied.( Healthcare,HR,
Sports,Weather, Finance , Digital marketing, Security etc)

Big Data – (Structured and Unstructured )

Data Minning -finding anomalies,correlations and patterns within big data to explain
outcomes

Types of Analysis( Descriptive, Diagnostic, Predictive and Prescriptive Analysis)

Methods of Prediction - (Similarity Based, Information Based , Error Based


Learning ,Probability Based Learning)

Key Learning Points


I had a better understanding on the concept of predictive analysis as a trend that applies a
strong impact to drive ones decision through experience being guided by data. (Siegel, Eric
(2015) Predictive Analysis:the power to predict who will click)

Industries And Thier Application Of Predictive Analysis


Over one third of the Predictive analysis applications are being used for marketing,(Rexer
(2015) Rexer Analytics White Paper)
It is used in profiling customers, managing customer’s lifetime value ,customer churn and
customer acquisition.
Another third of the Predictive analysis is applied in banking , insurance industry which helps
in fraud detection and risk analysis.
Lastly, the remaining one third of the predictive analytics technology is applied in medical,
academia, manufacturing , information technology, pharmaceutical,and government.
(Kotu,Vijay (2018).

4
Big Data - Every medical procedure, credit application, fraudulent act, post on Facebook,
movie recommendation,and purchase of any type is encoded as data and stored. Big Data
grows by an estimated 2.5 quintillion bytes per day (Siegel, Eric (2015) Predictive
Analysis:the power to predict who will click)

Structured Data : Structured data is the easiest option to work with in predictive analysis. It
is very organized with a set parameter defining its dimensions.It’s mostly quantitative data
like contactaddress, age ,billing expenses.

Unstructured Data : It is the unorganized part of data. It's mostly qualitative data, and it's
difficult to manipulate, search, , and analyze using a traditional spreadsheet.

Data Mining: Data mining starts with data, which can range from a simple few numeric
observations to thousands of variables with millions of complex matrix observations .
The act of data mining makes use of some advanced computational methods to identify
meaningful and relevant structures in the data.

Types of Analytics
Descriptive, this answers the question of “What happened?”
Diagnostic, this will answer the question of “Why did this happen?”
Predictive, this one answers the question of “What might happen in the future?”
Prescriptive will answer the question of “What should we do next?”

AI and Machine Learning.


Artificial intelligence is a technology that allows a machine to copy and learn human
behavior. Machine learning is a part of a larger group of AI which enables a machine to
automatically copy and learn from historical data without explicitly programming

Methods of Prediction:
Information Based Learning (Machine Learning Algorithms):This method AI conducts its task,
predicting output generally from given input data. The two main processes of machine
learning algorithms are regression and classification.
They are categorized into supervised or unsupervised. Where Supervised learning algorithms
uses labeling to provide for both the input data and the desired output, while the
unsupervised algorithms will work with data that are neither labeled or classified.

Similarity Based Learning ; This is a method that identify similarity of two objects based on
algorithmic distance functions. Metric and non-metric distance functions and high-speed
indexing are the two critical capabilities that are required for this algorithm to operate at the
scale and speed of machine learning.

Probability Based Learning: This method measures the extent of the certainty in an
uncertain event. Probability plays an important role to approximate the predictive analysis.

Error Based Learning;this type of learning


Error measuring is very crucial in calculating the fit a set of data with a linear regression. So
an error function is significant in explaining a regression model. Ajila,Lung,(2021)Analysis of
error-based machine learning algorithms in network anomaly detection and categorization.

5
Types of Analytics Approach
Deductive analysis can assist in sorting data into organizational categories like
participant, time, data type.

Inductive analysis can assist in getting insights from data, develop findings, discover
data that are representing to support the findings.

Workshop 2
In this session we will cover analysis pattern and interpretations. Here approaches
will be taken to test interpretation in which test scores relationships are used to
inform diagnosis that are different.
Task
A multinational corporation elicits views on efficacy of communications in the
organization : they are (obviously/likely) not exactly two groups of employees one
in Boston and say the other in London. Is the difference something to be bothered
about i.e. is it significant enough ?
Process
Task
Started with an explanation on the following basic techniques:
Mean: It is gotten by adding the numbers in a data set given and divide it by the
number of items in that set.
Median: The number must be sorted first in value order in ascending order or
descending order.when the amount number on the list is even, then determine the
middle pair, add them together and divide it by two to find the median.

Mode: It can be applied to any type of data and its is not affected by extreme values
found in the dateset, so it can give positive insights into any data.

Range: When there are very high or low values the range can sometimes not be
accurate. It is the difference between the lowest and highest values.

Standard Deviation:It measures how far each value is from the mean.For the value
of the standard deviation be high it means that values are far from the mean, and
when the standard deviation is low shows that values are clustered close to the
mean.
T-Test; To determine if there is a significant difference between the means of two
groups for sample sizes that are less than 30.

Z-Test: To determine if there is an important difference between the means of two


groups their sample size is greater than 30.

F-Test:To determine whether a group of variables are jointly important will use the
F-test.

6
Variance: How far the mean is from each number in a set is determined by variance.

Covariance: It is measure the relationship between two random variable .

Correlation: Correlations are useful because if you can find out what relationship
variables have

Note: If the number of people in a group are the same, the mean of each group cant
be the same.

Exercise working
The mean for boths are already given for this session, the mean for the boston group
is 1.61 while the mean for London is 1.68
Then find the difference between the two mean 1.68-1.61=0.7
To see if there is a significant difference between the two mean of the group we
have to compare the spread of each sample with the mean.
We used standard deviation to find the spread for each sample using the formula
below

Here for the first sample Σx = 1.61 and n = 15 , for the second sample Σx = 1.68 and n
= 15

The spread for the 1st sample is 1.086 while the spread for the 2nd sample is 1.174
Now we will use the T-test formula to compare the mean and the standard deviation
to see if there is a significant difference between the average of the 2 samples.Using
the formula below.

7
Here x1 is the mean of sample 1 =1.61
s1 is the standard deviation of sample 1 .086
n1 is the number of the individuals in sample 1 =15
x2 is the mean of the sample 2 =1.68
s2 is the standard deviation of sample 2 = 1.174
n2 is the number of the individuals in sample 1 = 15

The value of the t-test is 0.43

In the next step we calculated the degree of freedom (d.f) for the 2 identical samples;
Degrees of freedom refers to the maximum number of logically independent values
which are values that with the freedom to vary in the data sample.
To find the degree of freedom we will add the number of individuals in sample 1 and
sample 2
d.f. = n1 + n2 – 2 = 15 + 15 – 2 = 28

In the next step we will find the critical value of t for the relevant number of degree
of freedom.
A T critical value on a t distribution is a “cut off point”’.
We will use the confidence limit of 95% (0.05 error margin)
The error margin will be divided by 2 because we are running a two tailed test.
We will use the two tailed T Distribution table to find the critical value of t
Now, we will look up the df in the left hand side of the t-distribution table and the
error margin of 0.025 along the top row. Find the intersection of the row and column.

8
From the table, the critical value of t is 2.048
Since our critical value of t is below the degree of freedom of 28 so there are no
significant difference between the mean of the two samples.

Key learning point

Coefficient of correlation is used to measure the degree of intensity of the


relationship between two variables. The types of correlation are as follows:

1. Perfect correlation: Here two variables change in the same proportion, they
either increase or decrease.It either be a positive or negative .(Moore,McCabe(1999)
Introduction to the Practice of Statistics. )

2. Zero correlation: Here there are no relationship between two variable. It means
that when the value of one variable changes it will have no effect on other variable.

9
Workbook 3
This session covered the difference between Prediction and Forecasting , different
distribution of Sampling and Data Distribution , and also gives an adequate
understanding of Scoping Analysis and Inferential Analysis.
Task:
1. Predict the selling price of a house
2. Estimate chances of getting an investment given the information below

Task 1
Process
First ,we looked at the difference between predicting and forecasting.
The only difference between prediction and forecasting is that the later considers a
temporal dimension.
Prediction is focused on estimating the outcomes for data that has not been seen yet.
One of the sub-discipline of prediction is forecasting , and it is made on the basis of
time-series data.
We had an open discussion on factors to consider while predicting the price of a
house,style of the house, location ,age, etc.

To predict the price of the house in the data set we considered the average selling
prices of houses in that region and their various sizes.
We used a preliminary predictive model to determine the price of the house in the
data set.
Given the average selling prices of the houses in that region to be $20,000 and the
average value per square feet is $40
Our prediction for the selling price of a house of 2000 square feet is estimated at
20,000 + 40(2000) = $100,000

Due to other factors that like the style of the house, age, location ,etc that was not
considered in this model , its makes the deterministic model is not to be suitable, so
we will use a probabilistic model.
To consider the other factors in the prediction of the selling price of the house is
what we call the error margin, If the error margin is small we consider the prediction
a good one.(Moore,McCabe(1999) Introduction to the Practice of Statistics. )

To understand the conditional probability we treated another task using the Bayes
Theory.

10
Task 2

Process
We applied the information given to the Bayes Theorem Formula.

Let A be the events of the VC giving us the investment and B the event that the VC
will ask for more information.
Our goal is to find P(A\B) ie probability of A if B is true = If they ask for more
information will they invest

So there is a 65.2% chance that if the VC ask for more information that we will get
the investment

11
Key Learning Points

Task 1

A margin of error helpds us to know the percentage points our result will be
different from the real value of the population.
In a probabilistic model randomness plays a significant role in predicting future
events.
While a deterministic model produces a single outcome possible for an event, a
probabilistic model gives a probability distribution solution.

Bayes' theorem is used to revise existing predictions.

Conditional probability is the possibility of an event happening based on occurennce


of previous outcome.

Eye Balling: This is a process of viewing a set of data and making values that are
statistical without any speciafic technology.

Sample: When population sizes are large for test to include all observations then
sample is used.

Standard Error: This is the deviation of a sample mean from the actual mean.iation is
the standard error of the mean.

Sample Error: When a sample is not selected that will represent the population of
data.

Discrete Data:This type of quantitative data depends on counts.

Continues Data: Observations from this are measured on a scale.

Symmetrical data: frequency of the values of variables appearing is regular.

Asymmetrical Data The frequency of the values of variables appears irregularly.

12
Workshop 4

In this session we covered the quantitative Analysis and different types of


quantitative analysis

We covered the practical side of this analysis with the use of excel on some data sets.

Task 1.
Using Regression Analysis on the Data set .
Doing regression on Excel and Interpreting Regression analysis Output.

Process;
Task 1
We covered the definition of regression analysis and its 2 most common models, linear and
multiple and its types.
Simple linear:The relationship between a dependent variable and an independent variable is
measured with simple linear regression.

Now we will apply some data set on microsoft excel doing regression and interpreting insights

Finding Mean (we clicked on the cell for mean and typed the formula)

13
Finding Median(we clicked on the cell for median and typed the formula)

Finding Standard Deviation.s(there is standard deviation for sample and and for population)

14
Finding Variance for SampleSize

Finding Variance for Population Size

15
Finding T-test(after dragging the 1st array,I added a comma then dragged the second array, then
choose the 2 tailed distribution and the paired t-test)

Finding Covariance(the value is a positive number meaning that the variables are positivlely
linked)

16
Finding Correlation(the correlation is 0 meaning that the data are not correlated with each other)

Learning Points
Task 1
R-Squared is a goodness of fit of data to the linear regression, it explains the
variation over total variation measuring the quality of fit. Formula to calculate R
squared is:
R squared = 1 — (SSR/TSS)
This means R squared = 1 — Variance (residual)/Variance (y)
Note: a high R squared means a good fit.

17
Reference
Siegel, Eric (2015) Predictive Analysis: the power to predict who will click .London:
Wiley.Blackwell.Available through: Ulaw Library website https://library.law.ac.uk
(Accessed: 23 April 2022).

James Cook University, 2019. How data science is used to solve real-world business problems.
Available at: https://online.jcu.edu.au/blog/data-science-solves-business-problems
[Accessed : 23 April 2022].

(Kotu,Vijay (2018) Data Science. Amesterdan:Morgan Kaufmann.Amesterdan.Available


through:https://www.oreilly.com/library/view/data-science-2nd/9780128147627/
(Accessed: 26 April 2022)

Rexer (2015) Rexer Analytics White Paper. Available at https://www2.cs.uh.edu/~ceick/UDM


Rexer2015(Accessed 27 2022)

Johannes Ledolter (2013) Data Mining and Business Analytics with R: Packt
Publishing. New Jersey: Wiley. Available through: Ulaw Library website
https://library.law.ac.uk (Accessed: 23 April 2022).

Ajila,Lung,(2021)Analysis of error-based machine learning algorithms in network anomaly


detection and categorization.Available
Through:https://www.researchgate.net/publication/350498866_Analysis_of_error-
based_machine_learning_algorithms_in_network_anomaly_detection_and_categorization(
Accessed 23 April 2022)

Moore,McCabe(1999) Introduction to the Practice of Statistics. New York: . Freeman


Publishing.London(AvailableThrough:https://www.statisticshowto.com/probability-
and-statistics/hypothesis-testing/margin-of-error(Accessed: 23 April 2022)

Jafferjee, A., 2020. Building a Modern Analytics Stack. [Online]


Availableat:https://towardsdatascience.com/building-a-modern-analytics-stack-
966b0525dbc5 (Accessed 23 2022].

18
19

You might also like