Ucs551 GRP Project

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 34

INTRODUCTION TO DATA ANALYTICS AND

APPLICATION (UCS551)

GROUP PROJECT (40%)

Name Student ID

Noor Alyani Binti Ahmad Zamani 2022908435

Tuan Nurasyiqin Binti Tuan Aznan 2022972737

Nurin Izzdhinie Binti Nasrudin 2022793445

Muhammad Iman Hafiz bin Mohd Ilham 2022912585

Muhammad Naufal Bin Md.Din 2022923697

PREPARED FOR : NIK MOHAMMAD WAFIY BIN AZMI


TABLE OF CONTENT

1.0 INTRODUCTION..................................................................................................................................3
2.0 CASE STUDY BACKGROUND.........................................................................................................4
3.0 PROBLEM STATEMENT...................................................................................................................5
4.0 OBJECTIVES.......................................................................................................................................7
5.0 METHODOLOGY.................................................................................................................................7
6.0 DATASET AND ANALYTIC APPLICATION................................................................................11
7.0 MODELING.........................................................................................................................................12
a. Decision Tree Model Building..........................................................................................................12
b. Random Forest Model Building,.......................................................................................................12
c. KKN Model Building,.......................................................................................................................12
d. SVM Model Building........................................................................................................................13
8.0 RESULT AND DISCUSSION...........................................................................................................14
a) Process (RapidMiner).......................................................................................................................14
b) Result................................................................................................................................................23
c) Compare two techniques for producing the result............................................................................25
d) Graph................................................................................................................................................27
9.0 CONCLUSION....................................................................................................................................31
10.0 REFERENCES.................................................................................................................................32

2
1.0 INTRODUCTION (NOOR ALYANI)

RapidMiner is a data processing software developed by the RapidMiner company that is typically
used as a data science software platform, providing an integrated environment for the specific purpose of
data preparation, deep learning, text mining, predictive analysis, and machine learning using Java as its
programming language. Due to considerable modification by the corporation, the programme was
properly introduced in 2006, despite its debut or initial release in 2001. However, it will not be available
until March 2020, despite the fact that the firm had perfected it following their stable release. Following
its extensive use in data preparation, the software was employed by both business and commercial users.

RapidMiner is commonly used as a platform for data exploration, cleansing, and blending.
Furthermore, commercial businesses' unrestricted use of automation, such as data mining and machine
learning methods that involve data pre-processing, visualization, and predictive analysis, has been found
advantageous for a more systematic approach. The major goal of this project is to use fast mining to apply
real-world data analytics. Heart disease data was chosen for analysis. Heart disease refers to a group of
disorders that affect the heart. Blood vessel illnesses, such as coronary artery disease; heart rhythm issues
(arrhythmias); and congenital heart defects, among others, fall under the umbrella of heart disease. The
goal is to solve the problem and forecast individuals who have suffered from heart disease by using any
variables from the data. We will determine the minimum, maximum, average, and standard deviation
using Rapid Miner.

The chart of a few attributes will next be interpreted: age, blood pressure, heart rate, etc. It will go
through data processing to get the intended and usable form. The processing will be based on a preset
series of procedures that can be performed automatically or manually. After the data is processed, it will
be converted into a more understandable format, such as a graph, text, image, table chart, or any other
format needed. The data we used came from secondary sources that were made freely available for
academics to use in their own analyses. We used Data World data and chose heart disease prediction.

3
2.0 CASE STUDY BACKGROUND (NOOR ALYANI)

Heart disease is characterized by a blockage in the coronary arteries, which are blood channels
that transport blood and oxygen to the heart. The formation of fatty material and plaque within the
coronary arteries causes coronary heart disease. 1 Heart disorders, as lethal as they sound, are difficult to
detect in their early stages.

Researchers are constantly attempting to identify better detection techniques that could timely
identify a person's heart disease, as current techniques are not as effective in detecting this disease in the
early stages as one would hope for due to accuracy and computational time. 20 Furthermore, when
professional health experts and advanced technology are not readily available, it can be extremely
difficult to detect any symptoms of a heart disease until the person begins experiencing chest pain or
complains about breathing difficulties, at which point it may be extremely difficult to cure this disease
and save that person's life. It will be critical to have a tool that detects such a life-threatening sickness
early on in order to save a person's life and assist clinicians in detecting this disease in its early stages.

The purpose of this study is to create a model that will help doctors forecast if a person has heart
disease or not before the disease spreads throughout the body and makes the situation tough to handle. It
is centered on four major machine learning techniques: Decision Tree Classifier, K-Neighbors Classifier,
Random Forest Classifier, and Support Vector Classifier are the four classifiers. A dataset containing
heart disease-related factors will also be preprocessed, evaluated, and trained before being applied to each
of the four algorithms listed above to predict the different scores of accuracies in predicting whether an
individual has or does not have heart disease.

4
3.0 PROBLEM STATEMENT (NOOR ALYANI)

In this project, we look at a dataset that contains a variety of health parameters from heart
patients, such as age, blood pressure, heart rate, and more. Our goal is to create a predictive
algorithm that can reliably identify people with heart disease. Given the serious consequences of missing
a positive diagnosis, our primary focus is on ensuring that the model recognises all prospective patients,
hence recall for the positive class is an important parameter. Heart diseases, also known as cardiovascular
diseases, are not easily detected until a person experiences signs or symptoms of a heart attack such
as chest pain, upper back/neck pain, heartburn, nausea, vomiting, and so on; heart failure such as
difficulty breathing, fatigue, swelling of the feet, and so on; or an arrhythmia such as palpitation. High
blood pressure, high cholesterol, and smoking are the leading causes of this dangerous disease, and,
shockingly,
nearly half of the population (47%) in the United States suffers from at least one of these three conditions.
According to 2020 statistics, roughly 697,000 people in the United States died of a heart illness; this
indicates that one in every five people in the country died of a heart disease in 2020! Unhealthy food,
overweight/obesity, diabetes, excessive alcohol intake, and other variables all contribute to the
development of heart disease.

According to the American Heart Association online page, Heart Disease and Stroke Statistics -
2023 Update, cardiovascular disease, or heart disease, was the top cause of death in the United States in
2020, with a total number of deaths reported of 928,741. Coronary heart disease was the leading factor
contributing to the total number of deaths in the United States due to cardiovascular disease in 2020, at
41.2%, followed by stroke, which contributed to 17.3% of total deaths due to this disease. High blood
pressure caused 12.9% of individuals to lose their battle to life, 9.2% died due to heart failure, and 2.6%
died due to some disease that spread in their arteries. Based on data collected in 2020, the age-adjusted
death rate from heart disease in the United States was 224.4 per 100,000 people, which was not far
behind the global death rate from heart disease, which was 239.8 per 100,000.

Moving on to the financial aspect of cardiovascular disease, statistics show that the amount of
money needed to spend directly to combat this disease in 2018 and 2019 combined was $251.4 billion,
while lost productivity and mortality cost $155.9 billion, bringing the total cost of battling cardiovascular
disease to a whopping $407.3 billion for both of these years combined. Data science and machine
learning have been two of the primary driving forces

5
behind technological innovation in the medical and healthcare sectors. The majority of companies in this
sector are increasingly adapting and integrating data science and machine learning into their systems,
increasing their chances of discovering patterns for different diseases and providing patients with better
feedback on the disease that they may have in the future based on their medical history and data.

Adapting to machine learning techniques, according to the Neptune Blog, allows medical
organizations to "find patterns, extract knowledge from data, and tackle a diverse set of computationally
hard tasks." Machine learning tools can be used by data scientists to establish the association between
"various attributes and features of the patients with the labelled disease." This, in turn, allows clinicians to
better understand illness patterns and provide better, preventive therapies for patients. When working with
vast amounts of data, it is critical to examine the hazards that surround it.

Hacking into a system and obtaining the data of millions of patients can harm an organization's
integrity and reputation. To avoid such problems, developers must construct a model using a language
that has capabilities that provide protection against any outside threats. According to a Stack Overflow
Developer survey performed in 2017, the graphic of which is mentioned in Belitsoft's article "Python in
Healthcare”, Python is one of the top five popular languages used for designing healthcare systems.

6
4.0 OBJECTIVES (NURIN IZZDHINIE)

This study's primary goal is to use the data set to explore the dataset. The attributes like age, sex, fasting
blood sugar level, cholesterol, resting heart rate, blood pressure are going to be interpreted further.

1. To uncover the patterns of heart disease and age.


2. To see the relation of heart disease with the cholesterol level.
3. To determine the total number of heart disease patients by gender.
4. To predict the chance to get heart disease based on the resting blood pressure.

7
5.0 METHODOLOGY (NURIN IZZDHINIE)

CRISP DM is the acronym for CRoss Industry Standard Process for Data Mining. According to
Hotz, 2023, he provided that the CRISP DM is defined by “ is a process model that serves as the base for
a data science process.” he further his explanation by providing the 6 phases which are:

I. Business understanding – What does the business need?


II. Data understanding – What data do we have / need? Is it clean?
III. Data preparation – How do we organize the data for modeling?
IV. Modeling – What modeling techniques should we apply?
V. Evaluation – Which model best meets the business objectives?
VI. Deployment – How do stakeholders access the results?

I. Business understanding – What does the business need?

As we are in the healthcare industry, which is the heart disease department, we are
concerned about the early detection of heart disease. The earlier the detection is, the better it
will provide the solutions and ways of curing this number one killer. In this project our goal is to
create a predictive algorithm that is able to correctly identify people who have heart disease. We
are able to do so by analyzing the data that has already been gathered and well put in the excel
workbook. Our main focus is on making sure the model detects all possible patients, therefore
recall for the positive class is an important statistic, given the serious consequences of missing a
positive diagnosis.

II. Data understanding - what data do we have?

Dataset description

Variable Description

age Age of patient in years

8
sex Gender of the patient (0 = Male, 1 = Female)

cp Chest pain type: 0:


Typical anglina
1: Atypical angina
2: Non-anginal pain
3: Asymptomatic

trestbps Resting blood pressure in mmHg

chol Serum cholesterol in mg/dl

fbs Fasting blood sugar level, categorized as above 120 mg/dl (1 = true, 0
= false)

restecg Resting electrocardiographic results:


0: Normal
1: Having ST-T wave abnormality
2: Showing probable or definite left ventricular hypertrophy

thalac Maximum heart rate achieved during a stress test

exang Exercise-induced angina (1 = yes, 0 = no)

oldpeak ST depression induced by exercise relative to rest

slope Slope of the peak exercise ST segment: 0:


Upsloping
1: Flat
2: Downsloping

ca Number of major vessels (0-4) colored by fluoroscopy

thal Thalium stress test result: 0:


Normal
1: Fixed defect
2: Reversible defect

9
3: Not described

target Heart disease status (0 = no disease, 1 = presence of disease)

III. Data preparation – How do we organize the data for modeling?

Data preparation is a technique used to improve data quality before data mining to get
quality mining results. The steps that involve data preparation are cleaning, integration,
transformation and reduction.
All features in the dataset appear to be relevant based on our Exploratory Data Analysis
(EDA). No columns seem redundant or irrelevant. Thus, we'll retain all features, ensuring no
valuable information is lost, especially given the dataset's small size. Moreover, upon the earlier
inspection, it is obvious that there are no missing values in our dataset. This is ideal as it means
we don't have to make decisions about imputation or removal, which can introduce bias or reduce
our already limited dataset size.

IV. Modeling – What modeling techniques should we apply?

Figure 1 shows the Recall for Positive Class across Models

The SVM model demonstrates a commendable capability in recognizing potential heart


patients. With a recall of 0.97 for class 1, it's evident that almost all patients with heart disease are
correctly identified. This is of paramount importance in a medical setting. However, the model's
balanced performance ensures that while aiming for high

10
recall, it doesn't compromise on precision, thereby not overburdening the system with unnecessary
alerts.

V. Evaluation – Which model best meets the business objectives?

In the Evaluation phase, there are three tasks that need to be done according to Hotz
(2019), which are (i) Assess the outcome: Do the models satisfy the requirements for corporate
success? Which one or ones ought to be approved for the company?, (ii) Review procedure:
Examine the completed work. Was there anything that was missed? Were all the processes
carried out correctly? Compile your findings and make any necessary corrections, (iii) Identify
the next action to take: Choose whether to move on to deployment, continue iterating, or start
new projects based on the results of the previous three activities.

VI. Deployment – How do stakeholders access the results?


Data mining is finally paying off here. The result gained can be delivered by coming up
with the deployment plan, monitoring and maintenance, and final report. All of those also can
help the stakeholders understand better the result that we get after analyzing the data of the heart
disease.

11
6.0 DATASET AND ANALYTIC APPLICATION (TUAN NURASYIQIN)

Kaggle is one of the world's largest communities of data scientists and machine learning
practitioners, and its platform contains thousands of datasets spanning a wide range of topics and
industries. We use Kaggle's Heart Disease Prediction dataset that contains a variety of health metrics from
heart patients, including age, blood pressure, heart rate, and more. Our goal is to create a predictive model
that can accurately identify individuals with heart disease. Given the serious consequences of missing a
positive diagnosis, our primary goal is to ensure that the model identifies all potential patients, so recall
for the positive class is an important metric.
This dataset contains 303 entries ranging from index 0 to 302, and 14 columns corresponding to
various patient attributes and test results. All of the columns included the patient's age in years, gender,
and chest pain type (0: Typical angina, 1: Atypical angina, 2: Non-anginal pain, 3: Asymptomatic),
Resting blood pressure in millimeters of mercury Serum cholesterol (mg/dL), Fasting blood sugar level,
classified as greater than 120 mg/dl (1 = true, 0
= false), Resting electrocardiography results: (0: Normal. 1: ST-T wave abnormality, 2: Probable or
definite left ventricular hypertrophy), the highest heart rate achieved during a stress test, and Exercise-
induced angina (1 = yes, 0 = no), old peak ST depression caused by exercise compared to rest. Based on
the data types , there are 9 columns (sex, cp, fbs, restecg, exang, slope, ca, thal, and target) are numerical
in terms of data type, but categorical in terms of their semantics.

12
7.0 MODELING (TUAN NURASYIQIN)
We use four different models. These include Decision Tree Model Building, Random Forest
Model Building, KKN Model Building, and SVM Model Building.

a. Decision Tree Model Building


Decision Trees (DTs) are a non-parametric supervised learning technique for classification and
regression. The goal is to build a model that can predict the value of a target variable using simple
decision rules derived from data features. A tree can be considered a piecewise constant approximation.
The steps included defining the base DT model, configuring the hyperparameters grid, using the
tune_clf_hyperparameters function to determine the optimal hyperparameters for the DT model, and
evaluating DT model performance on both the training and test datasets. The metric values for the
training and test datasets are nearly identical and do not differ significantly, the model does not appear to
be overfitting.

b. Random Forest Model Building,


The random forest algorithm extends the bagging method by combining bagging and feature
randomness to generate an uncorrelated forest of decision trees. The first step is to define the basic RF
model. Next, we set up the hyperparameters grid and use the tune_clf_hyperparameters function to
pinpoint the optimal hyperparameters for our RF model, followed by evaluating the model's performance
on both the training and test datasets. The RF model's similar performance on both training and test data
indicates that it is not overfitting.

c. KKN Model Building,


K Nearest Neighbour (KKN) is a simple algorithm that stores all available cases and classifies
new data or cases using a similarity measure. It is primarily used to classify a data point according to how
its neighbors are classified. In KNN, the parameter 'k' refers to the number of nearest neighbors to include
in the majority of the voting process. 'k' in the KNN algorithm is based on feature similarity. Choosing the
right value of K is a process known as parameter tuning, which is critical for improving accuracy.
Define the base KNN model and the pipeline with scaling, then set up the hyperparameters grid
and use the tune_clf_hyperparameters function to find the best hyperparameters for our KNN pipeline,
and finally evaluate the model's performance on both the training and test datasets. The KNN model's
consistent scores across training and test sets suggest that there is no overfitting.

13
d. SVM Model Building.
The Support Vector Machine (SVM) algorithm is a simple but powerful supervised machine
learning algorithm that can be used to build regression and classification models. The SVM algorithm
works well with both linearly and non-linearly separable datasets. Despite having limited data, the
support vector machine algorithm performs admirably. The steps are to define the base SVM model and
configure the pipeline with scaling. Configure the hyperparameters grid and use the
tune_clf_hyperparameters function to find the best hyperparameters for our SVM pipeline. Lastly,
evaluate the performance of our SVM model on both the training and test datasets.

In the critical context of diagnosing heart disease, our primary goal is to achieve a high recall for
the positive class. It is critical to accurately identify every potential heart disease case, as even a single
missed diagnosis could have serious consequences. However, while aiming for high recall, it is critical to
maintain a balanced performance in order to avoid unnecessary medical interventions for healthy people.
We'll now compare our models to these crucial medical benchmarks.
The SVM model performs admirably in identifying potential heart patients. With a recall of 0.97
for class 1, it is clear that almost all patients with heart disease are accurately identified. This is critically
important in a medical setting. However, the model's balanced performance ensures that, while aiming for
high recall, it does not sacrifice precision, avoiding overburdening the system with unnecessary alerts.

14
8.0 RESULT AND DISCUSSION

a) Process (RapidMiner) (MUHAMMAD NAUFAL)

i) Examples of Data integration:

Step 1: Load the sample dataset “Heart Disease.csv”


- Insert “Read CSV” operators

- Right click the “Read CSV” and insert the data of “Heart Disease.csv”

15
- Change the type of “Target” from integer to binominal

16
17
Step 2: Set role for attribute “” as label. Insert
“Set Role” operators

- Select “Target” as attribute and “Label” as target role

18
19
ii) Examples of Data Cleaning:
In this process, we will clean all the missing value in the data

- Select “Cholesterol” and change it to “is not missing” and you will get the result.

20
iii) Examples of Data Selection process:

We will reduce the amount of Data from 303 to 150. This can simplify the concluding
result from the data.

Step 1: Insert “Sample” into operators

21
iv) Use of split or Cross Validation
- Insert “Cross Validation” into Process

v) Use of machine learning techniques operators


In this process, we have chosen 2 types of cross validation which are Decision tree and Random Forest.

1. Decision tree is a popular data mining algorithm used for classification and regression
tasks. A decision tree is a flowchart-like structure where an internal node represents a feature or attribute,
the branch represents a decision rule, and each leaf node represents the outcome or class label.

2
b) Result (MUHAMMAD IMAN HAFIZ)

Result:
The accuracy for the Decision Tree is 58.00%. The total samples for True Yes are 61 while True
Negative are 26. The ability for the algorithm to predict the positive values is not good. The ability for the
algorithm to classify positive values (precision) is 60.40% and ability to predict positive values is 72.62%
(recall).

2. Random Forest is a classifier that contains several decision trees on various subsets of the given
dataset and takes the average to improve the predictive accuracy of that dataset. the greater the number of
trees in a Random Forest Algorithm, the higher its accuracy and problem-solving ability.

2
Result:
The accuracy for Random Forest is 58.67%. As you can see, the total samples for True Positive are 59
while True Negative are 29. The ability for the algorithm to predict the Positive values is good but not for
predicted the negative values. The ability for the algorithm to classify positive values (precision) is
61.46% and ability to predict positive values is 70.24% (recall)

2
c) Compare two techniques for producing the result (MUHAMMAD IMAN HAFIZ)

Step 1: replace “Cross Validation” with “Compare ROC s” operator

Step 2: insert“Decision Tree” and “Random Forest” and link up the operator based on below.

2
Result:

As the graph shows, the blue line (representing Decision Tree) reaches 1.00 while the other lines
are below that. Red line (Random Forest) is lower than the decision tree. An excellent model has AUC
near to the 1 which means it has a good measure of separability. A poor model has AUC near to the 0
which means it has the worst measure of separability, meaning it predicts 0s as 1s and 1s as 0s. This can
conclude that, Decision Tree algorithm is the best machine learning model compared to the Random
Forest.

2
d) Graph (MUHAMMAD IMAN HAFIZ)

The graph shows the number of people


with heart disease according to their age. The
blue line represents the number of people who
have heart disease, and the orange line
represents the number of people who have not
had heart disease.

The graph shows that the number of


people with heart disease increases with age.
For example, at age 20, there are about 10
people with heart disease and 84 people
without heart disease. At age 80, there are
about 130 people with heart disease and 66
people without heart disease

2
The pie chart shows the number of people with heart disease and cholesterol. It looks like there are
more people with heart disease (84) than people with cholesterol (66). The target number of people with
cholesterol is 150, but the current number is much lower than that.

The pie chart also shows that the majority of people with heart disease do not have cholesterol (NO),
while the majority of people with cholesterol do not have heart disease (NO). This suggests that there is not a
strong correlation between heart disease and cholesterol.

2
The chart shows the number of people with heart disease, broken down by sex and target
status. There are more women with heart disease than men, regardless of target status. Among
women, there are more who have not reached the target than those who have. Among men, there are
more who have reached the target than those who have not. The difference in the number of people
with heart disease between those who have reached the target and those who have not is smaller for
men than for women.

2
The box appears to have a positive correlation between resting blood pressure and the risk of
heart disease. This means that as resting blood pressure increases, the risk of heart disease also
increases. The blue box represents the number of people with heart disease, and the light blue box
represents the number of people without heart disease. The number of people with heart disease
appears to be much lower than the number of people without heart disease. The target for resting
blood pressure appears to be 66. The current average resting blood pressure appears to be higher than
the target.

3
9.0 CONCLUSION (TUAN NURASYIQIN)

The RapidMiner company developed an integrated environment for data preparation, deep
learning, text mining, predictive analysis, and machine learning. This data processing software is
commonly used as a data science software platform. A blockage in the coronary arteries, which are blood
vessels that carry oxygen and blood to the heart, is what defines heart disease. The goal of this research is
to develop a model that will assist medical professionals in determining whether a patient has heart
disease or not before the condition worsens and becomes more difficult to treat. Achieving a high recall
for the positive class is our main objective. When it comes to identifying potential heart patients, the
SVM model performs admirably.
The decision tree is a popular data mining algorithm for classification and regression. The
Decision Tree's accuracy is 58.00%. The algorithm's ability to predict positive values is not good. A
Random Forest Algorithm's accuracy and problem-solving ability improve as the number of trees
increases. The algorithm's ability to predict positive values is good, but it does not predict negative
values. The algorithm can classify positive values (61.46% precision) and predict positive values (70.24%
recall). An excellent model has AUC close to one, indicating a good measure of separability. This leads
us to conclude that the Decision Tree algorithm outperforms the Random Forest model.

3
10.0 REFERENCES

1. Hotz, N. (2023). CRISP-DM. [online] Data Science Project Management. Available at:
https://www.datascience-pm.com/crisp-dm-2/.
2. rpubs.com. (n.d.). RPubs - CRISPR-DM Case Study. [online] Available at:
https://rpubs.com/Argaadya/crispr_dm.
3. 1.10. decision trees. scikit. (n.d.). https://scikit-learn.org/stable/modules/tree.html.
4. Subramanian, D. (2021, July 12). A simple introduction to K-nearest neighbors algorithm.
Medium.
https://towardsdatascience.com/a-simple-introduction-to-k-nearest-neighbors-algorithm-b 3519ed98e.
5. Centers for Disease Control and Prevention (CDC) (2019). Heart Health Information: About
Heart Disease. [online] Centers for Disease Control and Prevention. Available at:
https://www.cdc.gov/heartdisease/about.htm.
6. WebMD (2002). Heart Disease: Types, Causes, and Symptoms. [online] WebMD.
Available at:
https://www.webmd.com/heart-disease/heart-disease-types-causes-symptoms.

3
PART A : Group Information

Group Name: Noor Alyani Binti Ahmad Zamani, Tuan Nurasyiqin Binti Tuan Aznan, Nurin
Izzdhinie Binti Nasrudin, Muhammad Iman Hafiz bin Mohd Ilham, Muhammad Naufal Bin
Md.Din

Project

Correct format

Able to give good introduction

Case Study background

Clear Problem statement

Objectives is(are) well highlighted

Methodology - CRISP-DM Life Cycle

3
Good explanation on each attributes in dataset

(explain how the dataset is obtained), explain each


attributes in the dataset)

Dataset and Analytic Application

Result and Discussion -Process

Result and Discussion –Result, Graph

Result and Discussion -Techniques

TOTAL /70

You might also like