Cardiovascular Disease Prediction Using Machine Learning
Cardiovascular Disease Prediction Using Machine Learning
Cardiovascular Disease Prediction Using Machine Learning
SUBMITTED TO
ARUNURU NAVEEN
Reg.No: L20MC23014
DEPARTMENT OF MCA
i
(Affiliated to ACHARYA NAGARJUNA UNIVERSITY)
2019-2021
DEPARTMENT OF MCA
BAPATLA ENGINEERING COLLEGE
BAPATLA-522101
CERTIFICATE
This is to certify that this project work entitled “Secure Cloud Storage based on
RLWE Problem” is the bonafide work carried out by ARUNURU NAVEEN, Reg.No:
L20MC23014 submitted in Partial fulfillment of the requirement for the Award of Degree of
“Master of Computer Applications”, during the academic year 2019-2021.
The results submitted in this project have been verified and are found to be satisfactory.
The results embodied in this thesis have not been submitted to any other university for the
award of the any other degree/diploma.
ii
ACKNOWLEDGEMENT
I sincerely thank the following distinguished personalities who have given their advice
and support for successful completion of this work.
I extended my sincere thanks to Sri K.N. Prasad, Head of the Department for
extending their cooperation and providing the required resources.
I am deeply indebted to most respect and my project guide Sri N. Kiran Kumar,
Assistant Professor Dept. of MCA for his valuable and inspiring guidance, comments,
suggestions and encouragements.
I extended my sincere thanks to all other teaching and non-teaching staff of the
department who helped directly or indirectly for their cooperation and encouragement.
ARUNURU NAVEEN
(L20MC23014)
This is to declare that the project “Secure Cloud Storage based on RLWE Problem” at
Bapatla Engineering College has been presented by me during the academic year 2019-2021 in
partial fulfillment of the requirements for the “Master of Computer Application”.
I also declare that this project is the result of my own efforts and that it has not been
submitted to any other universities for the award of degree or diploma.
ARUNURU NAVEEN
(L20MC23014)
Page 2
Page 3
TABLE OF CONTENTS
ACKNOWLEDGEMENT......................................................................................................iii
ABSTRACT...........................................................................................................................viii
CHAPTER-1.............................................................................................................................1
INTRODUCTION....................................................................................................................1
1.1 INTRODUCTION............................................................................................................1
CHAPTER- 2............................................................................................................................5
LITERATURE SURVEY........................................................................................................5
2.1 LITERATURE REVIEW......................................................................................................5
CHAPTER-3.............................................................................................................................7
THEORETICAL BACKGROUND........................................................................................7
3.1 INTRODUCTION:................................................................................................................7
3.2 INTRODUCTION TO PYTHON........................................................................................12
3.3 BENFITS OF PYTHON......................................................................................................21
CHAPTER-4...........................................................................................................................30
SYSTEM ANALYSIS............................................................................................................30
4.1 EXISTING SYSTEM:.........................................................................................................30
4.1.1 DISADVANTAGES OF EXISTING SYSTEM:..........................................................30
4.2 PROPOSED SYSTEM:.......................................................................................................30
4.2.1 ADVANTAGES OF PROPOSED SYSTEM:..............................................................31
CHAPTER- 5..........................................................................................................................32
SYSTEM DESIGN.................................................................................................................32
5.1 INTRODUCTION...............................................................................................................32
5.2 MODULES..........................................................................................................................32
5.2.1 DATASET:....................................................................................................................32
Page 4
5.2.2 PREPROCESSING:......................................................................................................32
5.2.3 GRAPHS:......................................................................................................................32
5.2.4 PREDICTION:..............................................................................................................32
5.3 SYSTEM ARCHITECTURE..............................................................................................33
5.4 UML DAIGRAMS..............................................................................................................34
5.4.1 CONSTRUCTION OF USE CASE DIAGRAMS:.......................................................37
5.4.2 SEQUENCE DIAGRAMS:...........................................................................................40
5.4.3 CLASS DIAGRAM:.....................................................................................................42
5.4.4 ACTIVITY DIAGRAM:...............................................................................................42
CHAPTER-6...........................................................................................................................44
SYSTEM REQUIREMENTS...............................................................................................44
6.1 SYSTEM REQUIREMENTS..............................................................................................44
6.1.1 HARDWARE REQUIREMENTS:...............................................................................44
6.1.2 SOFTWARE REQUIREMENTS:................................................................................44
CHAPTER-7...........................................................................................................................45
SYSTEM IMPLEMENTATION..........................................................................................45
7.1 INPUT AND OUTPUT DESIGNS......................................................................................45
7.1.1 LOGICAL DESIGN......................................................................................................45
7.1.2 PHYSICAL DESIGN....................................................................................................45
7.2 INPUT & OUTPUT REPRESENTATION.........................................................................46
7.2.1 INPUT DESIGN............................................................................................................46
7.2.2 OBJECTIVES...............................................................................................................47
7.2.3 OUTPUT DESIGN..................................................................................................47
CHAPTER-8...........................................................................................................................69
SYSTEM TESTING...............................................................................................................69
8.1 INTRODUCTION:..............................................................................................................69
Page 5
8.2 LEVELS OF TESTING.......................................................................................................69
8.2.1 BLACK BOX TESTING..............................................................................................70
8.2.2 WHITE BOX TESTING...............................................................................................72
CHAPTER-9...........................................................................................................................74
OUTPUT SCREENS..............................................................................................................74
CONCLUSION.......................................................................................................................75
REFERENCES.......................................................................................................................76
List of Figures
Name of the figure Pg.no
Page 6
Page 7
ABSTRACT
Cardiovascular disease (CVD) makes our heart and blood vessels dysfunctional and often leads to death or
physical paralysis. Therefore, early and automatic detection of CVD can save many human lives. Multiple
investigations have been carried out to achieve this objective, but there is still room for improvement in
performance and reliability. This study is yet another step in this direction.
In this study, two reliable machine learning techniques, SVM, and DT, KNN, GB, RFC have been employed for
CVD detection using publicly available University of California Irvine repository data. The performances of the
models are optimally increased by removing outliers and attributes having null values and predict if patient
has cardiovascular disease or not.
Problem Statement:
The problem statement for this study revolves around the need for improved early detection of
cardiovascular disease (CVD). Despite existing efforts, there are still challenges in achieving
high performance and reliability in CVD detection using machine learning techniques. This
study aims to address these challenges by investigating the effectiveness of support vector
machines (SVM), decision trees (DT), k-nearest neighbors (KNN), gradient boosting (GB), and
random forest classifiers (RFC) on CVD detection. The specific objectives include optimizing
model performance by removing outliers and handling null attribute values, ultimately aiming to
accurately predict whether a patient has cardiovascular disease or not.
Objective:
The primary objective of this study is to enhance the early detection of cardiovascular disease
(CVD) through the utilization of machine learning techniques. Specifically, the study aims to
employ support vector machines (SVM), decision trees (DT), k-nearest neighbors (KNN),
gradient boosting (GB), and random forest classifiers (RFC) on publicly available data from the
University of California Irvine repository. By optimizing the performance of these models
Page 8
through outlier removal and handling null attribute values, the study seeks to accurately predict
whether a patient has cardiovascular disease or not.
Page 9
CHAPTER-1
INTRODUCTION
1.1 INTRODUCTION
Cardiovascular disease (CVD) remains a leading cause of mortality and morbidity globally,
posing significant challenges to public health systems worldwide. Despite advancements in
medical technology and treatment modalities, early detection of CVD remains paramount for
effective management and prevention of adverse outcomes. Machine learning (ML) techniques
have emerged as promising tools for improving the accuracy and efficiency of CVD detection,
offering the potential to leverage large datasets to uncover subtle patterns and relationships
within patient data.
The prevalence of CVD underscores the urgent need for reliable and accessible methods for its
early detection. Traditionally, diagnosis has relied on clinical assessments, such as blood tests
and imaging studies, which may be costly, time-consuming, and subject to interpretation
variability. In contrast, ML algorithms can analyze diverse data sources, including patient
demographics, medical history, and biomarkers, to generate predictive models capable of
identifying individuals at risk of developing CVD with high accuracy and specificity.
This project builds upon previous research by exploring the efficacy of several ML algorithms,
including support vector machines (SVM), decision trees (DT), k-nearest neighbors (KNN),
gradient boosting (GB), and random forest classifiers (RFC), in CVD detection. Leveraging
publicly available datasets from the University of California Irvine repository, the study aims to
optimize model performance by addressing common challenges such as outlier detection and
handling missing data. By enhancing the robustness and reliability of CVD detection algorithms,
this research contributes to the ongoing efforts to mitigate the global burden of cardiovascular
disease and improve patient outcomes.
Page 10
CHAPTER- 2
LITERATURE SURVEY
2.1 LITERATURE REVIEW
Many researchers examine a number of cardiac disease expectation frameworks utilizing various
data mining techniques. They utilizing datasets and various calculations, in addition to test
findings and future work that would be possible on the framework, and achieving more
productive results. Researchers completed numerous research attempts to accomplish efficient
techniques and high accuracy in recognizing disorders associated with the heart.
Pattekari [36] study creating a model using the Naive Bayesian data mining presentation method.
It’s a computer program in which the user answers predetermined questions. It pulls hidden
information from a dataset and evaluates client values to a preset data set. It can provide answers
to difficult questions regarding heart disease diagnosis, allowing medical service providers to
make more informed clinical decisions than normal choice emotionally supporting networks. It
also helps reduce treatment expenses by providing effective treatments.
Tran [37] study built an Intelligent System using the Naive Bayes data mining modeling
technique. It is a web application in which the user answers pre-programmed questions. It tries to
find a database for hidden information and compares user values to a trained data set. It can
provide answers to difficult questions about cardiac disease diagnosis, allowing healthcare
professionals to make more informed clinical decisions than traditional decision support systems.
It also lowers treatment costs by delivering effective care.
Gnaneswar [38] demonstrates the significance of monitoring the heart rate when cycling.
Cyclists can cope with cycling meetings, such as cycling rhythm, to identify the level of activity
by monitoring their pulse while accelerating. By managing their pedaling exertion, cyclists can
avoid overtraining and cardiac failure. The cyclist’s pulse can be used to determine the intensity
of an exercise. Pulse can be measured using a sensor that can be worn. Unfortunately, the sensor
Page 11
does not capture all information at regular intervals, such as one second, two seconds, etc.
Consequently, we will need a pulse expectation model to fill in the gaps.
Gnaneswar [38] work aims to use a Feedforward Brain Organization to construct a predictive
model for pulse in consideration of cycling rhythm. On the second, pulse and rhythm are the data
sources. The result is the predicted pulse for the following second. Using a feed-forward brain
structure, the relationship between pulse and bicycle rhythm is represented statistically.
Mutijarsa [39] expand of medical care administrations, based on these arguments. Numerous
breakthroughs in remote communication have been made in anticipation of cardiac sickness.
Utilizing data mining (DM) techniques for the detection and localization of coronary disease is
highly useful. In their assessment, a comparative analysis of multiple single- and mixed-breed
information mining calculations is conducted to determine which computation most accurately
predicts coronary disease.
Yeshvendra [40] argues that the use of AI computations in the forecasting of various diseases is
growing. This notion is so significant and diverse because of the ability of an AI computation to
have a comparable perspective as a human for improving the accuracy of coronary disease
prognosis. Patil [41] notes that a proper diagnosis of cardiac disease is one of the most
fundamental biomedical concerns that must be addressed. Three information mining techniques:
support vector machine, naïve bayes, and Decision tree. These techniques were used to create an
emotionally supportive network for their preferred option. Tripoliti [42] argues that the
identification of diseases with large prevalence rates, such as Alzheimer’s, Parkinson’s, diabetes,
breast cancer, and coronary disease, is one of the most fundamental biomedical tests demanding
immediate attention. Gonsalves [43] attempted to forecast coronary CVD using machine learning
and historical medical data. Oikonomou [44] provides an overview of the varieties of
information encountered in chronic disease settings. Using multiple machine learning methods,
they elucidated the extreme value theory in order to better measure chronic disease severity and
risk.
Page 12
According to Ibrahim [45], machine learning-based systems can be utilized for predicting and
diagnosing heart disease. Active learning (AL) methods enhance the accuracy of classification
by integrating user-expert system feedback with sparsely labeled data. Furthermore, Pratiyush et
al. [46] explored the role of ensemble classifiers over the XAI framework in predicting heart
disease from CVD datasets. The proposed work employed a dataset comprising 303 instances
and 14 attributes, with categorical, integer, and real type attribute characteristics, and the
classification task was based on classification techniques such as KNN, SVM, naive Bayes,
AdaBoost, bagging and LR.
The literature attempted to create strategies for predicting cardiac disease diagnosis. Because of
the high dimensionality of textual input, many traditional machine learning algorithms fail to
incorporate it into the prediction process at the same time [47,48,49,50,51,52,53]. As a result,
this paper investigates and develops a set of robust machine learning algorithms for improving
the early prediction of CVD development, allowing for prompt intervention and recovery.
Page 13
CHAPTER-3
THEORETICAL BACKGROUND
3.1 INTRODUCTION:
In the endeavor to construct a machine learning-based platform for predicting PCOS,
understanding the multifaceted factors contributing to the condition's onset and progression is
paramount. These factors span a spectrum of clinical and physiological parameters, from
hormonal imbalances and ovarian morphology to metabolic dysfunction and genetic
predispositions. Harnessing advanced machine learning methodologies, including feature
selection algorithms and predictive modeling techniques, researchers can scrutinize extensive
datasets encompassing these parameters to unveil patterns and correlations indicative of PCOS.
The development of precise prediction models holds the potential to facilitate early identification
of PCOS, enabling prompt interventions and tailored treatment approaches. Furthermore,
integrating such platforms with chatbot functionalities could revolutionize the diagnostic process
by providing personalized guidance and support to individuals concerned about PCOS. By
merging technological innovation with clinical practice, these integrated platforms aim to
streamline diagnostics and improve the overall management of PCOS, ultimately enhancing the
well-being and reproductive health outcomes of affected individuals.
Simple
Architecture neutral
Page 14
Object oriented
Portable
Distributed
High performance
Interpreted
Multithreaded
Robust
Dynamic
Secure
With most programming languages, you either compile or interpret a program so that you
can run it on your computer. The Python programming language is unusual in that a program is
both compiled and interpreted. With the compiler, first you translate a program into an
intermediate language called Python byte codes —the platform-independent codes interpreted by
the interpreter on the Python platform. The interpreter parses and runs each Python byte code
instruction on the computer. Compilation happens just once; interpretation occurs each time the
program is executed. The following figure illustrates how this works.
• Machine leaning models involves machines learning from data without the help of
humans or any kind of human intervention.
Page 15
• Machine Learning is the science of making of making the computers learn and act
like humans by feeding data and information without being explicitly programmed.
• There are Many Algorithms in Machine Learning through which we will provide
us the exact solution in predicting the disease of the patients.
Python
Python is high level language and it is also integrated version of the program. Python is an
object-oriented approach and its main aim to help programmers to write the code clearly, logical
code for small and large scale of project.
Pytrhon is dynamically typed and garbage collected it also support multiple programming and it
is both procedure and object oriented and also functional programming. And structural
programming also supported. It has many built in function it also supports filter, map and reduce
function. All the machine learning algorithm and the libraries are being supported by the python
programming language. Python also support list, dict, sets and other generators. Python code can
be run in different platform such as anaconda, PyCharm etc.
Page 17
• Python is simple, object-oriented programming language.
• The language and implementation should provide support for software engineering
principles such as strong type library preset for different machine learning algorithm, and all
other algorithm in simple manner.
• Coding will be smooth in python and the data analysis can be easily done in python.
This is so much so to the point where we now have modules and APIs at our disposal, and you
can engage in machine learning very easily without almost any knowledge at all of how it works.
With the defaults from Scikit-learn, you can get 90-95% accuracy on many tasks right out of the
gate. Machine learning is a lot like a car, you do not need to know much about how it works in
order to get an incredible amount of utility from it.
Despite the apparent age and maturity of machine learning, I would say there's no better time
than now to learn it, since you can actually use it. Machines are quite powerful, the one you are
working on can probably do most of this series quickly. Data is also very plentiful lately.
Anaconda
Anaconda is free and open-source distribution of the Python and R programming languages for
scientific computing (data science, machine Learning applications, Large- scale data processing,
predictive analytics, etc.), that aims to simplify package management and deployment. It is
developed and maintained by Anaconda, Inc. The distribution incudes data-science packages
suitable for Windows, Linux, and macOS. Packaged versions are required and are managed
Page 18
by the package management system anaconda. This package manager was spun out as a separate
open-source package as it ended up being useful on its own and for other things than Python.
There is also a small, bootstrap version of Anaconda called Miniconda, which includes only
conda, Python, the packages they depends on, and a small number of other packages.
Anaconda Console
Page 19
Jupyter notebook
Jupiter Notebook or so called IPython Notebook is an interactive web based computational mean
for starting with Jupiter Notebook documents. The term notebook itself is a huge entity to
represent the integration with different entity sets. JSON is the main document form from the
same for the execution which follows the brief on the schema and the input and output means. It
has high integration with several language set and has various flexibilities with the choices. The
extension used for the same is “.ipynb” which runs in this platform. It’s an open-source software
package with interactive communication means. It has it’s open standards for the same. It’s an
open community best for budding programmers . The flexibility of the same is phenomenon and
splendidly done the configuration and integration of the same is simplest and easy on hold so that
no prior distortion is generated and the efficiency of the same is measured through out any
system of choice.
It’s the best software sets that been used across cross for designing and developing of the
products and support wide help support. Not only to that, it provides scalability in the code and
the deployment of the same. Various Language can be changed and the project can be
undertaken on the same. The created notebook files can be shared and stored in various means
for further utilization. It supports cultivated and interactive output sets. Easily crossed over for
graphing, plotting and visualizing of the elements. Data Integration of the same is to it’s best.
The integration of big data and it can process chunks of values in an approx. time which gives a
better performance and the higher computational means. Various works on data like cleaning,
cleansing, transforming modeling and visualizing can be done by the same
Machine learning is the ability that gives the computer to learn without being explicitly
programmed. There are two types of machine learning:
Page 20
Supervised Learning: supervised learning is the learning of the labelled data. It is the types of
machine learning that maps the input and output based on the examples input-output pairs. In
supervised learning each training data having pairs of input and desired outputs values.
Supervised learning algorithm analyzes the training data and produces a function which can be
used for mapping of new data.
Fig 2.1 Supervised Learning The output to solve the supervised learning algorithm are as:
• Determine the types of data, before doing anything else the user should understand which
types of data set is to be used for training the data.
• Gathered the training data sets either in form of human experts or from measurements.
• Determine the feature of inputs from the learned data and depends on the inputs it
changed into feature vector; number of features should not be large but should contains enough
information to accurately predict the outputs.
• Check the learned function and the learned algorithm for example we use support vector
machines or decisions tree.
• Analyzed the output and verify the data sets to get the accurate outputs.
Page 21
Unsupervised Learning:
Unsupervised learning is a type of machine learning that helps in finding the previously
unknown patterns in the data set without any known labels. It is known as self- organization and
allows modelling probability densities of given inputs.
Fig 2.2 unsupervised Learning Some of the algorithm used in unsupervised learning are:
• Clustering
• Anomaly detection
• Neural networks
Semi Supervised Machine Learning algorithm: It’s like the middle man which have some labeled
data and some unlabeled which can be prosed by the both the structured and unsupervised
learning.
The algorithms have been compared based upon the parameters: Size of the dataset and Number
of technical indicators used. Accuracy and F-measure values have been computed for each
algorithm. Long term model has been used to compute the accuracy and F-measure.
Page 22
Reinforcement Learning: This type of learning is used to reinforce or strengthen the network
based on critic information. That is, a network being trained under reinforcement learning,
receives some feedback from the environment. However, the feedback is evaluative and not
instructive as in the case of supervised learning. Based on this feedback, the network performs
the adjustments of the weights to obtain better critic information in future.
This learning process is similar to supervised learning but we might have very less information.
The following figure gives the block diagram of reinforcement learning:
import numpy as np
Page 23
efficiently and with less code than is possible using Python’s built-in
sequences.
• A growing plethora of scientific and mathematical Python-based packages are
using NumPy arrays; though these typically support Python-sequence input,
they convert such input to NumPy arrays prior to processing, and they often
output NumPy arrays. In other words, in order to efficiently use much
(perhaps even most) of today’s scientific/mathematical Python-based
software, just knowing how to use Python’s built-in sequence types is
insufficient - one also needs to know how to use NumPy arrays.
import time
This module provides various time-related functions. For related functionality, see also the
datetime and calendar modules.
Although this module is always available, not all functions are available on all platforms. Most
of the functions defined in this module call platform C library functions with the same name. It
may sometimes be helpful to consult the platform documentation, because the semantics of these
functions varies among platforms.
The epoch is the point where the time starts, and is platform dependent. For Unix, the epoch is
January 1, 1970, 00:00:00 (UTC). To find out what the epoch is on a given platform, look at
time.gmtime(0).
The term seconds since the epoch refers to the total number of elapsed seconds since the epoch,
typically excluding leap seconds. Leap seconds are excluded from this total on all POSIX-
compliant platforms.
The functions in this module may not handle dates and times before the epoch or far in the
future. The cut-off point in the future is determined by the C library; for 32-bit systems, it is
typically in 2038.
Page 24
Function strptime() can parse 2-digit years when given %y format code. When 2-digit years are
parsed, they are converted according to the POSIX and ISO C standards: values 69–99 are
mapped to 1969–1999, and values 0–68 are mapped to 2000–2068.
UTC is Coordinated Universal Time (formerly known as Greenwich Mean Time, or GMT). The
acronym UTC is not a mistake but a compromise between English and French.
DST is Daylight Saving Time, an adjustment of the timezone by (usually) one hour during part
of the year. DST rules are magic (determined by local law) and can change from year to year.
The C library has a table containing the local rules (often it is read from a system file for
flexibility) and is the only source of True Wisdom in this respect.
The precision of the various real-time functions may be less than suggested by the units in which
their value or argument is expressed. E.g. on most Unix systems, the clock “ticks” only 50 or 100
times a second.
On the other hand, the precision of time() and sleep() is better than their Unix equivalents: times
are expressed as floating point numbers, time() returns the most accurate time available (using
Unix gettimeofday() where available), and sleep() will accept a time with a nonzero fraction
(Unix select() is used to implement this, where available).
The time value as returned by gmtime(), localtime(), and strptime(), and accepted by asctime(),
mktime() and strftime(), is a sequence of 9 integers. The return values of gmtime(), localtime(),
and strptime() also offer attribute names for individual fields.
Page 25
Changed in version 3.3: The struct_time type was extended to provide the tm_gmtoff and
tm_zone attributes when platform supports corresponding struct tm members.
Changed in version 3.6: The struct_time attributes tm_gmtoff and tm_zone are now available on
all platforms.
import os
This module provides a portable way of using operating system dependent functionality. If you
just want to read or write a file see open(), if you want to manipulate paths, see the os.path
module, and if you want to read all the lines in all the files on the command line see the fileinput
module. For creating temporary files and directories see the tempfile module, and for high-level
file and directory handling see the shutil module.
The design of all built-in operating system dependent modules of Python is such that as long as
the same functionality is available, it uses the same interface; for example, the function
os.stat(path) returns stat information about path in the same format (which happens to have
originated with the POSIX interface).
Extensions peculiar to a particular operating system are also available through the os module, but
using them is of course a threat to portability.
All functions accepting path or file names accept both bytes and string objects, and result in an
object of the same type, if a path or file name is returned.
Page 26
CHAPTER-4
SYSTEM ANALYSIS
4.1 EXISTING SYSTEM:
The existing system for cardiovascular disease (CVD) detection typically relies on traditional
risk assessment methods, which may include clinical assessments, laboratory tests, and medical
imaging. These methods often involve subjective interpretation and may not fully capture the
complex interplay of risk factors contributing to CVD development. Moreover, they may not be
scalable or easily adaptable to changing healthcare needs, particularly in the context of large-
scale population screening or remote monitoring.
Subjectivity: Traditional risk assessment methods may rely on subjective interpretation, leading to
variability in results and potentially inaccurate risk predictions.
Limited Scalability: Manual processes for CVD detection may be time-consuming and resource-intensive,
limiting scalability for population-wide screening initiatives.
Lack of Personalization: Existing systems may provide generalized risk assessments, overlooking
individual variability in risk factors and health status.
Inefficiency: Manual data analysis and interpretation can be inefficient, leading to delays in diagnosis
and treatment initiation.
Resistance to Change: Healthcare systems may be resistant to adopting new technologies and
processes, hindering the implementation of more advanced CVD detection methods.
Page 27
The proposed system aims to enhance CVD detection through the integration of advanced
machine learning techniques with comprehensive data analysis. By leveraging large datasets and
employing algorithms such as support vector machines (SVM), decision trees (DT), k-nearest
neighbors (KNN), gradient boosting (GB), and random forest classifiers (RFC), the proposed
system can identify subtle patterns and relationships within patient data that may not be
discernible through traditional methods.Moreover, the proposed system incorporates novel
feature selection methods and neural network models for feature refinement and classification,
thereby improving predictive accuracy and reliability. By optimizing model performance through
rigorous data preprocessing and validation, the proposed system offers a more robust and
scalable approach to CVD detection.
the proposed system represents a significant advancement in early CVD detection, with the
potential to revolutionize clinical practice by providing more accurate risk assessments and
personalized interventions for individuals at risk of developing cardiovascular disease.
Scalability: Machine learning models can efficiently process large volumes of data, making the proposed
system scalable for population-wide screening and remote monitoring.
Personalized Risk Assessment: By analyzing diverse data sources, the proposed system can provide
personalized risk assessments, enabling targeted interventions and preventive measures for individuals
at higher risk of developing CVD.
Automation: The proposed system automates the CVD detection process, reducing the reliance on
manual interpretation and streamlining clinical workflows.
Adaptability: Machine learning models can adapt to evolving healthcare needs and incorporate new
data sources, ensuring the system remains relevant and effective over time.
Page 28
CHAPTER- 5
SYSTEM DESIGN
5.1 INTRODUCTION
System Design Introduction:
5.2 MODULES
5.2.1 DATA COLLECTION:
User need to collect data set from kaggle website. This dataset has features and labels which are
used for prediction. User can load data using flask web framework and enter all symptoms
related to cardio vascual disease and predict stages of cardio vascual disease.
5.2.2 PRE-PROCESSING:
This is preprocessing module where datasets are converted to training data and then converted to
single combined dataset. This dataset is used as input for application in the next for creating
model..
This is the final step, in which we assess how well our model has performed on testing data using
certain scoring metrics, I have used 'accuracy score' to evaluate my model. First, we create a
Page 29
model instance, this is followed by fitting the training data on the model using a fit method and
then we will use the predict method to make predictions on x_test or the testing data, these
predictions will be stored in a variable called y_test_hat. For model evaluation, we will feed the
y_test and y_test_hat into the accuracy_score function and store it in a variable called
test_accuracy, a variable that will hold the testing accuracy of our model. We followed these
steps for a variety of classification algorithm models and obtained corresponding test accuracy
scores.
In this stage final dataset is taken as in put and model is created using random
forest classifier in three steps .
First data is dividing in to testing and training set and features and labels are extracted from these
datasets and then data is trained and fitting is done. Then a pkl file is created which is model for
this application.
Algorithms
RANDOM FOREST
The Random Forest classification model is made up of several decision trees. In simple terms, it
combines the results from numerous decision trees to reach a single result. The main difference
between decision trees and random forests is that decision trees consider all the possible feature splits,
however, random forests will only select a subset of those features.
Page 30
RF was developed by Breiman, L. [60]. This is an ensemble learning algorithm made up of several DT
classifiers, and the output category is determined collectively by these individual trees. When the
number of trees in the forest increases, the fallacy in generalization error for forests converges. There
are also important benefits of the RF. For example, it can manage high-dimensional data without
choosing a feature; trees are independent of each other during the training process, and
implementation is fairly simple; however, the training speed is generally fast and, at the same time, the
generalization functionality is good enough [4]. Random forest algorithm for machine learning has tree
predictions, and based on tree predictions, the RF provides random forest predictions [61]. The RF
model is visualized in Figur
reported a study on forecasting the downtime of a printing machine based on real time predictions of
imminent failures. In their study, they utilized unstructured historical machine data to train the ML
classification algorithms including RF, XGBoost, and LR in predicting the machine failures. Various
metrics were analyzed to determine the goodness of fit of the models. These metrics include empirical
cross-entropy, area under the receiver operating characteristic curve (AUC), receiver operating
Page 31
characteristic curve itself (ROC), precision-recall curve (PRC), number of false positives (FP), true
positives (TP), false negatives (FN), and true negatives (TN) at various decision thresholds, and
calibration curves of the estimated probabilities. Based on the results obtained, in terms of ROC, all the
algorithms performed significantly better and almost similar. But in terms of decision thresholds, RF and
XGBoost perform better than LR. Using a given set of independent variables, linear regression is used to
estimate the continuous dependent variations. However, using a given set of independent variables,
logistic regression is used to estimate the categorical contingent variations [68]. Graph of the linear
regression model and logistics regression model are shown in Figure 9.
Decision Tree is a network system composed primarily of nodes and branches, and nodes comprising
root nodes and intermediate nodes. The intermediate nodes are used to represent a feature, and the
leaf nodes are used to represent a class label [52]. DT can be used for feature selection [57]. DT
algorithm is presented in Figure
Page 32
DT classifiers have gained considerable popularity in a number of areas, such as character identification,
medical diagnosis, and voice recognition. More notably, the DT model has the potential to decompose a
complicated decision-making mechanism into a series of simplified decisions by recursively splitting
covariate space into subspaces, thereby offering a solution that is sensitive to interpretation
Simple SVM: In case of linearly separable data in two dimensions, as shown in Figure 2.6, a
typical machine learning algorithm tries to find a boundary that divides the data in such a
way that the misclassification error can be minimized. If you closely look at Figure 2.6, there can
be several boundaries that correctly divide the data points. The two dashed lines as well as
one solid line classify the data correctly.
Page 33
Fig. 2.6 Multiple Decision Boundaries
SVM differs from the other classification algorithms in the way that it chooses the
decision boundary that maximizes the distance from the nearest data points of all the
classes. An SVM doesn’t merely find a decision boundary; it finds the most optimal decision
boundary. The most optimal decision boundary is the one which has maximum margin from
the nearest points of all the classes. The nearest points from the decision boundary that
maximize the distance between the decision boundary and the points are called support
vectors as seen in Figure¬2.7. The decision boundary in case of support vector machines is called
the maximum margin classifier, or the maximum margin hyper plane.
Page 34
Fig. 2.7 Decision Boundary with Support Vectors
Kernel SVM: In the previous two figures Figure 2.6 and Figure 2.7 it was shown how the
simple SVM algorithm can be used to find decision boundary for linearly separable data.
However, in the case of non-linearly separable data, such as the one shown in Figure 2.8, a
straight line cannot be used as a decision boundary.
In case of non-linearly separable data, the simple SVM algorithm cannot be used. Rather,
a modified version of SVM, called Kernel SVM, is used. Basically, the kernel SVM projects the
non-linearly separable data lower dimensions to linearly separable data in higher dimen-
sions in such a way that data points belonging to different classes are allocated to different
dimensions.
Random Forest
Random forest is a tree-based algorithm which involves building several trees (decision trees),
then combining their output to improve generalization ability of the model. The method of
combining trees is known as an ensemble method. Ensembling is nothing but a combination
of weak learners (individual trees) to produce a strong learner.
Page 35
Definition: A random forest is a classifier consisting of a collection of tree structured
classifiers h(x, Θk), k = 1, ... where the Θk are independent identically distributed (i.i.d)
random vectors and each tree casts a unit vote for the most popular class at input [4].
Random Forest Algorithm: The following are the basic steps involved in performing the
random forest algorithm:
• Choose the number of trees you want in your algorithm and repeat steps (i) and (ii).
• In case of a classification problem, each tree in the forest predicts the category to which the
new record belongs. Finally, the new record is assigned to the category that wins the
majority vote.
Figure 2.1 shows different trees labelling the class differently. What ensemble does is
take the mode (maximum occurring class) of the output produced by n different trees to create
a better model. To say it in simple words: Random forest builds multiple decision trees and
merges them together to get a more accurate and stable prediction.
Page 36
Even though decision trees are pretty intuitive and easier to understand, they can be very
noisy. Few changes in the data can lead to different splits and completely different models.
The instability of the tree makes it unrealistic as a prediction model by itself. A single
decision tree is insufficient and generally overfits the data, that is it can capture the structure
of the in-sample data very well, but it tends to work poorly out-of-sample. In the context of
statistics, decisions trees have low bias (as it can fit the data well) but high variances (the
predictions are noisy).
Understanding the working principle of decision trees is imperative in the understanding
of Random Forest Algorithm. The most popular algorithm for decision trees is ID3 algorithm. It
finds the best attributes/features that best classifies the target attribute. One of the most
commonly used way to figure out the best attribute is by calculating Information Gain which
is, in turn, calculated using another property called Entropy.
i=1
Here, c is the total number of classes or attributes and pi is number of examples belonging to
the ith class. Information gain is simply the expected reduction in entropy caused by
partitioning all our examples according to a given attribute. Mathematically, it is defined as:
|Sv|
Gain(S, A) ≡ Entropy(S) − ∑ Entropy(Sv) (2.2)
|S|
v∈Values(A)
S refers to the entire set of examples that we have. A is the attribute we want to partition
or split. |S| is the number of examples and |Sv| is the number of examples for the current value
of attribute A. The attribute with the highest information gain sits at the root node, and the
tree is first split based on that attribute.
Page 37
XGBoost
XGBoost is another ensemble learning method. As it is almost never sufficient to reply upon
the results of just one model, it combines the predictive powers of multiple learners to reach a
conclusion. The base learners are weak learners in which the bias is high, and the predictive
power is just slightly better than random guessing. But each of these weak learners add some vital
information for prediction, resulting in a strong learner by effectively combining these weak
learners. The final strong learner brings down both the bias and the variance.
The tree ensemble model consists of a set of classification and regression trees (CART).
Figure 2.2 shows a simple example of a CART that classifies whether someone will like an
app or not. The original figure from [5] had been modified to paint a better picture of our
dataset.
Suppose, the many app categories available are classified into different leaves and as-
signed a score on the corresponding leaf. Unlike decision trees, in which the leaf only contains
decision values, in CART, a real score is associated with each of the leaves, which gives a
better interpretation.
The task of training the model involves finding the best parameters θ that best fit the
Page 38
training data xi and labels yi. This is done via the objective function which measures how well
the model fits the training data. Objective functions are composed of two parts: training loss
and regularization term which can be denoted by:
where L is the training loss function, and Ω is the regularization term. The regularization term
controls the complexity of the model, helping to avoid overfitting.
While trees are built in a parallel manner in bagging, boosting builds trees sequentially
such that each subsequent tree aims to reduce the errors of the previous tree. Figure 2.3
perfectly illustrates the concept. Due to each tree learning from its predecessors and updating the
residual errors (difference between an observed y-value and the corresponding predicted y-
value), the tree that grows next in the sequence will always learn from an updated version of
the residuals. This is known as an additive strategy where what has already been learned is
fixed, and a new tree is added one at a time.
The boosting process in its absolute basic can be broken down into the following steps
[22]:
Page 39
Fig. 2.3 Sequential Tree Structure
At each step, the residual would also need to be calculated: hm(x) = y − Fm(x) where hm(x)
can be any model, but in our case, it is a tree-based learner. With this in mind, suppose that
instead of training h0 on the residuals of F0, we train h0 on the gradient of the loss function,
L(y, F0(x)) with respect to the prediction values produced by Fm(x). With samples in hm
grouped into leaves, an average gradient can be calculated and then scaled by some factor, γ,
such that Fm + γhm minimizes the loss function for the samples in each leaf. In practice, a
different factor is chosen for each leaf. For iteration m = 1 to M:
Most of these are true for all previous gradient boosting algorithms that came before XGBoost, but
what really separates it from the others is [22]:
• Handling sparse data: Missing values or data processing steps like one-hot encoding
can make data sparse. XGBoost incorporates a sparsity-aware split finding algorithm
that can handle different types of sparsity patterns in the data.
• Weighted quantile sketch: Most existing tree based algorithms can find the split points
when the data points are of equal weights (using quantile sketch algorithm). However,
they can not handle weighted data. XGBoost has a distributed weighted quantile sketch
Page 40
algorithm that can effectively handle weighted data.
• Block structure for parallel learning: For faster computing, XGBoost can utilize
multiple cores on the CPU. Unlike other algorithms, this enables the data layout to be
reused by subsequent iterations, instead of computing it again.
• Out-of-core computing: This feature optimizes the available disk space and maximizes its
usage when handling huge datasets that do not fit into memory
To improve model performance, add adversity so your model can learn to recognize your target
Page 41
Train on fewer epochs to cut the processing time
Try a different activation function, like ReLU which only activates certain neurons,
making it more efficient compared to sigmoid or tanh
Try dropout so randomly selected neurons are ignored during training, thus creating
less network computation
Avoid large pixel images as adding more image clarity doesn’t improve learning
much (224 by 224 pixels is standard)
Logistic regression is a statistical method used for binary classification tasks, where the
outcome variable has only two possible classes. It's a type of regression analysis that
predicts the probability of occurrence of an event by fitting data to a logistic curve. Here's
the algorithm for logistic regression:
Given:
Page 42
pseudo code and algorithm
Input: Cardio vascular Dataset
initialize librarires
split dataset
train dataset
Page 43
Save model
predict Result.
Page 44
Figure 5. 1 System Architecture
3-Tier Architecture:
Page 45
The three tier architecture is used when an effective distributed client/server design is
needed that provides (when compared to the two tier) increased performance, flexibility,
maintainability, reusability, and scalability, while hiding the complexity of distributed processing
from the user. These characteristics have made three layer architectures a popular choice for
Internet applications and net-centric information systems.
Advantages of Three-Tier:
Identification of actors:
Actor: Actor represents the role a user plays with respect to the system. An actor interacts with,
but has no control over the use cases.
Graphical representation:
Page 46
Actor <<Actor name>>
Who is using the system? Or, who is affected by the system? Or, which groups need
help from the system to perform a task?
Page 47
Who affects the system? Or, which user groups are needed by the system to perform
its functions? These functions can be both main functions and secondary functions
such as administration.
Which external hardware or systems (if any) use the system to perform tasks?
What problems does this application solve (that is, for whom)?
And, finally, how do users use the system (use case)? What are they doing with the
system?
The actors identified in this system are:
a. System Administrator
b. Customer
c. Customer Care
Identification of use cases:
Use case: A use case can be described as a specific way of using the system from a user’s
(actor’s) perspective.
Graphical representation:
Page 48
Use cases provide a means to:
For each actor, find the tasks and functions that the actor should be able to perform or
that the system needs the actor to perform. The use case should represent a course of
events that leads to clear goal
Name the use cases.
Describe the use cases briefly by applying terms with which the user is familiar.
This makes the description less ambiguous
Page 49
A flow of events is a sequence of transactions (or events) performed by the system. They
typically contain very detailed information, written in terms of what the system should do, not
how the system accomplishes the task. Flow of events are created as separate files or documents
in your favorite text editor and then attached or linked to a use case using the Files tab of a model
element.
A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented as
use cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor. Roles of the actors in
the system can be depicted.
Page 50
Figure 5. 2 Use Case Diagram
5.4.2 SEQUENCE DIAGRAMS:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that
shows how processes operate with one another and in what order. It is a construct of a Message
Sequence Chart. Sequence diagrams are sometimes called event diagrams, event scenarios, and
timing diagrams.
Page 51
Figure 5. 3 Sequence diagram
Page 52
5.4.3 CLASS DIAGRAM:
Page 53
5.4.4 ACTIVITY DIAGRAM:
Page 54
Er Diagram
Page 55
CHAPTER-6
SYSTEM REQUIREMENTS
6.1 SYSTEM REQUIREMENTS
6.1.1 HARDWARE REQUIREMENTS:
• System : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz
• Hard Disk : 1 TB.
• Input Devices : Keyboard, Mouse
• Ram : 4 GB.
Page 56
CHAPTER-7
SYSTEM IMPLEMENTATION
To conduct studies and analyses of an operational and technological nature, and To
promote the exchange and development of methods and tools for operational analysis as applied
to defense problems.
The logical design of a system pertains to an abstract representation of the data flows,
inputs and outputs of the system. This is often conducted via modeling, using an over-abstract
(and sometimes graphical) model of the actual system. In the context of systems design are
included. Logical design includes ER Diagrams i.e. Entity Relationship Diagrams
The physical design relates to the actual input and output processes of the system. This is
laid down in terms of how data is input into a system, how it is verified / authenticated, how it is
processed, and how it is displayed as output. In Physical design, following requirements about
the system are decided.
1. Input requirement,
2. Output requirements,
3. Storage requirements,
4. Processing Requirements,
5. System control and backup or recovery.
Page 57
Put another way, the physical portion of systems design can generally be broken down into three
sub-tasks:
User Interface Design is concerned with how users add information to the system and
with how the system presents information back to them. Data Design is concerned with how the
data is represented and stored within the system. Finally, Process Design is concerned with how
data moves through the system, and with how and where it is validated, secured and/or
transformed as it flows into, through and out of the system. At the end of the systems design
phase, documentation describing the three sub-tasks is produced and made available for use in
the next phase.
Physical design, in this context, does not refer to the tangible physical design of an
information system. To use an analogy, a personal computer's physical design involves input via
a keyboard, processing within the CPU, and output via a monitor, printer, etc. It would not
concern the actual layout of the tangible hardware, which for a PC would be a monitor, CPU,
motherboard, hard drive, modems, video/graphics cards, USB slots, etc. It involves a detailed
design of a user and a product database structure processor and a control processor. The H/S
personal specification is developed for the proposed system.
Page 58
The input design is the link between the information system and the user. It comprises the
developing specification and procedures for data preparation and those steps are necessary to put
transaction data in to a usable form for processing can be achieved by inspecting the computer to
read data from a written or printed document or it can occur by having people keying the data
directly into the system. The design of input focuses on controlling the amount of input required,
controlling the errors, avoiding delay, avoiding extra steps and keeping the process simple. The
input is designed in such a way so that it provides security and ease of use with retaining the
privacy. Input Design considered the following things:
7.2.2 OBJECTIVES
Input Design is the process of converting a user-oriented description of the input into a
computer-based system. This design is important to avoid errors in the data input process and
show the correct direction to the management for getting correct information from the
computerized system.
It is achieved by creating user-friendly screens for the data entry to handle large volume
of data. The goal of designing input is to make data entry easier and to be free from errors. The
data entry screen is designed in such a way that all the data manipulates can be performed. It also
provides record viewing facilities.
Page 59
When the data is entered it will check for its validity. Data can be entered with the help of
screens. Appropriate messages are provided as when needed so that the user will not be in maize
of instant. Thus the objective of input design is to create an input layout that is easy to follow
A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users and to
other system through outputs. In output design it is determined how the information is to be
displaced for immediate need and also the hard copy output. It is the most important and direct
source information to the user. Efficient and intelligent output design improves the system’s
relationship to help user decision-making.
a. Designing computer output should proceed in an organized, well thought out manner; the
right output must be developed while ensuring that each output element is designed so
that people will find the system can use easily and effectively. When analysis design
computer output, they should Identify the specific output that is needed to meet the
requirements.
b. Select methods for presenting information.
c. Create document, report, or other formats that contain information produced by the
system.
Page 60
Code
IMAGE_SIZE = [224,224]
train_path = "Dataset/"
train_datagen =
ImageDataGenerator(rescale=1./255,horizontal_flip=True,zoom_range=0.2,
validation_split=0.15)
training_set = train_datagen.flow_from_directory(
train_path,target_size=(224,224),
batch_size=32,class_mode='categorical',
subset='training')
validation_set = train_datagen.flow_from_directory(
train_path,target_size=(224,224),
batch_size=32,class_mode='categorical',shuffle = True,
subset='validation')
## We are initialising the input shape with 3 channels rgb and weights
as imagenet and include_top as False will make to use our own custom
inputs
Page 61
mv =
VGG19(input_shape=IMAGE_SIZE+[3],weights='imagenet',include_top=False)
# In[7]:
model = Model(inputs=mv.input,outputs=prediction)
model.summary()
import tensorflow as tf
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self,epoch,logs={}):
if(logs.get('loss')<=0.05):
print("\nEnding training")
self.model.stop_training = True
# initiating the myCallback function
callbacks = myCallback()
## Let us compile the model with Adam optimizer and loss function
categorical_crossentropy and metrics as categorical_accuracy
from tensorflow.keras.optimizers import Adam
model.compile(optimizer=Adam(lr=0.0001),loss='categorical_crossentropy
',metrics=['categorical_accuracy'])
history = model.fit(training_set,
validation_data=validation_set,
epochs=50,
verbose=1,
steps_per_epoch=len(training_set),
validation_steps=len(validation_set),
callbacks = [callbacks]
)
Page 62
acc = history.history['categorical_accuracy']
val_acc = history.history['val_categorical_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs,loss)
plt.plot(epochs,val_loss)
plt.title("Training and validation Loss")
plt.savefig('validationaccuracy.png')
model.save("cancer.h5")
#classes = model.predict(x)
#print (classes)
Page 63
CHAPTER-8
SYSTEM TESTING
8.1 INTRODUCTION:
Testing is the debugging program is one of the most critical aspects of the computer
programming triggers, without programming that works, the system would never produce an
output of which it was designed. Testing is best performed when user development is asked to
assist in identifying all errors and bugs. The sample data are used for testing. It is not quantity
but quality of the data used the matters of testing. Testing is aimed at ensuring that the system
was accurately an efficiently before live operation commands.
Testing objectives:
This examines the logic of the program. For example, the logic for updating various
sample data and with the sample files and directories were tested and verified.
Specification Testing:
Page 64
Executing this specification starting what the program should do and how it should
performed under various conditions. Test cases for various situation and combination of
conditions in all the modules are tested.
Unit testing:
In the unit testing we test each module individually and integrate with the overall system.
Unit testing focuses verification efforts on the smallest unit of software design in the module.
This is also known as module testing. The module of the system is tested separately. This testing
is carried out during programming stage itself. In the testing step each module is found to work
satisfactorily as regard to expected output from the module. There are some validation checks for
fields also. For example, the validation check is done for varying the user input given by the user
which validity of the data entered. It is very easy to find error debut the system.
Black box testing is a software testing techniques in which functionality of the software
under test (SUT) is tested without looking at the internal code structure, implementation details
and knowledge of internal paths of the software. This type of testing is based entirely on the
software requirements and specifications.
Page 65
In Black Box Testing we just focus on inputs and output of the software system without
bothering about internal knowledge of the software program.
The above Black Box can be any software system you want to test. For example : an
operating system like Windows, a website like Google ,a database like Oracle or even your own
custom application. Under Black Box Testing , you can test these applications by just focusing
on the inputs and outputs without knowing their internal code implementation.
Black box testing - Steps
Here are the generic steps followed to carry out any type of Black Box Testing.
Functional testing – This black box testing type is related to functional requirements of a
system; it is done by software testers.
Non-functional testing – This type of black box testing is not related to testing of a
specific functionality, but non-functional requirements such as performance, scalability,
usability.
Regression testing – Regression testing is done after code fixes , upgrades or any other
system maintenance to check the new code has not affected the existing code.
White Box Testing is the testing of a software solution's internal coding and
infrastructure. It focuses primarily on strengthening security, the flow of inputs and outputs
through the application, and improving design and usability.White box testing is also known as
clear, open, structural, and glass box testing.
It is one of two parts of the "box testing" approach of software testing. Its counter-part,
blackbox testing, involves testing from an external or end-user type perspective. On the other
hand, Whitebox testing is based on the inner workings of an application and revolves around
internal testing. The term "whitebox" was used because of the see-through box concept. The
clear box or whitebox name symbolizes the ability to see through the software's outer shell (or
"box") into its inner workings. Likewise, the "black box" in "black box testing" symbolizes not
being able to see the inner workings of the software so that only the end-user experience can be
tested
White box testing involves the testing of the software code for the following:
Page 67
Internal security holes
Broken or poorly structured paths in the coding processes
The flow of specific inputs through the code
Expected output
The functionality of conditional loops
Testing of each statement, object and function on an individual basis
The testing can be done at system, integration and unit levels of software development. One of
the basic goals of whitebox testing is to verify a working flow for an application. It involves
testing a series of predefined inputs against expected or desired outputs so that when a specific
input does not result in the expected output, you have encountered a bug.
To give you a simplified explanation of white box testing, we have divided it into two
basic steps. This is what testers do when testing an application using the white box testing
technique:
The first thing a tester will often do is learn and understand the source code of the
application. Since white box testing involves the testing of the inner workings of an application,
the tester must be very knowledgeable in the programming languages used in the applications
they are testing. Also, the testing person must be highly aware of secure coding practices.
Security is often one of the primary objectives of testing software. The tester should be able to
Page 68
find security issues and prevent attacks from hackers and naive users who might inject malicious
code into the application either knowingly or unknowingly.
The second basic step to white box testing involves testing the application’s source code
for proper flow and structure. One way is by writing more code to test the application’s source
code. The tester will develop little tests for each process or series of processes in the application.
This method requires that the tester must have intimate knowledge of the code and is often done
by the developer. Other methods include manual testing, trial and error testing and the use of
testing tools as we will explain further on in this article.
Unit testing:
Items being tested: Dataset features and labels are displayed or not
Remarks: Pass.
Page 69
Sl # Test Case : UTC2
Actual output: Based on given test size data is divided and stored in train
and test sets
Remarks: pass
Integration Testing:
Page 70
Integration testing is a level of software testing where individual units are combined and
tested as a group. The purpose of this level of testing is to expose faults in the interaction
between integrated units. Test drivers and test stubs are used to assist in Integration Testing.
Integration testing is defined as the testing of combined parts of an application to determine if
they function correctly. It occurs after unit testing and before validation testing. Integration
testing can be done in two ways: Bottomup integration testing and Topdown integration
testing.
Bottomup Integration
This testing begins with unit testing, followed by tests of progressively higherlevel
combinations of units called modules or builds.
Topdown Integration
In this testing, the highestlevel modules are tested first and progressively, lowerlevel modules
are tested thereafter.
Page 71
Name of Test: Train Model
Remarks: Pass.
Page 72
Sl # Test Case : ITC2
Remarks: Pass.
System testing:
System testing is the first step in the Software Development Life Cycle, where the
application is tested as a whole.
Page 73
The application is tested thoroughly to verify that it meets the functional and technical
specifications.
System testing enables us to test, verify, and validate both the business requirements as
well as the application architecture.
Remarks: Pass
Page 74
CHAPTER-9
OUTPUT SCREENS
9.1 DATASET SCREEN
Page 75
CONCLUSION
In conclusion, this study demonstrates the potential of machine learning techniques in advancing the
early detection of cardiovascular disease (CVD). By leveraging diverse datasets and employing
algorithms such as support vector machines (SVM), decision trees (DT), k-nearest neighbors (KNN),
gradient boosting (GB), and random forest classifiers (RFC), we have shown significant improvements in
predictive accuracy and reliability.
Through rigorous data preprocessing, including outlier removal and handling missing values, we have
optimized model performance, enhancing their ability to accurately identify individuals at risk of
developing CVD. These findings underscore the importance of robust data preprocessing techniques in
improving the effectiveness of machine learning algorithms for medical diagnosis and prognosis.
Moving forward, further research is needed to validate the performance of these models in real-world
clinical settings and to explore additional features and data sources that may further enhance predictive
accuracy. Additionally, efforts should be made to ensure the accessibility and interpretability of these
models for healthcare professionals, facilitating their integration into clinical decision-making processes.
Ultimately, by advancing the early detection of CVD, these machine learning approaches have the
potential to save lives, reduce healthcare costs, and alleviate the burden of cardiovascular disease on
individuals and healthcare systems worldwide.
Page 76
REFERENCES
Javeed A, Rizvi SS, Zhou S, Riaz R, Khan SU, Kwon SJ. Heart risk failure prediction using a
novel feature selection method for feature refinement and neural network for classification. Mob
Inf Syst. 2020;2020:1–11. https://doi.org/10.1155/2020/8843115.
Eckel R, Jakicic J, Ard JD. Aha/acc guideline on lifestyle management to reduce cardiovascular
risk: a report of the american college of cardiology/american heart association task force on
practice guidelines. American College of Cardiology/American Heart Association Task Force on
Practice Guidelines. 2014. https://doi.org/10.1161/01.cir.0000437740.48606.d1.pmid:24222015.
Anderson KM, Wilson PW, Odell PM, Kannel WB. An updated coronary risk profile. A
statement for health professionals. Circulation. 1991;83(1):356–62.
https://doi.org/10.1161/01.cir.83.1.356.
Azmi J, Arif M, Nafis MT, Alam MA, Tanweer S, Wang G. A systematic review on machine
learning approaches for cardiovascular disease prediction using medical big data. Med Eng Phys.
2022;103825.
Page 77
Mythili T, Mukherji D, Padalia N, Naidu A. A heart disease prediction model using svm-
decision trees-logistic regression (sdl). Int J Comput Appl. 2013;68(16):11–5.
https://doi.org/10.1161/01.CIR.97.18.1837.
Frieden TR, Jaffe MG. Saving 100 million lives by improving global treatment of hypertension
and reducing cardiovascular disease risk factors. J Clin Hypertens. 2018;20(2):208.
Kumar PM, Lokesh S, Varatharajan R, Babu GC, Parthasarathy P. Cloud and iot based disease
prediction and diagnosis system for healthcare using fuzzy neural classifier. Future Gener
Comput Syst. 2018;68:527–34.
Mohan S, Thirumalai C, Srivastava G. Effective heart disease prediction using hybrid machine
learning technique. IEEE Access. 2019;7:81542–54.
Kwon JM, Lee Y, Lee S, Park J. Effective heart disease prediction using hybrid machine
learning technique. J Am Heart Assoc. 2018;7(13):1–11.
Esfahani HA, Ghazanfari M, Ecardiovascular disease detection using a new ensemble classifier.
in,. IEEE 4th international conference on knowledge-based engineering and innovation (KBEI).
Tehran, Iran. 2017;2017:488–96.
Gandhi M, Singh SN. Cardiovascular disease detection using a new ensemble classifier. in 2015
International Conference on Futuristic Trends on Computational Analysis and Knowledge
Management (ABLAZE), Greater Noida, India, 2015;520–525
Page 78
Krittanawong C, Virk HUH, Bangalore S, Wang Z, Johnson KW, Pinotti R, Zhang H, Kaplin S,
Narasimhan B, Kitai T, et al. Machine learning prediction in cardiovascular diseases: a meta-
analysis. Sci Rep. 2020;10(1):16057.
Shouman TT, Stocker R. Integrating clustering with different data mining techniques in the
diagnosis of heart disease. J Comput Sci Eng 2013;20(1).
Motur S, Rao ST, Vemuru S. Frequent itemset mining algorithms: a survey. J Theor Appl Inf
Technol 2018;96(3).
Javeed A, Khan SU, Ali L, Ali S, Imrana Y, Rahman A. Machine learning-based automated
diagnostic systems developed for heart failure prediction using different types of data modalities:
A systematic review and future directions. Comput Math Methods Med. 2022;2022:1–30.
https://doi.org/10.1155/2022/9288452.
Malki Z, Atlam E, Dagnew G, Alzighaibi AR, Ghada E, Gad I. Bidirectional residual lstm—
based human activity recognition. J Comput Inf Sci. 2020;13(3):1–40.
Malki Z, Atlam E-S, Hassanien AE, Dagnew G, Elhosseini MA, Gad I. Association between
weather data and COVID-19 pandemic predicting mortality rate: machine learning approaches.
Chaos Solitons Fractals. 2020;138: 110137. https://doi.org/10.1016/j.chaos.2020.110137.
Page 79