Lung Cancer Project
Lung Cancer Project
Lung Cancer Project
INTRODUCTION
1
allows direct visualization of the airways, further facilitated the diagnosis and staging of lung
cancer (Paradis et al., 2016).
Treatment options for lung cancer have evolved significantly throughout history. In the early
days, surgical resection was the primary approach, with the goal of removing the tumor and
surrounding tissues. However, the effectiveness of surgery was limited, as many cases were
diagnosed at advanced stages when the cancer had spread beyond the lungs. The introduction of
radiation therapy in the 20th century expanded treatment possibilities, offering an alternative or
complementary option to surgery (Gianfaldoni et al., 2017).
Chemotherapy, which involves the use of drugs to kill cancer cells, revolutionized lung cancer
treatment in the 1940s (Bagnyukova et al., 2010). Initially, chemotherapy drugs had limited
efficacy and were associated with significant side effects. However, ongoing research and
clinical trials led to the development of more targeted and effective chemotherapy regimens. In
recent years, the emergence of targeted therapies and immunotherapy has further transformed the
treatment landscape for specific subsets of lung cancer patients.
Targeted therapies are designed to specifically target and inhibit the growth of cancer cells with
specific genetic mutations (Waarts et al., 2022). By identifying the genetic abnormalities driving
the cancer, doctors can tailor treatment to individual patients, improving response rates and
reducing side effects. Immunotherapy, on the other hand, harnesses the body's immune system to
recognize and attack cancer cells (Papaioannou et al., 2016). It has shown remarkable success in
some patients by unleashing the immune system's potential to fight lung cancer.
In addition to these treatment modalities, palliative care has gained prominence in recent years to
improve the quality of life for lung cancer patients. Palliative care focuses on alleviating
symptoms, managing pain, and providing emotional support to patients and their families. It aims
to address the physical, psychological, and social aspects of living with a serious illness,
promoting a holistic approach to patient care.
The understanding and classification of lung cancer have evolved over time. In the mid-19th
century, lung cancer was classified into non-small cell lung cancer (NSCLC) and small cell lung
cancer (SCLC) (Gazdar, 2010). Although this classification system remains widely used, further
subtypes and molecular classifications have been identified, leading to targeted therapies and
2
personalized medicine approaches. Diagnostic techniques for lung cancer have significantly
improved over the years. Early methods such as chest X-rays and bronchoscopies allowed for the
identification of lung tumors. The development of computed tomography (CT) scans in the
1970s revolutionized lung cancer detection by providing detailed cross-sectional images (Sharma
et al., 2015). Subsequent advancements introduced positron emission tomography (PET) scans
and magnetic resonance imaging (MRI), enhancing diagnostic accuracy (Kure et al., 2021).
Liquid biopsy techniques, such as circulating tumor DNA analysis, have recently shown promise
for non-invasive detection and monitoring of lung cancer (Lone et al., 2022).
Surgery has been a primary treatment for localized lung cancer since the 1930s. Innovations in
surgical techniques, such as video-assisted thoracic surgery (VATS) and robotic-assisted
procedures, have improved patient outcomes (Choe et al., 2020). Radiation therapy, used since
the early 1900s, has also advanced with intensity-modulated radiation therapy (IMRT) and
stereotactic body radiation therapy (SBRT) (Yu et al., 2014). Chemotherapy and targeted
therapies have significantly improved survival rates for both NSCLC and SCLC patients (Musika
et al., 2021). Immunotherapy, particularly immune checkpoint inhibitors, has emerged as a
groundbreaking treatment option, leveraging the body's immune system to combat cancer
cells. Recognition of the smoking-lung cancer link has led to widespread public health
campaigns to reduce tobacco consumption. Smoke-free policies, tobacco taxation, and anti-
smoking campaigns have contributed to declining smoking rates (Levy et al., 2016). Screening
programs utilizing low-dose CT scans have been implemented for high-risk populations,
enabling early detection (Yip et al., 2021). Efforts to improve air quality, reduce occupational
exposures, and promote healthy lifestyles continue to be crucial in preventing lung cancer.
The history of lung cancer reveals a complex journey marked by significant advancements in
understanding, diagnosis, treatment, and prevention. From ancient observations to modern
molecular classifications and personalized therapies, our knowledge of lung cancer has expanded
exponentially. While challenges remain, such as late-stage diagnoses and limited treatment
options for advanced cases, ongoing research and public health initiatives offer hope for
continued progress. Through a comprehensive approach encompassing prevention, early
detection, and innovative treatments, we strive to reduce the burden of lung cancer and improve
patient outcomes in the future.
3
Lung cancer is a deadly disease with challenges in diagnosis and treatment. Machine learning, a
branch of artificial intelligence, has shown promise in various aspects of lung cancer care. It has
been used to analyze medical imaging data for early detection and classification of lung nodules.
Machine learning algorithms can assist in diagnosis by considering patient data and predicting
tumor characteristics. Treatment planning can be improved with machine learning models that
generate personalized treatment recommendations. Furthermore, machine learning can predict
patient outcomes and survival rates by analyzing diverse patient data. However, challenges such
as data quality, privacy, and validation need to be addressed. Overall, machine learning has the
potential to revolutionize lung cancer care and improve patient outcomes.
This research will adopt a quantitative research design to develop and evaluate predictive models
for lung cancer detection. The study will involve the collection and analysis of a dataset
containing patient symptoms obtained from Kaggle. Two algorithms, Support Vector Machines
(SVM) and Convolutional Neural Networks (CNN), will be utilized for developing predictive
models. The dataset containing patients' symptoms will be obtained from Kaggle, ensuring that it
includes relevant information such as demographic details, clinical histories, and medical
imaging data. The dataset will be preprocessed to handle missing values, normalize features, and
ensure data quality. Preprocessing of the dataset will involve cleaning the data, handling missing
4
values, and normalizing or standardizing the features. Categorical variables will be encoded
appropriately, and feature selection techniques may be employed to reduce dimensionality if
necessary.
Two predictive models, SVM and CNN, will be developed for lung cancer detection using the
preprocessed dataset. SVM, a supervised learning algorithm, will be trained on the dataset to
learn patterns and classify instances into lung cancer or non-lung cancer classes. CNN, a deep
learning algorithm, will be implemented to extract meaningful features from the medical imaging
data and classify lung nodules. The dataset will be divided into training and validation sets. The
SVM model will be trained using the training set and tuned using appropriate hyperparameters
through techniques like cross-validation. Similarly, the CNN model will be trained using the
training set and fine-tuned to optimize its performance. The trained models will be evaluated
using the validation set to assess their predictive performance.
Evaluation metrics such as accuracy, sensitivity, specificity, and F1 score will be calculated to
measure the models' effectiveness in detecting lung cancer. Comparative analysis between the
SVM and CNN models will be performed to understand their respective strengths and
limitations. The results obtained from the evaluation of the predictive models will be analyzed
and interpreted. The performance of the SVM and CNN models will be compared to determine
their effectiveness in lung cancer detection. Python is a popular programming language for
artificial intelligence and machine learning. This research will use Python to develop the
predictive models for lung cancer. The model will be developed in Jupyter Notebook, and it will
use HTML, CSS, and JavaScript for its user interface. The model will store its data in a MySQL
database, which will be accessible to patients and the general public.
Further analysis will focus on identifying the strengths and weaknesses of each model and
potential areas for improvement. The findings of the research will be discussed in the context of
existing literature and previous studies on lung cancer detection. The limitations of the study and
possible future research directions will be outlined. The research will conclude with a summary
of the study's contributions, implications, and recommendations for the implementation of
predictive models for lung cancer detection in clinical settings.
5
1.5 Significance of the Study
The study on predictive models for lung cancer detection is important for multiple reasons. Lung
cancer is a widespread and deadly disease, causing numerous deaths annually. Early detection is
crucial for improving patient outcomes. Predictive models can help healthcare professionals
identify individuals at higher risk, enabling timely interventions. Using a symptom dataset in
predictive modeling offers advantages over traditional diagnostic methods. It reduces the need
for expensive, time-consuming, and invasive tests, lowering costs and saving time. Predictive
models can improve detection accuracy by analyzing large amounts of patient data and
identifying patterns and correlations. This aids in identifying high-risk individuals who may need
further testing. Additionally, predictive models contribute to precision medicine by tailoring
screening and diagnosis based on individual risk factors and symptoms. The study has the
potential to revolutionize lung cancer screening and diagnosis, improving accuracy, reducing
costs, and enabling early interventions for better patient outcomes and survival rates.
The models will be developed using Support Vector Machines (SVM) and Convolutional Neural
Networks (CNN). Additionally, the study will include a thorough evaluation of the performance
of the developed predictive models. This evaluation process will employ appropriate metrics to
assess accuracy, sensitivity, specificity, and overall predictive power. To validate their
6
effectiveness, the models will be compared against existing diagnostic methods or medical
expert opinions. By utilizing predictive models for lung cancer detection, this study aims to
provide a more efficient and cost-effective alternative to traditional diagnostic approaches. The
potential impact of this research lies in its ability to aid healthcare professionals in identifying
individuals at high risk of lung cancer, enabling early detection and timely intervention,
ultimately leading to improved patient outcomes.
Lung cancer: A malignant tumor that originates in the lungs. It is the leading cause of
cancer death in the world.
Surgeons: Doctors who specialize in surgery. They are responsible for performing
operations to diagnose and treat diseases.
Targeted therapies: Drugs that are designed to specifically target and inhibit the growth
of cancer cells with specific genetic mutations.
Immunotherapy: Treatment that harnesses the body's immune system to recognize and
attack cancer cells.
Palliative care: Care that focuses on alleviating symptoms, managing pain, and
providing emotional support to patients and their families.
Machine learning: A type of artificial intelligence that allows computers to learn from
data and make predictions.
7
CHAPTER TWO
LITERATURE REVIEW
2.1 Introduction
Lung cancer is a devastating disease that has affected millions of lives worldwide, making it one
of the leading causes of cancer-related deaths (Thandra et al., 2021). It is believed to have
originated in ancient times, although it was not until the late 19th and early 20th centuries that its
connection to smoking became more apparent (Sanchez-Ramos, 2020). The sharp rise in tobacco
consumption during the 20th century significantly contributed to the lung cancer epidemic,
leading to its prevalence becoming a public health concern (Proctor, 2012). With growing
awareness of the link between smoking and lung cancer, efforts were initiated to mitigate
smoking-related risks. However, the disease's complexity and lack of early-stage diagnostic
methods continued to challenge medical professionals.
While smoking remains the most significant risk factor for lung cancer, exposure to secondhand
smoke, occupational hazards such as asbestos and radon, air pollution, and genetic predisposition
also play pivotal roles (Cheng et al., 2021). Smoking introduces harmful carcinogens into the
lungs, causing chronic inflammation and DNA damage, which can lead to uncontrolled cell
growth and the formation of tumors (Hussain et al., 2019). Non-smokers can also develop lung
cancer due to these other risk factors.
Lung cancer can manifest in various forms, primarily non-small cell lung cancer (NSCLC) and
small cell lung cancer (SCLC), each having distinct growth patterns and treatment responses
(Rudin et al., 2021). Symptoms may not appear until the disease has progressed to advanced
stages, leading to reduced treatment success rates and increased mortality. Common symptoms
include persistent cough, shortness of breath, chest pain, unexplained weight loss, and recurrent
respiratory infections. Lung cancer not only affects physical health but also has profound
psychological and emotional impacts on patients and their families.
8
Historically, lung cancer diagnosis has relied on traditional methods such as chest X-rays,
computed tomography (CT) scans, bronchoscopy, and biopsy (Folch et al., 2015). While these
approaches have been effective to some extent, they often lack the sensitivity and specificity
required for early detection. Early-stage tumors can be challenging to identify accurately, leading
to delayed diagnosis and reduced chances of successful treatment.
Furthermore, ML models can be trained to analyze patient health records, lifestyle habits, and
genetic information to assess an individual's lung cancer risk and recommend personalized
screening and prevention strategies (Dritsas & Trigka, 2022). This approach not only aids in
early detection but also empowers patients to make informed decisions about their health. While
the integration of machine learning in lung cancer detection holds great promise, several
challenges must be addressed. One major challenge is the need for large and diverse datasets to
train accurate and generalizable ML models. Additionally, the ethical use of patient data and the
establishment of robust data privacy measures are crucial considerations.
Moreover, the integration of ML into the clinical workflow requires collaboration between
medical experts and data scientists. This interdisciplinary approach ensures that ML algorithms
are validated, interpretable, and clinically relevant. Furthermore, regulatory approvals and
adoption by healthcare institutions are essential steps in transitioning from research to practical
clinical implementation.
9
The quest for early detection and improved outcomes in lung cancer has been ongoing for
decades. The emergence of machine learning offers an unprecedented opportunity to transform
lung cancer detection and diagnosis. By harnessing the power of data and advanced algorithms,
this project aims to pave the way for a future where early detection becomes a reality,
significantly improving the prognosis and survival rates for lung cancer patients worldwide.
With continued research, collaboration, and innovation, we can bring this vision to life and make
substantial strides in the battle against lung cancer.
Lung cancer is a complex disease with multifactorial origins and etiology. It arises when
abnormal cells grow uncontrollably in the lungs, forming tumors that can interfere with lung
function and, if left untreated, can spread to other parts of the body. The origins and etiology of
lung cancer involve a combination of genetic, environmental, and lifestyle factors.
i. Smoking and Tobacco Use: The most significant risk factor for developing lung cancer
is smoking tobacco, including cigarettes, cigars, and pipes. Smoking is responsible for
about 85% of all lung cancer cases (Warren & Cummings, 2013). Tobacco smoke
contains numerous carcinogens that damage lung cells and lead to genetic mutations,
promoting the development of cancerous cells.
ii. Secondhand Smoke: Exposure to secondhand smoke, also known as passive smoking,
can increase the risk of lung cancer, particularly in non-smokers who live or work with
smokers. Secondhand smoke contains many of the same harmful substances as direct
smoke and can contribute to lung cancer development (Kim et al., 2018).
iii. Radon Gas: Radon is a naturally occurring radioactive gas that can seep into homes and
buildings from the ground. Prolonged exposure to high levels of radon is a significant
risk factor for lung cancer, especially for individuals who smoke (Riudavets et al., 2022).
iv. Occupational and Environmental Exposures: Certain workplace exposures to
carcinogens like asbestos, arsenic, chromium, nickel, and diesel exhaust can increase the
risk of lung cancer (Spyratos et al., 2013). Additionally, exposure to air pollution,
industrial emissions, and certain chemicals may contribute to lung cancer development.
v. Genetic Predisposition: While smoking is the primary cause of lung cancer, some
individuals may have a genetic predisposition that increases their susceptibility to the
10
disease (Hjelmborg et al., 2016). Certain inherited genetic mutations can elevate the risk
of developing lung cancer, even in individuals who have never smoked.
vi. Pre-existing Lung Conditions: People with pre-existing lung conditions, such as
chronic obstructive pulmonary disease (COPD) or pulmonary fibrosis, have a higher risk
of developing lung cancer (Schwartz et al., 2016).
vii. Radiation Exposure: Exposure to high doses of ionizing radiation, such as during
certain medical procedures or radiation therapy for other cancers, can increase the risk of
developing lung cancer (Ali et al., 2020).
11
ii. Small Cell Lung Cancer (SCLC): SCLC, also known as oat cell carcinoma, accounts
for about 15% of all lung cancers. It is more aggressive and fast-growing than NSCLC
and is more likely to have already spread to other parts of the body at the time of
diagnosis. SCLC is strongly associated with smoking, and it typically responds well to
chemotherapy and radiation therapy in the initial stages.
It is essential to diagnose the specific type of lung cancer accurately, as treatment decisions are
often based on this classification. Other rare types of lung cancer, such as carcinoid tumors, may
also occur, but they are less common than NSCLC and SCLC.
Effects:
i. Decreased Lung Function: As lung cancer grows, it can block air passages in the lungs,
reducing the organ's capacity to exchange oxygen and carbon dioxide efficiently. This
leads to shortness of breath and decreased lung function (Han & Mallampalli, 2015).
ii. Metastasis: Lung cancer can spread to other parts of the body, such as the bones, brain,
liver, and other organs, through the lymphatic system or bloodstream. This advanced
stage is known as metastatic or stage IV lung cancer and can have severe effects on
various organ functions(Popper, 2016).
iii. Cachexia: Advanced lung cancer can cause a condition called cachexia, characterized by
severe weight loss, muscle wasting, weakness, and fatigue (Ni & Zhang, 2020). Cachexia
is a result of the body's response to cancer, and it can further weaken the patient.
iv. Paraneoplastic Syndromes: Lung cancer can trigger the production of hormones or
immune system responses that affect other organs, leading to various paraneoplastic
syndromes (Fletcher, 2021). These syndromes can cause neurological symptoms,
hormonal imbalances, and other systemic effects.
Symptoms:
12
i. Persistent Cough: A persistent or chronic cough is one of the most common symptoms of
lung cancer. The cough may be dry or produce phlegm and might worsen over time.
ii. Shortness of Breath: As the tumor grows or blocks air passages, patients may experience
increasing shortness of breath, especially during physical activity.
iii. Chest Pain: Lung cancer can cause chest pain that may be dull, constant, or intermittent.
The pain may worsen with deep breathing, coughing, or laughing.
iv. Coughing up Blood: Hemoptysis, or coughing up blood or blood-streaked mucus, is
another common symptom of lung cancer.
v. Hoarseness: If the cancer affects the recurrent laryngeal nerve, it can lead to hoarseness
or changes in the voice.
vi. Unexplained Weight Loss: Significant and unexplained weight loss can occur due to the
effects of cancer on metabolism and appetite.
vii. Fatigue: Lung cancer patients often experience severe fatigue, which can be a result of
the disease itself, treatments, or other related factors.
viii. Recurrent Respiratory Infections: Some lung cancer patients may be more susceptible to
respiratory infections due to a weakened immune system.
i. Medical History and Physical Examination: The process begins with a comprehensive
medical history, where the doctor will ask about the patient's symptoms, risk factors
(such as smoking history or exposure to certain chemicals), and any relevant family
history of cancer. A physical examination is also conducted to assess the patient's overall
health.
ii. Imaging Tests: Several imaging tests are commonly used to visualize the lungs and
detect any abnormalities (Gargani & Volpicelli, 2014). These may include:
13
X-rays: Simple chest X-rays can reveal the presence of masses or abnormal
nodules in the lungs (Khan et al., 2011). However, they may not be sufficient to
confirm the diagnosis or provide detailed information about the cancer.
CT Scan (Computed Tomography): CT scans are more detailed than X-rays and
can provide a clearer view of the lungs, helping to identify smaller nodules or
tumors and evaluate whether the cancer has spread to nearby lymph nodes or
other organs.
MRI (Magnetic Resonance Imaging): In some cases, an MRI may be used to get a
more precise image of the lungs and surrounding areas (Kumar et al., 2016).
PET Scan (Positron Emission Tomography): PET scans help determine the
metabolic activity of lung nodules or masses (Lai et al., 2022). This information is
valuable in distinguishing between benign and malignant growths and in staging
the cancer.
iii. Biopsy: A biopsy is the definitive way to diagnose lung cancer. It involves obtaining a
small sample of tissue from the lung for examination under a microscope (Holland,
2021). There are different types of biopsies:
Needle Biopsy: A thin, hollow needle is inserted into the lung tissue to obtain a
sample. This can be done through the skin (percutaneous biopsy) or with the
guidance of imaging techniques such as CT or ultrasound.
Bronchoscopy: A thin, flexible tube with a camera on the end (bronchoscope) is
inserted through the mouth or nose into the airways. The doctor can then collect
small tissue samples from the lungs for analysis.
Thoracoscopy or Mediastinoscopy: In some cases, a surgical procedure may be
necessary to access and biopsy hard-to-reach areas in the chest cavity.
iv. Laboratory Tests: After obtaining a tissue sample, it is sent to a pathology laboratory
for examination. A pathologist will analyze the sample to determine whether cancer is
present and, if so, the type and subtype of lung cancer.
v. Staging: Once lung cancer is confirmed, the next step is staging. Staging involves
determining the extent of the cancer, including its size, location, and whether it has
spread to nearby lymph nodes or other parts of the body (Mirsadraee et al., 2012). This
information is crucial for determining the most appropriate treatment plan.
14
vi. Molecular Testing: For some types of lung cancer, molecular testing may be performed
on the tissue sample to identify specific genetic mutations or biomarkers (Shim et al.,
2017). These results can help guide targeted therapies or immunotherapies.
vii. Additional Tests: Depending on the stage and type of lung cancer, additional tests may
be conducted to assess the patient's overall health and their ability to undergo specific
treatments. These tests may include lung function tests, blood tests, and other imaging
studies.
The diagnosis of lung cancer can be a complex process, and it is important to have a
multidisciplinary team of healthcare professionals, including oncologists, pathologists,
radiologists, and pulmonologists, working together to provide an accurate diagnosis and develop
an appropriate treatment plan for the patient. As always, early detection and timely intervention
play a critical role in improving the prognosis for individuals with lung cancer.
i. Surgery: Surgery is often the preferred treatment for early-stage non-small cell lung
cancer (NSCLC) and some limited-stage small cell lung cancer (SCLC). During surgery,
the tumor and surrounding lymph nodes are removed to prevent further spread of cancer
(Doerr et al., 2022). The type of surgery performed depends on the tumor's size, location,
and extent.
ii. Radiation Therapy: Radiation therapy uses high-energy X-rays or other particles to
target and destroy cancer cells (Baskar et al., 2014). It can be used as the primary
treatment for early-stage lung cancer in patients who are not candidates for surgery or as
an adjuvant therapy after surgery to reduce the risk of cancer recurrence. In advanced
cases, radiation therapy can help relieve symptoms and improve the quality of life.
iii. Chemotherapy: Chemotherapy is a systemic treatment that uses drugs to kill cancer cells
or prevent them from dividing and growing (Krans, 2021). It is commonly used for both
NSCLC and SCLC, either as the main treatment or in combination with surgery or
15
radiation. Chemotherapy can be given before or after surgery (neoadjuvant or adjuvant
chemotherapy) or in advanced cases to slow tumor growth and alleviate symptoms.
iv. Targeted Therapy: Targeted therapies are drugs that specifically target certain genetic
mutations or proteins that promote the growth of cancer cells (Wilkes, 2018). These
therapies are mainly used in NSCLC when specific genetic mutations are present. They
can be more effective and have fewer side effects compared to traditional chemotherapy.
v. Immunotherapy: Immunotherapy is a type of treatment that helps the body's immune
system recognize and attack cancer cells. It has shown significant promise in treating
both NSCLC and SCLC, especially in cases where other treatments have been
unsuccessful (Easton, 2017). Immune checkpoint inhibitors are the most common form of
immunotherapy used for lung cancer.
vi. Palliative Care: Palliative care focuses on providing relief from symptoms and
improving the quality of life for patients with advanced lung cancer (Tan &
Ramchandran, 2020). It is not curative but can help manage pain, shortness of breath,
fatigue, and other symptoms associated with the disease.
vii. Clinical Trials: Clinical trials are research studies that test new treatments or
combinations of treatments to evaluate their safety and effectiveness. Patients with lung
cancer may consider participating in clinical trials as a potential option when other
treatments have not been successful or to access cutting-edge therapies (Nielsen et al.,
2020).
The choice of treatment depends on the specific characteristics of the lung cancer, the stage of
the disease, the patient's overall health, and their preferences. A multidisciplinary team, including
oncologists, surgeons, radiation oncologists, and other specialists, work together to develop a
personalized treatment plan for each patient. It's essential for patients to discuss their treatment
options thoroughly with their healthcare team and make informed decisions based on their
individual circumstances.
16
work, as opposed to being manually programmed with a specific set of instructions to carry out a
certain operation (Copeland, 2021). The method through which computers can take an algorithm
(also known as rules-based programming) and improve it as they interact with more data is
called machine learning. You may provide computers terabytes and petabytes of data with the
aid of machine learning so they can discover distinctions and develop their own algorithms based
on the underlying human-driven programming to achieve the desired result. Machine learning is
widely used to detect hate speech. Machine learning, which enables systems to automatically
learn from their experiences and improve over time without having to be explicitly designed,
uses artificial intelligence (AI). Machine learning is the process of creating computer programs
that can access data and utilize it to learn for themselves. The learning process begins with
observations or data, such as examples, firsthand experience, or teaching, in order to uncover
patterns in data and enhance decisions made in the future based on the examples we provide. The
basic objective is to enable computers to learn independently of humans and modify their
behavior as a result.
i. Image and speech recognition: Machine learning algorithms are used in image and
speech recognition software to identify patterns and features in images and audio signals.
For example, facial recognition technology can be used in security and surveillance
systems, while speech recognition technology can be used in personal assistants and
voice-controlled devices.
ii. Natural language processing (NLP): Machine learning algorithms are used in NLP to
analyze and understand human language, enabling computers to perform tasks such as
language translation, sentiment analysis, and chat bot interactions.
iii. Recommender systems: Machine learning algorithms can be used to analyze user
behavior and preferences to recommend products, services, and content that users are
likely to be interested in. Examples of such systems include movie and music
recommendation systems, as well as personalized product recommendations on e-
commerce websites.
17
iv. Fraud detection: Machine learning algorithms can be used to detect fraudulent behavior
in financial transactions and other contexts, helping to prevent fraud and improve
security.
v. Healthcare: Machine learning can be used in healthcare to analyze medical data and help
diagnose diseases, identify risk factors, and develop treatment plans.
vi. Predictive maintenance: Machine learning algorithms can be used to analyze data from
sensors and other sources to predict when maintenance will be required on machines and
equipment, helping to prevent breakdowns and reduce downtime.
vii. Autonomous vehicles: Machine learning algorithms are used in self-driving cars to
identify obstacles, detect traffic patterns, and make real-time driving decisions.
viii. Financial analysis: Machine learning algorithms can be used in financial analysis to
identify patterns and trends in data, analyze market behavior, and predict future
outcomes.
Overall, machine learning has a wide range of applications across industries and fields, enabling
businesses and organizations to improve efficiency, accuracy, and decision-making capabilities.
Classification Clustering
Regression Association
18
To identify the rules that translate a set of inputs to a set of outputs is the goal of supervised
learning. Algorithms for supervised machine learning can use what they have learnt in the past to
predict future events using tagged examples and fresh data. Starting from the analysis of a well-
known training dataset, the learning algorithm constructs an inferred function to predict the
values of the outputs. The system is capable of producing targets for any new input after
receiving enough training. In order to find faults and make any necessary model corrections, the
learning algorithm can also compare its results to the desired, accurate results. The learning
process is supervised because example labelled data from prior input and output pairs are given
to the model to teach it how to behave. Supervised learning is the term used to describe the
machine learning task of learning a function that maps an input to an output based on sample
input-output pairs. From labeled training data, which consists of a collection of training
instances, it infers a function. Two types of supervised learning exist
19
Classification
Similar data points are categorized by being divided into various sections using classification.
The rules that specify how to divide the different data pieces are found using machine learning.
Simply said, categorization approaches look for the optimal way to draw a line that divides data
points. The distinctions between classes are marked by the decision boundaries. The entire area
selected to designate a class is referred to as the decision surface. A data point is classified into a
certain class if it is contained inside the decision surface's borders.
20
Examples of Classification Algorithms
i. Naïve Bayes
ii. Support Vector Machine
iii. Stochastic Gradient Descent
iv. Logistic Regression
v. Random Forest
Regression
A subset of machine learning algorithms that are in the family of supervised machine learning
algorithms are regression algorithms. The capacity of supervised learning algorithms to forecast
the value of fresh data by simulating dependencies and interactions between the intended output
and input properties is one of its fundamental features. Regression algorithms forecast the values
of the output by using input attributes from the data provided into the system. According to
protocol, the algorithm builds a model from the traits of training data and uses that model to
predict the value of fresh data.
21
Regression produces a number rather than a class, which is how it differs from classification.
Regression is the process of figuring out how closely two independent and dependent variables
are correlated. It assists in making predictions about continuous variables like market trends,
home prices, and so forth. The method predicts a single output value using training data. On the
basis of training data, regression can be used to forecast house prices, for instance. The input
variables will be location, home size, and so forth.
22
iv. After the training is finished, you don't necessarily need to keep the training data in your
memory. The mathematical expression for the decision boundary can still be used.
Disadvantages of Supervised Learning
i. For a variety of reasons, supervised learning cannot handle some of the more difficult
machine learning tasks.
ii. Supervised learning is unable to extract unknown information from training data, in
contrast to unsupervised learning.
iii. In contrast to unsupervised learning, it is unable to classify or organize data by figuring
out its features on its own.
iv. The output can have the wrong class label if we offer a classification input that does not
correspond to one of the classes in the training data. Consider training an image classifier
with data on cats and dogs as an example. The outcome may then be a cat or a dog, which
is incorrect, if you enter an image of a giraffe.
Unsupervised learning is the process of teaching a computer to act on unlabeled data without
human supervision. The machine's job in this scenario is to categorize unsorted data based on
similarities, patterns, and differences without any prior data training. In contrast to supervised
learning, unsupervised learning does not result in the machine being educated. As a result, the
machine can only successfully locate the hidden structure in unlabeled data on its own. This
model works independently to uncover previously unnoticed patterns and data. It mostly makes
use of unlabeled data. Unsupervised machine learning methods are applied when the training
data is neither categorized nor labeled. Unsupervised learning typically comes in two forms:
Clustering
Unsupervised learning is largely used in clustering. When individuals in a group are more similar
to one another than to those in other clusters, this process of grouping is known as clustering.
There are many different clustering methods available. They typically use a type of similarity
meter based on specific metrics, such as Euclidean or probabilistic distance. Genetic clustering,
bioinformatics sequence analysis, pattern mining, and object recognition are a few clustering
problems that can be resolved utilizing the unsupervised learning approach. Clustering is the
23
process of forming groups with recognizable traits. Clustering is used to look for several
subgroups in a dataset. In unsupervised learning, we are not limited by any set of labels in terms
of how many clusters we can build.
Association
Association learning is a rule-based machine learning technique for finding pertinent
associations between variables in huge databases. By using some indicators of interest, it aims to
find reliable rules in databases. In association learning, you aim to identify the rules that best
describe your data. As an illustration, it's probable that someone who sees video A will also
24
watch video B. Association rules are perfect in cases like this when you want to find similar
products.
25
learning with modified and altered versions. This technique is used in areas like content
classification and speech analysis.
26
Feature Extraction: In the initial step, relevant features are extracted from the medical
images. These features may include texture descriptors, shape characteristics, or other
quantitative measurements related to the appearance of lung nodules or abnormalities.
Training: A labeled dataset is used to train the SVM model. The dataset contains medical
images with corresponding labels indicating whether they represent lung cancer or not.
Hyperplane Optimization: SVM aims to find the hyperplane that maximizes the margin
between the two classes while minimizing classification errors. This hyperplane acts as
the decision boundary that separates cancerous and non-cancerous instances.
Testing: Once the SVM model is trained, it can be used to classify new, unseen medical
images as either cancerous or non-cancerous based on their extracted features.
Figure 2.6 Separating the hyperplane for Support vector machines (Javatpoint, 2021)
Convolution and Pooling: CNNs use convolutional layers to extract meaningful features
from input images. These layers apply a set of learnable filters (kernels) to the image,
27
capturing important patterns and features. Pooling layers downsample the feature maps to
reduce the model's complexity and computational requirements.
Feature Hierarchy: CNNs learn hierarchies of features, starting from basic patterns (e.g.,
edges) in the early layers to more complex features in deeper layers. These hierarchical
representations enable CNNs to capture intricate patterns present in medical images.
Classification: The extracted features are then passed through fully connected layers to
perform the final classification. The last layer often uses a sigmoid or softmax activation
function to output the probability of the image belonging to different classes (e.g.,
cancerous or non-cancerous).
Training: CNNs require a large labeled dataset for training. During the training process,
the model adjusts its internal parameters to minimize the difference between predicted
and actual labels.
Testing: Once the CNN is trained, it can be used to predict the presence of lung cancer in
new medical images.
Both SVM and CNN approaches have their strengths and weaknesses. SVM is often preferred
for small datasets or when feature engineering is crucial, as it allows better control over the
feature selection process. On the other hand, CNNs excel in learning intricate features
automatically from raw images, making them highly effective when large labeled datasets are
available. In practice, researchers and medical professionals may choose the approach that suits
their specific dataset and requirements best.
28
2.6 General Architecture of Machine Learning
Machine learning is a subfield of Artificial Intelligence that deals with building computer
systems that can automatically learn from data without being explicitly programmed. The
general architecture of machine learning typically involves several stages or components that
work together to train a model that can make predictions or decisions based on input data. The
general
architecture of machine learning is referred to as the overall structure and flow of the machine
learning process, which includes various stages such as data collection, data pre-processing,
feature engineering, data modelling, evaluation, and deployment.
29
and other text-based sources. Once the data sources have been identified, the next step is to
extract the relevant data from them. This involves filtering out irrelevant data, such as duplicates
or noise, and selecting only the data that is relevant to the problem being solved. In some cases,
the data may need to be transformed to a common format to ensure consistency and
compatibility. After the relevant data has been extracted, it needs to be stored in a suitable format
for further processing. This may involve converting the data into a standard format such as CSV,
JSON, or XML. The data may also need to be organized into a database or data warehouse for
easy access and retrieval.
Data collection is a critical component of machine learning, as it sets the foundation for the rest
of the machine learning process. The quality and relevance of the data collected during this stage
have a significant impact on the accuracy and performance of the machine learning model. As
such, it is essential to ensure that the data collection process is thorough, efficient, and reliable.
This can be achieved through careful planning, data cleaning, and the use of appropriate tools
and techniques.
Data Cleaning: Data cleaning involves removing irrelevant data and dealing with missing data.
This is important because the quality of the data used to train the machine learning model has a
significant impact on the accuracy and performance of the model. In data cleaning, data points
with incorrect, incomplete, or inconsistent information are removed or corrected.
30
Data Transformation: Data transformation involves converting data from one format to
another. This can involve converting categorical data into numerical data or changing the scale
of numerical data. This step is essential because different machine learning algorithms require
data in different formats.
Data Normalization: Data normalization is the process of scaling the data to fit within a specific
range. This is important because some machine learning algorithms are sensitive to the scale of
the data. Normalization ensures that the data is on a common scale, making it easier to compare
and analyse.
Data pre-processing is essential in machine learning because it ensures that the data used to train
the machine learning model is of high quality and in the right format. A well-processed dataset
can significantly improve the accuracy and performance of the machine learning model.
Therefore, it is crucial to pay close attention to data pre-processing when building machine
learning models.
31
Imputation: Imputation is a technique used to fill missing values in the data. Missing data can
affect the performance of a machine learning model. Imputation techniques such as mean
imputation, median imputation, and regression imputation can be used to fill in the missing
values.
Scaling: Scaling is a technique used to normalize the data to a specific range. Normalizing the
data can help to reduce the impact of outliers and improve the performance of the machine
learning model. Techniques such as min-max scaling and standard scaling can be used for
scaling.
Encoding: Encoding is a technique used to convert categorical data into numerical data that can
be used by machine learning algorithms. Techniques such as one-hot encoding and label
encoding can be used for encoding.
Feature Selection: Feature selection is a technique used to select the most relevant features from
the data. Selecting relevant features can help to reduce the dimensionality of the data and
improve the performance of the machine learning model. Techniques such as correlation analysis
and feature importance can be used for feature selection.
Feature Extraction: Feature extraction is a technique used to create new features from the
existing features. Creating new features can help to capture more information from the data and
improve the performance of the machine learning model. Techniques such as principal
component analysis and wavelet transformation can be used for feature extraction.
The success of a machine learning model depends on the quality of the features used, and feature
engineering plays a critical role in creating relevant and informative features that can help to
improve the performance of the machine learning model.
32
predictions. Data modelling involves several steps, including selecting an appropriate model,
defining the model parameters, training the model on a dataset, and testing the model's
performance on a new dataset. The selection of the appropriate model is critical since the model
should be able to capture the underlying patterns and relationships within the data. The most
common types of models used in data modelling include linear regression, decision trees,
random forests, and neural networks.
Once the appropriate model is selected, the next step is to define the model parameters. The
model parameters are the values that the model needs to learn during the training process. The
process of learning these parameters is known as model training. The training process involves
feeding the model with a dataset and adjusting the model parameters until the model can
accurately predict the output for the given input. The model training process involves several
techniques, such as optimization algorithms, regularization techniques, and hyperparameter
tuning. Optimization algorithms such as gradient descent are used to update the model
parameters iteratively until the model can accurately predict the output. Regularization
techniques such as L1 and L2 regularization are used to prevent overfitting of the model.
Hyperparameter tuning involves selecting the optimal values for the model's hyperparameters,
such as learning rate and batch size. After the model is trained, the next step is to evaluate the
model's performance on a new dataset.
2.6.5 Evaluation
Evaluation is a component of machine learning that involves assessing the performance of the
trained model. It is the final step in the machine learning process and is used to determine how
well the model can generalize to new, unseen data. The evaluation process involves measuring
the accuracy, precision, recall, F1 score, and other metrics to assess the performance of the
model. The primary purpose of evaluation is to validate the performance of the model and
determine whether it is suitable for the intended task. It helps to identify any issues or errors that
may have occurred during the training process and provides insights into how the model can be
improved.
The evaluation process involves splitting the dataset into two parts, namely the training set and
the testing set. The training set is used to train the model, while the testing set is used to evaluate
33
its performance. The testing set should be representative of the real-world data that the model is
intended to work with. There are various evaluation metrics used in machine learning, and the
choice of metrics depends on the nature of the problem being solved. Some of the commonly
used evaluation metrics include:
Accuracy: It is the most basic evaluation metric and measures the percentage of correctly
classified samples.
Precision: It is the ratio of correctly predicted positive samples to the total number of positive
samples. Precision is used when the goal is to minimize false positives.
Recall: It is the ratio of correctly predicted positive samples to the total number of actual
positive samples. Recall is used when the goal is to minimize false negatives.
F1 Score: It is a weighted average of precision and recall and provides a balance between the
two metrics.
ROC Curve: It is a graphical representation of the true positive rate against the false positive
rate at different classification thresholds.
In addition to these metrics, there are other advanced evaluation techniques such as cross-
validation, which involves splitting the dataset into several subsets and training the model on
different subsets to improve its performance. In summary, evaluation is a critical component of
machine learning that involves assessing the performance of the model. It helps to determine
whether the model is suitable for the intended task and provides insights into how it can be
improved. The choice of evaluation metrics depends on the nature of the problem being solved,
and there are various advanced techniques that can be used to improve the performance of the
model.
2.6.6 Deployment
34
Deployment is an essential component of machine learning that refers to the process of making a
machine learning model available for use in the real world. In other words, it involves deploying
the trained machine learning model to a production environment where it can be used to make
predictions on new data. The deployment phase is the final step in the machine learning pipeline,
after the model has been trained, tested, and evaluated. The deployment phase of machine
learning involves taking the trained model and integrating it into a system that can receive input
data, process it, and output the predicted results. This can involve building an application that
integrates the machine learning model with other components, such as a web server or mobile
app. The deployment process can be challenging and requires careful consideration of factors
such as scalability, reliability, security, and performance. There are several different methods of
deploying machine learning models, including:
Hosting on cloud platforms: One popular method is to host the machine learning model
on cloud platforms such as AWS, Google Cloud, or Microsoft Azure. These platforms
provide scalable and reliable infrastructure for hosting machine learning models and can
also provide additional services such as auto-scaling, load balancing, and security.
Containerization: Another method of deployment is containerization using technologies
such as Docker or Kubernetes. Containerization allows the machine learning model to be
packaged with all its dependencies and run consistently across different environments.
Microservices: Machine learning models can also be deployed as microservices, which
are small, modular components that can be independently deployed and scaled. This
approach can make it easier to integrate machine learning into existing applications or
systems.
Mobile deployment: Machine learning models can also be deployed on mobile devices,
such as smartphones or tablets, using technologies such as TensorFlow Lite or Core ML.
This approach can enable real-time predictions on device, without the need for a network
connection.
The deployment phase is a critical component of machine learning, as it enables the machine
learning model to be used in the real world to solve real-world problems. Effective deployment
requires careful planning and consideration of various factors, including the choice of
35
deployment method, infrastructure requirements, and operational considerations such as
monitoring, logging, and maintenance.
Researchers have studied this subject using diverse datasets, which vary in quality and features.
By carefully selecting and extracting relevant information from these datasets, authors have been
able to draw meaningful conclusions from their research. Several significant studies have
contributed to the advancement of machine learning-based lung cancer detection, offering hope
for improved diagnosis and outcomes for patients. The following research works shed more light
on the subject topic
Abbas et al. (2023) proposed a study titled "Fused Weighted Federated Deep Extreme Machine
Learning Based on Intelligent Lung Cancer Disease Prediction Model for Healthcare 5.0." In this
study, the authors address the challenge of accurate disease prediction in the era of smart
healthcare industry 5.0 and information technology advancement. Particularly, the accurate
prediction of deadly cancer diseases, such as lung cancer, is crucial for human well-being in this
context.
Alsheikhy et al. (2023) proposed "A CAD System for Lung Cancer Detection Using Hybrid
Deep Learning Techniques," a study that introduces a fully automated and practical system for
identifying and classifying lung cancer. The primary objective of this system is to detect cancer
36
at an early stage, potentially saving lives or reducing mortality rates. The authors employ a
combination of deep learning techniques, including a deep convolutional neural network
(DCNN) called VGG-19, and long short-term memory networks (LSTMs). These techniques are
customized and integrated to detect and classify lung cancers, with image segmentation
techniques also applied. The system falls under the category of computer-aided diagnosis (CAD).
Gupta et al. (2022) proposed a study titled "A Study On Prediction Of Lung Cancer Using
Machine Learning Algorithms." In this paper, the authors conducted image classification and
applied various machine learning algorithms to a lung cancer disease dataset in order to calculate
measures such as accuracy and sensitivity. The study employed K-NN, Random Forest, and
SVM algorithms to analyze the initial stage of lung cancer by applying them to the lung cancer
dataset.
Kumar et al. (2022) developed a study titled "Lung Cancer Prediction from Text Datasets Using
Machine Learning". The aim of their research was to optimize the process of lung cancer
detection using a machine learning model based on support vector machines (SVMs). The study
utilized an SVM classifier to classify lung cancer patients based on their symptoms, while
implementing the model using the Python programming language.
Li et al. (2022) developed "Machine learning for lung cancer diagnosis, treatment, and
prognosis" in which they conducted a review aiming to provide an overview of machine
learning-based approaches that enhance different aspects of lung cancer diagnosis and therapy.
The authors specifically focused on early detection, auxiliary diagnosis, prognosis prediction,
and immunotherapy practice. They discussed the various applications of machine learning
techniques in these areas and highlighted both the challenges and opportunities for future
advancements in the field of machine learning for lung cancer.
Nageswaran et al. (2022) proposed a study titled "Lung cancer classification and prediction using
machine learning and image processing." This research demonstrates the accurate classification
and prediction of lung cancer through the utilization of machine learning and image processing
technologies. The study involved the collection of photos, specifically 83 CT scans obtained
from 70 distinct patients, which served as the dataset for the experimental investigation. To
enhance image quality, a geometric mean filter was applied during picture preprocessing.
37
Nemlander et al. (2022) proposed a study titled "Lung cancer prediction using machine learning
on data from a symptom e-questionnaire for never smokers, former smokers, and current
smokers." The aim of their study was to examine the predictive capability of symptoms reported
in an adaptive e-questionnaire for lung cancer, separately for individuals who have never
smoked, former smokers, and current smokers.
Shimazaki et al. (2022) developed a study titled "Deep learning-based algorithm for lung cancer
detection on chest radiographs using the segmentation method." In this research, the authors
aimed to develop and validate a deep learning (DL)-based model for detecting lung cancer on
chest radiographs. They utilized the segmentation method and evaluated the model's performance
in terms of sensitivity and mean false positive indications per image (mFPI).
Chaturvedi et al. (2021) proposed a study titled "Prediction and classification of lung cancer
using machine learning techniques." In this paper, the authors focus on listing, discussing,
comparing, and analyzing several methods related to image segmentation, feature extraction, and
various techniques for the early detection and classification of lung cancer. The goal is to explore
the effectiveness of these approaches in improving the accuracy and efficiency of lung cancer
detection, ultimately leading to better patient outcomes.
Madan et al. (2019) developed a study titled "Lung cancer detection using deep learning." The
purpose of their work was to enhance the accuracy of the initial stage detection of lung cancer by
leveraging image processing techniques. To conduct their experiments, the researchers obtained
CT images from the NIHINCI Lung Image Database Consortium (LIDC) dataset, which
provided them with the opportunity to analyze and evaluate their proposed approach.
38
extreme machine the lung cancer disease
learning model dataset.
3. Gupta et al. (2022) K-NN, Random The dataset used does not 84.2%
Forest, and SVM include a comprehensive set
of relevant features for
predicting lung cancer.
39
CHAPTER THREE
METHODOLOGY
40
Feature
Processing Data balancing
extraction
Evaluation Feature
Classification
measure selection
41
The following stages are taken in the development:
Step 2: Verify the dataset's dimensions, including its rows, columns, form, statistical summary,
etc.
Step 3: Clean data to remove gaps and outliers, look for duplicates, deal with missing data, and
handle null values.
Step 4: Scale and normalize - Our dataset most likely consists of features with different scales.
Scaling standardizes the independent variables of a dataset into a particular range.
Step 5: Split the dataset into training and testing on a ratio of 80:20
Data balancing, in the context of machine learning and data analysis, refers to the process of
adjusting the class distribution of a dataset to ensure that each class is represented fairly. This is
particularly important in situations where one class significantly outweighs the others, leading to
biased model predictions. Techniques for data balancing include oversampling the minority
class, under sampling the majority class, or using more advanced methods like SMOTE
(Synthetic Minority Over- sampling Technique) to generate synthetic data points.
42
Here's a step-by-step outline of our approach:
Step 1: We start by importing a Python library that is essential for feature evaluation.
Step 2: Among various methods available for numerical variables, we opt for the widely used
Pearson correlation method. It assigns values between −1 and 1, where 0 indicates no correlation,
1 signifies complete positive correlation, and −1 represents complete negative correlation. A
correlation value of 0.7 between two variables indicates a significant and positive relationship
between them. A positive correlation implies that when variable A increases, variable B also
tends to increase, while a negative correlation suggests that as A increases, B decreases.
Step 3: The correlation coefficient helps us understand whether the characteristics are positively
or negatively related to the target variable. To identify which characteristics are most associated
with the target variable in this project, we utilize a heat map.
Step 4: By combining the insights from both feature evaluation methods, we select the common
features that will be used for modeling purposes.
43
of both algorithms is then being compared in terms of accuracy, and the algorithm that produces
the results with the highest accuracy will be used to develop the prediction system.
i. Linear SVM: This is the standard SVM for linearly separable data. It finds a hyperplane
that best separates two classes.
ii. Non-Linear SVM: SVM can handle non- linear data using kernel functions. Examples of
kernel functions include Polynomial kernel, Gaussian Radial Basis Function (RBF)
kernel, and Sigmoid kernel.
iii. Support Vector Regression (SVR): instead of classification, SVR is used for regression
tasks.it tries to fit a “tube” around the data points, aiming to have as many points within
this tube as possible.
iv. One-class SVM: used for outlier detection, this algorithm identifies observations that
deviate significantly from the majority of the data.
v. Nu-SVM: this variant of SVM uses a parameter called “NU” to control the trade-off
between achieving a broader margin and allowing some misclassification.
vi. Multi-class SVM: SVMs can be extended to handle multiple classes using strategies like
One-vs-One (OVO) and One-vs-All (OVA) approaches.
In the realm of supervised machine learning, the Support Vector Machine (SVM) stands out as a
robust and expeditious classification model designed to tackle classification problems. Its
efficiency is particularly pronounced when dealing with limited datasets. The SVM represents
instances as points in space and skillfully delineates diverse categories by maximizing the gap or
width between them using hyperplanes. An exemplary application of this approach lies in
medical datasets, where the SVM is often employed to categorize symptom parameters into
groups exhibiting similar patterns. By utilizing hyperplanes, the SVM classifier efficiently
segregates these parameters into distinct groups. What sets the SVM apart is its ability not only
44
to classify known data but also to predict the class of previously unseen data. The ultimate goal
is to identify the optimal hyperplane, which boasts the most extensive separator margin.
In the context of SVM, we have an input vector represented by x, a weight vector denoted as w, a
bias term indicated by b, and a decision function, f(x), responsible for classifying the input vector
into a specific class. The primary objective of SVM is to identify the most suitable values for w
and b, which effectively enhance the margin between the hyperplane and the nearest data points
belonging to different classes.
i. LeNet-5: One of the earliest CNNs: one of the earliest developed by Yann LeCun, used
for handwritten recognition.
ii. AlexNet: A breakthrough architecture in the imageNet Large Scale Visual Recognition
Challenge in 2012, designed by Alex Krizhevsky and llya Sutskever.
iii. VGGNet: known for its simplicity and depth, VGGNet has various versions, such as
VGG16 and VGG19, with many layers.
vi. MobileNet: Designed for mobile and embedded vision applications, MobileNet uses
depthwise separable convolutions for efficiency.
A CNN (Convolutional Neural Network) is a type of deep learning architecture that is primarily
used for image recognition and computer vision tasks. It is designed to automatically and
45
adaptively learn spatial hierarchies of features from input images, allowing it to identify patterns,
objects, and structures within the images. The key components of a CNN are convolutional
layers, pooling layers, and fully connected layers. Here's a brief overview of each:
Convolutional Layer: The convolutional layer is the core building block of a CNN. It
consists of a set of learnable filters (also called kernels) that slide over the input image.
Each filter performs a convolution operation, which involves element-wise multiplication
of the filter with a local region of the input image, followed by summation. The result is a
feature map that highlights certain patterns or features found in the input image. The
formula for the convolution operation in 2D can be represented as follows:
Pooling Layer: The pooling layer is used to reduce the spatial dimensions of the feature
maps obtained from the convolutional layers. It helps in reducing the computational
complexity and making the network more robust to small variations in the input. The
most common type of pooling is max-pooling, which takes the maximum value from a
local region of the feature map and retains only the most significant information. The
formula for max-pooling in 2D can be represented as follows:
Fully Connected Layer: After several convolutional and pooling layers, the final feature
maps are flattened into a 1D vector and passed through one or more fully connected
layers. These layers are similar to those in a traditional neural network, connecting all
neurons from the previous layer to all neurons in the current layer. They help in learning
complex non-linear relationships between the extracted features and the output classes.
The formula for a fully connected layer is standard and involves a matrix multiplication:
3.2.7 Evaluation
The model's performance is measured in terms of recall, accuracy, precision, and f1score.
46
i. Accuracy: quantifies the percentage of instances that are classified correctly among all
instances. It is computed by dividing the number of accurate predictions by the total
number of predictions. Essentially, it represents the proportion of correct predictions
made by our model, reflecting its overall correctness.
ii. Precision: evaluates the accuracy of identifying true positive instances among the
predicted positives. It quantifies the proportion of true positives out of all positive
predictions. This assessment reflects the model's effectiveness in predicting a particular
category and is employed to gauge its capability in correctly classifying positive values.
iii. Recall: revolves around accurately identifying positive instances among the actual
positives. Mathematically, it represents the true positives divided by the total count of
actual positive instances. This metric provides insights into how effectively the model
detects a particular category and assesses its capacity to predict true positive values.
iv. F1-Score: serves as a balanced metric, taking into account both precision and recall
simultaneously. When there is a need to consider both precision and recall, the F1 Score
comes in handy, as it embodies the harmonic mean of these two metrics.
The application of the confusion matrix technique aids in obtaining the essential parameters for
evaluating model performance. Primarily employed for assessing classification models, this two-
dimensional table arranges the model's predicted labels in columns and the true class labels in
rows. The confusion matrix enables the derivation of crucial metrics such as True Positive (TP),
True Negative (TN), False Positive (FP), and False Negative (FN). Figure 1.9 visually illustrates
the extraction of these values from the table, which subsequently serve as the foundation for
calculating the model's performance metrics.
True47class
Predictive class
True positives (TP) occur when the model accurately predicts the positive class, correctly
recognizing an observation as part of the positive class.
False positives (FP) happen when the model predicts the positive class incorrectly, wrongly
identifying an observation as belonging to the positive class when it actually does not.
True negatives (TN) are instances where the model correctly predicts the negative class,
accurately identifying an observation as not belonging to the positive class.
False negatives (FN) are situations where the model predicts the negative class incorrectly,
mistakenly classifying an observation as not belonging to the positive class when it actually
does.
48
on unfamiliar, previously unseen data. In other words, the model should demonstrate proficiency
in predicting target values for data it has never encountered during training.
49
3.4 Software Component Requirements
Software refers to a set of instructions and programs that direct the operation of a machine.
These are created by programmers and software teams. In this study, the following software
components are used, each serving a specific purpose:
Operating system: To ensure optimal utilization of computing resources, Windows 7 and newer
versions are employed.
Integrated development environment (IDE): Visual Studio Code is utilized for writing the
system's source codes.
Web browsers: Google Chrome and Safari are the chosen web browsers.
Python 3.9: This is a versatile programming language used for various coding purposes.
Streamlit: An open-source Python library used for constructing interactive web applications
focused on data science and machine learning. It simplifies the process of creating and deploying
data-driven applications without extensive HTML, CSS, or JavaScript coding.
GitHub: This web-based platform serves as a centralized location for version control and
collaboration among developers. It is built on the Git version control system, which tracks
changes in code over time and allows multiple developers to work on the same codebase
concurrently..
Hard Disk: A minimum of 500GB for permanent storage of codes, datasets, and trained models.
Mouse: A 3D mouse.
CHAPTER FOUR
RESULTS AND DISCUSSIONS
4.1 Introduction
The machine learning model is deployed by designing a web-based application through which
the user can get the predicted results by inputting the values for the required parameters such as
the result for age, gender, anxiety, peer pressure and so on. As these parameters are inputted into
the web application, it sends it to the backend where the machine learning model is stored. When
the data is received by the backend, the model then make predictions on the outcome of the lung
cancer results and sends the response back to the frontend so the result can be displayed. The
degree of accuracy for this system is based on the number of records contained in the dataset
used to carry out the operation. The application is a medium by which data passes through to the
machine learning model, the model checks itself to see if there is some resemblance in the
51
dataset used in its construction, it then learns from already available data gotten from the dataset
utilized. Finally, it delivers the appropriate outcome back to the user. The web application
designed in this study was tested with localhost and can be deployed online as a fully working
application i.e., it can be accessed at any time and from anywhere without restrictions. Users of
this system can test whenever they feel the need to, which in turn improves the model’s ability to
learn new things about the different data values presented to it.
This system runs on a web environment and is used with the following procedures on the
localhost:
Step 1: Start the xampp control panel and start the apache server and MySQL database.
Step 2: On your preferred browser type “localhost/lung”. Once the web application loads up, it
displays a user interface (frontend) that initially shows information about hypertension,
symptoms and how it can be controlled. A login and registration page are provided for users to
access the system and register if not already on the system. When the user accesses the system
they are provided with the full functionalities of the system.
In the system architecture, the presentation layer is designed using HTML/CSS and jQuery
(JavaScript Framework) as the user interface, the application layer is designed using PHP that
runs on the server and the data layer is designed using MySQL database server is installed on the
computer system. Python was used on the backend layer for machine learning predictions. The
computer system captures users details, allows information input from the user on the systems
database, where users can easily access and view database information.
52
Figure 4.1 Home page
The image above shows the interface of the software where the users interact with the software.
The prediction model is used to predict lung cancer in patients. The basic structure of the
prediction system is as follows: data acquisition, data preprocessing, feature extraction, training
and testing dataset and classification using the web-based application.
The dataset used for this project was collected from kaggle with 3000 lung cancer images. 500
images are for training adenocarcinoma lung cancer, 200 images are for testing adenocarcinoma
lung cancer, 550 images are for training large cell carcinoma lung cancer, 250 images are for
testing large cell carcinoma lung cancer, 475 images are for training normal lung cancer, 175
images are for testing normal lung cancer, 575 images are for training squamous cell carcinoma
lung cancer, 275 images are for testing squamous cell carcinoma lung cancer. The table 4.1 and
53
4.2 shows that the data was evenly distributed for a suitable machine learning task. The training
dataset made up 70% of the data and the testing dataset made up 30%.
1 Adenocarcinoma 500 24
54
3 Normal 475 22
500 images are for training adenocarcinoma lung cancer, 550 images are for training large cell
carcinoma lung cancer, 475 images are for training normal lung cancer, 575 images are for
training squamous cell carcinoma lung cancer, which sum up to 2100 images. All 2100 sample
were used for training the model. The training dataset made up 70% of the data
1 Adenocarcinoma 200 24
3 Normal 175 22
200 images are for testing adenocarcinoma lung cancer, 250 images are for testing large cell
carcinoma lung cancer, 175 images are for testing normal lung cancer, 275 images are for testing
squamous cell carcinoma lung cancer, which sum up to 900 images. All 900 sample were used
for testing the model. The testing dataset made up 30%.
Data preprocessing is the most important part of machine learning, to prepare and have an
accurate machine learning application, data must be processed before being inputted into the
system. Data preprocessing is a technique to prepare the raw data such that the model can learn
55
from it, the data pre-processing step deals with the missing values in the dataset, converts the
numeric values into string type so that we can perform one hot encoding and also, we handle the
under sampling of the target attribute.
Step 1: In order to perform data preprocessing using python, we need to import predefined
python libraries. These libraries are used to perform some specific jobs such as to remove gaps,
noise and outliers. These are three specific libraries that we will use for data preprocessing,
which are: pandas, numpy, matplotlib and seaborn.
56
Step3: The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning model.
Hence it is necessary to handle missing values present in the dataset.
57
Figure 4.7 Dataset after dealing with categorical data
This is a step that involves extracting and selecting features to get the best results from the
models. Getting the best features help to demonstrate the underlying data structures. The
presence of irrelevant features in the dataset makes it difficult to differentiate the impact of some
relevant features. This significantly reduces the performance of the machine learning model. To
curb this and identify essential symptoms, univariate feature selection method, correlation
coefficient and feature importance. Common symptoms from the feature selection method are
picked and use to train the model.
i. Correlation coefficient
Correlation matrix is a table used to define correlation coefficients among variables or
features, and it is a tool for the feature selection and extraction process. Each cell in the
matrix represents a relationship between two variables. It is used to summarize a large
dataset as well as to identify the most highly correlated features in the data. The
correlation coefficients value near 1 indicates that the features participating in the
correlation are highly correlated to each other; on the other hand, the correlation
coefficients value near 0 indicates that the features participating in the correlation are less
correlated to each other.
58
Figure 4.8 Heatmap of selected features
ii. Univariate Feature Selection
Statistical test can be used to identify the features with the strongest relationship to the
output variable. The scikit learn library includes the SelectKBest class, which can be used
in conjunction with a variety of statistical tests to select a specific number of features. The
figure below selects 15 of the best features using chi2 statistical test for non-negative
features.
iii. Feature Importance feature selection
You can get the feature importance of each feature of the dataset of the dataset by using
the feature importance property of the model. Feature importance gives a score for each
feature of the data, the higher the score the more important or relevant the feature is.
Feature importance is an inbuilt class that comes with a tree-based classifiers, we will be
using Extra Tree classifiers for extracting the top 15 features for the dataset.
59
With these features selection techniques, common features from all the three methods
were obtained and a list of 12 features was selected and every other with no relevancy was
dropped.
In machine learning data preprocessing, we split our selected features into a testing set and
training set. This is one of the crucial steps of data preprocessing, we can enhance the
performance of our machine learning model. Suppose, if we have given training to our machine
learning model by a dataset and we test it by a totally different dataset. Then, it will create
difficulties for our model to understand the correlations between the models. If we train our
model very well and its training accuracy is also very high, but we provide a new dataset to it,
then it will decrease the performance. So, we always try to make a machine learning model
which performs well with the training set and also with the test dataset.
The dataset was then further divided into two sets (training set and testing set) using the test-train
split method. The system was trained using training set data and the accuracy of the ML
classifier, then tested using the testing set. Finally, the model was used to predict the likelihood
of disease infection in terms of positive or negative outcomes using new patient data
i. Training dataset
A subset of dataset to train the machine learning model, and we already know the
output. 70% of the dataset was trained and fed to the machine learning model with the aid
of Random Forest and support vector machine. This teaches the machine learning model
to learn and make the desired predictions.
60
Figure 4.9 Training Dataset
ii. Testing dataset
A subset of dataset to test the machine learning model, and by using the test set, model
predicts the output. 30% of the dataset is used to evaluate the accuracy of the develop
model and predictions are made. To validate the efficacy and accuracy of the developed
model.
61
4.3.5 User output
Typically refers to the result or outcomes generated by users interacting with a system, software
or service.it represent the information, actions, or changes produced as a result of user input or
engagement.
The image above shows the result to be large cell carcinoma for training an uploaded dataset.
The image above shows the result to be normal for training an uploaded dataset.
62
Figure 4.1.4 predictions
The image above shows the result to be adenocarcinoma for training an uploaded dataset.
There following formula can be used to get the accuracy score, precision , sensitivity, specificity
and F1-Score.
Accuracy = (TP+TN)/(TP+FP+TN+FN)
Precision = TP/(TP+FP)
Sensitivity = TP/(TP+FN)
64
Figure 4.1.5 Performance metrics for SVM
According to Figure 4.1.5 Support Vector Machine model provided us with an 52% training
accuracy. With such a result, developing an accurate machine learning model could be
inaccurate.
The confusion matrix is obtained from the machine learning code, after training the SVM model.
It will be used to describe the performance of a classification model on the training data.
To evaluate the precision, sensitivity and specificity of the Support Vector Machine model, we
draw conclusions from the confusion matrix in Figure 4.1.6
65
Precision: Precision is the percentage of accurately identified positive values. This can be
derived from the above confusion matrix using the following formula:
TP 226 226
Precision= =TP=226 , FP=33= = = 0.87
TP+ FP 226+ 33 259
Sensitivity: Sensitivity, also name for recall, is the percentage of true positive cases that are
accurately identified. This can be derived from the above confusion matrix using the following
formula:
TP 226 226
Sensitivity= = TP= 226, FN = 49 = = = 0.82
TP+ FN 226+49 275
Specificity: Specificity is the percentage of truly negative cases that are accurately identified.
This can be derived from the above confusion matrix using the following formula:
TN 192 192
Specificity= =TN=192 , FP=33= = = 0.85
TN + FP 192+33 225
66
Figure 4.1.7 Performance metrics for CNN
The confusion matrix is obtained from the machine learning code, after training the CNN model.
It will be used to describe the performance of a classification model on the training data.
To evaluate the precision, sensitivity and specificity of the CNN model, we draw conclusions
from the confusion matrix in Figure 4.1.5
67
Precision: Precision is the percentage of accurately identified positive values. This can be
derived from the above confusion matrix using the following formula:
TP 208 208
Precision= =TP=208 , FP=36= = = 0.88
TP+ FP 208+36 244
Sensitivity: Sensiti4vity, also name for recall, is the percentage of true positive cases that are
accurately identified. This can be derived from the above confusion matrix using the following
formula:
TP 208 208
Sensitivity= =TP=208 , FN =43= = = 0.85
TP+ FN 208+ 43 251
Specificity: Specificity is the percentage of truly negative cases that are accurately identified.
This can be derived from the above confusion matrix using the following formula:
TN 213 213
Specificity= =TN=213 , FP=36= = = 0.87
TN + FP 213+36 249
Two machine learning techniques were trained and tested. 70 percent of dataset was used for
training model while 30 percent was used for testing. The developed model achieved an accuracy
of 84% for SVM and an accuracy of 96% for CNN.
68
Table 4.5: Comparison of Lung Cancer Techniques
2. Alsheikhy et al. VGG-19, and long The research does not 98.8%
(2023) short-term discuss the interpretability
memory networks of the developed model.
(LSTMs).
Ontology is a data model that represents knowledge as a set of concepts within a domain and the
relationship between these concepts. Ontology is a form of knowledge management, it captures
the knowledge within an organization as a model, this model can then be queried by users to
answer complex questions and display relationships. The two standards that govern the
69
construction of ontologies are the resource description framework (RDF) and Web Ontology
Language (OWL) in accordance with RDF and OWL, the ontologies are made up of two
components, classes and relationships, classes are represented by ovals and relationships are
represented by arrows which can be used to represent real life relationships. These relationships
are called a triple. A triple consists of a subject a predicate and an object. Triples are the heart of
ontologies, they can be merged together to represent the comprehensive view of the real world.
Ontology was used in this project work to show the relationship between processes involved in
Lung cancer prediction on this system.
For lung cancer predictions to be made, the patient class was necessary. The patient class has a
patient ID and diagnosis subclass, the patient ID subclass is the identification number for the
patients and the diagnosis subclass was for diagnostic results of patients. The diagnosis subclass
contained the classification subclass, which represented the classification model used to predict
lung cancer abnormalities. The Lung cancer abnormalities subclass consisted of attributes needed
for lung cancer prediction. This attributes where in the form of subclasses such as Age, Sex,
LOH, LOS, alcohol CPD, salt CID, CKD, ATD, Smoking, Pregnancy, GPC, PA, BMI etc. The
Classification class gives results to the diagnosis class which is further passed to the Disease
class which gives the result of lung cancer to the patient class.
Overall, CNN are a useful tool in machine learning and can provide valuable insights into the
relationships between features and the target variable. However, they should be used with
caution and in combination with other algorithms to mitigate their limitations.
Prior to the actual implementation of the system, it had to be tested comprehensively and every
possible error discovered. Since the system cannot be tested exhaustively, the black box testing
method was used for system testing. The black box testing usually demonstrates that software
functions are operational; that the input is properly accepted and the output is correctly produced;
and that integrity of external information (database) is maintained.
70
It is pertinent to note that though all the program modules have been debugged, this does not
mean that they are completely error free as logical errors might develop at any time later during
the usage of the system. System testing can be divided into;
Unit testing was carried out on individual modules of the system to ensure that they are fully
functional units. We did this by examining each unit, for example the login page. It was checked
to ensure that it functions as required and that it adds data and other details and also ensured that
this data is sent to the database. The success of each individual unit gave us the go ahead to
carryout integration testing. All identified errors were dealt with.
We carried out integration testing after different modules had been put together to make a
complete system. Integration was aimed at ensuring that modules are compatible and they can be
integrated to form a complete working system. For example, we tested that when a user is logged
in, he/she is linked to the appropriate module, and also could make predictions and access the
database.
71
iii. Direct Changeover: this is the changeover method in which the old (current or existing
system) is completely replaced by the newly introduced system in one move. This method is
quick and has the minimal work load but very risky and maybe unavailable when the two
systems substantially differ sometimes.
iv. Phase Changeover: this requires that the existing system will gradually phase out as the new
system is gradually phased in.
The process of modifying a system to meet changing needs is known as system maintenance.
System maintenance is a primary task or obligation any computerized organization must take up
in order to ensure efficiency and continuity of the developed system. It is a routine activity,
which is to say that the maintenance of the system is very essential to the smooth running of the
system. The following practices and measure must be taken to ensure that the new system does
not breakdown and achieve its proposed aims and objectives:
i. Password Management: Each user is required to enter an individual username and password
when accessing the software; this keeps the data in the database secured. For maximum security,
each user must protect their password.
ii. Regular Database Backup: This involves the creating duplicates of data which acts as an
insurance copy should in case the active copy is damaged or destroyed. The backup is usually
stored in an external storage device. Recovery involves the use of specialized utility programs to
72
rebuild or replace damaged files. The best way to recover a file or program is to restore it from a
backup copy.
iii. Virus Protection: This requires the use of a program that protects a system from malicious
software called a virus. A virus is a program that infects a computer and could damage a system
depending on its nature. Because new viruses must be analyzed as they appear, the antivirus
must be updated regularly to be effective.
iv. Proper use of the system: These include starting (booting) and shutting down the system in
the right manner to prevent the system from hanging or data corruption and file loss.
CHAPTER FIVE
5.1 Conclusion
Lung cancer is a disease with multifactorial origins and etiology. It rises when abnormal cells
grow uncontrollably in the lungs, forming tumors that can interfere with lung function and, if left
untreated, can spread to other pert of the body. This shows that the demand and need for a
system that diagnoses Lung cancer quickly and efficiently is crucial. The machine learning
model is deployed by designing a web-based application through which the user can get the
predicted results by inputting the values for the required parameters such as the result for age,
gender, anxiety, peer pressure and so on. As these parameters are inputted into the web
application, it sends it to the backend where the machine learning model is stored. When the data
is received by the backend, the model then make predictions on the outcome of the lung cancer
results and sends the response back to the frontend so the result can be displayed. The degree of
accuracy for this system is based on the number of records contained in the dataset used to carry
out the operation.
We proposed this ontology-based lung cancer prediction system for the sole aim of predicting
lung cancer. Testing and evaluating machine learning models on a clinical dataset through
analysis by means of two different algorithm reveals accuracies of 83.6%, 84.2%,75% and 71%
for the SVM and CNN classifiers, respectively. We can conclude that using the CNN classifier
we can reach a significant accuracy level of 90% on training and 84.2% on testing.
73
Ontology was made for this system using protégé, this was used in our project work as a
knowledge base for the operations of our system. In our future work we intend to apply the
algorithm in this project and other machine learning algorithms on larger datasets with more
disease attributes to obtain higher accuracy, make ontologies that can be queried using the
SparQL query language and also review gaps on more related works to better the vastness of the
system.
5.2 Recommendation
To ensure proper and efficient adaptation of the findings in this project work, it is recommended
that there is consistency in the use of computerized prediction and management software. It is
also recommended that two or more algorithms should be added to the current one on a larger
dataset to find out if a better result on accuracy can be achieved. Additionally, the continuous
maintenance of the software is very important to meet up the needs of new technologies.
Predictive models for lung cancer detection suffer from limited generalizability due to training
on specific datasets, leading to potential bias and inaccuracies when applied to diverse
populations of settings.
74
REFERENCES
Abedi, V., Khan, A., Chaudhary, D., Misra, D., Avula, V., Mathrawala, D., Kraus, C. K.,
Marshall, K., Chaudhary, N., Li, X., Schirmer, C. M., Scalzo, F., Li, J., & Zand, R.
(2020). Using artificial intelligence for improving stroke diagnosis in emergency
departments: a practical framework. Therapeutic Advances in Neurological Disorders,
13, 175628642093896. https://doi.org/10.1177/1756286420938962
Adegoke, O., Awolola, N. A., & Ajuluchukwu, J. (2018). Prevalence and pattern of
cardiovascular-related causes of out-of-hospital deaths in Lagos, Nigeria. African Health
Sciences, 18(4), 942. https://doi.org/10.4314/ahs.v18i4.13
Ahsan, M. M., Luna, S. A., & Siddique, Z. (2022). Machine-Learning-Based Disease Diagnosis:
A Comprehensive Review. Healthcare, 10(3), 541.
https://doi.org/10.3390/healthcare10030541
Batko, K., & Slezak, A. (2022). The use of Big Data Analytics in healthcare. Journal of Big
Data, 9(1). https://doi.org/10.1186/s40537-021-00553-4
Kolyshkina, I., & Simoff, S. (2021). Interpretability of Machine learning solutions in Public
Healthcare: the CRISP-ML approach. Frontiers in Big Data, 4.
https://doi.org/10.3389/fdata.2021.660206
Kumar, Y., Koul, A., Singla, R., & Ijaz, M. (2022). Artificial intelligence in disease diagnosis: a
systematic literature review, synthesizing framework and future research agenda. Journal
75
of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-021-
03612-z
Ozaltin, O., Coşkun, O., Yeniay, Ö., & Subasi, A. (2022). A Deep Learning Approach for
Detecting Stroke from Brain CT Images Using OzNet. Bioengineering, 9(12), 783.
https://doi.org/10.3390/bioengineering9120783
Wallace, E., & Liberman, A. L. (2021). Diagnostic challenges in outpatient stroke: stroke
chameleons and atypical stroke syndromes. Neuropsychiatric Disease and Treatment,
Volume 17, 1469–1480. https://doi.org/10.2147/ndt.s275750
WHO. (2023, July 19). WHO and Nigerian Government move to curb cardiovascular diseases |
WHO | Regional Office for Africa. WHO | Regional Office for Africa.
https://www.afro.who.int/news/who-and-nigerian-government-move-curb-
cardiovascular-diseases
World Health Organization: WHO. (2019). Cardiovascular diseases. www.who.int.
https://www.who.int/health-topics/cardiovascular-diseases#:~:text=Cardiovascular
%20diseases%20(CVDs)%20are%20the,heart%20disease%20and%20other
%20conditions.
Adeoye, O., Nystrom, K., Yavagal, D. R., Luciano, J. M., Nogueira, R. G., Zorowitz, R. D.,
Khalessi, A. A., Bushnell, C., Barsan, W. G., Panagos, P. D., Alberts, M. J., Tiner, A. C.,
Schwamm, L. H., & Jauch, E. C. (2019). Recommendations for the Establishment of
Stroke Systems of Care: a 2019 update. Stroke, 50(7).
https://doi.org/10.1161/str.0000000000000173
Alageel, N., Alharbi, R., Alharbi, R., Alsayil, M., & Alharbi, L. A. (2023). Using machine
learning algorithm as a method for improving stroke prediction. International Journal of
Advanced Computer Science and Applications, 14(4).
https://doi.org/10.14569/ijacsa.2023.0140481
Alanazi, E. M., Abdou, A., & Luo, J. (2021). Predicting Risk of Stroke From Lab Tests Using
Machine Learning Algorithms: Development and Evaluation of Prediction Models. JMIR
Formative Research, 5(12), e23440. https://doi.org/10.2196/23440
Alotaibi, N. H., Alotaibi, A., Eliazer, M., & Srinivasulu, A. (2022). Detection of Ischemic Stroke
Tissue Fate from the MRI Images Using a Deep Learning Approach. Mobile Information
Systems, 2022, 1–11. https://doi.org/10.1155/2022/9399876
76
Amann, J. (2021). Machine Learning in Stroke Medicine: Opportunities and Challenges for risk
prediction and Prevention. In Springer eBooks (pp. 57–71). https://doi.org/10.1007/978-
3-030-74188-4_5
Amiral, J. (2022). Editorial for the special issue of monitoring anticoagulants. Biomedicines,
10(1), 155. https://doi.org/10.3390/biomedicines10010155
Arráez-Aybar, L. A., Navia, P., Fuentes-Redondo, T., & Bueno-López, J. L. (2015). Thomas
Willis, a pioneer in translational research in anatomy (on the 350th anniversary of
Cerebri anatome ). Journal of Anatomy, 226(3), 289–300.
https://doi.org/10.1111/joa.12273
CDC. (2023a, July 27). Stroke Facts | Cdc.gov. Centers for Disease Control and Prevention.
https://www.cdc.gov/stroke/facts.htm
CDC. (2023b, July 27). Stroke Facts | Cdc.gov. Centers for Disease Control and Prevention.
https://www.cdc.gov/stroke/facts.htm
CDC. (2023c, July 27). Stroke Facts | Cdc.gov. Centers for Disease Control and Prevention.
https://www.cdc.gov/stroke/facts.htm
Chandrabhatla, A. S., Kuo, E. A., Sokolowski, J. D., Kellogg, R. T., Park, M., & Mastorakos, P.
(2023). Artificial Intelligence and Machine Learning in the diagnosis and Management of
Stroke: A Narrative Review of United States Food and Drug Administration-Approved
Technologies. PubMed, 12(11). https://doi.org/10.3390/jcm12113755
Charbuty, B., & Abdulazeez, A. M. (2021). Classification based on decision tree algorithm for
machine learning. Journal of Applied Science and Technology Trends, 2(01), 20–28.
https://doi.org/10.38094/jastt20165
Chohan, S. A., Venkatesh, P. K., & How, C. H. (2019). Long-term complications of stroke and
secondary prevention: an overview for primary care physicians. Singapore Medical
Journal, 60(12), 616–620. https://doi.org/10.11622/smedj.2019158
Chugh, C. (2019). Acute Ischemic Stroke: Management Approach. Indian Journal of Critical
Care Medicine, 23(S2), 140–146. https://doi.org/10.5005/jp-journals-10071-23192
Das, S., Pegu, H., Sahu, K., & Swayamjyoti, S. (2020). Machine Learning in Materials Modeling
-- Fundamentals and the opportunities in 2D materials. ResearchGate.
https://www.researchgate.net/publication/338594296_Machine_Learning_in_Materials_
Modeling_--_Fundamentals_and_the_Opportunities_in_2D_Materials
77
Dev, S., Sun, L., Nwosu, C. S., Jain, N., Veeravalli, B., & Deepu, C. J. (2022). A predictive
analytics approach for stroke prediction using machine learning and neural networks.
Healthcare Analytics, 2, 100032. https://doi.org/10.1016/j.health.2022.100032
DiGiacinto, J. (2023, April 25). Facts about LDL: The bad kind of cholesterol. Healthline.
https://www.healthline.com/health/ldl-cholesterol
Dock, E. (2023, April 20). Aspiration pneumonia: symptoms, causes, and treatment. Healthline.
https://www.healthline.com/health/aspiration-pneumonia
Dritsas, E., & Trigka, M. (2022). Stroke Risk Prediction with Machine Learning Techniques.
Sensors, 22(13), 4670. https://doi.org/10.3390/s22134670
Esse, K., Fossati-Bellani, M. R., Traylor, A., & Martin-Schild, S. (2011). Epidemic of illicit drug
use, mechanisms of action/addiction and stroke as a health hazard. Brain and Behavior,
1(1), 44–54. https://doi.org/10.1002/brb3.7
Fuchs, F. D., & Whelton, P. K. (2020). High blood pressure and cardiovascular disease.
Hypertension, 75(2), 285–292. https://doi.org/10.1161/hypertensionaha.119.14240
Geiger, D. (2021, September 23). Know the signs of stroke - BE FAST. Duke Health.
https://www.dukehealth.org/blog/know-signs-of-stroke-be-fast
González-Fernández, M., Ottenstein, L., Atanelov, L., & Christian, A. (2013). Dysphagia after
stroke: an overview. Current Physical Medicine and Rehabilitation Reports, 1(3), 187–
196. https://doi.org/10.1007/s40141-013-0017-y
Gray, L. A. (2020). Living the Full Catastrophe: A Mindfulness-Based Program to Support
Recovery from Stroke. Healthcare, 8(4), 498. https://doi.org/10.3390/healthcare8040498
Habehh, H., & Gohel, S. (2021). Machine learning in healthcare. Current Genomics, 22(4), 291–
300. https://doi.org/10.2174/1389202922666210705124359
Heo, J., Yoon, J. G., Park, H., Kim, Y. H., Nam, H. S., & Heo, J. H. (2019). Machine learning–
based model for prediction of outcomes in acute stroke. Stroke, 50(5), 1263–1265.
https://doi.org/10.1161/strokeaha.118.024293
Higuera, V. (2023, February 10). Hemorrhagic stroke: What to look for and how to prevent it.
https://www.medicalnewstoday.com/articles/317111#:~:text=This%20can%20happen
%20when%20a,by%20a%20blocked%20blood%20supply.
Johnson, J. (2020, July 27). What’s a deep neural network? Deep Nets explained. BMC Blogs.
https://www.bmc.com/blogs/deep-neural-network/
78
Kamal, H., Fine, E. J., Shakibajahromi, B., & Mowla, A. (2021). A history of the path towards
imaging of the brain: From skull radiography through cerebral angiography. Current
Journal of Neurology. https://doi.org/10.18502/cjn.v19i3.5426
Kamalakannan, S., Murthy, G. V., Prost, A., Natarajan, S., Pant, H., Chitalurri, N., Goenka, S.,
& Kuper, H. (2016). Rehabilitation needs of stroke survivors after discharge from
hospital in India. Archives of Physical Medicine and Rehabilitation, 97(9), 1526-1532.e9.
https://doi.org/10.1016/j.apmr.2016.02.008
Kaur, M., Sakhare, S. R., Wanjale, K., & Akter, F. (2022). Early Stroke Prediction Methods for
Prevention of Strokes. Behavioural Neurology, 2022, 1–9.
https://doi.org/10.1155/2022/7725597
Khuda, I., & Alshamrani, F. (2018). Stroke medicine in antiquity: The Greek and Muslim
contribution. PubMed, 25(3), 143–147. https://doi.org/10.4103/jfcm.jfcm_8_17
Kim, E., Heo, J., Eun, S., & Lee, J. Y. (2022). Development of Early-Stage Stroke Diagnosis
System for the Elderly Neurogenic Bladder Prevention. International Neurourology
Journal, 26(Suppl 1), S76-82. https://doi.org/10.5213/inj.2244030.015
Kuriakose, D., & Xiao, Z. (2020). Pathophysiology and Treatment of Stroke: Present status and
future Perspectives. International Journal of Molecular Sciences, 21(20), 7609.
https://doi.org/10.3390/ijms21207609
Martel, J. (2020, February 27). Atherosclerosis. Healthline.
https://www.healthline.com/health/atherosclerosis
Mayo Clinic. (2023, July 8). Stroke - Symptoms and causes - Mayo Clinic.
https://www.mayoclinic.org/diseases-conditions/stroke/symptoms-causes/syc-20350113
Mehta, J. L., Calcaterra, G., & Bassareo, P. P. (2020). COVID ‐19, thromboembolic risk, and
Virchow’s triad: Lesson from the past. Clinical Cardiology, 43(12), 1362–1367.
https://doi.org/10.1002/clc.23460
Owolabi, M., Sarfo, F. S., Howard, V. J., Irvin, M. R., Gebregziabher, M., Akinyemi, R.,
Bennett, A., Armstrong, K., Tiwari, H. K., Akpalu, A., Wahab, K., Owolabi, L., Fawale,
B., Komolafe, M., Obiako, R., Adebayo, P., Manly, J., Ogbole, G., Melikam, E., . . .
Howard, G. (2017). Stroke in Indigenous Africans, African Americans, and European
Americans. Stroke, 48(5), 1169–1175. https://doi.org/10.1161/strokeaha.116.015937
79
Pawar, U. (2020). Lets open the black box of random forests. Analytics Vidhya.
https://www.analyticsvidhya.com/blog/2020/12/lets-open-the-black-box-of-random-
forests/
Peng, J., Jury, E. C., Dönnes, P., & Ciurtin, C. (2021). Machine Learning Techniques for
Personalised Medicine Approaches in Immune-Mediated Chronic Inflammatory
Diseases: Applications and Challenges. Frontiers in Pharmacology, 12.
https://doi.org/10.3389/fphar.2021.720694
Pitchai, R., Dappuri, B., Pramila, P. V., Vidhyalakshmi, M., Shanthi, S., Alonazi, W. B.,
Almutairi, K. M., Sundaram, R. S., & Beyene, I. (2022). An Artificial Intelligence-Based
Bio-Medical Stroke Prediction and Analytical System Using a Machine Learning
Approach. Computational Intelligence and Neuroscience, 2022, 1–9.
https://doi.org/10.1155/2022/5489084
Qureshi, A. I., & Qureshi, M. (2017). Acute hypertensive response in patients with intracerebral
hemorrhage pathophysiology and treatment. Journal of Cerebral Blood Flow and
Metabolism, 38(9), 1551–1563. https://doi.org/10.1177/0271678x17725431
Rahman, S., Hasan, M., & Sarkar, A. K. (2023). Prediction of Brain Stroke using Machine
Learning Algorithms and Deep Neural Network Techniques. European Journal of
Electrical Engineering and Computer Science, 7(1), 23–30.
https://doi.org/10.24018/ejece.2023.7.1.483
Richards, L. (2023, January 30). What to know about aphasia after stroke.
https://www.medicalnewstoday.com/articles/aphasia-stroke
Rioux, B., Brissette, V., Marin, F. F., Lindsay, P., Keezer, M. R., & Poppe, A. Y. (2021). The
impact of stroke public awareness campaigns differs between sociodemographic groups.
Canadian Journal of Neurological Sciences, 49(2), 231–238.
https://doi.org/10.1017/cjn.2021.76
Rutkowski, N., Sabri, E., & Yang, C. (2021). Post-stroke fatigue: A factor associated with
inability to return to work in patients <60 years—A 1-year follow-up. PLOS ONE, 16(8),
e0255538. https://doi.org/10.1371/journal.pone.0255538
Saunders, D. H., Sanderson, M., Hayes, S., Kilrane, M., Greig, C., Brazzelli, M. G., & Mead, G.
(2016). Physical fitness training for stroke patients. The Cochrane Library.
https://doi.org/10.1002/14651858.cd003316.pub6
80
Schwartz, J., Gao, M., Geng, E., Mody, K. S., Mikhail, C., & Cho, S. K. (2019). Applications of
machine learning using electronic medical records in spine surgery. Neurospine, 16(4),
643–653. https://doi.org/10.14245/ns.1938386.193
Strilciuc, S., Grad, D. A., Radu, C., Chira, D., Stan, A., Ungureanu, M., Gheorghe, A., &
Muresanu, D. (2021). The economic burden of stroke: a systematic review of cost of
illness studies. Journal of Medicine and Life, 14(5), 606–619.
https://doi.org/10.25122/jml-2021-0361
Tazin, T., Alam, N. H., Dola, N., Bari, M., Bourouis, S., & Khan, M. M. (2021). Stroke Disease
Detection and Prediction Using Robust Learning Approaches. Journal of Healthcare
Engineering, 2021, 1–12. https://doi.org/10.1155/2021/7633381
Thachil, J. (2022). Embolism—The journey from a calendar to the clot via the Lord’s prayer.
Journal of Thrombosis and Haemostasis, 20(2), 538–539.
https://doi.org/10.1111/jth.15610
Veiga, P. (2014). Health and Medicine in Ancient Egypt: Magic and Science. Lmu-munich.
https://www.academia.edu/225468/Health_and_Medicine_in_Ancient_Egypt_Magic_and
_Science
Visvanathan, A., Mead, G., Dennis, M., Whiteley, W., Doubal, F., & Lawton, J. (2019).
Maintaining hope after a disabling stroke: A longitudinal qualitative study of patients’
experiences, views, information needs and approaches towards making treatment
decisions. PLOS ONE, 14(9), e0222500. https://doi.org/10.1371/journal.pone.0222500
Werner, C. (2023, February 28). How long will a stroke show up on an MRI?
https://www.medicalnewstoday.com/articles/how-long-will-a-stroke-show-up-on-an-mri
81
APPENDIX A
82
APPENDIX B
83
APPENDIX C
84
APPENDIX D
85
APPENDIX E
86
APPENDIX F
87
100%
0.96
95%
90%
Values in %
0.88 Precision
0.87 Recall
0.85 F1-score
85% 0.84
0.82
80%
75%
SVM CNN
APPENDIX G
88
import pandas as pd
import numpy as np
df = pd.read_csv('dataset.csv')
####visualization
##
##sns.distplot(df['AGE'], kde=False, color='Blue')
##sns.countplot(x='GENDER', data=df, hue='LUNG_CANCER')
##sns.countplot(x='AGE', data=df, hue='LUNG_CANCER')
89
##sns.distplot(df['ANXIETY'], kde=False, color='Blue')
##sns.distplot(df['ALCOHOL_CONSUMING'], kde=False)
##plt.show()
90
print(f'Testing Accuracy :{accuracy_score(y_test,y_pred)}')
print(f'Confusion matrix:\n {confusion_matrix(y_test,y_pred)}')
print(f'Recall: {recall_score(y_test,y_pred)}')
print(f'precision: {precision_score(y_test,y_pred)}')
print(f'F1-score: {f1_score(y_test,y_pred)}')
print(f'Fbeta-score: {fbeta_score(y_test,y_pred,beta=0.5)}')
print(classification_report(y_test,y_pred))
print('-'*33)
91
##save model
##import joblib
##joblib.dump(model,'model.joblib')
92