Lung Cancer Project

CHAPTER ONE
INTRODUCTION
1.1 Background of the Study

Lung cancer, a malignant tumor originating in the lungs, is a highly prevalent and deadly form of
cancer worldwide. This discussion aims to delve into the history of lung cancer, highlighting its
early recognition, advancements in diagnosis, treatment, and prevention. By understanding its
historical context, we can appreciate the challenges faced and the progress made in combatting
this devastating disease. The history of lung cancer can be traced back to ancient civilizations. In
ancient Egyptian manuscripts from 3000 BC, respiratory symptoms similar to those associated
with lung cancer were described (Falzone et al., 2018). However, it was during the Industrial
Revolution in the 18th century that lung cancer cases surged due to increased exposure to coal
dust and occupational hazards (Mackenbach, 2020). Surgeons in the 19th century began
documenting lung tumors during autopsies, recognizing the disease as a distinct entity. In the
early 20th century, researchers began investigating the association between lung cancer and
tobacco smoking (Ruegg, 2015). German physician Franz Reuter's 1909 study suggested a
connection, laying the foundation for subsequent research (Grunert, 2013). In the 1950s, Richard
Doll and Bradford Hill in the UK, and Ernst Wynder and Evarts Graham in the US, conducted
epidemiological studies that confirmed the strong link between smoking and lung cancer (Di
Cicco et al., 2016). This established lung cancer as a preventable disease with a major behavioral
risk factor.
As awareness of the link between smoking and lung cancer grew, efforts to prevent and control
the disease intensified. Public health campaigns were launched to educate the public about the
dangers of smoking and to encourage smoking cessation. These campaigns included warning
labels on cigarette packages, restrictions on tobacco advertising, and the implementation of
smoke-free policies in public spaces. In parallel with preventive measures, advancements in the
diagnosis of lung cancer played a crucial role in improving patient outcomes. The first
radiographic evidence of lung cancer was reported in 1918, and over time, imaging techniques
such as X-rays and computed tomography (CT) scans became instrumental in detecting lung
tumors (Purandare & Rangarajan, 2015). The development of bronchoscopy, a procedure that
1
allows direct visualization of the airways, further facilitated the diagnosis and staging of lung
cancer (Paradis et al., 2016).
Treatment options for lung cancer have evolved significantly throughout history. In the early
days, surgical resection was the primary approach, with the goal of removing the tumor and
surrounding tissues. However, the effectiveness of surgery was limited, as many cases were
diagnosed at advanced stages when the cancer had spread beyond the lungs. The introduction of
radiation therapy in the 20th century expanded treatment possibilities, offering an alternative or
complementary option to surgery (Gianfaldoni et al., 2017).
Chemotherapy, which involves the use of drugs to kill cancer cells, revolutionized lung cancer
treatment in the 1940s (Bagnyukova et al., 2010). Initially, chemotherapy drugs had limited
efficacy and were associated with significant side effects. However, ongoing research and
clinical trials led to the development of more targeted and effective chemotherapy regimens. In
recent years, the emergence of targeted therapies and immunotherapy has further transformed the
treatment landscape for specific subsets of lung cancer patients.
Targeted therapies are designed to specifically target and inhibit the growth of cancer cells with
specific genetic mutations (Waarts et al., 2022). By identifying the genetic abnormalities driving
the cancer, doctors can tailor treatment to individual patients, improving response rates and
reducing side effects. Immunotherapy, on the other hand, harnesses the body's immune system to
recognize and attack cancer cells (Papaioannou et al., 2016). It has shown remarkable success in
some patients by unleashing the immune system's potential to fight lung cancer.
In addition to these treatment modalities, palliative care has gained prominence in recent years to
improve the quality of life for lung cancer patients. Palliative care focuses on alleviating
symptoms, managing pain, and providing emotional support to patients and their families. It aims
to address the physical, psychological, and social aspects of living with a serious illness,
promoting a holistic approach to patient care.
The understanding and classification of lung cancer have evolved over time. In the mid-19th
century, lung cancer was classified into non-small cell lung cancer (NSCLC) and small cell lung
cancer (SCLC) (Gazdar, 2010). Although this classification system remains widely used, further
subtypes and molecular classifications have been identified, leading to targeted therapies and
2
personalized medicine approaches. Diagnostic techniques for lung cancer have significantly
improved over the years. Early methods such as chest X-rays and bronchoscopies allowed for the
identification of lung tumors. The development of computed tomography (CT) scans in the
1970s revolutionized lung cancer detection by providing detailed cross-sectional images (Sharma
et al., 2015). Subsequent advancements introduced positron emission tomography (PET) scans
and magnetic resonance imaging (MRI), enhancing diagnostic accuracy (Kure et al., 2021).
Liquid biopsy techniques, such as circulating tumor DNA analysis, have recently shown promise
for non-invasive detection and monitoring of lung cancer (Lone et al., 2022).
Surgery has been a primary treatment for localized lung cancer since the 1930s. Innovations in
surgical techniques, such as video-assisted thoracic surgery (VATS) and robotic-assisted
procedures, have improved patient outcomes (Choe et al., 2020). Radiation therapy, used since
the early 1900s, has also advanced with intensity-modulated radiation therapy (IMRT) and
stereotactic body radiation therapy (SBRT) (Yu et al., 2014). Chemotherapy and targeted
therapies have significantly improved survival rates for both NSCLC and SCLC patients (Musika
et al., 2021). Immunotherapy, particularly immune checkpoint inhibitors, has emerged as a
groundbreaking treatment option, leveraging the body's immune system to combat cancer
cells. Recognition of the smoking-lung cancer link has led to widespread public health
campaigns to reduce tobacco consumption. Smoke-free policies, tobacco taxation, and anti-
smoking campaigns have contributed to declining smoking rates (Levy et al., 2016). Screening
programs utilizing low-dose CT scans have been implemented for high-risk populations,
enabling early detection (Yip et al., 2021). Efforts to improve air quality, reduce occupational
exposures, and promote healthy lifestyles continue to be crucial in preventing lung cancer.
The history of lung cancer reveals a complex journey marked by significant advancements in
understanding, diagnosis, treatment, and prevention. From ancient observations to modern
molecular classifications and personalized therapies, our knowledge of lung cancer has expanded
exponentially. While challenges remain, such as late-stage diagnoses and limited treatment
options for advanced cases, ongoing research and public health initiatives offer hope for
continued progress. Through a comprehensive approach encompassing prevention, early
detection, and innovative treatments, we strive to reduce the burden of lung cancer and improve
patient outcomes in the future.
3
Lung cancer is a deadly disease with challenges in diagnosis and treatment. Machine learning, a
branch of artificial intelligence, has shown promise in various aspects of lung cancer care. It has
been used to analyze medical imaging data for early detection and classification of lung nodules.
Machine learning algorithms can assist in diagnosis by considering patient data and predicting
tumor characteristics. Treatment planning can be improved with machine learning models that
generate personalized treatment recommendations. Furthermore, machine learning can predict
patient outcomes and survival rates by analyzing diverse patient data. However, challenges such
as data quality, privacy, and validation need to be addressed. Overall, machine learning has the
potential to revolutionize lung cancer care and improve patient outcomes.
1.2 Statement of the Problem

Lung cancer is the leading cause of cancer death in the world, accounting for 1 in 4 cancer
deaths. Early detection is critical for improving survival rates, but current methods of lung cancer
detection are limited and expensive.
1.3 Aim and Objectives of the Study
The aim of this study is to develop predictive lung cancer model using Support Vector Machine
(SVM) and Convolutional Neural Network (CNN). The specific objectives are to:
i. develop predictive model using Channel Squeeze and Excitation (SE)Techniques.

ii. develop predictive models for lung cancer detection using SVM and CNN.
iii. developed system using the following metrics accuracy, precision, recall, confusion
matrix and F1 scores.
1.4 Research Methodology
This research will adopt a quantitative research design to develop and evaluate predictive models
for lung cancer detection. The study will involve the collection and analysis of a dataset
containing patient symptoms obtained from Kaggle. Two algorithms, Support Vector Machines
(SVM) and Convolutional Neural Networks (CNN), will be utilized for developing predictive
models. The dataset containing patients' symptoms will be obtained from Kaggle, ensuring that it
includes relevant information such as demographic details, clinical histories, and medical
imaging data. The dataset will be preprocessed to handle missing values, normalize features, and
ensure data quality. Preprocessing of the dataset will involve cleaning the data, handling missing
4
values, and normalizing or standardizing the features. Categorical variables will be encoded
appropriately, and feature selection techniques may be employed to reduce dimensionality if
necessary.
Two predictive models, SVM and CNN, will be developed for lung cancer detection using the
preprocessed dataset. SVM, a supervised learning algorithm, will be trained on the dataset to
learn patterns and classify instances into lung cancer or non-lung cancer classes. CNN, a deep
learning algorithm, will be implemented to extract meaningful features from the medical imaging
data and classify lung nodules. The dataset will be divided into training and validation sets. The
SVM model will be trained using the training set and tuned using appropriate hyperparameters
through techniques like cross-validation. Similarly, the CNN model will be trained using the
training set and fine-tuned to optimize its performance. The trained models will be evaluated
using the validation set to assess their predictive performance.
Evaluation metrics such as accuracy, sensitivity, specificity, and F1 score will be calculated to
measure the models' effectiveness in detecting lung cancer. Comparative analysis between the
SVM and CNN models will be performed to understand their respective strengths and
limitations. The results obtained from the evaluation of the predictive models will be analyzed
and interpreted. The performance of the SVM and CNN models will be compared to determine
their effectiveness in lung cancer detection. Python is a popular programming language for
artificial intelligence and machine learning. This research will use Python to develop the
predictive models for lung cancer. The model will be developed in Jupyter Notebook, and it will
use HTML, CSS, and JavaScript for its user interface. The model will store its data in a MySQL
database, which will be accessible to patients and the general public.
Further analysis will focus on identifying the strengths and weaknesses of each model and
potential areas for improvement. The findings of the research will be discussed in the context of
existing literature and previous studies on lung cancer detection. The limitations of the study and
possible future research directions will be outlined. The research will conclude with a summary
of the study's contributions, implications, and recommendations for the implementation of
predictive models for lung cancer detection in clinical settings.
5
1.5 Significance of the Study
The study on predictive models for lung cancer detection is important for multiple reasons. Lung
cancer is a widespread and deadly disease, causing numerous deaths annually. Early detection is
crucial for improving patient outcomes. Predictive models can help healthcare professionals
identify individuals at higher risk, enabling timely interventions. Using a symptom dataset in
predictive modeling offers advantages over traditional diagnostic methods. It reduces the need
for expensive, time-consuming, and invasive tests, lowering costs and saving time. Predictive
models can improve detection accuracy by analyzing large amounts of patient data and
identifying patterns and correlations. This aids in identifying high-risk individuals who may need
further testing. Additionally, predictive models contribute to precision medicine by tailoring
screening and diagnosis based on individual risk factors and symptoms. The study has the
potential to revolutionize lung cancer screening and diagnosis, improving accuracy, reducing
costs, and enabling early interventions for better patient outcomes and survival rates.
1.6 Scope of the Study

The scope of this study is to develop and assess machine learning algorithms or predictive
models that can efficiently detect lung cancer using a dataset of patient symptoms. The ultimate
goal is to utilize these models to reduce the time and cost associated with diagnosing lung
cancer. To achieve this, we will collect a comprehensive dataset of patient symptoms including
coughing, shortness of breath, chest pain, weight loss, and other relevant clinical information.
This dataset will serve as the foundation for training and testing the predictive models. The
objective of the models will be to analyze the collected patient symptom data and identify strong
associations or patterns indicative of lung cancer. By learning from the available data, the models
will make predictions or classifications regarding the likelihood of a patient having lung cancer.
The models will be developed using Support Vector Machines (SVM) and Convolutional Neural
Networks (CNN). Additionally, the study will include a thorough evaluation of the performance
of the developed predictive models. This evaluation process will employ appropriate metrics to
assess accuracy, sensitivity, specificity, and overall predictive power. To validate their
6
effectiveness, the models will be compared against existing diagnostic methods or medical
expert opinions. By utilizing predictive models for lung cancer detection, this study aims to
provide a more efficient and cost-effective alternative to traditional diagnostic approaches. The
potential impact of this research lies in its ability to aid healthcare professionals in identifying
individuals at high risk of lung cancer, enabling early detection and timely intervention,
ultimately leading to improved patient outcomes.
1.7 Definition of Terms
Lung cancer: A malignant tumor that originates in the lungs. It is the leading cause of
cancer death in the world.
Surgeons: Doctors who specialize in surgery. They are responsible for performing
operations to diagnose and treat diseases.
Epidemiological studies: Studies that investigate the distribution and determinants of

diseases in populations.
Radiographic evidence: Evidence of a disease that is seen on a radiograph, a type of

medical image.
Bronchoscopy: A procedure that allows a doctor to look inside the airways.
Chemotherapy: The use of drugs to kill cancer cells.
Targeted therapies: Drugs that are designed to specifically target and inhibit the growth
of cancer cells with specific genetic mutations.
Immunotherapy: Treatment that harnesses the body's immune system to recognize and
attack cancer cells.
Palliative care: Care that focuses on alleviating symptoms, managing pain, and
providing emotional support to patients and their families.
Machine learning: A type of artificial intelligence that allows computers to learn from
data and make predictions.
7
CHAPTER TWO
LITERATURE REVIEW
2.1 Introduction
Lung cancer is a devastating disease that has affected millions of lives worldwide, making it one
of the leading causes of cancer-related deaths (Thandra et al., 2021). It is believed to have
originated in ancient times, although it was not until the late 19th and early 20th centuries that its
connection to smoking became more apparent (Sanchez-Ramos, 2020). The sharp rise in tobacco
consumption during the 20th century significantly contributed to the lung cancer epidemic,
leading to its prevalence becoming a public health concern (Proctor, 2012). With growing
awareness of the link between smoking and lung cancer, efforts were initiated to mitigate
smoking-related risks. However, the disease's complexity and lack of early-stage diagnostic
methods continued to challenge medical professionals.
While smoking remains the most significant risk factor for lung cancer, exposure to secondhand
smoke, occupational hazards such as asbestos and radon, air pollution, and genetic predisposition
also play pivotal roles (Cheng et al., 2021). Smoking introduces harmful carcinogens into the
lungs, causing chronic inflammation and DNA damage, which can lead to uncontrolled cell
growth and the formation of tumors (Hussain et al., 2019). Non-smokers can also develop lung
cancer due to these other risk factors.
Lung cancer can manifest in various forms, primarily non-small cell lung cancer (NSCLC) and
small cell lung cancer (SCLC), each having distinct growth patterns and treatment responses
(Rudin et al., 2021). Symptoms may not appear until the disease has progressed to advanced
stages, leading to reduced treatment success rates and increased mortality. Common symptoms
include persistent cough, shortness of breath, chest pain, unexplained weight loss, and recurrent
respiratory infections. Lung cancer not only affects physical health but also has profound
psychological and emotional impacts on patients and their families.
8
Historically, lung cancer diagnosis has relied on traditional methods such as chest X-rays,
computed tomography (CT) scans, bronchoscopy, and biopsy (Folch et al., 2015). While these
approaches have been effective to some extent, they often lack the sensitivity and specificity
required for early detection. Early-stage tumors can be challenging to identify accurately, leading
to delayed diagnosis and reduced chances of successful treatment.
Machine learning, a subset of artificial intelligence, has emerged as a transformative tool in

various industries, including healthcare (Javaid et al., 2022). Its ability to analyze vast amounts
of data, recognize patterns, and make data-driven predictions has brought newfound hope to lung
cancer detection. By leveraging machine learning algorithms, researchers and clinicians can
extract meaningful insights from medical imaging, patient records, and genomics data to aid in
early diagnosis and personalized treatment strategies. Machine learning algorithms can process
high-dimensional datasets from different sources, enabling the identification of subtle patterns
indicative of early-stage lung cancer. For instance, deep learning, a subset of ML, employs
artificial neural networks inspired by the human brain to analyze medical images with incredible
accuracy. By feeding these algorithms with annotated datasets of lung cancer images, they can
learn to detect and classify tumors more precisely than traditional methods.
Furthermore, ML models can be trained to analyze patient health records, lifestyle habits, and
genetic information to assess an individual's lung cancer risk and recommend personalized
screening and prevention strategies (Dritsas & Trigka, 2022). This approach not only aids in
early detection but also empowers patients to make informed decisions about their health. While
the integration of machine learning in lung cancer detection holds great promise, several
challenges must be addressed. One major challenge is the need for large and diverse datasets to
train accurate and generalizable ML models. Additionally, the ethical use of patient data and the
establishment of robust data privacy measures are crucial considerations.
Moreover, the integration of ML into the clinical workflow requires collaboration between
medical experts and data scientists. This interdisciplinary approach ensures that ML algorithms
are validated, interpretable, and clinically relevant. Furthermore, regulatory approvals and
adoption by healthcare institutions are essential steps in transitioning from research to practical
clinical implementation.
9
The quest for early detection and improved outcomes in lung cancer has been ongoing for
decades. The emergence of machine learning offers an unprecedented opportunity to transform
lung cancer detection and diagnosis. By harnessing the power of data and advanced algorithms,
this project aims to pave the way for a future where early detection becomes a reality,
significantly improving the prognosis and survival rates for lung cancer patients worldwide.
With continued research, collaboration, and innovation, we can bring this vision to life and make
substantial strides in the battle against lung cancer.
2.1.1 The Origins and Etiology of Lung Cancer
Lung cancer is a complex disease with multifactorial origins and etiology. It arises when
abnormal cells grow uncontrollably in the lungs, forming tumors that can interfere with lung
function and, if left untreated, can spread to other parts of the body. The origins and etiology of
lung cancer involve a combination of genetic, environmental, and lifestyle factors.
i. Smoking and Tobacco Use: The most significant risk factor for developing lung cancer
is smoking tobacco, including cigarettes, cigars, and pipes. Smoking is responsible for
about 85% of all lung cancer cases (Warren & Cummings, 2013). Tobacco smoke
contains numerous carcinogens that damage lung cells and lead to genetic mutations,
promoting the development of cancerous cells.
ii. Secondhand Smoke: Exposure to secondhand smoke, also known as passive smoking,
can increase the risk of lung cancer, particularly in non-smokers who live or work with
smokers. Secondhand smoke contains many of the same harmful substances as direct
smoke and can contribute to lung cancer development (Kim et al., 2018).
iii. Radon Gas: Radon is a naturally occurring radioactive gas that can seep into homes and
buildings from the ground. Prolonged exposure to high levels of radon is a significant
risk factor for lung cancer, especially for individuals who smoke (Riudavets et al., 2022).
iv. Occupational and Environmental Exposures: Certain workplace exposures to
carcinogens like asbestos, arsenic, chromium, nickel, and diesel exhaust can increase the
risk of lung cancer (Spyratos et al., 2013). Additionally, exposure to air pollution,
industrial emissions, and certain chemicals may contribute to lung cancer development.
v. Genetic Predisposition: While smoking is the primary cause of lung cancer, some
individuals may have a genetic predisposition that increases their susceptibility to the
10
disease (Hjelmborg et al., 2016). Certain inherited genetic mutations can elevate the risk
of developing lung cancer, even in individuals who have never smoked.
vi. Pre-existing Lung Conditions: People with pre-existing lung conditions, such as
chronic obstructive pulmonary disease (COPD) or pulmonary fibrosis, have a higher risk
of developing lung cancer (Schwartz et al., 2016).
vii. Radiation Exposure: Exposure to high doses of ionizing radiation, such as during
certain medical procedures or radiation therapy for other cancers, can increase the risk of
developing lung cancer (Ali et al., 2020).
2.1.2 Types of Lung Cancer

Lung cancer is a type of cancer that begins in the lungs and is one of the leading causes of
cancer-related deaths worldwide. There are two main types of lung cancer: non-small cell lung
cancer (NSCLC) and small cell lung cancer (SCLC). These two types are classified based on the
appearance of cancer cells under a microscope and have distinct characteristics, treatments, and
prognoses.
i. Non-Small Cell Lung Cancer (NSCLC): NSCLC is the most common type of lung
cancer, accounting for approximately 85% of all cases. It tends to grow and spread more
slowly than small cell lung cancer, allowing for a wider range of treatment options.
NSCLC is further divided into subtypes:
a. Adenocarcinoma: This subtype is the most prevalent and often occurs in the
outer areas of the lung. It is more common in non-smokers and former smokers.
Adenocarcinoma is often associated with certain genetic mutations and responds
well to targeted therapies (Jin et al., 2019).
b. Squamous Cell Carcinoma: Formerly known as epidermoid carcinoma, this
subtype usually develops in the lining of the bronchial tubes (Johnson et al.,
2020). It is linked to smoking and can cause symptoms such as coughing and
breathing difficulties.
c. Large Cell Carcinoma: This is a less common subtype of NSCLC and is named
for its appearance under a microscope (Raso et al., 2021). It tends to grow and
spread quickly and often requires aggressive treatment.
11
ii. Small Cell Lung Cancer (SCLC): SCLC, also known as oat cell carcinoma, accounts
for about 15% of all lung cancers. It is more aggressive and fast-growing than NSCLC
and is more likely to have already spread to other parts of the body at the time of
diagnosis. SCLC is strongly associated with smoking, and it typically responds well to
chemotherapy and radiation therapy in the initial stages.
It is essential to diagnose the specific type of lung cancer accurately, as treatment decisions are
often based on this classification. Other rare types of lung cancer, such as carcinoid tumors, may
also occur, but they are less common than NSCLC and SCLC.
2.1.3 Effects and Symptoms of Lung Cancer

The effects and symptoms of lung cancer can vary depending on the stage of the disease and the
type of lung cancer involved. Below is a discussion on the effects and symptoms of lung cancer:
Effects:
i. Decreased Lung Function: As lung cancer grows, it can block air passages in the lungs,
reducing the organ's capacity to exchange oxygen and carbon dioxide efficiently. This
leads to shortness of breath and decreased lung function (Han & Mallampalli, 2015).
ii. Metastasis: Lung cancer can spread to other parts of the body, such as the bones, brain,
liver, and other organs, through the lymphatic system or bloodstream. This advanced
stage is known as metastatic or stage IV lung cancer and can have severe effects on
various organ functions(Popper, 2016).
iii. Cachexia: Advanced lung cancer can cause a condition called cachexia, characterized by
severe weight loss, muscle wasting, weakness, and fatigue (Ni & Zhang, 2020). Cachexia
is a result of the body's response to cancer, and it can further weaken the patient.
iv. Paraneoplastic Syndromes: Lung cancer can trigger the production of hormones or
immune system responses that affect other organs, leading to various paraneoplastic
syndromes (Fletcher, 2021). These syndromes can cause neurological symptoms,
hormonal imbalances, and other systemic effects.
Symptoms:
12
i. Persistent Cough: A persistent or chronic cough is one of the most common symptoms of
lung cancer. The cough may be dry or produce phlegm and might worsen over time.
ii. Shortness of Breath: As the tumor grows or blocks air passages, patients may experience
increasing shortness of breath, especially during physical activity.
iii. Chest Pain: Lung cancer can cause chest pain that may be dull, constant, or intermittent.
The pain may worsen with deep breathing, coughing, or laughing.
iv. Coughing up Blood: Hemoptysis, or coughing up blood or blood-streaked mucus, is
another common symptom of lung cancer.
v. Hoarseness: If the cancer affects the recurrent laryngeal nerve, it can lead to hoarseness
or changes in the voice.
vi. Unexplained Weight Loss: Significant and unexplained weight loss can occur due to the
effects of cancer on metabolism and appetite.
vii. Fatigue: Lung cancer patients often experience severe fatigue, which can be a result of
the disease itself, treatments, or other related factors.
viii. Recurrent Respiratory Infections: Some lung cancer patients may be more susceptible to
respiratory infections due to a weakened immune system.
2.1.4 Diagnosis of Lung Cancer

Diagnosis of lung cancer involves a series of tests and procedures aimed at detecting the
presence of cancerous cells in the lungs and determining the extent and stage of the disease.
Early diagnosis is crucial for improving the chances of successful treatment and better outcomes.
Here's an overview of the typical steps involved in the diagnosis of lung cancer:
i. Medical History and Physical Examination: The process begins with a comprehensive
medical history, where the doctor will ask about the patient's symptoms, risk factors
(such as smoking history or exposure to certain chemicals), and any relevant family
history of cancer. A physical examination is also conducted to assess the patient's overall
health.
ii. Imaging Tests: Several imaging tests are commonly used to visualize the lungs and
detect any abnormalities (Gargani & Volpicelli, 2014). These may include:
13
 X-rays: Simple chest X-rays can reveal the presence of masses or abnormal
nodules in the lungs (Khan et al., 2011). However, they may not be sufficient to
confirm the diagnosis or provide detailed information about the cancer.
 CT Scan (Computed Tomography): CT scans are more detailed than X-rays and
can provide a clearer view of the lungs, helping to identify smaller nodules or
tumors and evaluate whether the cancer has spread to nearby lymph nodes or
other organs.
 MRI (Magnetic Resonance Imaging): In some cases, an MRI may be used to get a
more precise image of the lungs and surrounding areas (Kumar et al., 2016).
 PET Scan (Positron Emission Tomography): PET scans help determine the
metabolic activity of lung nodules or masses (Lai et al., 2022). This information is
valuable in distinguishing between benign and malignant growths and in staging
the cancer.
iii. Biopsy: A biopsy is the definitive way to diagnose lung cancer. It involves obtaining a
small sample of tissue from the lung for examination under a microscope (Holland,
2021). There are different types of biopsies:
 Needle Biopsy: A thin, hollow needle is inserted into the lung tissue to obtain a
sample. This can be done through the skin (percutaneous biopsy) or with the
guidance of imaging techniques such as CT or ultrasound.
 Bronchoscopy: A thin, flexible tube with a camera on the end (bronchoscope) is
inserted through the mouth or nose into the airways. The doctor can then collect
small tissue samples from the lungs for analysis.
 Thoracoscopy or Mediastinoscopy: In some cases, a surgical procedure may be
necessary to access and biopsy hard-to-reach areas in the chest cavity.
iv. Laboratory Tests: After obtaining a tissue sample, it is sent to a pathology laboratory
for examination. A pathologist will analyze the sample to determine whether cancer is
present and, if so, the type and subtype of lung cancer.
v. Staging: Once lung cancer is confirmed, the next step is staging. Staging involves
determining the extent of the cancer, including its size, location, and whether it has
spread to nearby lymph nodes or other parts of the body (Mirsadraee et al., 2012). This
information is crucial for determining the most appropriate treatment plan.
14
vi. Molecular Testing: For some types of lung cancer, molecular testing may be performed
on the tissue sample to identify specific genetic mutations or biomarkers (Shim et al.,
2017). These results can help guide targeted therapies or immunotherapies.
vii. Additional Tests: Depending on the stage and type of lung cancer, additional tests may
be conducted to assess the patient's overall health and their ability to undergo specific
treatments. These tests may include lung function tests, blood tests, and other imaging
studies.
The diagnosis of lung cancer can be a complex process, and it is important to have a
multidisciplinary team of healthcare professionals, including oncologists, pathologists,
radiologists, and pulmonologists, working together to provide an accurate diagnosis and develop
an appropriate treatment plan for the patient. As always, early detection and timely intervention
play a critical role in improving the prognosis for individuals with lung cancer.
2.1.5 Treatment Options for Lung Cancer

The treatment options for lung cancer depend on several factors, including the type and stage of
the cancer, the patient's overall health, and their preferences. Lung cancer is generally treated
using one or a combination of the following approaches:
i. Surgery: Surgery is often the preferred treatment for early-stage non-small cell lung
cancer (NSCLC) and some limited-stage small cell lung cancer (SCLC). During surgery,
the tumor and surrounding lymph nodes are removed to prevent further spread of cancer
(Doerr et al., 2022). The type of surgery performed depends on the tumor's size, location,
and extent.
ii. Radiation Therapy: Radiation therapy uses high-energy X-rays or other particles to
target and destroy cancer cells (Baskar et al., 2014). It can be used as the primary
treatment for early-stage lung cancer in patients who are not candidates for surgery or as
an adjuvant therapy after surgery to reduce the risk of cancer recurrence. In advanced
cases, radiation therapy can help relieve symptoms and improve the quality of life.
iii. Chemotherapy: Chemotherapy is a systemic treatment that uses drugs to kill cancer cells
or prevent them from dividing and growing (Krans, 2021). It is commonly used for both
NSCLC and SCLC, either as the main treatment or in combination with surgery or
15
radiation. Chemotherapy can be given before or after surgery (neoadjuvant or adjuvant
chemotherapy) or in advanced cases to slow tumor growth and alleviate symptoms.
iv. Targeted Therapy: Targeted therapies are drugs that specifically target certain genetic
mutations or proteins that promote the growth of cancer cells (Wilkes, 2018). These
therapies are mainly used in NSCLC when specific genetic mutations are present. They
can be more effective and have fewer side effects compared to traditional chemotherapy.
v. Immunotherapy: Immunotherapy is a type of treatment that helps the body's immune
system recognize and attack cancer cells. It has shown significant promise in treating
both NSCLC and SCLC, especially in cases where other treatments have been
unsuccessful (Easton, 2017). Immune checkpoint inhibitors are the most common form of
immunotherapy used for lung cancer.
vi. Palliative Care: Palliative care focuses on providing relief from symptoms and
improving the quality of life for patients with advanced lung cancer (Tan &
Ramchandran, 2020). It is not curative but can help manage pain, shortness of breath,
fatigue, and other symptoms associated with the disease.
vii. Clinical Trials: Clinical trials are research studies that test new treatments or
combinations of treatments to evaluate their safety and effectiveness. Patients with lung
cancer may consider participating in clinical trials as a potential option when other
treatments have not been successful or to access cutting-edge therapies (Nielsen et al.,
2020).
The choice of treatment depends on the specific characteristics of the lung cancer, the stage of
the disease, the patient's overall health, and their preferences. A multidisciplinary team, including
oncologists, surgeons, radiation oncologists, and other specialists, work together to develop a
personalized treatment plan for each patient. It's essential for patients to discuss their treatment
options thoroughly with their healthcare team and make informed decisions based on their
individual circumstances.
2.2 Machine Learning

Nvidia defines machine learning as the process of utilizing algorithms to analyze data, learn from
it, and then determine or predict something about the outside world. The machine is thus
"trained" using vast amounts of data and algorithms that enable it to learn how to complete the
16
work, as opposed to being manually programmed with a specific set of instructions to carry out a
certain operation (Copeland, 2021). The method through which computers can take an algorithm
(also known as rules-based programming) and improve it as they interact with more data is
called machine learning. You may provide computers terabytes and petabytes of data with the
aid of machine learning so they can discover distinctions and develop their own algorithms based
on the underlying human-driven programming to achieve the desired result. Machine learning is
widely used to detect hate speech. Machine learning, which enables systems to automatically
learn from their experiences and improve over time without having to be explicitly designed,
uses artificial intelligence (AI). Machine learning is the process of creating computer programs
that can access data and utilize it to learn for themselves. The learning process begins with
observations or data, such as examples, firsthand experience, or teaching, in order to uncover
patterns in data and enhance decisions made in the future based on the examples we provide. The
basic objective is to enable computers to learn independently of humans and modify their
behavior as a result.
2.3 Machine Learning Applications

Machine learning has a wide range of applications across various industries and fields. Here are
some examples:
i. Image and speech recognition: Machine learning algorithms are used in image and
speech recognition software to identify patterns and features in images and audio signals.
For example, facial recognition technology can be used in security and surveillance
systems, while speech recognition technology can be used in personal assistants and
voice-controlled devices.
ii. Natural language processing (NLP): Machine learning algorithms are used in NLP to
analyze and understand human language, enabling computers to perform tasks such as
language translation, sentiment analysis, and chat bot interactions.
iii. Recommender systems: Machine learning algorithms can be used to analyze user
behavior and preferences to recommend products, services, and content that users are
likely to be interested in. Examples of such systems include movie and music
recommendation systems, as well as personalized product recommendations on e-
commerce websites.
17
iv. Fraud detection: Machine learning algorithms can be used to detect fraudulent behavior
in financial transactions and other contexts, helping to prevent fraud and improve
security.
v. Healthcare: Machine learning can be used in healthcare to analyze medical data and help
diagnose diseases, identify risk factors, and develop treatment plans.
vi. Predictive maintenance: Machine learning algorithms can be used to analyze data from
sensors and other sources to predict when maintenance will be required on machines and
equipment, helping to prevent breakdowns and reduce downtime.
vii. Autonomous vehicles: Machine learning algorithms are used in self-driving cars to
identify obstacles, detect traffic patterns, and make real-time driving decisions.
viii. Financial analysis: Machine learning algorithms can be used in financial analysis to
identify patterns and trends in data, analyze market behavior, and predict future
outcomes.
Overall, machine learning has a wide range of applications across industries and fields, enabling
businesses and organizations to improve efficiency, accuracy, and decision-making capabilities.
2.4 Machine Learning Approaches

There are numerous various strategies that can be applied when undertaking machine learning.
Unsupervised, semi-supervised, supervised, and reinforced learning are the standard
classifications used to group them.
Machine
Learning
Supervised Unsupervised Semi-supervised Reinforcement

Learning Learning Learning Learning
Classification Clustering
Regression Association
Figure 2.1 Machine Learning Approaches (Gazdar, 2010)
2.4.1 Supervised Learning
18
To identify the rules that translate a set of inputs to a set of outputs is the goal of supervised
learning. Algorithms for supervised machine learning can use what they have learnt in the past to
predict future events using tagged examples and fresh data. Starting from the analysis of a well-
known training dataset, the learning algorithm constructs an inferred function to predict the
values of the outputs. The system is capable of producing targets for any new input after
receiving enough training. In order to find faults and make any necessary model corrections, the
learning algorithm can also compare its results to the desired, accurate results. The learning
process is supervised because example labelled data from prior input and output pairs are given
to the model to teach it how to behave. Supervised learning is the term used to describe the
machine learning task of learning a function that maps an input to an output based on sample
input-output pairs. From labeled training data, which consists of a collection of training
instances, it infers a function. Two types of supervised learning exist
19
Classification
Similar data points are categorized by being divided into various sections using classification.
The rules that specify how to divide the different data pieces are found using machine learning.
Simply said, categorization approaches look for the optimal way to draw a line that divides data
points. The distinctions between classes are marked by the decision boundaries. The entire area
selected to designate a class is referred to as the decision surface. A data point is classified into a
certain class if it is contained inside the decision surface's borders.
Figure 2.2 Classification algorithm in machine learning (jacint, 2016)
Types of Classification Problems

i. Binary Classification: There are two categories into which this classification issue can be
divided. For instance, to categorize an email as spam or not spam, we can assign labels 0
for spam and 1 for not spam, which will similarly change the output probabilities to either
of these labels, making it a binary classification task.
ii. Multi-Class Classification: In this case, classification may come from more than two
classes. Assume we have historical weather data for a location where each day is
classified as either sunny, cloudy, or wet. In such situation, we can train a model and
apply it to forecast the weather for tomorrow, which may fall into any of the
aforementioned categories. Multi-class classification is demonstrated in this case.
iii. Multi-Label Classification: The goal of semantic tagging is to examine some text data
and foretell the text's content groups. Accordingly, a multi-label classification problem
arises when a single input can be classified into one, two, or more categories.
20
Examples of Classification Algorithms
i. Naïve Bayes
ii. Support Vector Machine
iii. Stochastic Gradient Descent
iv. Logistic Regression
v. Random Forest
Applications of Classification Algorithms

i. Malware classification
ii. Web text classification
iii. Anomaly detection
iv. Customer churn prediction etc.
Regression
A subset of machine learning algorithms that are in the family of supervised machine learning
algorithms are regression algorithms. The capacity of supervised learning algorithms to forecast
the value of fresh data by simulating dependencies and interactions between the intended output
and input properties is one of its fundamental features. Regression algorithms forecast the values
of the output by using input attributes from the data provided into the system. According to
protocol, the algorithm builds a model from the traits of training data and uses that model to
predict the value of fresh data.
Figure 2.3 Regression algorithm in machine learning (Tieu,2016)
21
Regression produces a number rather than a class, which is how it differs from classification.
Regression is the process of figuring out how closely two independent and dependent variables
are correlated. It assists in making predictions about continuous variables like market trends,
home prices, and so forth. The method predicts a single output value using training data. On the
basis of training data, regression can be used to forecast house prices, for instance. The input
variables will be location, home size, and so forth.
Types of Regression Algorithms

i. Simple Linear Regression: A mapping function is used in simple linear regression to
model the linear relationship between an independent variable and a dependent variable
that must be predicted. Consider, for illustration, that a locality's housing costs are only
influenced by its geography. So, basing the trained model on historical data, one may
forecast home costs given the area of any new region.
ii. Multiple Linear Regression: Regression analysis using many independent variables and
one dependent variable is known as multiple linear regression. For instance, a restaurant's
evaluations are influenced by the caliber of the food as well as the ambiance, service, and
location. Therefore, in this situation, several independent variables affect the dependent
parameter linearly.
iii. Polynomial Regression: The non-linear relationship between the independent and
dependent variables is mapped by the polynomial regression procedure. The equation is
not linear since the mapping consists of many powers of an independent variable.
Applications of Regression Algorithms

i. Predicting the behavior of customers
ii. Prediction of stock rates
iii. Performing analysis on survey data
Advantages of Supervised Learning

i. You will have a comprehensive understanding of the classes of the training data.
ii. You can define the classes in great detail by training the classifier so that it has a flawless
decision boundary that accurately distinguishes between various classes.
iii. The specific number of classes can be determined before supplying the data for training.
22
iv. After the training is finished, you don't necessarily need to keep the training data in your
memory. The mathematical expression for the decision boundary can still be used.
Disadvantages of Supervised Learning
i. For a variety of reasons, supervised learning cannot handle some of the more difficult
machine learning tasks.
ii. Supervised learning is unable to extract unknown information from training data, in
contrast to unsupervised learning.
iii. In contrast to unsupervised learning, it is unable to classify or organize data by figuring
out its features on its own.
iv. The output can have the wrong class label if we offer a classification input that does not
correspond to one of the classes in the training data. Consider training an image classifier
with data on cats and dogs as an example. The outcome may then be a cat or a dog, which
is incorrect, if you enter an image of a giraffe.
2.4.2 Unsupervised Learning
Unsupervised learning is the process of teaching a computer to act on unlabeled data without
human supervision. The machine's job in this scenario is to categorize unsorted data based on
similarities, patterns, and differences without any prior data training. In contrast to supervised
learning, unsupervised learning does not result in the machine being educated. As a result, the
machine can only successfully locate the hidden structure in unlabeled data on its own. This
model works independently to uncover previously unnoticed patterns and data. It mostly makes
use of unlabeled data. Unsupervised machine learning methods are applied when the training
data is neither categorized nor labeled. Unsupervised learning typically comes in two forms:
Clustering
Unsupervised learning is largely used in clustering. When individuals in a group are more similar
to one another than to those in other clusters, this process of grouping is known as clustering.
There are many different clustering methods available. They typically use a type of similarity
meter based on specific metrics, such as Euclidean or probabilistic distance. Genetic clustering,
bioinformatics sequence analysis, pattern mining, and object recognition are a few clustering
problems that can be resolved utilizing the unsupervised learning approach. Clustering is the
23
process of forming groups with recognizable traits. Clustering is used to look for several
subgroups in a dataset. In unsupervised learning, we are not limited by any set of labels in terms
of how many clusters we can build.
Figure 2.4 Clustered data points (Waart, 2022)
Types of Clustering Algorithms

i. Grid-based Clustering
ii. Distribution-based Clustering
iii. Density-based Clustering
iv. Hierarchical Clustering
v. K-Means Clustering
vi. Centroid Clustering
Applications of Clustering Algorithms

i. Pattern recognition
ii. Image processing
iii. Market research
iv. Data analysis
Association
Association learning is a rule-based machine learning technique for finding pertinent
associations between variables in huge databases. By using some indicators of interest, it aims to
find reliable rules in databases. In association learning, you aim to identify the rules that best
describe your data. As an illustration, it's probable that someone who sees video A will also
24
watch video B. Association rules are perfect in cases like this when you want to find similar
products.
Figure 2.5 Association rule learning algorithm (Henschke,2021)
Types of Association Algorithms

i. F-P Growth Algorithm
ii. Apriori Algorithm
iii. Eclat Algorithm
Applications of Clustering Algorithms
i. Protein sequence
ii. Basket data analysis
iii. Web usage mining
iv. Medical Diagnosis
2.4.3 Semi-supervised Learning

A significant amount of unlabeled data is combined with a modest amount of labeled data during
training as part of the machine learning technique known as semi-supervised learning. Semi-
supervised learning lies in the middle between supervised and unsupervised learning. Clustering
the unlabeled data is often how semi-supervised learning is started. The labeled data are then
used to label the clustered unlabeled data. Finally, a large volume of now-labeled data is used to
train machine learning models. Semi-supervised learning models have the potential to be very
successful because they can use a lot of data. Semi-supervised learning models are often created
by combining the existing machine learning algorithms used in supervised and unsupervised
25
learning with modified and altered versions. This technique is used in areas like content
classification and speech analysis.
2.4.4 Reinforcement Learning

Even though the final type of machine learning is the most difficult and unusual, it has generated
remarkable results. It educates through rewards rather than labels in the conventional sense.
Actions that intelligent agents should do in a specific environment to optimize the concept of
cumulative reward are taken into consideration in the machine learning discipline known as
reinforcement learning. This is extremely similar to how people learn. Positive and negative
impulses are constantly around us, and we take them and apply them to our life. These signals
reach us in a variety of ways, including by way of the brain chemicals. When something positive
happens, the neurons in our brains release dopamine and other elevating neurotransmitters,
making us feel good and increasing our inclination to repeat the positive occurrence. We don't
always require supervision, as contrast to supervised learning. Even when reinforcement signals
are only seldom supplied, we can still learn quite well.
2.5 Theoretical Frameworks

Support Vector Machines (SVM) and Convolutional Neural Networks (CNN) are two different
theoretical frameworks used in machine learning and deep learning, respectively, for lung cancer
detection and classification. Both approaches have been applied successfully to medical imaging
tasks, including lung cancer detection.
2.5.1 Support Vector Machines

SVM is a supervised machine learning algorithm used for classification and regression tasks. In
the context of lung cancer detection, SVM can be used to classify medical images (such as chest
X-rays or CT scans) into different categories, such as "cancerous" or "non-cancerous." The main
idea behind SVM is to find the optimal hyperplane that best separates the different classes in the
feature space.
Here's how SVM works for lung cancer detection:
26
 Feature Extraction: In the initial step, relevant features are extracted from the medical
images. These features may include texture descriptors, shape characteristics, or other
quantitative measurements related to the appearance of lung nodules or abnormalities.
 Training: A labeled dataset is used to train the SVM model. The dataset contains medical
images with corresponding labels indicating whether they represent lung cancer or not.
 Hyperplane Optimization: SVM aims to find the hyperplane that maximizes the margin
between the two classes while minimizing classification errors. This hyperplane acts as
the decision boundary that separates cancerous and non-cancerous instances.
 Testing: Once the SVM model is trained, it can be used to classify new, unseen medical
images as either cancerous or non-cancerous based on their extracted features.
Figure 2.6 Separating the hyperplane for Support vector machines (Javatpoint, 2021)
2.5.2 Convolutional Neural Networks

CNN is a type of deep learning architecture specifically designed for image recognition and
computer vision tasks. It has achieved remarkable success in various medical imaging
applications, including lung cancer detection.
Here's how CNN works for lung cancer detection:
 Convolution and Pooling: CNNs use convolutional layers to extract meaningful features
from input images. These layers apply a set of learnable filters (kernels) to the image,
27
capturing important patterns and features. Pooling layers downsample the feature maps to
reduce the model's complexity and computational requirements.
 Feature Hierarchy: CNNs learn hierarchies of features, starting from basic patterns (e.g.,
edges) in the early layers to more complex features in deeper layers. These hierarchical
representations enable CNNs to capture intricate patterns present in medical images.
 Classification: The extracted features are then passed through fully connected layers to
perform the final classification. The last layer often uses a sigmoid or softmax activation
function to output the probability of the image belonging to different classes (e.g.,
cancerous or non-cancerous).
 Training: CNNs require a large labeled dataset for training. During the training process,
the model adjusts its internal parameters to minimize the difference between predicted
and actual labels.
 Testing: Once the CNN is trained, it can be used to predict the presence of lung cancer in
new medical images.
Both SVM and CNN approaches have their strengths and weaknesses. SVM is often preferred
for small datasets or when feature engineering is crucial, as it allows better control over the
feature selection process. On the other hand, CNNs excel in learning intricate features
automatically from raw images, making them highly effective when large labeled datasets are
available. In practice, researchers and medical professionals may choose the approach that suits
their specific dataset and requirements best.
Figure 2.7 How CNN algorithm works (Ashley, 2023)
28
2.6 General Architecture of Machine Learning
Machine learning is a subfield of Artificial Intelligence that deals with building computer
systems that can automatically learn from data without being explicitly programmed. The
general architecture of machine learning typically involves several stages or components that
work together to train a model that can make predictions or decisions based on input data. The
general
architecture of machine learning is referred to as the overall structure and flow of the machine
learning process, which includes various stages such as data collection, data pre-processing,
feature engineering, data modelling, evaluation, and deployment.
Figure 2.8 General architecture of machine learning (Tripathi et al., 2021)
2.6.1 Data Collection

Data collection is an important component of machine learning that involves the process of
gathering and acquiring data from various sources such as databases, sensors, and web pages.
The quality and quantity of the data collected during this stage have a significant impact on the
accuracy and performance of the machine learning model. In machine learning, data is the most
critical component, as the model's ability to make accurate predictions depends on the quality
and relevance of the data. Data collection involves several steps, including identifying the data
sources, extracting the relevant data, and storing it in a suitable format for further processing.
The data sources can be structured or unstructured, depending on the nature of the problem being
solved by the machine learning system. For instance, structured data can be obtained from
databases, while unstructured data can be obtained from social media platforms, online reviews,
29
and other text-based sources. Once the data sources have been identified, the next step is to
extract the relevant data from them. This involves filtering out irrelevant data, such as duplicates
or noise, and selecting only the data that is relevant to the problem being solved. In some cases,
the data may need to be transformed to a common format to ensure consistency and
compatibility. After the relevant data has been extracted, it needs to be stored in a suitable format
for further processing. This may involve converting the data into a standard format such as CSV,
JSON, or XML. The data may also need to be organized into a database or data warehouse for
easy access and retrieval.
Data collection is a critical component of machine learning, as it sets the foundation for the rest
of the machine learning process. The quality and relevance of the data collected during this stage
have a significant impact on the accuracy and performance of the machine learning model. As
such, it is essential to ensure that the data collection process is thorough, efficient, and reliable.
This can be achieved through careful planning, data cleaning, and the use of appropriate tools
and techniques.
2.6.2 Data Pre-processing

Data pre-processing is a component of machine learning that involves cleaning, transforming,
and preparing data for analysis. In the context of machine learning, data pre-processing is the
second stage of the machine learning process, which comes after data collection and before
feature engineering. The goal of data pre-processing in machine learning is to prepare the
collected data in a format that can be easily analysed and processed by the machine learning
algorithms. This involves cleaning the data to remove irrelevant data, dealing with missing data,
and transforming the data into a suitable format for the machine learning algorithms. Data pre-
processing involves several steps, including data cleaning, data transformation, and data
normalization.
Data Cleaning: Data cleaning involves removing irrelevant data and dealing with missing data.
This is important because the quality of the data used to train the machine learning model has a
significant impact on the accuracy and performance of the model. In data cleaning, data points
with incorrect, incomplete, or inconsistent information are removed or corrected.
30
Data Transformation: Data transformation involves converting data from one format to
another. This can involve converting categorical data into numerical data or changing the scale
of numerical data. This step is essential because different machine learning algorithms require
data in different formats.
Data Normalization: Data normalization is the process of scaling the data to fit within a specific
range. This is important because some machine learning algorithms are sensitive to the scale of
the data. Normalization ensures that the data is on a common scale, making it easier to compare
and analyse.
Data pre-processing is essential in machine learning because it ensures that the data used to train
the machine learning model is of high quality and in the right format. A well-processed dataset
can significantly improve the accuracy and performance of the machine learning model.
Therefore, it is crucial to pay close attention to data pre-processing when building machine
learning models.
2.6.3 Feature Engineering

Feature engineering is a component of machine learning that involves the selection, extraction,
and transformation of relevant features from raw data to improve the performance of a machine
learning model. In simple terms, feature engineering is the process of creating new input
variables or features from the raw data to help the machine learning algorithm to learn better and
make accurate predictions. The importance of feature engineering in machine learning cannot be
overemphasized, as it has a significant impact on the accuracy, speed, and scalability of the
machine learning models. In many real-world problems, the raw data may be complex and
contain irrelevant, noisy, or missing information, making it difficult for the machine learning
algorithm to learn and make accurate predictions. Feature engineering can help to address these
issues by reducing the dimensionality of the data, identifying relevant patterns, and transforming
the data into a more suitable format for the machine learning algorithm. There are several types
of feature engineering techniques that can be used to extract and transform features from the raw
data. Some of the popular techniques are:
31
Imputation: Imputation is a technique used to fill missing values in the data. Missing data can
affect the performance of a machine learning model. Imputation techniques such as mean
imputation, median imputation, and regression imputation can be used to fill in the missing
values.
Scaling: Scaling is a technique used to normalize the data to a specific range. Normalizing the
data can help to reduce the impact of outliers and improve the performance of the machine
learning model. Techniques such as min-max scaling and standard scaling can be used for
scaling.
Encoding: Encoding is a technique used to convert categorical data into numerical data that can
be used by machine learning algorithms. Techniques such as one-hot encoding and label
encoding can be used for encoding.
Feature Selection: Feature selection is a technique used to select the most relevant features from
the data. Selecting relevant features can help to reduce the dimensionality of the data and
improve the performance of the machine learning model. Techniques such as correlation analysis
and feature importance can be used for feature selection.
Feature Extraction: Feature extraction is a technique used to create new features from the
existing features. Creating new features can help to capture more information from the data and
improve the performance of the machine learning model. Techniques such as principal
component analysis and wavelet transformation can be used for feature extraction.
The success of a machine learning model depends on the quality of the features used, and feature
engineering plays a critical role in creating relevant and informative features that can help to
improve the performance of the machine learning model.
2.6.4 Data Modelling

Data modelling is a component of machine learning that involves creating a mathematical
representation of the underlying structure and relationships within the data. It is a process of
creating a model that can predict or describe the behaviour of the data. The goal of data
modelling is to create a model that can generalize well to new data and make accurate
32
predictions. Data modelling involves several steps, including selecting an appropriate model,
defining the model parameters, training the model on a dataset, and testing the model's
performance on a new dataset. The selection of the appropriate model is critical since the model
should be able to capture the underlying patterns and relationships within the data. The most
common types of models used in data modelling include linear regression, decision trees,
random forests, and neural networks.
Once the appropriate model is selected, the next step is to define the model parameters. The
model parameters are the values that the model needs to learn during the training process. The
process of learning these parameters is known as model training. The training process involves
feeding the model with a dataset and adjusting the model parameters until the model can
accurately predict the output for the given input. The model training process involves several
techniques, such as optimization algorithms, regularization techniques, and hyperparameter
tuning. Optimization algorithms such as gradient descent are used to update the model
parameters iteratively until the model can accurately predict the output. Regularization
techniques such as L1 and L2 regularization are used to prevent overfitting of the model.
Hyperparameter tuning involves selecting the optimal values for the model's hyperparameters,
such as learning rate and batch size. After the model is trained, the next step is to evaluate the
model's performance on a new dataset.
2.6.5 Evaluation
Evaluation is a component of machine learning that involves assessing the performance of the
trained model. It is the final step in the machine learning process and is used to determine how
well the model can generalize to new, unseen data. The evaluation process involves measuring
the accuracy, precision, recall, F1 score, and other metrics to assess the performance of the
model. The primary purpose of evaluation is to validate the performance of the model and
determine whether it is suitable for the intended task. It helps to identify any issues or errors that
may have occurred during the training process and provides insights into how the model can be
improved.
The evaluation process involves splitting the dataset into two parts, namely the training set and
the testing set. The training set is used to train the model, while the testing set is used to evaluate
33
its performance. The testing set should be representative of the real-world data that the model is
intended to work with. There are various evaluation metrics used in machine learning, and the
choice of metrics depends on the nature of the problem being solved. Some of the commonly
used evaluation metrics include:
Accuracy: It is the most basic evaluation metric and measures the percentage of correctly
classified samples.
Precision: It is the ratio of correctly predicted positive samples to the total number of positive
samples. Precision is used when the goal is to minimize false positives.
Recall: It is the ratio of correctly predicted positive samples to the total number of actual
positive samples. Recall is used when the goal is to minimize false negatives.
F1 Score: It is a weighted average of precision and recall and provides a balance between the
two metrics.
ROC Curve: It is a graphical representation of the true positive rate against the false positive
rate at different classification thresholds.
Confusion Matrix: It is a table that summarizes the performance of a classification model by

comparing the actual and predicted values.
In addition to these metrics, there are other advanced evaluation techniques such as cross-
validation, which involves splitting the dataset into several subsets and training the model on
different subsets to improve its performance. In summary, evaluation is a critical component of
machine learning that involves assessing the performance of the model. It helps to determine
whether the model is suitable for the intended task and provides insights into how it can be
improved. The choice of evaluation metrics depends on the nature of the problem being solved,
and there are various advanced techniques that can be used to improve the performance of the
model.
2.6.6 Deployment
34
Deployment is an essential component of machine learning that refers to the process of making a
machine learning model available for use in the real world. In other words, it involves deploying
the trained machine learning model to a production environment where it can be used to make
predictions on new data. The deployment phase is the final step in the machine learning pipeline,
after the model has been trained, tested, and evaluated. The deployment phase of machine
learning involves taking the trained model and integrating it into a system that can receive input
data, process it, and output the predicted results. This can involve building an application that
integrates the machine learning model with other components, such as a web server or mobile
app. The deployment process can be challenging and requires careful consideration of factors
such as scalability, reliability, security, and performance. There are several different methods of
deploying machine learning models, including:
 Hosting on cloud platforms: One popular method is to host the machine learning model
on cloud platforms such as AWS, Google Cloud, or Microsoft Azure. These platforms
provide scalable and reliable infrastructure for hosting machine learning models and can
also provide additional services such as auto-scaling, load balancing, and security.
 Containerization: Another method of deployment is containerization using technologies
such as Docker or Kubernetes. Containerization allows the machine learning model to be
packaged with all its dependencies and run consistently across different environments.
 Microservices: Machine learning models can also be deployed as microservices, which
are small, modular components that can be independently deployed and scaled. This
approach can make it easier to integrate machine learning into existing applications or
systems.
 Mobile deployment: Machine learning models can also be deployed on mobile devices,
such as smartphones or tablets, using technologies such as TensorFlow Lite or Core ML.
This approach can enable real-time predictions on device, without the need for a network
connection.
The deployment phase is a critical component of machine learning, as it enables the machine
learning model to be used in the real world to solve real-world problems. Effective deployment
requires careful planning and consideration of various factors, including the choice of
35
deployment method, infrastructure requirements, and operational considerations such as
monitoring, logging, and maintenance.
2.7 Related works

Lung cancer is a critical health concern, and researchers have been actively exploring methods to
improve its detection and early diagnosis. The use of machine learning algorithms for symptom-
based lung cancer detection has gained considerable attention in recent years. With the
prevalence of lung cancer and the importance of timely diagnosis and treatment, scientists have
been investigating various approaches to accurately identify individuals who may have lung
cancer. One area of focus in this domain is the development of symptom-based lung cancer
detection systems using machine learning algorithms. These systems leverage patient symptom
data, such as persistent cough, chest pain, and weight loss, to identify individuals at higher risk
of having lung cancer. Machine learning algorithms, including Random Forests and Support
Vector Machines, are employed to analyze this data and predict the likelihood of lung cancer
presence in a patient.
Researchers have studied this subject using diverse datasets, which vary in quality and features.
By carefully selecting and extracting relevant information from these datasets, authors have been
able to draw meaningful conclusions from their research. Several significant studies have
contributed to the advancement of machine learning-based lung cancer detection, offering hope
for improved diagnosis and outcomes for patients. The following research works shed more light
on the subject topic
Abbas et al. (2023) proposed a study titled "Fused Weighted Federated Deep Extreme Machine
Learning Based on Intelligent Lung Cancer Disease Prediction Model for Healthcare 5.0." In this
study, the authors address the challenge of accurate disease prediction in the era of smart
healthcare industry 5.0 and information technology advancement. Particularly, the accurate
prediction of deadly cancer diseases, such as lung cancer, is crucial for human well-being in this
context.
Alsheikhy et al. (2023) proposed "A CAD System for Lung Cancer Detection Using Hybrid
Deep Learning Techniques," a study that introduces a fully automated and practical system for
identifying and classifying lung cancer. The primary objective of this system is to detect cancer
36
at an early stage, potentially saving lives or reducing mortality rates. The authors employ a
combination of deep learning techniques, including a deep convolutional neural network
(DCNN) called VGG-19, and long short-term memory networks (LSTMs). These techniques are
customized and integrated to detect and classify lung cancers, with image segmentation
techniques also applied. The system falls under the category of computer-aided diagnosis (CAD).
Gupta et al. (2022) proposed a study titled "A Study On Prediction Of Lung Cancer Using
Machine Learning Algorithms." In this paper, the authors conducted image classification and
applied various machine learning algorithms to a lung cancer disease dataset in order to calculate
measures such as accuracy and sensitivity. The study employed K-NN, Random Forest, and
SVM algorithms to analyze the initial stage of lung cancer by applying them to the lung cancer
dataset.
Kumar et al. (2022) developed a study titled "Lung Cancer Prediction from Text Datasets Using
Machine Learning". The aim of their research was to optimize the process of lung cancer
detection using a machine learning model based on support vector machines (SVMs). The study
utilized an SVM classifier to classify lung cancer patients based on their symptoms, while
implementing the model using the Python programming language.
Li et al. (2022) developed "Machine learning for lung cancer diagnosis, treatment, and
prognosis" in which they conducted a review aiming to provide an overview of machine
learning-based approaches that enhance different aspects of lung cancer diagnosis and therapy.
The authors specifically focused on early detection, auxiliary diagnosis, prognosis prediction,
and immunotherapy practice. They discussed the various applications of machine learning
techniques in these areas and highlighted both the challenges and opportunities for future
advancements in the field of machine learning for lung cancer.
Nageswaran et al. (2022) proposed a study titled "Lung cancer classification and prediction using
machine learning and image processing." This research demonstrates the accurate classification
and prediction of lung cancer through the utilization of machine learning and image processing
technologies. The study involved the collection of photos, specifically 83 CT scans obtained
from 70 distinct patients, which served as the dataset for the experimental investigation. To
enhance image quality, a geometric mean filter was applied during picture preprocessing.
37
Nemlander et al. (2022) proposed a study titled "Lung cancer prediction using machine learning
on data from a symptom e-questionnaire for never smokers, former smokers, and current
smokers." The aim of their study was to examine the predictive capability of symptoms reported
in an adaptive e-questionnaire for lung cancer, separately for individuals who have never
smoked, former smokers, and current smokers.
Shimazaki et al. (2022) developed a study titled "Deep learning-based algorithm for lung cancer
detection on chest radiographs using the segmentation method." In this research, the authors
aimed to develop and validate a deep learning (DL)-based model for detecting lung cancer on
chest radiographs. They utilized the segmentation method and evaluated the model's performance
in terms of sensitivity and mean false positive indications per image (mFPI).
Chaturvedi et al. (2021) proposed a study titled "Prediction and classification of lung cancer
using machine learning techniques." In this paper, the authors focus on listing, discussing,
comparing, and analyzing several methods related to image segmentation, feature extraction, and
various techniques for the early detection and classification of lung cancer. The goal is to explore
the effectiveness of these approaches in improving the accuracy and efficiency of lung cancer
detection, ultimately leading to better patient outcomes.
Madan et al. (2019) developed a study titled "Lung cancer detection using deep learning." The
purpose of their work was to enhance the accuracy of the initial stage detection of lung cancer by
leveraging image processing techniques. To conduct their experiments, the researchers obtained
CT images from the NIHINCI Lung Image Database Consortium (LIDC) dataset, which
provided them with the opportunity to analyze and evaluate their proposed approach.
2.8 Comparison of related works

Over the years, several methods have been proposed for the detection of lung cancer. These
methods vary in their accuracy, cost, and invasiveness. A table of the different techniques
involved in the diagnosis of lung cancer is shown below in Table 2.1.
Table 2.1: Comparison of Lung Cancer Techniques

S/N Author(s) Strategy Limitations Performance
1. Abbas et al. (2023) Fused weighted The proposed fused 97.2%

federated deep weighted model is limited to
38
extreme machine the lung cancer disease
learning model dataset.
2. Alsheikhy et al. VGG-19, and The research does not 98.8%

(2023) Long short-term discuss the interpretability
memory networks of the developed model.
(LSTMs).
3. Gupta et al. (2022) K-NN, Random The dataset used does not 84.2%
Forest, and SVM include a comprehensive set
of relevant features for
predicting lung cancer.
4. Kumar et al. SVM, SMOTE

(2022)
2.8.1 Research Gap

A review of the literature found that many researchers have used a variety of techniques to create
different types of prediction models using machine learning algorithms such as SVM, CNN,
ANN, K-NN, Random Forest, VGG-19, and LSTMs. However, these models have some
drawbacks, such as limited datasets, overfitting of the data during pre-processing, and models
that are not relevant to the real world due to their predefined sizes. These limitations can lead to
inefficient or inaccurate model performance. To address these issues, this model will use a real-
time dataset, a large dataset size, and only the necessary pre-processing and feature extraction to
prevent overfitting and underfitting. The model will then be evaluated using several metrics to
ensure that it is accurate.
39
CHAPTER THREE
METHODOLOGY
3.1 General Overview

This chapter focuses on designing a machine learning model to predict lung cancer in potential
patients. It outlines the steps involved, including data collection, preprocessing, feature
extraction, and the uses of Support Vector Machines (SVM) and Convolutional Neural Network
(CNN) as classifiers. The chapter also mentions evaluating the model’s performance using
metrics like accuracy, recall, precision, and the confusion matrix.
3.2. Description of the Proposed System

Figure 1.8 illustrates the proposed prediction system, which aims to address the challenge of
predicting lung cancer in patients. The focus is on developing a comprehensive and consistent
system by effectively integrating machine learning approaches using Support Vector Machines
(SVM) and Convolutional Neural Network (CNN). The primary objective of this research is to
enhance the reliability of identifying and predicting lung cancer in patients.
40
Feature
Processing Data balancing
extraction
Evaluation Feature
Classification
measure selection
Figure 3.1 Proposed Lung Cancer Prediction Model
3.2.1 Data Collection

This research work utilized the Survey Lung Cancer dataset, acquired from Kaggle, provided in
csv format. The dataset comprises essential information regarding the presence of various
symptoms and the outcomes, specifically whether an individual has lung cancer. It consists of
data from 310 patients, encompassing 15 attributes, including physical characteristics and
symptoms like Age of the patient, Smoking, Yellow fingers, Anxiety, Peer pressure, Chronic
Disease, Fatigue, Allergy, Wheezing, Alcohol, Coughing, Shortness of Breath, Swallowing
Difficulty, Chest pain, and Lung Cancer. These attributes facilitate training the model to adapt to
multiple potential symptom variations. The entire dataset was employed for training the
prediction system's model.
3.2.2 Data Pre-processing
The pre-processing stage receives the gathered data as input. To make the acquired raw data
useful during the descriptive phases, it undergoes a variety of initial processing stages. Data pre-
processing is used to convert unclean and unusable raw data into a tidy and useful structure for
creating models.
41
The following stages are taken in the development:
Step 1: Import the python standard library.
Step 2: Verify the dataset's dimensions, including its rows, columns, form, statistical summary,
etc.
Step 3: Clean data to remove gaps and outliers, look for duplicates, deal with missing data, and
handle null values.
Step 4: Scale and normalize - Our dataset most likely consists of features with different scales.
Scaling standardizes the independent variables of a dataset into a particular range.
Step 5: Split the dataset into training and testing on a ratio of 80:20
3.2.3 Data Balancing
Data balancing, in the context of machine learning and data analysis, refers to the process of
adjusting the class distribution of a dataset to ensure that each class is represented fairly. This is
particularly important in situations where one class significantly outweighs the others, leading to
biased model predictions. Techniques for data balancing include oversampling the minority
class, under sampling the majority class, or using more advanced methods like SMOTE
(Synthetic Minority Over- sampling Technique) to generate synthetic data points.
3.2.4 Feature Extraction

The process of feature evaluation aims to reduce the number of features or input variables in a
dataset by selectively removing or enhancing them based on raw data. Dealing with a large
number of characteristics makes it more complex to analyze training information and build an
accurate prediction model. To tackle this challenge, we combine the Pearson correlation method
and correlation coefficient for this particular project.
42
Here's a step-by-step outline of our approach:
Step 1: We start by importing a Python library that is essential for feature evaluation.
Step 2: Among various methods available for numerical variables, we opt for the widely used
Pearson correlation method. It assigns values between −1 and 1, where 0 indicates no correlation,
1 signifies complete positive correlation, and −1 represents complete negative correlation. A
correlation value of 0.7 between two variables indicates a significant and positive relationship
between them. A positive correlation implies that when variable A increases, variable B also
tends to increase, while a negative correlation suggests that as A increases, B decreases.
Step 3: The correlation coefficient helps us understand whether the characteristics are positively
or negatively related to the target variable. To identify which characteristics are most associated
with the target variable in this project, we utilize a heat map.
Step 4: By combining the insights from both feature evaluation methods, we select the common
features that will be used for modeling purposes.
3.2.5 Feature Selection

Feature selection is the process of choosing a subset of relevant features (variables, predictors)
from the original set of features to use in model training. It aims to reduce the dimensionality of
the dataset while retaining the most important information. By selecting only the most relevant
features, feature selection can improve model performance, reduce overfitting, and speed up
training. Techniques for feature selection include filter methods (e.g., recursive feature
elimination), and embedded methods (e.g., LASSO regression).
3.2.6 Classification Algorithms

The current phase is the building of the prediction model, which utilizes the support vector
machine and convolutional neural networks methods. These algorithms are selected largely for
their particular capacity to cut down on training time and boost forecast precision. The two
algorithms are being applied here sequentially, with the results from the feature engineering
stages being fed into the support vector machine and convolutional neural networks. The output
43
of both algorithms is then being compared in terms of accuracy, and the algorithm that produces
the results with the highest accuracy will be used to develop the prediction system.
Support Vector Machine
Here are few examples of Support Vector Machine (SVM) algorithm
i. Linear SVM: This is the standard SVM for linearly separable data. It finds a hyperplane
that best separates two classes.
ii. Non-Linear SVM: SVM can handle non- linear data using kernel functions. Examples of
kernel functions include Polynomial kernel, Gaussian Radial Basis Function (RBF)
kernel, and Sigmoid kernel.
iii. Support Vector Regression (SVR): instead of classification, SVR is used for regression
tasks.it tries to fit a “tube” around the data points, aiming to have as many points within
this tube as possible.
iv. One-class SVM: used for outlier detection, this algorithm identifies observations that
deviate significantly from the majority of the data.
v. Nu-SVM: this variant of SVM uses a parameter called “NU” to control the trade-off
between achieving a broader margin and allowing some misclassification.
vi. Multi-class SVM: SVMs can be extended to handle multiple classes using strategies like
One-vs-One (OVO) and One-vs-All (OVA) approaches.
In the realm of supervised machine learning, the Support Vector Machine (SVM) stands out as a
robust and expeditious classification model designed to tackle classification problems. Its
efficiency is particularly pronounced when dealing with limited datasets. The SVM represents
instances as points in space and skillfully delineates diverse categories by maximizing the gap or
width between them using hyperplanes. An exemplary application of this approach lies in
medical datasets, where the SVM is often employed to categorize symptom parameters into
groups exhibiting similar patterns. By utilizing hyperplanes, the SVM classifier efficiently
segregates these parameters into distinct groups. What sets the SVM apart is its ability not only
44
to classify known data but also to predict the class of previously unseen data. The ultimate goal
is to identify the optimal hyperplane, which boasts the most extensive separator margin.
One way to represent the linear SVM equation is as follows:
f(x) = w^T x + b (1)
In the context of SVM, we have an input vector represented by x, a weight vector denoted as w, a
bias term indicated by b, and a decision function, f(x), responsible for classifying the input vector
into a specific class. The primary objective of SVM is to identify the most suitable values for w
and b, which effectively enhance the margin between the hyperplane and the nearest data points
belonging to different classes.
Convolutional Neural Network
Here are some examples of CNN algorithm
i. LeNet-5: One of the earliest CNNs: one of the earliest developed by Yann LeCun, used
for handwritten recognition.
ii. AlexNet: A breakthrough architecture in the imageNet Large Scale Visual Recognition
Challenge in 2012, designed by Alex Krizhevsky and llya Sutskever.
iii. VGGNet: known for its simplicity and depth, VGGNet has various versions, such as
VGG16 and VGG19, with many layers.
iv. GoogLeNet(Inception): Developed by Google, this architecture introduced the concept of

inception modules for efficient deep learning.
v. ResNet: Residual Networks introduced residual connections to address the vanishing

gradient problem, allowing for very deep networks.
vi. MobileNet: Designed for mobile and embedded vision applications, MobileNet uses
depthwise separable convolutions for efficiency.
A CNN (Convolutional Neural Network) is a type of deep learning architecture that is primarily
used for image recognition and computer vision tasks. It is designed to automatically and
45
adaptively learn spatial hierarchies of features from input images, allowing it to identify patterns,
objects, and structures within the images. The key components of a CNN are convolutional
layers, pooling layers, and fully connected layers. Here's a brief overview of each:
 Convolutional Layer: The convolutional layer is the core building block of a CNN. It
consists of a set of learnable filters (also called kernels) that slide over the input image.
Each filter performs a convolution operation, which involves element-wise multiplication
of the filter with a local region of the input image, followed by summation. The result is a
feature map that highlights certain patterns or features found in the input image. The
formula for the convolution operation in 2D can be represented as follows:
F(i,j) = (I∗K)(i,j) = ∑m∑nI(i+m,j+n)⋅K(m,n) (2)
 Pooling Layer: The pooling layer is used to reduce the spatial dimensions of the feature
maps obtained from the convolutional layers. It helps in reducing the computational
complexity and making the network more robust to small variations in the input. The
most common type of pooling is max-pooling, which takes the maximum value from a
local region of the feature map and retains only the most significant information. The
formula for max-pooling in 2D can be represented as follows:
O(i,j) = max(F(i⋅s,j⋅s)) (3)
 Fully Connected Layer: After several convolutional and pooling layers, the final feature
maps are flattened into a 1D vector and passed through one or more fully connected
layers. These layers are similar to those in a traditional neural network, connecting all
neurons from the previous layer to all neurons in the current layer. They help in learning
complex non-linear relationships between the extracted features and the output classes.
The formula for a fully connected layer is standard and involves a matrix multiplication:
Output = Activation (Input ⋅ Weights + Bias) (4)
3.2.7 Evaluation
The model's performance is measured in terms of recall, accuracy, precision, and f1score.
46
i. Accuracy: quantifies the percentage of instances that are classified correctly among all
instances. It is computed by dividing the number of accurate predictions by the total
number of predictions. Essentially, it represents the proportion of correct predictions
made by our model, reflecting its overall correctness.
Accuracy = Number of correct predictions/Total number of predictions (5)
ii. Precision: evaluates the accuracy of identifying true positive instances among the
predicted positives. It quantifies the proportion of true positives out of all positive
predictions. This assessment reflects the model's effectiveness in predicting a particular
category and is employed to gauge its capability in correctly classifying positive values.
Precision = True positive/True positives + false positives (6)
iii. Recall: revolves around accurately identifying positive instances among the actual
positives. Mathematically, it represents the true positives divided by the total count of
actual positive instances. This metric provides insights into how effectively the model
detects a particular category and assesses its capacity to predict true positive values.
Recall = True positive/ True positive + false negatives (7)
iv. F1-Score: serves as a balanced metric, taking into account both precision and recall
simultaneously. When there is a need to consider both precision and recall, the F1 Score
comes in handy, as it embodies the harmonic mean of these two metrics.
F1-Score = 2• Precision• Recall/ Precision + Recall (8)
The application of the confusion matrix technique aids in obtaining the essential parameters for
evaluating model performance. Primarily employed for assessing classification models, this two-
dimensional table arranges the model's predicted labels in columns and the true class labels in
rows. The confusion matrix enables the derivation of crucial metrics such as True Positive (TP),
True Negative (TN), False Positive (FP), and False Negative (FN). Figure 1.9 visually illustrates
the extraction of these values from the table, which subsequently serve as the foundation for
calculating the model's performance metrics.
Table 3.1: Sample of confusion matrix (Waart, 2022)
True47class
Predictive class
True positives (TP) occur when the model accurately predicts the positive class, correctly
recognizing an observation as part of the positive class.
False positives (FP) happen when the model predicts the positive class incorrectly, wrongly
identifying an observation as belonging to the positive class when it actually does not.
True negatives (TN) are instances where the model correctly predicts the negative class,
accurately identifying an observation as not belonging to the positive class.
False negatives (FN) are situations where the model predicts the negative class incorrectly,
mistakenly classifying an observation as not belonging to the positive class when it actually
does.
3.2.8 Model Training

During the training phase, the algorithm is furnished with a collection of training data
comprising input features along with their corresponding target values. It then fine-tunes its
parameters iteratively, striving to make precise predictions for new input data. Throughout this
iterative process, the algorithm refines its parameters by continuously minimizing the disparity
between its predictions and the actual values in the training data. The paramount objective of
model training lies in crafting a generalized model capable of making accurate predictions even
48
on unfamiliar, previously unseen data. In other words, the model should demonstrate proficiency
in predicting target values for data it has never encountered during training.
3.2.9 Model Testing

Within the focus of machine learning, model testing involves assessing how well a trained
machine learning model performs on a distinct dataset, one that differs from the data used for
training. This critical process gauges the model's ability to generalize effectively when presented
with fresh, unseen data. Our unique method of model testing incorporates inputs sourced directly
from real users, which enables us to estimate the model's performance accurately on these novel
and previously unseen data points.
3.3 User Interface Design

The main focus of the user interface design is to anticipate user needs, leading to the
development of a straightforward, coherent, user-friendly, and responsive interface for user
engagement. This interface will be a single webpage built using Streamlit, a popular Python
open-source framework for data-driven web applications. Streamlit is especially advantageous
for creating interfaces for machine learning models, as it enables developers to easily generate
interactive visualizations and widgets for presenting and interacting with data and model outputs.
Through the interface, users can submit input requests to the server's classification model, and
the results and responses from the server will be displayed. The user interface will encompass
both input and output design elements.
3.3.1 Input Design

Input design involves collecting user input and queries, and converting them into instructions
that the system can comprehend. It is crucial to create an efficient design to ensure users provide
accurate information. On the system's input page, users are prompted to input their information.
Once they've completed the form, they can click the predict button and wait for the results.
3.3.2 Output Design

This pertains to the diverse methods of showing or presenting information to the user. The design
will encompass comprehensive details about the prediction outcome, all of which will be visible
on the same page.
49
3.4 Software Component Requirements
Software refers to a set of instructions and programs that direct the operation of a machine.
These are created by programmers and software teams. In this study, the following software
components are used, each serving a specific purpose:
Operating system: To ensure optimal utilization of computing resources, Windows 7 and newer
versions are employed.
Integrated development environment (IDE): Visual Studio Code is utilized for writing the
system's source codes.
Web browsers: Google Chrome and Safari are the chosen web browsers.
Python 3.9: This is a versatile programming language used for various coding purposes.
Google Colaboratory (Google Colab): It is a cloud-based Jupyter notebook environment

offered by Google. It enables users to write and run Python code directly in a web browser.
Google Colab provides free access to computing resources like CPU, GPU, and TPU, facilitating
collaboration on notebooks.
Streamlit: An open-source Python library used for constructing interactive web applications
focused on data science and machine learning. It simplifies the process of creating and deploying
data-driven applications without extensive HTML, CSS, or JavaScript coding.
GitHub: This web-based platform serves as a centralized location for version control and
collaboration among developers. It is built on the Git version control system, which tracks
changes in code over time and allows multiple developers to work on the same codebase
concurrently..
3.5 Hardware Component Requirements

The hardware components of the computer system are the physical parts that can be seen,
touched, and felt. These include the keyboard, monitor, hard drive, CPU, memory, and
mousepad. Below is a list of the specific hardware elements used in this investigation and their
respective purposes:
Processor: At least a 2GHz Core i5 8th generation processor.

50
Random Access Memory (RAM): At least 4GB of RAM.
Hard Disk: A minimum of 500GB for permanent storage of codes, datasets, and trained models.
Keyboard: A 101-key US traditional keyboard.
Mouse: A 3D mouse.
CHAPTER FOUR
RESULTS AND DISCUSSIONS
4.1 Introduction
The machine learning model is deployed by designing a web-based application through which
the user can get the predicted results by inputting the values for the required parameters such as
the result for age, gender, anxiety, peer pressure and so on. As these parameters are inputted into
the web application, it sends it to the backend where the machine learning model is stored. When
the data is received by the backend, the model then make predictions on the outcome of the lung
cancer results and sends the response back to the frontend so the result can be displayed. The
degree of accuracy for this system is based on the number of records contained in the dataset
used to carry out the operation. The application is a medium by which data passes through to the
machine learning model, the model checks itself to see if there is some resemblance in the
51
dataset used in its construction, it then learns from already available data gotten from the dataset
utilized. Finally, it delivers the appropriate outcome back to the user. The web application
designed in this study was tested with localhost and can be deployed online as a fully working
application i.e., it can be accessed at any time and from anywhere without restrictions. Users of
this system can test whenever they feel the need to, which in turn improves the model’s ability to
learn new things about the different data values presented to it.
This system runs on a web environment and is used with the following procedures on the
localhost:
Step 1: Start the xampp control panel and start the apache server and MySQL database.
Step 2: On your preferred browser type “localhost/lung”. Once the web application loads up, it
displays a user interface (frontend) that initially shows information about hypertension,
symptoms and how it can be controlled. A login and registration page are provided for users to
access the system and register if not already on the system. When the user accesses the system
they are provided with the full functionalities of the system.
4.2 System Implementation and System Requirements
In the system architecture, the presentation layer is designed using HTML/CSS and jQuery
(JavaScript Framework) as the user interface, the application layer is designed using PHP that
runs on the server and the data layer is designed using MySQL database server is installed on the
computer system. Python was used on the backend layer for machine learning predictions. The
computer system captures users details, allows information input from the user on the systems
database, where users can easily access and view database information.
52
Figure 4.1 Home page
The image above shows the interface of the software where the users interact with the software.
4.3 Developed lung cancer Prediction System
The prediction model is used to predict lung cancer in patients. The basic structure of the
prediction system is as follows: data acquisition, data preprocessing, feature extraction, training
and testing dataset and classification using the web-based application.
4.3.1 Data acquisition
The dataset used for this project was collected from kaggle with 3000 lung cancer images. 500
images are for training adenocarcinoma lung cancer, 200 images are for testing adenocarcinoma
lung cancer, 550 images are for training large cell carcinoma lung cancer, 250 images are for
testing large cell carcinoma lung cancer, 475 images are for training normal lung cancer, 175
images are for testing normal lung cancer, 575 images are for training squamous cell carcinoma
lung cancer, 275 images are for testing squamous cell carcinoma lung cancer. The table 4.1 and
53
4.2 shows that the data was evenly distributed for a suitable machine learning task. The training
dataset made up 70% of the data and the testing dataset made up 30%.
Figure 4.2 Data acquisition process
Table 4.1: Training Dataset (70%)
S/N Types of lung cancer Number of images Percentage (%)
1 Adenocarcinoma 500 24
2 Large cell carcinoma 550 26
54
3 Normal 475 22
4 Squamous cell carcinoma 575 28
5 Total = 2100 100
500 images are for training adenocarcinoma lung cancer, 550 images are for training large cell
carcinoma lung cancer, 475 images are for training normal lung cancer, 575 images are for
training squamous cell carcinoma lung cancer, which sum up to 2100 images. All 2100 sample
were used for training the model. The training dataset made up 70% of the data
Table 4.2: Testing Dataset (30%)
S/N Types of lung cancer Number of images Percentage (%)
1 Adenocarcinoma 200 24
2 Large cell carcinoma 250 26
3 Normal 175 22
4 Squamous cell carcinoma 275 28
5 Total = 900 100
200 images are for testing adenocarcinoma lung cancer, 250 images are for testing large cell
carcinoma lung cancer, 175 images are for testing normal lung cancer, 275 images are for testing
squamous cell carcinoma lung cancer, which sum up to 900 images. All 900 sample were used
for testing the model. The testing dataset made up 30%.
4.3.2 Data preprocessing
Data preprocessing is the most important part of machine learning, to prepare and have an
accurate machine learning application, data must be processed before being inputted into the
system. Data preprocessing is a technique to prepare the raw data such that the model can learn
55
from it, the data pre-processing step deals with the missing values in the dataset, converts the
numeric values into string type so that we can perform one hot encoding and also, we handle the
under sampling of the target attribute.
Step 1: In order to perform data preprocessing using python, we need to import predefined
python libraries. These libraries are used to perform some specific jobs such as to remove gaps,
noise and outliers. These are three specific libraries that we will use for data preprocessing,
which are: pandas, numpy, matplotlib and seaborn.
Figure 4.3 Libraries for preprocessing
Step 2: Check the dataset dimension such as column, shape etc.
Figure 4.4 the shape and dimension of dataset
56
Step3: The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning model.
Hence it is necessary to handle missing values present in the dataset.
Figure 4.5 Total number of null values
Step 4: Dealing with categorical variables(alphabets) and converting to numerical

variables(numbers) that machine learning understands. Machine learning model fully works
on arithmetic and numbers, but if our dataset would have a categorical variable, then it may
cause trouble while building the machine learning model. So, it is necessary to encode these
categorical variables into numbers.
Figure 4.6 Dataset with categorical data
57
Figure 4.7 Dataset after dealing with categorical data
4.3.3 Feature Extraction and selection
This is a step that involves extracting and selecting features to get the best results from the
models. Getting the best features help to demonstrate the underlying data structures. The
presence of irrelevant features in the dataset makes it difficult to differentiate the impact of some
relevant features. This significantly reduces the performance of the machine learning model. To
curb this and identify essential symptoms, univariate feature selection method, correlation
coefficient and feature importance. Common symptoms from the feature selection method are
picked and use to train the model.
i. Correlation coefficient
Correlation matrix is a table used to define correlation coefficients among variables or
features, and it is a tool for the feature selection and extraction process. Each cell in the
matrix represents a relationship between two variables. It is used to summarize a large
dataset as well as to identify the most highly correlated features in the data. The
correlation coefficients value near 1 indicates that the features participating in the
correlation are highly correlated to each other; on the other hand, the correlation
coefficients value near 0 indicates that the features participating in the correlation are less
correlated to each other.
58
Figure 4.8 Heatmap of selected features
ii. Univariate Feature Selection
Statistical test can be used to identify the features with the strongest relationship to the
output variable. The scikit learn library includes the SelectKBest class, which can be used
in conjunction with a variety of statistical tests to select a specific number of features. The
figure below selects 15 of the best features using chi2 statistical test for non-negative
features.
iii. Feature Importance feature selection
You can get the feature importance of each feature of the dataset of the dataset by using
the feature importance property of the model. Feature importance gives a score for each
feature of the data, the higher the score the more important or relevant the feature is.
Feature importance is an inbuilt class that comes with a tree-based classifiers, we will be
using Extra Tree classifiers for extracting the top 15 features for the dataset.
59
With these features selection techniques, common features from all the three methods
were obtained and a list of 12 features was selected and every other with no relevancy was
dropped.
4.3.4 Model Development
In machine learning data preprocessing, we split our selected features into a testing set and
training set. This is one of the crucial steps of data preprocessing, we can enhance the
performance of our machine learning model. Suppose, if we have given training to our machine
learning model by a dataset and we test it by a totally different dataset. Then, it will create
difficulties for our model to understand the correlations between the models. If we train our
model very well and its training accuracy is also very high, but we provide a new dataset to it,
then it will decrease the performance. So, we always try to make a machine learning model
which performs well with the training set and also with the test dataset.
The dataset was then further divided into two sets (training set and testing set) using the test-train
split method. The system was trained using training set data and the accuracy of the ML
classifier, then tested using the testing set. Finally, the model was used to predict the likelihood
of disease infection in terms of positive or negative outcomes using new patient data
i. Training dataset
A subset of dataset to train the machine learning model, and we already know the
output. 70% of the dataset was trained and fed to the machine learning model with the aid
of Random Forest and support vector machine. This teaches the machine learning model
to learn and make the desired predictions.
60
Figure 4.9 Training Dataset
ii. Testing dataset
A subset of dataset to test the machine learning model, and by using the test set, model
predicts the output. 30% of the dataset is used to evaluate the accuracy of the develop
model and predictions are made. To validate the efficacy and accuracy of the developed
model.
Figure 4.1.1 Testing dataset
61
4.3.5 User output
Typically refers to the result or outcomes generated by users interacting with a system, software
or service.it represent the information, actions, or changes produced as a result of user input or
engagement.
Figure 4.1.2 predictions
The image above shows the result to be large cell carcinoma for training an uploaded dataset.
The image above shows the result to be normal for training an uploaded dataset.
62
The image above shows the result to be adenocarcinoma for training an uploaded dataset.
4.3.6 Evaluation Of Machine Learning Techniques

To evaluate the effectiveness of different ML techniques, we employed two common machine
learning algorithms to predict lung cancer in the dataset: Support Vector Machine (SVM),
Convolutional Neural Network (CNN) This gives us the option to choose the best ML method
for predicting lung cancer based on the factors we've provided. The classification procedure was
evaluated using measures such as precision, accuracy, confusion matrix, recall and f1-score.
Precision is the percentage that is truly positive, accuracy measures how often the model is
correct, sensitivity (sometimes called Recall) measures how good the model is at predicting
positives. This means it looks at true positives and false negatives (which are positives that have
been incorrectly predicted as negative), F-score is the "harmonic mean" of precision and
sensitivity. Lastly, confusion matrix is a table that is used to define the performance of a
classification algorithm. A confusion matrix visualizes and summarizes the performance of a
classification algorithm. Confusion matrix is shown in four parts. These are:
i. True positive (TP): Observation is predicted positive and is actually positive.

ii. False positive (FP): Observation is predicted positive and is actually negative.
iii. True negative (TN): Observation is predicted negative and is actually negative.
63
iv. False negative (FN): Observation is predicted negative and is actually positive.
Table 4.3: confusion matrix identifiers
There following formula can be used to get the accuracy score, precision , sensitivity, specificity
and F1-Score.
Accuracy = (TP+TN)/(TP+FP+TN+FN)
Precision = TP/(TP+FP)
Sensitivity = TP/(TP+FN)
Specificity = TN/(TN+FP)F1-Score = (2×Precision × Sensitivity)/ (Precision + Sensitivity)
4.3.5 Performance metrics for Support Vector Machine (SVM)
64
Figure 4.1.5 Performance metrics for SVM
According to Figure 4.1.5 Support Vector Machine model provided us with an 52% training
accuracy. With such a result, developing an accurate machine learning model could be
inaccurate.
The confusion matrix is obtained from the machine learning code, after training the SVM model.
It will be used to describe the performance of a classification model on the training data.
Figure 4.1.6 confusion matrix for SVM
To evaluate the precision, sensitivity and specificity of the Support Vector Machine model, we
draw conclusions from the confusion matrix in Figure 4.1.6
65
Precision: Precision is the percentage of accurately identified positive values. This can be
derived from the above confusion matrix using the following formula:
TP 226 226
Precision= =TP=226 , FP=33= = = 0.87
TP+ FP 226+ 33 259
Sensitivity: Sensitivity, also name for recall, is the percentage of true positive cases that are
accurately identified. This can be derived from the above confusion matrix using the following
formula:
TP 226 226
Sensitivity= = TP= 226, FN = 49 = = = 0.82
TP+ FN 226+49 275
Specificity: Specificity is the percentage of truly negative cases that are accurately identified.
This can be derived from the above confusion matrix using the following formula:
TN 192 192
Specificity= =TN=192 , FP=33= = = 0.85
TN + FP 192+33 225
F1-Score = (2×Precision × Sensitivity)/ (Precision + Sensitivity) =2×0.87× 0.82/0.87+ 0.82=

84%
4.3.4.2 Performance metrics for Convolutional Neural Network (CNN)
66
Figure 4.1.7 Performance metrics for CNN
The confusion matrix is obtained from the machine learning code, after training the CNN model.
It will be used to describe the performance of a classification model on the training data.
Figure 4.1.8 confusion matrix for CNN
To evaluate the precision, sensitivity and specificity of the CNN model, we draw conclusions
from the confusion matrix in Figure 4.1.5
67
Precision: Precision is the percentage of accurately identified positive values. This can be
derived from the above confusion matrix using the following formula:
TP 208 208
Precision= =TP=208 , FP=36= = = 0.88
TP+ FP 208+36 244
Sensitivity: Sensiti4vity, also name for recall, is the percentage of true positive cases that are
accurately identified. This can be derived from the above confusion matrix using the following
formula:
TP 208 208
Sensitivity= =TP=208 , FN =43= = = 0.85
TP+ FN 208+ 43 251
Specificity: Specificity is the percentage of truly negative cases that are accurately identified.
This can be derived from the above confusion matrix using the following formula:
TN 213 213
Specificity= =TN=213 , FP=36= = = 0.87
TN + FP 213+36 249
F1-Score = (2×Precision × Sensitivity)/ (Precision + Sensitivity) =

2×0.88× 0.85/0.88+ 0.85=96%
Table 4.4: Comparison of SVM and CNN
Algorithm Precision Sensitivity Specificity F1-score
SVM 0.87 0.82 0.85 84%
CNN 0.88 0.85 0.87 96%
Two machine learning techniques were trained and tested. 70 percent of dataset was used for
training model while 30 percent was used for testing. The developed model achieved an accuracy
of 84% for SVM and an accuracy of 96% for CNN.
68
Table 4.5: Comparison of Lung Cancer Techniques
S/N Author(s) Strategy Limitations Performance
1. Abbas et al. (2023) Fused weighted The proposed fused 97.2%

federated deep weighted model is limited to
extreme machine the lung cancer disease
learning model dataset.
2. Alsheikhy et al. VGG-19, and long The research does not 98.8%
(2023) short-term discuss the interpretability
memory networks of the developed model.
(LSTMs).
3. Lamina, Felix. CNN, SVM, The proposed model suffers 96%

(2024) Channel squeeze from limited
and Excitation generalizability due to
training on specific datasets.
4.4 Discussion of Results

Two machine learning techniques were trained and tested. 70 percent of data was used for
training the machine learning model while 30 percent was used for testing. Convolutional Neural
Network and Support Vector Machine produced accurate results. Convolutional Neural Network
ranked highest.
4.4.1 Ontology on Lung Cancer
Ontology is a data model that represents knowledge as a set of concepts within a domain and the
relationship between these concepts. Ontology is a form of knowledge management, it captures
the knowledge within an organization as a model, this model can then be queried by users to
answer complex questions and display relationships. The two standards that govern the
69
construction of ontologies are the resource description framework (RDF) and Web Ontology
Language (OWL) in accordance with RDF and OWL, the ontologies are made up of two
components, classes and relationships, classes are represented by ovals and relationships are
represented by arrows which can be used to represent real life relationships. These relationships
are called a triple. A triple consists of a subject a predicate and an object. Triples are the heart of
ontologies, they can be merged together to represent the comprehensive view of the real world.
Ontology was used in this project work to show the relationship between processes involved in
Lung cancer prediction on this system.
For lung cancer predictions to be made, the patient class was necessary. The patient class has a
patient ID and diagnosis subclass, the patient ID subclass is the identification number for the
patients and the diagnosis subclass was for diagnostic results of patients. The diagnosis subclass
contained the classification subclass, which represented the classification model used to predict
lung cancer abnormalities. The Lung cancer abnormalities subclass consisted of attributes needed
for lung cancer prediction. This attributes where in the form of subclasses such as Age, Sex,
LOH, LOS, alcohol CPD, salt CID, CKD, ATD, Smoking, Pregnancy, GPC, PA, BMI etc. The
Classification class gives results to the diagnosis class which is further passed to the Disease
class which gives the result of lung cancer to the patient class.
Overall, CNN are a useful tool in machine learning and can provide valuable insights into the
relationships between features and the target variable. However, they should be used with
caution and in combination with other algorithms to mitigate their limitations.
4.5 Software Testing
Prior to the actual implementation of the system, it had to be tested comprehensively and every
possible error discovered. Since the system cannot be tested exhaustively, the black box testing
method was used for system testing. The black box testing usually demonstrates that software
functions are operational; that the input is properly accepted and the output is correctly produced;
and that integrity of external information (database) is maintained.
70
It is pertinent to note that though all the program modules have been debugged, this does not
mean that they are completely error free as logical errors might develop at any time later during
the usage of the system. System testing can be divided into;
4.5.1 Unit Testing
Unit testing was carried out on individual modules of the system to ensure that they are fully
functional units. We did this by examining each unit, for example the login page. It was checked
to ensure that it functions as required and that it adds data and other details and also ensured that
this data is sent to the database. The success of each individual unit gave us the go ahead to
carryout integration testing. All identified errors were dealt with.
4.5.2 Integration Testing
We carried out integration testing after different modules had been put together to make a
complete system. Integration was aimed at ensuring that modules are compatible and they can be
integrated to form a complete working system. For example, we tested that when a user is logged
in, he/she is linked to the appropriate module, and also could make predictions and access the
database.
4.6 Changeover Method

There are various methods used for change over from an existing system to a new system. These
methods include: Parallel changeover, Pilot Changeover, Direct Changeover, Phase changeover.
i. Parallel changeover: parallel changeover involves implementing both old and new systems
simultaneously. It features minimal risk, making it a popular changeover technique. This
technique has the disadvantage of a higher cost, especially when compared with other methods.
ii. Pilot Changeover: this involves the selection of a part of the organization to operate new
system parallel with existing system. If the section is satisfied with the new system, users cease
the old system and continue with the new one and then piloting is transferred to another part of
the organization.
71
iii. Direct Changeover: this is the changeover method in which the old (current or existing
system) is completely replaced by the newly introduced system in one move. This method is
quick and has the minimal work load but very risky and maybe unavailable when the two
systems substantially differ sometimes.
iv. Phase Changeover: this requires that the existing system will gradually phase out as the new
system is gradually phased in.
4.6.1 Changeover Method to Be Employed

After System development Life Cycle (SDLC) stages of this application has been evaluated and
tested, it will be best to employ the parallel changeover techniques. This method involves using
both the old and the new system and running them side by side, using live data so that the
efficiency and reliability of the new system can be compared with that of the old system. Once
the user is satisfied with the system, the old system taken away and the new system becomes
fully active and utilized across the organization.
4.7 System Maintenance
The process of modifying a system to meet changing needs is known as system maintenance.
System maintenance is a primary task or obligation any computerized organization must take up
in order to ensure efficiency and continuity of the developed system. It is a routine activity,
which is to say that the maintenance of the system is very essential to the smooth running of the
system. The following practices and measure must be taken to ensure that the new system does
not breakdown and achieve its proposed aims and objectives:
i. Password Management: Each user is required to enter an individual username and password
when accessing the software; this keeps the data in the database secured. For maximum security,
each user must protect their password.
ii. Regular Database Backup: This involves the creating duplicates of data which acts as an
insurance copy should in case the active copy is damaged or destroyed. The backup is usually
stored in an external storage device. Recovery involves the use of specialized utility programs to
72
rebuild or replace damaged files. The best way to recover a file or program is to restore it from a
backup copy.
iii. Virus Protection: This requires the use of a program that protects a system from malicious
software called a virus. A virus is a program that infects a computer and could damage a system
depending on its nature. Because new viruses must be analyzed as they appear, the antivirus
must be updated regularly to be effective.
iv. Proper use of the system: These include starting (booting) and shutting down the system in
the right manner to prevent the system from hanging or data corruption and file loss.
CHAPTER FIVE
CONCLUSION AND RECOMMENDATION
5.1 Conclusion
Lung cancer is a disease with multifactorial origins and etiology. It rises when abnormal cells
grow uncontrollably in the lungs, forming tumors that can interfere with lung function and, if left
untreated, can spread to other pert of the body. This shows that the demand and need for a
system that diagnoses Lung cancer quickly and efficiently is crucial. The machine learning
model is deployed by designing a web-based application through which the user can get the
predicted results by inputting the values for the required parameters such as the result for age,
gender, anxiety, peer pressure and so on. As these parameters are inputted into the web
application, it sends it to the backend where the machine learning model is stored. When the data
is received by the backend, the model then make predictions on the outcome of the lung cancer
results and sends the response back to the frontend so the result can be displayed. The degree of
accuracy for this system is based on the number of records contained in the dataset used to carry
out the operation.
We proposed this ontology-based lung cancer prediction system for the sole aim of predicting
lung cancer. Testing and evaluating machine learning models on a clinical dataset through
analysis by means of two different algorithm reveals accuracies of 83.6%, 84.2%,75% and 71%
for the SVM and CNN classifiers, respectively. We can conclude that using the CNN classifier
we can reach a significant accuracy level of 90% on training and 84.2% on testing.
73
Ontology was made for this system using protégé, this was used in our project work as a
knowledge base for the operations of our system. In our future work we intend to apply the
algorithm in this project and other machine learning algorithms on larger datasets with more
disease attributes to obtain higher accuracy, make ontologies that can be queried using the
SparQL query language and also review gaps on more related works to better the vastness of the
system.
5.2 Recommendation
To ensure proper and efficient adaptation of the findings in this project work, it is recommended
that there is consistency in the use of computerized prediction and management software. It is
also recommended that two or more algorithms should be added to the current one on a larger
dataset to find out if a better result on accuracy can be achieved. Additionally, the continuous
maintenance of the software is very important to meet up the needs of new technologies.
5.3 Limitation of the study
Predictive models for lung cancer detection suffer from limited generalizability due to training
on specific datasets, leading to potential bias and inaccuracies when applied to diverse
populations of settings.
74
REFERENCES
Abedi, V., Khan, A., Chaudhary, D., Misra, D., Avula, V., Mathrawala, D., Kraus, C. K.,
Marshall, K., Chaudhary, N., Li, X., Schirmer, C. M., Scalzo, F., Li, J., & Zand, R.
(2020). Using artificial intelligence for improving stroke diagnosis in emergency
departments: a practical framework. Therapeutic Advances in Neurological Disorders,
13, 175628642093896. https://doi.org/10.1177/1756286420938962
Adegoke, O., Awolola, N. A., & Ajuluchukwu, J. (2018). Prevalence and pattern of
cardiovascular-related causes of out-of-hospital deaths in Lagos, Nigeria. African Health
Sciences, 18(4), 942. https://doi.org/10.4314/ahs.v18i4.13
Ahsan, M. M., Luna, S. A., & Siddique, Z. (2022). Machine-Learning-Based Disease Diagnosis:
A Comprehensive Review. Healthcare, 10(3), 541.
https://doi.org/10.3390/healthcare10030541
Batko, K., & Slezak, A. (2022). The use of Big Data Analytics in healthcare. Journal of Big
Data, 9(1). https://doi.org/10.1186/s40537-021-00553-4
Kolyshkina, I., & Simoff, S. (2021). Interpretability of Machine learning solutions in Public
Healthcare: the CRISP-ML approach. Frontiers in Big Data, 4.
https://doi.org/10.3389/fdata.2021.660206
Kumar, Y., Koul, A., Singla, R., & Ijaz, M. (2022). Artificial intelligence in disease diagnosis: a
systematic literature review, synthesizing framework and future research agenda. Journal
75
of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-021-
03612-z
Ozaltin, O., Coşkun, O., Yeniay, Ö., & Subasi, A. (2022). A Deep Learning Approach for
Detecting Stroke from Brain CT Images Using OzNet. Bioengineering, 9(12), 783.
https://doi.org/10.3390/bioengineering9120783
Wallace, E., & Liberman, A. L. (2021). Diagnostic challenges in outpatient stroke: stroke
chameleons and atypical stroke syndromes. Neuropsychiatric Disease and Treatment,
Volume 17, 1469–1480. https://doi.org/10.2147/ndt.s275750
WHO. (2023, July 19). WHO and Nigerian Government move to curb cardiovascular diseases |
WHO | Regional Office for Africa. WHO | Regional Office for Africa.
https://www.afro.who.int/news/who-and-nigerian-government-move-curb-
cardiovascular-diseases
World Health Organization: WHO. (2019). Cardiovascular diseases. www.who.int.
https://www.who.int/health-topics/cardiovascular-diseases#:~:text=Cardiovascular
%20diseases%20(CVDs)%20are%20the,heart%20disease%20and%20other
%20conditions.
Adeoye, O., Nystrom, K., Yavagal, D. R., Luciano, J. M., Nogueira, R. G., Zorowitz, R. D.,
Khalessi, A. A., Bushnell, C., Barsan, W. G., Panagos, P. D., Alberts, M. J., Tiner, A. C.,
Schwamm, L. H., & Jauch, E. C. (2019). Recommendations for the Establishment of
Stroke Systems of Care: a 2019 update. Stroke, 50(7).
https://doi.org/10.1161/str.0000000000000173
Alageel, N., Alharbi, R., Alharbi, R., Alsayil, M., & Alharbi, L. A. (2023). Using machine
learning algorithm as a method for improving stroke prediction. International Journal of
Advanced Computer Science and Applications, 14(4).
https://doi.org/10.14569/ijacsa.2023.0140481
Alanazi, E. M., Abdou, A., & Luo, J. (2021). Predicting Risk of Stroke From Lab Tests Using
Machine Learning Algorithms: Development and Evaluation of Prediction Models. JMIR
Formative Research, 5(12), e23440. https://doi.org/10.2196/23440
Alotaibi, N. H., Alotaibi, A., Eliazer, M., & Srinivasulu, A. (2022). Detection of Ischemic Stroke
Tissue Fate from the MRI Images Using a Deep Learning Approach. Mobile Information
Systems, 2022, 1–11. https://doi.org/10.1155/2022/9399876
76
Amann, J. (2021). Machine Learning in Stroke Medicine: Opportunities and Challenges for risk
prediction and Prevention. In Springer eBooks (pp. 57–71). https://doi.org/10.1007/978-
3-030-74188-4_5
Amiral, J. (2022). Editorial for the special issue of monitoring anticoagulants. Biomedicines,
10(1), 155. https://doi.org/10.3390/biomedicines10010155
Arráez-Aybar, L. A., Navia, P., Fuentes-Redondo, T., & Bueno-López, J. L. (2015). Thomas
Willis, a pioneer in translational research in anatomy (on the 350th anniversary of
Cerebri anatome ). Journal of Anatomy, 226(3), 289–300.
https://doi.org/10.1111/joa.12273
CDC. (2023a, July 27). Stroke Facts | Cdc.gov. Centers for Disease Control and Prevention.
https://www.cdc.gov/stroke/facts.htm
CDC. (2023b, July 27). Stroke Facts | Cdc.gov. Centers for Disease Control and Prevention.
CDC. (2023c, July 27). Stroke Facts | Cdc.gov. Centers for Disease Control and Prevention.
Chandrabhatla, A. S., Kuo, E. A., Sokolowski, J. D., Kellogg, R. T., Park, M., & Mastorakos, P.
(2023). Artificial Intelligence and Machine Learning in the diagnosis and Management of
Stroke: A Narrative Review of United States Food and Drug Administration-Approved
Technologies. PubMed, 12(11). https://doi.org/10.3390/jcm12113755
Charbuty, B., & Abdulazeez, A. M. (2021). Classification based on decision tree algorithm for
machine learning. Journal of Applied Science and Technology Trends, 2(01), 20–28.
https://doi.org/10.38094/jastt20165
Chohan, S. A., Venkatesh, P. K., & How, C. H. (2019). Long-term complications of stroke and
secondary prevention: an overview for primary care physicians. Singapore Medical
Journal, 60(12), 616–620. https://doi.org/10.11622/smedj.2019158
Chugh, C. (2019). Acute Ischemic Stroke: Management Approach. Indian Journal of Critical
Care Medicine, 23(S2), 140–146. https://doi.org/10.5005/jp-journals-10071-23192
Das, S., Pegu, H., Sahu, K., & Swayamjyoti, S. (2020). Machine Learning in Materials Modeling
-- Fundamentals and the opportunities in 2D materials. ResearchGate.
https://www.researchgate.net/publication/338594296_Machine_Learning_in_Materials_
Modeling_--_Fundamentals_and_the_Opportunities_in_2D_Materials
77
Dev, S., Sun, L., Nwosu, C. S., Jain, N., Veeravalli, B., & Deepu, C. J. (2022). A predictive
analytics approach for stroke prediction using machine learning and neural networks.
Healthcare Analytics, 2, 100032. https://doi.org/10.1016/j.health.2022.100032
DiGiacinto, J. (2023, April 25). Facts about LDL: The bad kind of cholesterol. Healthline.
https://www.healthline.com/health/ldl-cholesterol
Dock, E. (2023, April 20). Aspiration pneumonia: symptoms, causes, and treatment. Healthline.
https://www.healthline.com/health/aspiration-pneumonia
Dritsas, E., & Trigka, M. (2022). Stroke Risk Prediction with Machine Learning Techniques.
Sensors, 22(13), 4670. https://doi.org/10.3390/s22134670
Esse, K., Fossati-Bellani, M. R., Traylor, A., & Martin-Schild, S. (2011). Epidemic of illicit drug
use, mechanisms of action/addiction and stroke as a health hazard. Brain and Behavior,
1(1), 44–54. https://doi.org/10.1002/brb3.7
Fuchs, F. D., & Whelton, P. K. (2020). High blood pressure and cardiovascular disease.
Hypertension, 75(2), 285–292. https://doi.org/10.1161/hypertensionaha.119.14240
Geiger, D. (2021, September 23). Know the signs of stroke - BE FAST. Duke Health.
https://www.dukehealth.org/blog/know-signs-of-stroke-be-fast
González-Fernández, M., Ottenstein, L., Atanelov, L., & Christian, A. (2013). Dysphagia after
stroke: an overview. Current Physical Medicine and Rehabilitation Reports, 1(3), 187–
196. https://doi.org/10.1007/s40141-013-0017-y
Gray, L. A. (2020). Living the Full Catastrophe: A Mindfulness-Based Program to Support
Recovery from Stroke. Healthcare, 8(4), 498. https://doi.org/10.3390/healthcare8040498
Habehh, H., & Gohel, S. (2021). Machine learning in healthcare. Current Genomics, 22(4), 291–
300. https://doi.org/10.2174/1389202922666210705124359
Heo, J., Yoon, J. G., Park, H., Kim, Y. H., Nam, H. S., & Heo, J. H. (2019). Machine learning–
based model for prediction of outcomes in acute stroke. Stroke, 50(5), 1263–1265.
https://doi.org/10.1161/strokeaha.118.024293
Higuera, V. (2023, February 10). Hemorrhagic stroke: What to look for and how to prevent it.
https://www.medicalnewstoday.com/articles/317111#:~:text=This%20can%20happen
%20when%20a,by%20a%20blocked%20blood%20supply.
Johnson, J. (2020, July 27). What’s a deep neural network? Deep Nets explained. BMC Blogs.
https://www.bmc.com/blogs/deep-neural-network/
78
Kamal, H., Fine, E. J., Shakibajahromi, B., & Mowla, A. (2021). A history of the path towards
imaging of the brain: From skull radiography through cerebral angiography. Current
Journal of Neurology. https://doi.org/10.18502/cjn.v19i3.5426
Kamalakannan, S., Murthy, G. V., Prost, A., Natarajan, S., Pant, H., Chitalurri, N., Goenka, S.,
& Kuper, H. (2016). Rehabilitation needs of stroke survivors after discharge from
hospital in India. Archives of Physical Medicine and Rehabilitation, 97(9), 1526-1532.e9.
https://doi.org/10.1016/j.apmr.2016.02.008
Kaur, M., Sakhare, S. R., Wanjale, K., & Akter, F. (2022). Early Stroke Prediction Methods for
Prevention of Strokes. Behavioural Neurology, 2022, 1–9.
https://doi.org/10.1155/2022/7725597
Khuda, I., & Alshamrani, F. (2018). Stroke medicine in antiquity: The Greek and Muslim
contribution. PubMed, 25(3), 143–147. https://doi.org/10.4103/jfcm.jfcm_8_17
Kim, E., Heo, J., Eun, S., & Lee, J. Y. (2022). Development of Early-Stage Stroke Diagnosis
System for the Elderly Neurogenic Bladder Prevention. International Neurourology
Journal, 26(Suppl 1), S76-82. https://doi.org/10.5213/inj.2244030.015
Kuriakose, D., & Xiao, Z. (2020). Pathophysiology and Treatment of Stroke: Present status and
future Perspectives. International Journal of Molecular Sciences, 21(20), 7609.
https://doi.org/10.3390/ijms21207609
Martel, J. (2020, February 27). Atherosclerosis. Healthline.
https://www.healthline.com/health/atherosclerosis
Mayo Clinic. (2023, July 8). Stroke - Symptoms and causes - Mayo Clinic.
https://www.mayoclinic.org/diseases-conditions/stroke/symptoms-causes/syc-20350113
Mehta, J. L., Calcaterra, G., & Bassareo, P. P. (2020). COVID ‐19, thromboembolic risk, and
Virchow’s triad: Lesson from the past. Clinical Cardiology, 43(12), 1362–1367.
https://doi.org/10.1002/clc.23460
Owolabi, M., Sarfo, F. S., Howard, V. J., Irvin, M. R., Gebregziabher, M., Akinyemi, R.,
Bennett, A., Armstrong, K., Tiwari, H. K., Akpalu, A., Wahab, K., Owolabi, L., Fawale,
B., Komolafe, M., Obiako, R., Adebayo, P., Manly, J., Ogbole, G., Melikam, E., . . .
Howard, G. (2017). Stroke in Indigenous Africans, African Americans, and European
Americans. Stroke, 48(5), 1169–1175. https://doi.org/10.1161/strokeaha.116.015937
79
Pawar, U. (2020). Lets open the black box of random forests. Analytics Vidhya.
https://www.analyticsvidhya.com/blog/2020/12/lets-open-the-black-box-of-random-
forests/
Peng, J., Jury, E. C., Dönnes, P., & Ciurtin, C. (2021). Machine Learning Techniques for
Personalised Medicine Approaches in Immune-Mediated Chronic Inflammatory
Diseases: Applications and Challenges. Frontiers in Pharmacology, 12.
https://doi.org/10.3389/fphar.2021.720694
Pitchai, R., Dappuri, B., Pramila, P. V., Vidhyalakshmi, M., Shanthi, S., Alonazi, W. B.,
Almutairi, K. M., Sundaram, R. S., & Beyene, I. (2022). An Artificial Intelligence-Based
Bio-Medical Stroke Prediction and Analytical System Using a Machine Learning
Approach. Computational Intelligence and Neuroscience, 2022, 1–9.
https://doi.org/10.1155/2022/5489084
Qureshi, A. I., & Qureshi, M. (2017). Acute hypertensive response in patients with intracerebral
hemorrhage pathophysiology and treatment. Journal of Cerebral Blood Flow and
Metabolism, 38(9), 1551–1563. https://doi.org/10.1177/0271678x17725431
Rahman, S., Hasan, M., & Sarkar, A. K. (2023). Prediction of Brain Stroke using Machine
Learning Algorithms and Deep Neural Network Techniques. European Journal of
Electrical Engineering and Computer Science, 7(1), 23–30.
https://doi.org/10.24018/ejece.2023.7.1.483
Richards, L. (2023, January 30). What to know about aphasia after stroke.
https://www.medicalnewstoday.com/articles/aphasia-stroke
Rioux, B., Brissette, V., Marin, F. F., Lindsay, P., Keezer, M. R., & Poppe, A. Y. (2021). The
impact of stroke public awareness campaigns differs between sociodemographic groups.
Canadian Journal of Neurological Sciences, 49(2), 231–238.
https://doi.org/10.1017/cjn.2021.76
Rutkowski, N., Sabri, E., & Yang, C. (2021). Post-stroke fatigue: A factor associated with
inability to return to work in patients <60 years—A 1-year follow-up. PLOS ONE, 16(8),
e0255538. https://doi.org/10.1371/journal.pone.0255538
Saunders, D. H., Sanderson, M., Hayes, S., Kilrane, M., Greig, C., Brazzelli, M. G., & Mead, G.
(2016). Physical fitness training for stroke patients. The Cochrane Library.
https://doi.org/10.1002/14651858.cd003316.pub6
80
Schwartz, J., Gao, M., Geng, E., Mody, K. S., Mikhail, C., & Cho, S. K. (2019). Applications of
machine learning using electronic medical records in spine surgery. Neurospine, 16(4),
643–653. https://doi.org/10.14245/ns.1938386.193
Strilciuc, S., Grad, D. A., Radu, C., Chira, D., Stan, A., Ungureanu, M., Gheorghe, A., &
Muresanu, D. (2021). The economic burden of stroke: a systematic review of cost of
illness studies. Journal of Medicine and Life, 14(5), 606–619.
https://doi.org/10.25122/jml-2021-0361
Tazin, T., Alam, N. H., Dola, N., Bari, M., Bourouis, S., & Khan, M. M. (2021). Stroke Disease
Detection and Prediction Using Robust Learning Approaches. Journal of Healthcare
Engineering, 2021, 1–12. https://doi.org/10.1155/2021/7633381
Thachil, J. (2022). Embolism—The journey from a calendar to the clot via the Lord’s prayer.
Journal of Thrombosis and Haemostasis, 20(2), 538–539.
https://doi.org/10.1111/jth.15610
Veiga, P. (2014). Health and Medicine in Ancient Egypt: Magic and Science. Lmu-munich.
https://www.academia.edu/225468/Health_and_Medicine_in_Ancient_Egypt_Magic_and
_Science
Visvanathan, A., Mead, G., Dennis, M., Whiteley, W., Doubal, F., & Lawton, J. (2019).
Maintaining hope after a disabling stroke: A longitudinal qualitative study of patients’
experiences, views, information needs and approaches towards making treatment
decisions. PLOS ONE, 14(9), e0222500. https://doi.org/10.1371/journal.pone.0222500
Werner, C. (2023, February 28). How long will a stroke show up on an MRI?
https://www.medicalnewstoday.com/articles/how-long-will-a-stroke-show-up-on-an-mri
81
APPENDIX A
Graphical user interface of the developed lung cancer prediction system
The input screen of the developed lung cancer prediction system
82
APPENDIX B
The screen shows the various dataset collected for training
83
APPENDIX C
Prediction result for large cell carcinoma
84
APPENDIX D
Prediction result for adenocarcinoma lung cancer
85
APPENDIX E
Prediction result for normal lung cancer
86
APPENDIX F
SVM and CNN Chart
87
100%
0.96
95%
90%
Values in %
0.88 Precision
0.87 Recall
0.85 F1-score
85% 0.84
0.82
80%
75%
SVM CNN
APPENDIX G
Program source code
88
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import

confusion_matrix,accuracy_score,recall_score,precision_score,f1_score,fbeta_score,classificatio
n_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
import seaborn as sns

import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
df = pd.read_csv('dataset.csv')
"""change text value in target data to number"""

e=LabelEncoder()
df['LUNG_CANCER']=e.fit_transform(df['LUNG_CANCER'])
##Analysis and Processing

print(df.head())
print(df.columns)
print(df.describe())
print(df.info())
print(df.isnull().any())
####visualization
##
##sns.distplot(df['AGE'], kde=False, color='Blue')
##sns.countplot(x='GENDER', data=df, hue='LUNG_CANCER')
##sns.countplot(x='AGE', data=df, hue='LUNG_CANCER')
89
##sns.distplot(df['ANXIETY'], kde=False, color='Blue')
##sns.distplot(df['ALCOHOL_CONSUMING'], kde=False)
##plt.show()
"""data and target variable"""

x=df.drop(['LUNG_CANCER'],axis=1)
y=df['LUNG_CANCER']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)
"""Making the data balanced"""

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
x_train=scaler.fit_transform(x_train)
x_test=scaler.transform(x_test)
###Machine Learning Model

model = RandomForestClassifier()
print(f'using RandomForestClassifier: ')
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
print(f'Training Accuracy :{accuracy_score(y_train,model.predict(x_train))}')
print(f'Testing Accuracy :{accuracy_score(y_test,y_pred)}')
print(f'Confusion matrix:\n {confusion_matrix(y_test,y_pred)}')
print(f'Recall: {recall_score(y_test,y_pred)}')
print(f'precision: {precision_score(y_test,y_pred)}')
print(f'F1-score: {f1_score(y_test,y_pred)}')
print(f'Fbeta-score: {fbeta_score(y_test,y_pred,beta=0.5)}')
print(classification_report(y_test,y_pred))
print('-'*33)

model = LogisticRegression()
print(f'using LogisticRegression(): ')
90
print('-'*33)

model = SVC()
print(f'using SVC(): ')
print('-'*33)

model = DecisionTreeClassifier()
print(f'using DecisionTreeClassifier(): ')
print('-'*33)
91
##save model
##import joblib
##joblib.dump(model,'model.joblib')
92

Lung Cancer Project

Uploaded by

Copyright:

Available Formats

Lung Cancer Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lung Cancer Project

Uploaded by

Copyright:

Available Formats

CHAPTER ONE

1.1 Background of the Study

1.2 Statement of the Problem

i. develop predictive model using Channel Squeeze and Excitation (SE)Techniques.

1.4 Research Methodology

1.6 Scope of the Study

1.7 Definition of Terms

Epidemiological studies: Studies that investigate the distribution and determinants of

Radiographic evidence: Evidence of a disease that is seen on a radiograph, a type of

Bronchoscopy: A procedure that allows a doctor to look inside the airways.

Chemotherapy: The use of drugs to kill cancer cells.

Machine learning, a subset of artificial intelligence, has emerged as a transformative tool in

2.1.1 The Origins and Etiology of Lung Cancer

2.1.2 Types of Lung Cancer

2.1.3 Effects and Symptoms of Lung Cancer

2.1.4 Diagnosis of Lung Cancer

2.1.5 Treatment Options for Lung Cancer

2.2 Machine Learning

2.3 Machine Learning Applications

2.4 Machine Learning Approaches

Supervised Unsupervised Semi-supervised Reinforcement

Figure 2.1 Machine Learning Approaches (Gazdar, 2010)

2.4.1 Supervised Learning

Figure 2.2 Classification algorithm in machine learning (jacint, 2016)

Types of Classification Problems

Applications of Classification Algorithms

Figure 2.3 Regression algorithm in machine learning (Tieu,2016)

Types of Regression Algorithms

Applications of Regression Algorithms

Advantages of Supervised Learning

2.4.2 Unsupervised Learning

Figure 2.4 Clustered data points (Waart, 2022)

Types of Clustering Algorithms

vi. Centroid Clustering

Applications of Clustering Algorithms

Figure 2.5 Association rule learning algorithm (Henschke,2021)

Types of Association Algorithms

2.4.3 Semi-supervised Learning

2.4.4 Reinforcement Learning

2.5 Theoretical Frameworks

2.5.1 Support Vector Machines

Here's how SVM works for lung cancer detection:

2.5.2 Convolutional Neural Networks

Here's how CNN works for lung cancer detection:

Figure 2.7 How CNN algorithm works (Ashley, 2023)

Figure 2.8 General architecture of machine learning (Tripathi et al., 2021)

2.6.1 Data Collection

2.6.2 Data Pre-processing

2.6.3 Feature Engineering

2.6.4 Data Modelling

Confusion Matrix: It is a table that summarizes the performance of a classification model by

2.7 Related works

2.8 Comparison of related works

Table 2.1: Comparison of Lung Cancer Techniques

1. Abbas et al. (2023) Fused weighted The proposed fused 97.2%

2. Alsheikhy et al. VGG-19, and The research does not 98.8%

4. Kumar et al. SVM, SMOTE

2.8.1 Research Gap

3.1 General Overview