Improving Lung and Colon Cancer Detection Using Ensemble Method Approach
Improving Lung and Colon Cancer Detection Using Ensemble Method Approach
Improving Lung and Colon Cancer Detection Using Ensemble Method Approach
2024 11 International Conference on “Computing for Sustainable Global Development”, 28th Feb. – 1st Mar., 2024
th
Bharati Vidyapeeth's Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA)
Vanita Jain
Department of Electronic Science
University of Delhi
New Delhi, India
vjain@electronics.du.ac.in
low-performing model by using the averaging method; we colon cancer nucleus detection. Shein et al. [11] developed an
combined the predictions of multiple individual models into a architecture based on ML without segmentation or feature
single, more robust predictive model. This ensemble approach extraction methods, achieving 87.14% accuracy for lung
helped mitigate the limitations of the individual models and nodule classification. Three deep structured algorithms,
leveraged their collective knowledge to achieve higher including CNN, achieved 89% accuracy for lung nodule
accuracy. The resulting ensemble model demonstrated feature extraction [12]. Selvanambi et al. [13] achieved 98%
improved performance and enhanced the overall accuracy of accuracy for lung cancer detection using RNN with DLS and
our classification system for cancer detection. glowworm swarm optimization. Filho et al. [14] made use of
segmented imaging and CNN for processing the lung nodule
The study comprises a meticulous and comprehensive CT scans and achieved 92.6% accuracy for classification.
approach to data preprocessing, involving the fine and Yuan et al. [15] applied CNN with preprocessing techniques
complex task of refinement as well as the organization of vast such as edge detection and intensity adjustments, obtaining
amounts of data to extract the pertinent features by deploying 91.4% accuracy for polyp detection in colonoscopy videos. In
appropriate feature engineering techniques. This foundational CT images, benign and malignant lung nodules were classified
phase lays the base for the development of the subsequent with 92.8% accuracy using RestNet50 and SVM with RBF
stages, while at the same time ensuring the quality, accuracy,
kernel [16].
and relevance of the information which is fed into the system.
This research dives into the intricate depths of model training, The DFCNet model, which is based on deep CNN, was
where a wide array of sophisticated algorithms and models are proposed by Masood et al. [17] and achieved 84.5% accuracy
employed to allow the system to be able to discern and in pulmonary nodule classification. Faster R-CNN was
differentiate between the complex patterns and relationships employed by Mo et al. [18] to detect polyps in colonoscopy
within the data. recordings, with an average accuracy of 98.5% across four
datasets. With 96.4% accuracy, Urban et al. [19] created deep
Model Validation is an indispensable part of the process of CNNs for polyp identification in colonoscopy pictures.
development. It involves putting the trained model through a Binarized weights were applied to reduce network size [20],
rigorous and well-rounded assessment against different achieving 90.28% accuracy for colonoscopy frame
datasets. This is a crucial step that provides the necessary classification. A combined learning comprehensive neural
guarantee related to the accuracy and efficiency of the system. network optimized with AdaBoost obtained 98.42% accuracy
The evaluation metrics employed in the research ensure in recognising normal and abnormal lung shapes after wolf
precision and provide the model with a nuanced understanding heuristic features [21] were chosen to minimize
of the different features of the disease in the human body. dimensionality.
The integration of Deep Learning (DL) and Machine An eight-layer CNN architecture was suggested by Suresh
Learning (ML) [7] methods and the careful examination and and Mohan [22] to categorize CT images of lung lesions into
evaluation of pre-trained models are two of this study's most three categories. They used generative adversarial networks for
unique aspects. This methodology allows us to identify and both data augmentation and picture segmentation. With 93.9%
evade the loopholes and cons of an existing model and only classification accuracy, the model performed well. Masud et al.
include the positives of a given model under consideration. [23] introduced a lightweight CNN approach for pulmonary
The project involves rigorous data preprocessing, model nodule detection. Their model, consisting of four convolutional
training, validation, and testing, with evaluation based on layers, demonstrated a high accuracy of 97.9% and was
accuracy, precision, and loss curves. By leveraging DL and suitable for real-time CT image analysis. A pre-processing
pre-trained models, this research aims to enhance cancer technique [24] preserving image brightness and reducing noise
diagnosis [8] and treatment. This model successfully was used for lung cancer CT scans. An improved neural
revolutionizes the detection of lung and colon cancer, assisting network performed region segmentation and feature extraction,
in the process of early diagnosis, identification, and subsequent followed by an ensemble classifier for classification, achieving
treatment. Our work establishes a solid foundation for the use an accuracy of 96.2%. Pre-trained CNNs (ResNet-50, ResNet-
of DL and ML [9] models in the treatment of lung and colon 34, and ResNet-18) were used by Bukhari et al. [25] to assess
cancer in the healthcare sector.
colonic cancer histopathology pictures. Their ensemble
II. RELATED WORK approach achieved an accuracy of 96.4%.With a 97.8%
classification accuracy, Mangal et al. [26] used a shallow
Studies on malicious Cancer detection by making use of neural network to classify digital pathology pictures of lung
the signature sets have been put under thorough investigation and colon squamous cell carcinoma and adenocarcinoma. In
and utilized for a lengthy time in the past. Most of this research order to classify lung cancer [27]histopathology pictures,
often uses lists of recognized malicious Cancers. As soon as Hatuwal and Thapa [28] used a CNN [29], achieving accuracy
the module encounters a new cancer, a database query is of 96.11% in training and 97.2% in validation.
initiated. If the cancer is found to be blacklisted, it is then
regarded as malicious, and a warning is generated. Yamini et al.[30] created a unique machine learning model
that analyzes lung cancer datasets with the use of Gradient
Sirinukunwattana et al. [10] proposed a spatially Boosting, KNN, LR, DT, RF, SVM, and XGB classifier
constrained neural network achieving 97.1% accuracy for models. An automated system-based classifier was explored
and developed by Bishnoi et al. [31] which further elaborated models. Augmentation techniques were applied to increase the
upon a different approach to malignancy detection. In a recent dataset size and enhance model generalization. A sample of
study conducted by Gayap et al. [32] Deep Learning was the dataset used to train the model is shown in Table 1,
applied to diagnose lung cancer which demonstrated high which lists the main characteristics of the tissue under
potential and precision. Numerous approaches have been analysis and its classification into five distinct categories.
investigated by researchers, such as convolutional neural
networks with preprocessing techniques, recurrent neural
networks with optimization algorithms, and spatially restricted
neural networks. Impressive classification accuracies have
been achieved by these methods for identifying malignant
tumors in medical imaging data, including CT scans and
pathology pictures.
In summary, the combined findings demonstrate the critical
role that DL and ML play in the development of cancer
detection methods. By harnessing the power of computer
algorithms, researchers are making enormous strides towards
more accurate, efficient, and readily available diagnostic tools
for the battle against cancer.
III. PROPOSED SYSTEM
The primary objective is the seamless integration and
assimilation of pre-trained architectural frameworks, combined
with an efficient and meticulous data pre-processing pipeline,
specialized model training, validation, and exhaustive
protocols to perform testing. The augmentation that acts as the
pivot to this comprehensive framework lies in the strategic
incorporation of an ensemble model, which is an advanced
amalgamation of different individual models, designed Fig. 1. Flowchart of deployed Methodology
particularly to increase the accuracy and reliability of the
cancer detection system. The core of this endeavor revolves TABLE I. DESCRIPTION OF THE EMPLOYED DATASET
around the prediction of malignant cells in the lungs and colon, Image Type Class ID Class Title Total Images
distinguishing between them with a high level of accuracy.
Colon Adenocarcinoma 0 Colon_aca 5000
The progression through the following structured steps is what
Colon Benign 1 Colon_n 5000
forms the backbone of the approach followed in this research:
Lung Adenocarcinoma 2 Lung_aca 5000
Data Preprocessing Lung Benign 3 Lung_n 5000
Model Selection and training. Lung Squamous Cell
4 Lung_scc 5000
Carcinoma
Validation and Hyper-parameter Tuning
The histopathological pictures, of the distinct tissue types
Performance Evaluation that the model analyzes to classify the cancer, are shown in
Fig. 2.
Ensemble Model
Testing and Evaluation
The methodological approach used to create the ensemble
model for this research project is depicted in Fig. 1.
A. Data Preprocessing
1) Data cleaning, normalization, and augmentation:
Inconsistencies in data including noise, outliers etc, were
handled efficiently to ensure a well-rounded data cleaning
process, which formed the backbone of our model. Data
normalization was performed to ensure consistency in the data,
which then led to improvement in the performance of the
model. A meticulous procedure was employed to conduct the
refinement and structuring of the raw data to retrieve the Fig. 2. Histopathological images [33] from dataset: (a) Lung
essential features. This process ensured the integrity of the Adenocarcinoma. (b) Lung benign. (c) Lung squamous cell. (d) Colon
dataset for model training being conducted in subsequent adenocarcinoma. (e) Colon benign
Table II shows the ensemble model and other pre-trained Fig. 6 is a graph representing the validation loss curve of
architectures' comparative performance metrics. Each model's all the models considered in the study along with the curve of
efficacy is displayed in the table using its accuracy, recall, and the developed Ensemble Model. The loss curve depicts the
F-1 score. It is noteworthy that the ensemble model performs evolution of the validation loss incurred over successive
better than the individual architectures in every category, with iterations of the model.
improved metrics. These are the respective comparative validation accuracy
Fig. 5 represents the Comparative Validation Accuracy and loss curves which help assess the ability of the model to
curve representing the higher accuracy of the Ensemble Model generalize and make accurate predictions on unseen data; and
which was developed in this study, as compared to the other to provide insights into the effectiveness of the model's
pre-existing models. learning and optimization.
This empirical progression highlights the ensemble model's [2] Sloan, A. Frank and Hellen Gelband. "The cancer burden in low-and
clear advantage. Outlining the experimental method that served middle-income countries and how it is measured," In Cancer control
opportunities in low-and middle-income countries. National Academies
as the foundation for the examination of certain model Press (US), 2007.
performance indicators and the augmentation tactics that [3] S. Das, S. Biswas, A. Paul and A. Dey, "AI Doctor: An intelligent
followed is crucial. Every component model was carefully approach for medical diagnosis" in Industry Interactive Innovations in
assessed, examining its innate advantages and disadvantages. Science Engineering and Technology, Singapore:Springer, pp. 173-183,
Making use of the collective knowledge gained from the 2018.
various models, the final ensemble model was carefully [4] M. Dildar, S. Akram, M. Irfan et al. "Skin cancer detection: a review
constructed, combining the advantageous aspects of its using deep learning techniques," International journal of environmental
research and public health 18, no. 10 (2021): 5479.
predecessors. Consequently, a thorough process of empirical
[5] S. Garg and S. Garg “Prediction of lung and colon cancer through
research, iterative refinement, and strategic combination of analysis of histopathological images by utilizing Pre-trained CNN
model features supported the ensemble model's developmental models with visualization of class activation and saliency maps,” In
trajectory in this work and resulted in a strong and superior Proceedings of the 3rd Artificial Intelligence and Cloud Computing
model paradigm. Conference, 2020, pp. 38-45.
[6] O. Singh and K. K. Singh, "An approach to classify lung and colon
V. CONCLUSION cancer of histopathology images using deep feature extraction and an
ensemble method," International Journal of Information Technology 15,
The main aim of this study was to detect lung and colon no. 8 (2023): 4149-4160.
cancer accurately, and efficiently using a reliable and robust [7] R. S. Yadav, “Data analysis of COVID-2019 epidemic using machine
model that has been trained using a vast dataset and provides learning methods: a case study of India,” International Journal of
precise results on real-world data. Transfer learning was Information Technology 12, no. 4, pp. 1321-1330, 2020.
employed to fulfill this purpose and detect cancer in the [8] T. Babu, D. Gupta, T. Singh and S. Hameed, "Colon cancer prediction on
patients by training the model on a dataset of 25,000 different magnified colon biopsy images", Proc. 10th Int. Conf. Adv.
histopathology images of lung and colon cancer tissues. The Comput. (ICoAC), pp. 277-280, Dec. 2018.
project involved a comprehensive comparative analysis of the [9] M. Masud, N. Sikder, A.-A. Nahid, A. K. Bairagi and M. A. AlZain, "A
machine learning approach to diagnosing lung and colon cancer using a
various cancer detection models using an ensemble approach deep learning-based classification framework", Sensors, vol. 21, no. 3,
and pre-trained architectures, namely EfficientNetB7, pp. 748, Jan. 2021.
PretrainedModel2, InceptionV3, DenseNet201, and ResNet50. [10] K. Sirinukunwattana, S. E. A. Raza, Y.-W. Tsang, D. R. J. Snead, I. A.
The objective was to enhance the overall performance and Cree and N. M. Rajpoot, "Locality sensitive deep learning for detection
improve the metrics of model testing which were deployed to and classification of nuclei in routine colon cancer histology images",
ensure a sophisticated and highly precise system of cancer IEEE Trans. Med. Imag., vol. 35, no. 5, pp. 1196-1206, May 2016.
detection. We leveraged the qualities and collective knowledge [11] S. Mehmood, T. M. Ghazal, M. A. Khan, M. Zubair, M. T. Naseem, T.
Faiz and Munir Ahmad. "Malignancy detection in lung and colon
of multiple models that have been developed in the past for the histopathology images using transfer learning with class selective image
same or similar purposes. After doing thorough testing and processing," IEEE Access, pp. 25657-25668, 2020.
analysis, we found that the ensemble model performed better [12] Y. Su, D. Li and X. Chen, "Lung nodule detection based on faster R-
than the individual model. The ensemble approach allowed us CNN framework," Computer Methods and Programs in Biomedicine,
to leverage the strengths of each pre-trained architecture and 2021
mitigate their limitations, resulting in enhanced accuracy and [13] S. Ramani, J. Natarajan, M. Karuppiah, S. K. H. Islam, M. M. Hassan,
robust predictions. The developed Ensemble Model had a high and G. Fortino, "Lung cancer prediction using higher-order recurrent
accuracy of 0.98 which was superior to the other models neural network based on glowworm swarm optimization," Neural
Computing and Applications, vol. 32, pp. 4373-4386, 2020.
assessed. A highly precise result was obtained for the real-time
[14] N. Da, R. V. Medeiros, S. A. Peixoto, S. P. P. da Silva and P. P. R. Filho,
data fed into the model with a recall value of 0.96. Our "Lung nodule classification via deep transfer learning in CT lung
comparative analysis involved various metrics, including images," In 2018 IEEE 31st International Symposium on Computer-
accuracy and loss curves, to evaluate the performance of the based Medical Systems (CBMS), 2018, pp. 244-249.
models. These metrics provided valuable insights into the [15] S.-B. Zhao, W. Yang, S.-L. Wang et al., "Establishment and validation of
effectiveness of each model and helped us make informed a computer-assisted colonic polyp localization system based on deep
decisions during the ensemble model construction. learning," World Journal of Gastroenterology, vol. 27, no. 31, 2021.
[16] G. Zhang, Z. Yang, L. Gong, S. Jiang and L. Wang, "Classification of
The conclusions and discoveries made in this research benign and malignant lung nodules from CT images based on hybrid
initiative enhance machine-learning approaches in pathology features," Physics in Medicine & Biology, vol. 64, no. 12, 2019.
and have the potential to improve illness diagnostics and [17] A. Masood, B. Sheng, P. Li, X. Hou, X. Wei, J. Qin and D. Feng.
intelligent prescribing programs. The research route in "Computer-assisted decision support system in pulmonary cancer
detection and stage classification on CT images," Journal of biomedical
question presents significant potential for enhancing the informatics, vol. 79, pp. 117-128, 2018.
model's outlier tolerance and delving deeper into the [18] M. Xi, K. Tao, Q. Wang and G. Wang, "An efficient approach for polyps
Convolutional Neural Network (CNN) as a potential solution. detection in endoscopic videos based on faster R-CNN," In 2018 24th
international conference on pattern recognition (ICPR), 2018, pp. 3929-
REFERENCES 3934.
[1] International Agency for Research on Cancer, Oct. 2021, [online] [19] U. Gregor, P. Tripathi, T. Alkayali, M. Mittal, F. Jalali, W. Karnes and P.
Available: https://gco.iarc.fr/today/data/factsheets/ populations/ 900- Baldi, “Deep learning localizes and identifies polyps in real time with
world-fact-sheets.pdf. 96% accuracy in screening colonoscopy,” Gastroenterology, vol. 155, no.
4, pp. 1069-1078, 2018.