Breast Cancer Prediction Using Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Atlantis Highlights in Computer Sciences, volume 4

Proceedings of the 3rd International Conference on Integrated Intelligent Computing


Communication & Security (ICIIC 2021)

Breast Cancer Prediction Using Machine Learning


Techniques
Apoorva V1,*, Yogish H K2, Chayadevi M L3
1,2
Dept. of IS &E, MSRIT, Bengaluru
3
Dept. of IS & E, JSSATE, Bengaluru
*
Corresponding authors. Email: apoorvav133@gmail.com

ABSTRACT
Breast cancer affects the majority of women worldwide, and it is the second most common cause of death among
women. However, if cancer is detected early and treated properly, it is possible to be cured of the condition. Early
detection of breast cancer can dramatically improve the prognosis and chances of survival by allowing patients to
receive timely clinical therapy. Furthermore, precise benign tumour classification can help patients avoid
unneeded treatment. This paper study uses Convolution Neural Networks for Image dataset and K-Nearest
Neighbour (KNN), Decision Tree (CART), Support Vector Machine (SVM), and Naïve Bayes for numerical
dataset, whose features are obtained from digitised image of breast mass, as to forecast and analyse cancer
databases in order to improve accuracy. The dataset will be analysed, evaluated, and model is trained as part of
the process. Finally, both image and numerical test data will be used for prediction.

Keywords: IDC (Invasive Ductal Carcinoma), FNA (Fine Needle Aspirate) , Breast cancer prediction,
Classifier algorithms, CNN (Convolutional neural network).

1. INTRODUCTION learning technique is recognised as best method to


predict and classify for image dataset.
Breast-cancer is among the most serious
illnesses/diseases in India, causing many deaths in the Earlier methods for classifying data were used,
current situation. Due to changes in food and lifestyle, despite their lower accuracy, because they could be
the number of cancer cases in women is increasing day used for proper categorization and prediction. Deep
by day. It is the second most common cause of death in learning algorithms and numerical dataset machine
women in the world. [1], This uses concepts of Deep learning techniques are used to extract features and
learning (DL) and Machine learning (ML) to predict hidden features. The convolution value is obtained
breast cancer based on the data obtained. This cancer is from the stride function, which is used to harvest
produced by abnormal growth of fatty and fibrous features from different sizes of images. CNN is one that
tissues, and the different phases of cancer are caused by gives proper output for the dataset we utilised in this
cancer cells spreading throughout the tissue [4]. This is [13-17].
one of the most common cancers that affects women, Typical cancer screening procedures are grounded
but other types of cancer and those who are affected by on the "gold-standard", that consists of three tests:
them can be treated greatly, according to a government clinical evaluation, radiological imaging, and
survey, when compared to breast cancer. The various pathology testing. [18]. This traditional technique,
phases of breast cancer are identified via proper which is based on regression, detects the existence of
treatment and detailing. If we do not provide proper cancer, whereas new ML techniques and algorithms are
therapy to our patients, it will result in their death. A built on model creation. In its training and testing
number of methods for establishing an accurate stages, the model is meant to forecast unknown data
diagnosis of breast cancer have been presented. and offers a satisfactory predicted outcome [19]. Pre-
Because the dataset contains a variety of distinct report processing, feature selection or extraction, and
attributes, machine learning may be easily applied to classification are the three major methodologies used in
the dataset for prediction [22-25]. Even by using machine learning [20]. The feature extraction part of
Technology which is not fully automatically designed the machine learning method is crucial for cancer
to give the output. Hence here we propose the fully diagnosis and prediction. This process may
automatic classification and prediction of breast cancer differentiate between benign and malignant tumours
based on dataset. Using deep learning technique. This [21].

Copyright © 2021 The Authors. Published by Atlantis Press International B.V.


This is an open access article distributed under the CC BY-NC 4.0 license -http://creativecommons.org/licenses/by-nc/4.0/. 348
Atlantis Highlights in Computer Sciences, volume 4

The “gold-standard” method for detecting cancer Radiology professionals frequently struggle with
previously consisted of three parts: clinical evaluation, mammography mass lesion labelling, which can lead to
radiological imaging, and pathology testing. [18]. The unneeded and costly breast biopsies. The paper's
proposed technique indicates the presence of cancer implementation was evaluated using three publicly
based on regression while new algorithms are available. available benchmark datasets: the DDMS, INbreast,
Model which has been designed for prediction of new and BCDR databases for training and testing, and the
data and should give good result in their training and MIAS dataset for testing only. The results showed that
testing phase [19]. Here there are 3 main steps pre- when PCNN is paired with CNN, it outperforms other
processing features, extraction and classification. approaches for the same publicly available datasets.[1]
Figure 1 shows the types of breast cancers, in this paper
If the mammographic breast tissue is dense, the
we consider IDC.
federal law requires patient notice because increasing
sensitivity us a sign of breast cancer risk and can impair
sensitivity of mammography. Our goal was to get our
deep learning model externally validated using
radiologist breast density evaluations in a community
breast imaging practise.
3. METHODOLGY

3.1 Data Set Description


WBCD repository provided the numerical dataset.
Features are composed of fine needle aspirate (FNA) of
a breast mass. There are 30 features that was extracted
to describe characteristics of cell nuclei present in the
scanned images. The dataset consists of 569 patients
,212 have an outcome of malignancy and 357 are
Benign. The classes in the dataset are separated into 2
or 4 groups, with 2 corresponding to the benign case
Figure 1 The various kinds of breast cancer.
and 4 corresponding to the malignant case.
2. RELATED WORKS IDC dataset consists of 162 whole mount slide
images,2,77,524 patches were extracted among which
This section provides an orderly description of 1,98,738 are non-IDC and 78,786 are IDC. They are
several approaches focused on chest harm examination labelled 1 and 0, where 1 indicates cells with IDC
to grasp the concerns and challenges addressed by characteristics and 0 indicates cell with non-IDC
previous research studies. [1] noted that among the characteristics.
different harmful developments, chest infection is
possibly the most basic type of malignancy. Chest
harmful development is a well-known illness kind that 3.2 Data Pre-processing
is a hot research topic with enormous potential [2].
The term "data pre-processing" refers to the process
Clinical consideration companies are encouraging of converting unstructured data to structured data, as
unusual assistance for clinical experts in unique well as resizing and removing undesirable data from a
situations by utilising data science and AI estimates. dataset. The dataset's missing traits are replaced by the
Nowadays, detecting a case of chest injury is a mean value. The data is then randomly selected from
considerable challenge, as plans range in shape, the dataset to ensure that the data is circulated
surface, and other clinical aspects. As a result, the properly.[27]
clinical consideration sector is concentrating more on
developing a practical AI estimations application [3]. 3.3 Training and Testing Phase
Previously, a group of researchers concentrated on
engaging image examination to discriminate between This phase extracts the features from the dataset,
chest illnesses for separating the dangerous and the testing phase will deliver new data to be
developments have infected past the chest, numerous examined to see how well our algorithm works and
organs and lymph nodes in the vicinity [3-5], and cell behaves when it comes to prediction. As previously
science [6-8] utilising unique but little datasets derived stated, the dataset is divided into two pieces. To avoid
from computation appraisal challenges [9-12]. fitting, cross-validation is performed. Each iteration of
our approach uses a ten-fold approach to portion data,

349
Atlantis Highlights in Computer Sciences, volume 4

with nine-fold used for training and the remaining for


testing.[28]

3.4 Proposed convolution neural network


for image dataset analysis
The entire procedure is separated into three phases,
the first of which is data generation, analysis, and
prediction, as shown in Figure 2.
Data from the patient, such as x-rays, mRI, and so
on, has been generated utilising IoT devices. The
information gathered may have come from photographs
or from sensors for numerical data gathering, and it has
been recorded in a database. These records can be
viewed by authorised individuals at any time and from
any location. Medical specialists analyse the data and Figure 3 Image Representation
use various algorithms to forecast cancer
classifications. Pooled images are flattened into long vector which
We are utilising the CNN algorithm for analysis and is fed to fully connected ANN. An artificial neural
prediction in this proposed model. It is vital to association's core design consists of a large number of
understand the many levels and adjustments made in interconnected neurons grouped in three layers: input,
each layer in order to comprehend the rest of the concealment, and yield. By considering enough
process. For example, as seen in Figure 2 input data models, this type of association usually sorts out a way
should be in pixel format, which will be converted to a to do tasks. The following is a representation of forward
3X3 image matrix in the conv 2D layer, and then to a inducing and a single neuron:
single linear vector for classification.
Subject to appropriate channels, the CNN can
obtain spatio-transient information. On the images, a (1)
planning communication is used to better comprehend
The weight to input layer is wij, the bias value is bi,
the full visual information. The 4 x 4 x 3 picture is
and the input value is xi. The fully connected layer uses
showed up in Figure 3
Relu activation functions and softmax classifier.
Multiple ConV filters are stacked on top of each
other and Max pooling is performed. Using depth wise
separable convolution.

(2)

(3)

(4)

After the input layer, there is a hidden layer with


several sub layers, and after propagation, the loss value

Figure 2 Data_Matrix_Representation

350
Atlantis Highlights in Computer Sciences, volume 4

is derived from the anticipated and actual values. 3.5.2. K-Nearest Neighbour
In the field of machine learning, KNN is a supervised
machine learning algorithm. It is a classifier algorithm
that is used to classify data. It is used to calculate the
distances between points in the data, and subsequently
votes are used to make a judgement.[30-32]

3.5.3. Decision Tree


A controlled mimicked knowledge computation is a
Decision tree. A decision tree is just a tree with each
handle serving as a leaf placement point or decision
centre. The procedures for the Decision tree are quick
and appropriate for making a decision. Inside and
outside focal habitats were linked together in a decision
tree.[33-35]

3.5.4. Naïve Bayes


The Nave's Bayes learning computation is
influenced by social events. The restrictive likelihood
hypothesis is used to establish the class of another
Figure 4 The system architecture of ANN component vector. The NB employs the arranging
dataset to obtain the restrictive likelihood respect of
The proposed CNN model and suggested vectors for a particular class. After dealing with each
computation achieved the most essential precision for vector's likelihood restrictive assessment, the new
the dataset, which is greater than the other current vectors class is determined based on its chance
algorithms. The proposed CNN architecture clearly likelihood. NB is used to depict issues that are content-
demonstrates that it is significantly suitable for related.[36-40]
analysing the chest data, as evidenced by the
aforementioned explanation.
4. RESULTS AND DISCUSSION
Based on the above explanation of proposed
3.5 Proposed Method for Numerical Dataset methodolgy, analysis of the numerical dataset using
four different algorithms results in namely SVM, KNN,
In Numerical dataset we are using 4 different
Decision tree (CART), Naïve bayes yields following
algorithms and identifying which algorithm is suitable
results.
for the dataset and gives most accurate output.
First the baseline algorithm analysis is done, after
Algorithms used are SVM, KNN, Decision Tree
pipelining the dataset. Figure 5 shows the accuracy
and Naïve bayes.
obtained by these classifier algorithms.
3.5.1. SVM
Support vector machines are AI models that are
managed. While requesting items, an assistance vector
machine creates a hyperplane. The two classes are Figure 5 Initial output using baseline algorithm
perceived by a hyperplane, which is a line on a plane.
A non-probabilistic mixed direct classifier, an SVM Then we use standard scalar to for scaling and
getting ready estimation generates a model that assigns centering the data to boost accuracy, then along with
new advisers for at least one grouping or the scalarization the algorithm is applied.
backwards, given a social occasion of training models,
Figure 6 depicts the accuracy obtained after using
each set aside as having a spot with in any case one or
standard scalar.
something in spite of two characterizations.[29]

351
Atlantis Highlights in Computer Sciences, volume 4

This model is then used to predict the results of test


data.
Figure 10 provides plot of loss function, which
shows the significant decreases.
Figure 6 Accuracy boost after standard scalar

In Figure 7, it can be observed SVM shows high


accuracy. Along with accuracy mean and standard
values are also calculated.

Figure 7 Plot of accuracy comparison Figure 10 Plot of training and validation loss

Thus, SVM is considered for further analysis and To facilitate the ease of interface the GUI is
prediction. The train data is used to model and the test developed using Flask framework to connect the front
data is passed as input for prediction. end to the back-end model to process and provide
prediction.
Figure 8 is providing the screenshot of accuracy
obtained by using SVM. Medical practitioners can enter input values
manually using patient records and on submission the
record is classified as malignant or benign. Also, image
can be uploaded which then will be process by the
model built and the prediction is made.
Figure 8 Prediction accuracy using SVM for test data

The proposed method also talks about the prediction


of breast cancer using CNN and this is applied on IDC
image dataset, Figure 9 depicts the accuracy and other
parameters obtained.

Figure 11 Image data prediction: Malignant

Figure 9 Results obtained using CNN on test data

These results are obtained by the model that was


build using sequential API that forms a network which
uses ConV filters, stacks them and apply max pooling
and seperableConv2D depthwise convolutions are
applied.

352
Atlantis Highlights in Computer Sciences, volume 4

4. Keles, M. Wisconsin SVM vs up to


Kaya, [26] Diagnostic KNN, 96.91%
Breast Cancer decision
dataset trees and
Naives
bayes

5. Meteb M. DDMSdataset PCNN 97%


Altaf [1] , INbreast
dataset
6. Brian N Image dataset Deep 94.9%
Dontchos et learning
al. [7] model
Figure 12 Image data prediction: Malignant

Figure 11 and 12 shows the screenshots obtained for


The above-mentioned Table 1 shows the results of
the image prediction. These images are uploaded as
existing systems and the proposed system. It can be
input in the webpage.
observed that there is a significant increase in accuracy
obtained from the algorithms the is chosen.
5. CONCLUSION
The primary purpose of this study is to create and
execute a novel computation for interpreting and
orchestrating chest disease data obtained from
mammography and pathology results of patient scans
that is obtained and UCI repositories’ cloud. Python
programming, a convolutional neural association
Figure 13 Numerical data record prediction: Benign model is utilised to accomplish this, and the results are
confirmed. According to the findings, the suggested
CNN outperforms estimations in recognising and
requesting breast cancer for image datasets. And SVM
has proven to outperforms CART, NB and KNN in
analysis and prediction of cancer with numerical
dataset.

AUTHORS’ CONTRIBUTIONS
Both authors have contributed equally to the work.
Figure 14 Numerical data record prediction:
Malignant ACKNOWLEDGMENTS
Table 1: Comparison of Accuracies obtained by We thank all the faculties and friends for their
different authors expertise and assistance throughout all aspects of our
Author Dataset Method Accurac study and for their help in writing the manuscript.
y
1.Proposed Numerical SVM 96.48% REFERENCES
methodolog dataset CART 91.88%
y [1] Meteb M. Altaf- A hybrid deep learning model for
NB 93.19%
breast cancer diagnosis based on transfer learning
KNN 95.8%
IDC dataset CNN 98.13%
and pulse-coupled neural networks
2.Wang et Electronic Logistic 96.4 % [2] B. Akbugday, "Classification of Breast Cancer
al. [3] health records regressio Data Using Machine Learning Algorithms," 2019
n Medical Technologies Congress (TIPTEKNO),
Izmir, Turkey, 2019, pp. 1-4.
3. Akbugday Breast Cancer KNN 96.85%
[2] Wisconsin SVM 96.85% [3] Wang, D., Khosla, A., Gargeya, R., Irshad, H. &
dataset Beck, A. H. Deep learning for identifying
metastatic breast cancer. arXiv preprint
arXiv:1606.05718 (2016).

353
Atlantis Highlights in Computer Sciences, volume 4

[4] Nazeri, K., Aminpour, A. &Ebrahimi, M. Two- Intrusion Detection in Network Traffic Data.
stage convolutional neural network for breast arXiv preprint arXiv:1709.03082 (2017).
cancer histology image classification. In
[16] Kyunghyun Cho, Bart Van Merriënboer, Caglar
International Conference Image Analysis and
Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Recognition, 717–726 (Springer, 2018).
HolgerSchwenk, and Yoshua Bengio, (2014),
[5] Golatkar, A., Anand, D. &Sethi, A. Classification “Learning phrase representations using RNN
of breast cancer histology using deep learning. In encoder-decoder for statistical machine
International Conference Image Analysis and translation”, arXiv preprint arXiv:1406.1078
Recognition, 837–844 (Springer, 2018). (2014).
[6] Albarqouni, S. et al. Aggnet: deep learning from [17] C. Cortes and V. Vapnik, (1995), “Support-vector
crowds for mitosis detection in breast cancer Networks.Machine Learning 20.3”, (1995), 273–
histology images. IEEE transactions on medical 297. https://doi.org/10.1007/BF00994018.
imaging 35, 1313–1321 (2016).
[18] Gönen, M.; Alpaydın, E. Multiple kernel learning
[7] B. N. Dontchos, A. Yala, R. Barzilay, J. Xiang, C. algorithms. J. Mach. Learn. Res. 2011, 12, 2211–
D. Lehman, External validation of a deep learning 2268.
model for predicting mammographic breast
[19] Ferroni, P.; Zanzotto, F.M.; Scarpato, N.;
density in routine clinical practice, Acad.
Riondino, S.; Nanni, U.; Roselli, M.; Guadagni, F.
Radiol., 28 (2020), 475-480.
Risk assessment for venous thromboembolism in
[8] Rao, S. Mitos-rcnn: A novel approach to mitotic chemotherapy treated ambulatory cancer patients:
figure detection in breast cancer histopathology A precision medicine approach. Med. Dec. Mak.
images using region based convolutional neural 2017, 37, 234–242.
networks. arXiv preprint arXiv:1807.01788
[20] Ferroni, P.; Roselli, M.; Zanzotto, F.M.; Guadagni,
(2018).
F. Artificial Intelligence for cancer-associated
[9]. Bejnordi, B. E. et al. Diagnostic assessment of thrombosis risk assessment. Lancet
deep learning algorithms for detection of lymph Haematol.2018, 5, e391.
node metastases in women with breast cancer.
[21] Cristianini, N.; Shawe-Taylor, J. An Introduction
Jama 318, 2199–2210 (2017).
to Support Vector Machines and other kernel-
[10] Bándi, P. et al. From detection of individual based learning methods.Ai Magazine 2000, 22,
metastases to classification of lymph node status 190.
at the patient level: the camelyon17 challenge.
[22] Matyas, J. Random optimization. Automat. Rem.
IEEE Transactions on Med. Imaging (2018).
Control 1965, 26, 246–253.
[11] Litjens, G. et al. 1399 h&e-stained sentinel lymph
[23] Jain A, Levy D. 2016. Breast mass classification
node sections of breast cancer patients: the
using deep convolutional neural networks. In: 30th
camelyon dataset. GigaScience 7, giy065 (2018).
conference on neural information processing
[12] Aresta, G. et al. Bach: Grand challenge on breast systems (NIPS 2016). Barcelona, Spain. 1_6.
cancer histology images. arXiv preprint
[24] Jiang F. 2017. Breast mass lesion classification in
arXiv:1808.04277 (2018).
mammograms by transfer learning. In: ICBCB '17.
[13] Gouda I Salama, M Abdelhalim, and MagdyAbd- Hong Kong, 59_62 DOI
elghanyZeid. 2012. Breast cancer diagnosis on 10.1145/3035012.3035022.
three different datasets using multiclassifiers.
[25] Ragab DA, Sharkas M, Marshall S, Ren J. 2019.
Breast Cancer (WDBC) 32, 569 (2012), 2.
Breast cancer detection using deep convolutional
[14] William H Wolberg, W Nick Street, and Olvi L neural networks and support vector
Mangasarian. 1992. Breast cancer Wisconsin machines.PeerJ 7:e6201
(diagnostic) data set. UCI Machine Learning http://doi.org/10.7717/peerj.6201.
Repository [http://archive. ics. uci. edu/ml/]
[26] Keles, M. Kaya, "Breast Cancer Prediction and
(1992).
Detection Using Data Mining Classification
[15] Abien Fred Agarap. 2017. A Neural Network Algorithms: A Comparative Study." Tehnicki
Architecture Combining Gated Recurrent Unit Vjesnik - Technical Gazette, vol. 26, no. 1, 2019,
(GRU) and Support Vector Machine (SVM) for p. 149+.

354
Atlantis Highlights in Computer Sciences, volume 4

[27] Z. Guo, L. Tang, T. Guo, K. Yu, M. Alazab, A. communication system." Energies 13, no. 13
Shalaginov, “Deep Graph Neural Network-based (2020): 3466.
Spammer Detection Under the Perspective of
[36] Hu, Liwen, Ngoc-Tu Nguyen, Wenjin Tao, Ming
Heterogeneous Cyberspace”, Future Generation
C. Leu, Xiaoqing Frank Liu, Md Rakib Shahriar,
Computer Systems,
and SM Nahian Al Sunny. "Modeling of cloud-
https://doi.org/10.1016/j.future.2020.11.028.
based digital twins for smart manufacturing with
[28] Y. Sun, J. Liu, K. Yu, M. Alazab, K. Lin, MT connect." Procedia manufacturing 26 (2018):
“PMRSS: Privacy-preserving Medical Record 1193-1203
Searching Scheme for Intelligent Diagnosis in IoT
[37] Seyhan, Kübra, Tu N. Nguyen, Sedat Akleylek,
Healthcare”, IEEE Transactions on Industrial
Korhan Cengiz, and SK Hafızul Islam. "Bi-GISIS
Informatics, doi: 10.1109/TII.2021.3070544.
KE: Modified key exchange protocol with
[29] K. Yu, L. Tan, L. Lin, X. Cheng, Z. Yi and T. Sato, reusable keys for IoT security." Journal of
"Deep-Learning-Empowered Breast Cancer Information Security and Applications 58 (2021):
Auxiliary Diagnosis for 5GB Remote E-Health," 102788.
IEEE Wireless Communications, vol. 28, no. 3,
[38] Pham, Dung V., Giang L. Nguyen, Tu N. Nguyen,
pp. 54-61, June 2021, doi:
Canh V. Pham, and Anh V. Nguyen. "Multi-topic
10.1109/MWC.001.2000374.
misinformation blocking with budget constraint
[30] K. Yu, L. Tan, S. Mumtaz, S. Al-Rubaye, A. Al- on online social networks." IEEE Access 8
Dulaimi, A. K. Bashir, F. A. Khan, “Securing (2020): 78879-78889.
Critical Infrastructures: Deep Learning-based
[39] Arun, M., E. Baraneetharan, A. Kanchana, and S.
Threat Detection in the IIoT”, IEEE
Prabu. "Detection and monitoring of the
Communications Magazine, 2021.
asymptotic COVID-19 patients using IoT devices
[31] K. Yu, Z. Guo, Y. Shen, W. Wang, J. C. Lin, T. and sensors." International Journal of Pervasive
Sato, “Secure Artificial Intelligence of Things for Computing and Communications (2020).
Implicit Group Recommendations”, IEEE Internet
[40] Kumar, M. Keerthi, B. D. Parameshachari, S.
of Things Journal, 2021,
Prabu, and Silvia liberata Ullo. "Comparative
doi: 10.1109/JIOT.2021.3079574.
Analysis to Identify Efficient Technique for
[32] H. Li, K. Yu, B. Liu, C. Feng, Z. Qin and G. Interfacing BCI System." In IOP Conference
Srivastava, "An Efficient Ciphertext-Policy Series: Materials Science and Engineering, vol.
Weighted Attribute-Based Encryption for the 925, no. 1, p. 012062. IOP Publishing, 2020.
Internet of Health Things," IEEE Journal of
Biomedical and Health Informatics, 2021, doi:
10.1109/JBHI.2021.3075995.
[33] Puttamadappa, C., and B. D. Parameshachari.
"Demand side management of small scale loads in
a smart grid using glow-worm swarm optimization
technique." Microprocessors and
Microsystems 71 (2019): 102886.
[34] Rajendran, Ganesh B., Uma M. Kumarasamy,
Chiara Zarro, Parameshachari B. Divakarachari,
and Silvia L. Ullo. "Land-use and land-cover
classification using a human group-based particle
swarm optimization algorithm with an LSTM
Classifier on hybrid pre-processing remote-
sensing images." Remote Sensing 12, no. 24
(2020): 4135.
[35] Subramani, Prabu, Ganesh Babu Rajendran, Jewel
Sengupta, Rocío Pérez de Prado, and
Parameshachari Bidare Divakarachari. "A block
bi-diagonalization-based pre-coding for indoor
multiple-input-multiple-output-visible light

355

You might also like