A Project Report
A Project Report
A Project Report
on
BREAST CANCER DETECTION USING MACHINE LEARNING
TECHNIQUE
Submitted in partial fulfillment of the
requirement for the award of the degree of
Submitted By
Rishav Raj Rana omanshu goutam
20scse2030072 20scse2030004
CANDIDATE’S DECLARATION
We hereby certify that the work which is being presented in the project, entitled “BREAST
CANCER DETECTION USING MACHINE LEARNING” in partial fulfillment of the
requirements for the award of the Master of Computer Applications submitted in the School of
Computing Science and Engineering of Galgotias University, Greater Noida, is an original work
carried out during the period of September,2021 to December,2021 under the supervision of
Name… Designation, Department of Computer Science and Engineering/Computer Application
and Information and Science, of School of Computing Science and Engineering , Galgotias
University, Greater Noida
The matter presented in the project on has not been submitted by us for the award of any
other degree of this or any other places.
Rishav Raj Rana omanshu goutam
20scse2030072 20scse2030004
This is to certify that the above statement made by the candidates is correct to the best of my
knowledge.
Name of Supervisor: Dr.E.Rajesh
Designation : Associate Professor
CERTIFICATE
The Project Viva-Voce examination of Rishav Raj:20scse2030072 and Rana omanshu goutam:
20scse2030004 has been held on _________________ and his/her work is recommended for the
ABSTRACT
Breast cancer is a disease which we hear about a lot nowadays. It is one of the most widespread
diseases. It affects women all around the world. The National Cancer Institute says that breast
cancer is the second most common cancer for women in the United States. There are around
2000+ new cases of breast cancer in men each year, and about 2,30,000 new cases in women
every year. Diagnosis of this disease is crucial so that woman can get it treated faster. It is best for
a correct and early diagnosis. This is an important step in rehabilitation and treatment. Breast
cancer detection is done with the help of mammograms, which are basically X-rays of the breasts.
It’s a tool which is used to detect and help diagnose breast cancer. But detection is not easy due to
different kinds of uncertainties in using these mammograms. Machine Learning (ML) techniques
can help in the detection of breast cancer. We can use these techniques to make tools for doctors
that can be used as an effective mechanism for early detection and diagnosis of breast cancer
which could greatly enhance the survival rate of patients.
Contents
Title Page No.
Candidates Declaration I
Acknowledgment II
Abstract III
Contents IV
List of Table V
List of Figures VI
Acronyms VII
Chapter Introduction 1
1
1. Introduction 2
1
1. Formulation of Problem 3
2
1.2.1 Tool and Technology
Used
Chapter Literature Survey/Project Design 5
2
Chapter Functionality/Working of Project 9
3
Chapter Results and Discussion 11
4
Chapter Conclusion and Future Scope 41
5
5. Conclusion 41
1
5. Future Scope 42
2
Reference 43
Publication/Copyright/Product 45
List of Figure
2.1. ML Architecture 6
2.2 Places ML is used 7
2.3 Digitized images of FNA, Benign, Malignant 10
2.4 Decision Trees 13
2.5 K-NN 15
2.6 Random Forests 18
4.1 System Architecture 26
4.2 Data Flow Diagram 26
4.3 Expected Output 27
5.1 Heat Map 29
5.2 Feature Plotting 30
5.3 Box Plots 31
5.4 KNN Actual Data 32
5.5 KNN Predicted Data 32
5.6 Previous and Predicted Data 33
5.7 Confusion matrix and Accuracy 33
List of Table
CHAPTER 1
Chapter 3 INTRODUCTION
Cancer is a disease that occurs when there are changes or mutations that take place
in genes that help in cell growth. These mutations allow the cells to divide and
multiply in a very uncontrolled and chaotic manner. These cells keep increasing and
start making replicas which end up becoming more and more abnormal. These
abnormal cells later on form a tumor. Tumors, unlike other cells, don’t die even
though the body doesn’t need them.
The cancer that develops in the breast cells is called breast cancer. This type of
cancer can be seen in the breast ducts or the lobules. Cancer can also occur in the
fatty tissue or the fibrous connective tissue within the breast. These cancer cells
become uncontrollable and end up invading other healthy breast tissues and can
travel to the lymph nodes under the arms.
There are two types of cancers. Malignant and Benign. Malignant cancers are
cancerous. These cells keep dividing uncontrollably and start affecting other cells
and tissues in the body. They spread to all other parts of the body and it is hard to
cure this type of cancer. Chemotherapy, radiation therapy and immunotherapy are
types of treatments that can be given for these types of tumors. Benign cancer is
non-cancerous. Unlike malignant, this tumor does not spread to other parts of the
body and hence is much less risky that malignant. In many cases, such tumors don’t
really require any treatment.
Breast cancer is most commonly diagnosed in women of ages above 40. But this
disease can affect men and woman of any age. It can also occur when there’s a
family history of breast cancer. Breast Cancer has always had a high mortality rate
and according statistics, it alone accounts for about 25% of all new cancer diagnoses
and 15% of all cancer deaths among women worldwide. Scientists know about the
dangers of it from very early on, and hence there’s been a lot of research put into
finding the right treatment for it.
Breast cancer detection is done with the help of mammograms, which are basically
X-rays of the breasts. It’s a tool which can help detect and diagnose breast cancer.
But, detection is not easy due to different kinds of uncertainties in using these
mammograms. The result of a mammogram are images that can show any
calcifications or deposits of calcium in the breasts. These don’t always have to be
cancerous. These tests can also find cysts which are fluid-filled sacs that are very
normal during some women’s menstrual cycles — and any cancerous or
noncancerous lumps.
Mammograms can cost around ₹15,000 - ₹40,000 or more based on the hospital,
location, and area of body to be covered. This is very expensive and not many can
afford it.
It is always best for an early diagnosis so that the treatment process can also be
started early on.
Breast cancer is a disease which we hear about a lot nowadays. It is one of the most
widespread diseases. There are around 2000+ new cases of breast cancer in men
each year, and about 2,30,000 new cases in women every year. Diagnosis of this
disease is crucial so that woman can get it treated faster. It is best for a correct and
early diagnosis.
The main objective of this project is to help doctors analyze the huge datasets of
cancer data and find patterns with the patient’s data and that cancer data available.
With this analysis we can predict whether the patient might have breast cancer or
not.
Machine learning algorithms will help with this analysis of the datasets. These
techniques will be used to predict the outcome. The outcome can be either that the
cancer is benign or malignant. Benign cancer is the cancer which doesn’t spread
whereas malignant cancer cells spread across the body making it very dangerous.
This prediction can help doctors prescribe different medical examinations for the
patients based on the cancer type. This helps save a lot of time as well as money for
the patient.
1.3 PROBLEM DEFINITION
Over the years, a continuous evolution related to cancer research has been
performed. Scientists used various methods, like early-stage screening, so that they
could find different types of cancer before it could do any damage. With this
research, they were able to develop new strategies to help predict early cancer
treatment outcome.
With the arrival of new technology in the medical field, huge amount of data related
to cancer has been collected and is available for medical research. But physicians
find the accurate prediction of the cancer outcome as the most interesting yet
challenging part.
For this reason, machine learning techniques have become popular among
researchers. These tools can help discover and identify patterns and relationships
between the cancer data, from huge datasets, while they are able to effectively
predict future outcomes of a cancer type. Patients have to spend a lot of money on
different tests and treatments to check whether they have breast cancer or not.
These tests can take a long time and the results can be delayed. Also, after
confirmation that the patient has cancer, more tests need to be done to check
whether the cancer is benign or malignant.
In this project, I will be using different machine learning techniques to analyze the
data given in the datasets. This analysis will help us predict whether the cancer is
benign or malignant. Benign cancer is the cancer which doesn’t spread whereas
malignant cancer cells spread across the body making it very dangerous.
Machine learning algorithms will help with this analysis of the datasets. These
techniques will be used to predict the outcome. The outcome can be either that the
cancer is benign or malignant. Benign cancer is the cancer which doesn’t spread
whereas malignant cancer cells spread across the body making it very dangerous.
This prediction can help doctors prescribe different medical examinations for the
patients based on the cancer type. This helps save a lot of time as well as money for
the patient.
1.4 PROJECT FEATURES
This project scheme was developed to reduce some amount of work for the
physicians and other doctors so that they don’t have to conduct many tests on the
patients. It also helps minimize the amount of time and money spent by the patients
undergo these tests. As everything is digitalized and based on data analysis, it takes
less amount of time to get results. Based on the results, further action can be taken.
It also helps researchers in the medical as well as IT sector to understand how
different algorithms can predict different outcomes.
This scheme normally requires a huge amount of data about different patient history
and the cancer details. This data has been collected by many doctors for a long
period of time and will be used to do the analysis. This reduces the computational
time required to gather all the data necessary.
Company Profile
Elite Techno Group was created with a mission to create skilled software engineers
for our country and the world. It aims to bridge the gap between the quality of skills
demanded by industry and the quality of skills imparted by conventional institutes. With
assessments, learning paths and courses authored by industry experts. Elite Techno
Group helps businesses and individuals benchmark expertise across roles, speed up
release cycles and build reliable, secure products. Elite Techno Groups emphasizes
on the intellectual development of students by providing them with a practical mode of
learning and thereby channelings their technical knowledge towards innovative real-
world application.
Overview
We are one of the fastest-growing EdTech companies in the space of college education. Our
mission is to make engineers learn real engineering by working on practical problems and
projects. And we are open to ideas, suggestions, collaborations, and experiments.
Website
http://elitetechnogroups.com
Industry
Higher Education
Company size
11-50 employees
286 on LinkedIn Includes members with current employer listed as Elite Techno Groups,
including part-time roles.
Headquarters
Bangalore, Karnataka
Type
Privately Held
Founded
2013
Specialties
Mechanical Engineering, Automotive Engineering, Engineering Design, Machine
Learning, Artificial Intelligence, Industry4.0, Internet of Things, Data Science,
Computer Science, Programming, and Robotics
Mail Us
info@elitetechnogroups.com
Call Us
+91-9513023665
+91-7742633665
Our Location
418, Jaipur Electronics Mall, Riddhi Siddhi circle, Goplapura Bypass, Jaipur
(302018), Rajasthan, India
Regional Office
Office No. 22, 12th B Main Rd, HAL 2nd Stage, Indiranagar, Bengaluru, Karnataka
560008
CHAPTER 2
Machine Learning
The procedures used with machine learning are like that of data mining and
predictive modeling. Both require scanning through huge amounts of data to search
for any type of pattern in the data and then modify the program accordingly.
Machine Learning has been seen many a times by individuals while shopping on the
internet. They are then shown ads based on what they were searching earlier on.
This happens because many of these websites use machine learning to customize
the ads based on user searches and this is done in real time. Machine learning has
also been used in other various places like
detecting fraud, filtering of spam, network security threat detection, predictive
maintenance and building news feeds.
• Supervised learning – Here both the input and output is known. The training dataset
also contains the answer the algorithm should come up with on its own. So, a
labeled dataset of fruit images would tell the model which photos were of apples,
bananas and oranges. When a new image is given to the model, it compares it to the
training set to predict the correct outcome.
• Unsupervised learning – Here input dataset is known but output is not known. A
deep learning model is given a dataset without any instructions on what to do with
it. The training data contains information without any correct result. The network
tries to automatically understand the structure of the model.
• Semi-supervised learning – This type comes somewhere between supervised and
unsupervised learning. It contains both labelled and un-labelled data.
• Reinforcement learning – In this type, AI agents are trying to find the best way to
accomplish a particular goal. It tries to predict the next step which could possibly
give the model the best result at the end.
• It is used a lot in different sectors of life like business, medicine, sports etc.
Traditionally, the diagnosis of breast cancer and the classification of the cancer as
malignant or benign was done by various medical procedures like:
• Breast exam – The doctor would check the breasts and lymph nodes in the armpits
to check if there are any lumps or abnormalities.
• Mammogram – These are like X-ray of the breast. They are used to check whether
there is breast cancer or not. If any issues are found, the doctor may ask the patient
to take a diagnostic mammogram to check for further abnormalities.
• Removing a sample of breast cells for testing (biopsy) – This is probably the only
definite way of checking if a patient has breast cancer. The doctor uses a specialized
needle device guided by the X-ray or any other test to take samples of tissues from
the area to be checked.
• MRI of the breasts - An MRI machine uses a magnet and radio waves to create
pictures to see the interiors of the breast tissues.
Blood tests, CT scans and PET scans are also done to check for breast cancer.
Disadvantages:
• Time consuming.
• Very expensive.
2.3 PROPOSED SYSTEM
In the proposed system we plan on using existing data of breast cancer patients
which has been collected for a number of years and run different machine learning
algorithms on them. These algorithms will analyze the data from the datasets to
predict whether the patient has breast cancer or not and it will also tell us if the
cancer is malignant or benign.
It is done by taking the patient’s data and mapping it with the dataset and checking
whether there are any patterns found with the data. If a patient has breast cancer,
then instead of taking more tests to check whether the cancer is malignant or benign,
ML can be used to predict the case based on the huge amount of data on breast
cancer. This proposed system helps the patients as it reduces the amount of money
they need to spend just for the diagnosis.
Also, if the tumor is benign, then it is not cancerous, and the patient doesn’t need
to go
through any of the other tests. This saves a lot of time as well.
Advantages:
• Accurate.
• DATASET
The Wisconsin Diagnostic Breast Cancer (WDBC) dataset which can be found in
• radius
• texture
• perimeter
• area
• smoothness
• compactness
• concavity
• concave points
• symmetry
• fractal dimension
Each of these features have information on mean, standard error, and “worst” or
largest
(mean of the three largest values) computed. Hence, the dataset has a total of 30
features.
Table 2.1 Description of features used in the dataset
• ARTIFICIAL NEURAL NETWORKS
Artificial neural networks is a very important tool used in machine learning. This
technique, as the name suggests, is inspired by the brain and its activities. They
were designed in a way to replicate the way humans learn. Neural networks
normally consist of input as well as output layers, and sometimes a hidden layer
consisting of units that change the input into something that the output layer can
use. These tools help in finding different patterns which are too hard for a human
to find himself. A programmer programs the machine to recognize it.
• ML ALGORITHMS
1. DECISION TREES
This algorithm is used to predict the value of an output or target variable based
on many input variables. They are a collection of divides and conquer problem
solving strategies. It takes the shape of a tree like structure. It starts with root
nodes and this splits into sub-nodes or child nodes. These branches keep
spitting until the outcome isn’t reached. It is used mainly for classification
problems.
Both for classification and regression, a useful technique can be used to assign
weight to the contributions of the neighbors, so that the nearer neighbors
contribute more to the average than the more distant ones. For example, a
common weighting scheme consists in giving each neighbor a weight of 1/d,
where d is the distance to the neighbor.
When KNN is used for classification, the output can be calculated as the class
with the highest frequency from the K-most similar instances. Each instance in
essence votes for their class and the class with the most votes is taken as the
prediction.
If you are using K and you have an even number of classes (e.g. 2) it is a good
idea to choose a K value with an odd number to avoid a tie. And the inverse, use
an even number for K when you have an odd number of classes.
Advantages:
The algorithm is simple and easy to implement.
There’s no need to build a model, tune several parameters, or make additional
assumptions.
The algorithm is versatile. It can be used for classification, regression, and search (as
we will see in the next section).
Disadvantage
The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.
Naive Bayes classifier is based on Bayes’ theorem and is one of the oldest
approaches
for classification problems. The formula is:
Pros:
It is easy and fast to predict class of test data set. It also performs well in multi class
prediction
When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
It performs well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve, which is
a strong assumption).
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast.
Thus, it could be used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly
used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms. As a
result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis
(in social media analysis, to identify positive and negative customer sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together
builds a Recommendation System that uses machine learning and data mining
techniques to filter unseen information and predict whether a user would like a given
resource or not
Scikit learn (python library) will help build a Naive Bayes model in Python.
There are three types of Naive Bayes model under scikit learn library:
Gaussian: It is used in classification and it assumes that features follow a normal
distribution.
Multinomial: It is used for discrete counts. For example, let’s say, we have a text
classification problem. Here we can consider bernoulli trials which is one step further
and instead of “word occurring in the document”, we have “count how often word
occurs in the document”, you can think of it as “number of times outcome number x_i
is observed over the n trials”.
Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros
and ones). One application would be text classification with ‘bag of words’ model
where the 1s & 0s are “word occurs in the document” and “word does not occur in
the document” respectively.
The same random forest algorithm or the random forest classifier can use for both
classification and the regression task.
Random forest classifier will handle the missing values.
When we have more trees in the forest, random forest classifier won’t overfit the
model.
Can model the random forest classifier for categorical values also.
• PYTHON
• R PROGRAMMING
R's quality is the straightforwardness with which all around structured distribution
quality plots can be created, including numerical images and formulae where
required. Extraordinary consideration has been assumed control over the defaults
for the minor structure decisions in illustrations, yet the client holds full control.
• SCIKIT-LEARN
Scikit-learn is a free software machine learning library for the Python
programming language. It features various classification, regression
and clustering algorithms
including support vector machines, random forests, gradient boosting,
k- means and DBSCAN, and is designed to interoperate with the Python numerical
and scientific libraries NumPy and SciPy. The scikit-learn project started as
scikits.learn, a Google Summer of Code project by David Cournapeau. Its name
stems from the notion that it is a "SciKit" (SciPy Toolkit), a separately-developed
and distributed third-party extension to SciPy. The original codebase was later
rewritten by other developers. Of the various scikits, scikit-learn as well as scikit-
image were described as "well-maintained and popular" in November 2012. As of
2018, scikit-learn is under active development.
CHAPTER 3
1. Collection of datasets.
6. Improving results
FUNCTIONAL:
The functional requirement defines the system or the components of the system. A
function is basically inputs, behaviors and outputs. Stuff that can be called functional
requirements are: calculations, technical details, data manipulation and processing. It
tells us what a system is supposed to do.
• Understand all the features as well as the data provided in the dataset.
• Map the data in the dataset with the given input data. Find patterns, if any, with
both the dataset as well as input data.
• Check whether the input data of a patient will result in the diagnosis of breast cancer
or not.
• If breast cancer is diagnosed, provide information on the type of breast cancer, i.e,
benign or malignant.
• Provide the percentage accuracy of the proposed prediction.
NON FUNCTIONAL:
follows:
3.2.1 ACCESSIBILITY:
It is easy to access as the dataset is open source and can be found on the University
of California, Irvin’s ML dataset repository. Unlike breast cancer diagnosis tests in
hospitals which cost a lot, anyone can access this dataset for free.
3.2.2 MAINTAINABILITY:
Maintainability tells us how easily a software or tool or system can be modified in order
to:
• Correct defects
• Meet new requirements
Different programming languages can be used to make the predictive model based
on the programmer’s wishes. The datasets can also be modified and new data can
be added as and when the data is updated by doctors. Different ML algorithms can
also be used to check which algorithm will give the best result.
As python and R are both programming languages that can adapt to new changes
easily, it is easy to maintain this type of system.
3.2.3 SCALABILITY:
The system can work normally under situations such as low bandwidth and huge
datasets. The
R studio as well as Excel can take care of these data and can perform the algorithms
with ease.
3.2.4 PORTABILITY:
Portability is a feature which tells us about the ease at which we can reuse an
existing piece of code when we move from one location or environment to some
other.
This system uses python and R programming languages and they can be executed
under different operation conditions provided it meet its minimum configurations.
Only system files and dependent assemblies would have to be configured in such
case.
RAM : 512Mb
Hard Disk : 10 GB
Input device : Standard Keyboard and Mouse
Output device : VGA and High-Resolution Monitor
• Rattle
• Spyder
• Jupyter Notebook
• Scikit-learn library
CHAPTER 4
Chapter 6 DESIGN
Under our model, the goal of our project is to create a design to achieve the following:
4.1.1 ACCURACY
Only accurate outcomes can help make this model a good one. It can be reliable
only when all the outcomes are correct and can be trusted. As this data is required
for healthcare purposes, it is important that no errors occur.
4.1.2 EFFICIENCY
The model should be efficient as there is no requirement of manual data entry work
or any work by doctors. It takes less time to predict outcomes after all the ML
algorithms have been used on the data.
As this project does not have any UI, the architecture is basically the dataset and the
features of the dataset. It is trying to understand the dataset and try making the
system as simple and easy as possible.
The dataset is first split into training and testing set. The training set if first exposed
to the machine learning algorithms so that the system understands what data gives
what type of outcome.
After the system is trained, the testing data is used to test whether the system can
correctly predict the class of the data. It checks the percentage accuracy of the
model.
Breast
Cancer
Dataset
The dataflow diagram shows the way in which the data from the dataset moves.
The outcome of this model is to correctly check and predict whether a patient has
breast cancer or not. If yes, the model should also be able to tell if the patient has
malignant or benign type of cancer.
Chapter 7 IMPLEMENTAION
AND OUTPUT
Step 1: The first step in the machine learning process is to prepare the data. This
includes importing all the packages that will help us organize and visualize the data.
The packages used are as follows:
import pandas as
pd import numpy
as np
import matplotlib.pyplot as
plt import seaborn as sns
Step 2: After importing all the necessary packages, we need to load the dataset. We
use the help of Pandas to load the data set.
data = pd.read_csv('../input/data.csv');
Step 3: We need to drop the first column of the dataset which consists of IDs as this
field will not help us in the classification process. This is done as follows:
data.drop(data.columns[[-1, 0]], axis=1, inplace=True)
Step 4: To check how many data points are Malignant and benign.
diagnosis_all = list(data.shape)[0]
diagnosis_categories = list(data['diagnosis'].value_counts())
The data has 569 diagnosis, 357 malignant and 212 benign.
5.2 VISUALIZING THE DATA
We need to build visualizations of the data in order to decide how to proceed with the
machine learning tools. The Seaborn and the Matplotlib packages will be used for this
purpose. We use the mean values of the features. So first we will have to separate
those features in the list to make some work easier and the code more readable.
features_mean= list(data.columns[1:11])
The first method that can be used for visualization is heat map. A heat map is a two-
dimensional representation of data in which values are represented by colors. A
simple heat map provides an immediate visual summary of information. More
elaborate heat maps allow the viewer to understand complex data sets.
1. K-NEAREST NEIGHBORS
The data after using the algorithm on the testing set of data looks like this:
Using all mean values features from the dataset and the scikit-learn libraries,
we can run the code to find the accuracy in prediction of breast cancer data.
The important attributes that we must consider from that dataset are ‘target-
names'(the meaning of the labels), ‘target'(the classification labels),
‘feature_names'(the meaning of the features) and ‘data'(the data to learn).
For testing the accuracy of our classifier, we must test the model on unseen data.
So, before building the model, we will split our data into two sets, viz., training set
and test set. We will be using the training set to train and evaluate the model and
then use the trained model to make predictions on the
unseen test set. The sklearn modlue has a built-in
function called the train_test_split(), which automatically divides the data into these
sets. We will be using this function two split the data.
The train_test_split() function randomly splits the data using the parameter
test_size. What we have done here is that, we have split 33% of the original data
into test data (test). The remaining data (train) is the training data. Also, we have
respective labels for both the train variables and test variables, i.e. train_labels and
test_labels.
There are many machine learning models to choose from. All of them have their
own advantages and disadvantages. For this model, we will be using the Naive
Bayes algorithm that usually performs well in binary classification
tasks. Firstly, import the GaussianNB module and initialize it using the
GaussianNB() function. Then train the model by fitting it to the data in the dataset
using the fit() method.
start = time.time()
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
prediction =
clf.predict(X_test)
scores = cross_val_score(clf, X, y,
cv=5) end = time.time()
accuracy_all.append(accuracy_score(prediction, y_test))
cvs_all.append(np.mean(scores))
print("Decision Tree Accuracy: {0:.2%}".format(accuracy_score(prediction,
y_test))) print("Execution time: {0:.5} seconds \n".format(end-start))
X = data.loc[:,features_selection]
y = data.loc[:, 'diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
accuracy_selection = []
cvs_selection = []
1. K-NEAREST NEIGHBORS
from sklearn.neighbors import
KNeighborsClassifier start = time.time()
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
prediction =
clf.predict(X_test)
scores = cross_val_score(clf, X, y,
cv=5) end = time.time()
accuracy_selection.append(accuracy_score(prediction, y_test))
cvs_selection.append(np.mean(scores))
print("Accuracy: {0:.2%}".format(accuracy_score(prediction,
y_test))) print("Execution time: %s seconds \n" %
"{0:.5}".format(end-start))
Accuracy: 92.11%
Execution time: 0.020711 seconds
2. NAÏVE BAYES
In this project we have worked to collect the suitable dataset needed to help in
this predictive analysis. This dataset is then processed to remove all the junk
data. The predictive analysis method is being used in many different fields and is
slowly picking up pace. It is helping us by using smarter ways to solve or predict
a problem’s outcome. Our scheme was developed to reduce the time and cost
factors of the patients as well as to minimize the work of a doctor. We have tried
to use a very simple and understandable model to do this job. Next, machine
learning algorithms should be used on the training data and the testing data
should be used to check if the outcomes are accurate enough.
In the future, we can also use a dataset to predict the re-occurrence of breast
cancer after a surgery or chemotherapy session. Artificial Neural Networks can
be applied to make the prediction better and smarter. Accuracy can be increased
by selecting better features.
REFERENCES
[1] Wolberg, Street and Mangasarian, “Wisconsin Diagnostic Breast Cancer Dataset”
http://archive.ics.uci.edu/ml
[2] Mengjie Yu, “Breast Cancer Prediction Using Machine Learning Algorithm”, The
University of Texas at Austin, 2017.
[3] Wenbin Yue, Zidong Wang, Hongwei Chen, Annette Payne and Xiaohui Liu,
“Machine Learning with Applications in Breast Cancer Diagnosis and Prognosis”, 2018.
[4] S. Palaniappan and T. Pushparaj, “A Novel Prediction on Breast Cancer from the
Basis of Association rules and Neural Network”, 2013.
[5] Joseph A. Cruz, David S. Wishart, “Applications of Machine Learning in Cancer
Prediction and Prognosis”
[6] Yoichi Murakami, Kenji Mizuguchi: Applying the Nave Bayes classifier with kernel
den- sity estimation to the prediction of protein-protein interaction sites. Bioinformatics
26(15): 1841-1848 (2010).
[7] George H. John and Pat Langley. Estimating continuous distributions in Bayesian
classi- fiers. In P. Besnard and S. Hanks, editors, Eleventh Annual Conference on
Uncertainty in Artificial Intelligence, pages 338–345, San Francisco, 1995. Morgan
Kaufmann Publishers.
[8] Wilbert Sibanda and Philip Pretorius. Article: Novel Application of Multi-Layer
Percep- trons (MLP) Neural Networks to Model HIV in South Africa using Seroprevalence
Data from Antenatal Clinics. International Journal of Computer Applications 35(5):26-31,
De- cember 2011. Published by Foundation of Computer Science, New York, USA.
[11] J. Park, I.W. Sandberg :Approximation and radial basis function networks .Neural
Comput, 5 (1993), pp. 305-316.
[12] Domingos,P.A few useful things to know about machine learning. Commun. ACM.55
(10):78-87 (2012).
[14] Mousa, R., Munib, Q., Moussa, A., 2005. Breast cancer diagnosis system based on
wavelet analysis and fuzzy-neural. Expert Syst. Appl. 28, 713-723.
[15] Penna-Reyes, C.A., Sipper, M., 2000. A fuzzy genetic approach to breast cancer
diagnosis. Arti?cial Intell. Med. 17, 131-155.
[16] I. Kononenko, Machine learning for medical diagnosis: history, state of the art and
per- spective, Artif. Intell. Med. 23 (2001) 89109.
[17] Gardner, M. W., and Dorling, S. R. (1998). ”Artificial neural networks (The
multilayer per- ceptron) - A review of applications in the atmospheric sciences.”
Atmospheric Environment, 32(14/15), 2627-2636.