1 Introduction
The modern health care system generates huge volumes of data every day. There is
a need to mine and analyze these data to extract useful information and to reveal hid-
den pattern. Data mining is the process of discovering new patterns from data collect-
ed from varying sources. A number of machine learning algorithms have been used
successfully in making prediction in various domains such as healthcare, weather
forecasting, stock price prediction, product recommendation. An important aspect of
medical science research is the prediction of various diseases and factors that cause
them. In medical domain, healthcare data are being used to predict epidemics, to de-
tect disease, to improve quality of life and avoid early deaths [1]. In this work, we
investigate three different classification algorithms for its prediction.
Anemia is defined as decrease in amount of red blood cells (RBCs)
or hemoglobin in the blood [2] that has significant adverse health consequences, as
well as adverse impacts on economic and social development. Although the most
reliable indicator of anemia is blood hemoglobin concentration, but there are a num-
ber of factors that can cause anemia such as iron deficiency, chronic infections such
as HIV, malaria, and tuberculosis, vitamin deficiencies, e.g. vitamins B12 and A,
cancer, and acquired disorders that affect red blood cell production and hemoglobin
Anemia causes fatigue and low productivity [3, 4, 5] and, when it occurs in preg-
nancy, may be associated with increased risk of maternal and perinatal mortality [6,
7]. According to World health organization (WHO), maternal and neonatal mortality
were responsible for 3.0 million deaths in 2013 in developing countries.
Anemia disease prediction plays a most important role in order to detect other as-
sociated diseases. Anemia disease is classified on the basis of morphology or on the
basis of its underlying cause (Figure 1).
Based on the morphology, anemia is divided into three types, which are normo-
cytic, microcytic and macrocytic. Based on cause, anemia is classifies into three
types namely blood loss, inadequate production of normal blood and excessive de-
struction of blood cells.
In the last decade, numerous data mining and machine learning techniques have been
used for anemia disease. Most noted once are the following:
In [8], SMO support vector machine and C4.5 decision tree algorithm has been
used for the prediction of anemia and a performance comparison of the two algo-
rithms is done.
In [11], WEKA is used to get a suitable classifier for developing a mobile App,
which can predict and diagnose Hematological data comments. The authors com-
pared neural network classification algorithms with J48 and Naïve Bayes classifier.
The results show that J48 classifier exhibits maximum accuracy.
Dogan & Turkoglu [12] developed a decision support system for detecting Iron-
Deficiency Anemia using decision tree algorithm. The algorithm uses three hematol-
ogy parameters, Serum iron, Serum iron-binding capacity and Ferritin. The evaluation
is done on Data of 96 patients and the results were successfully matched with Physi-
cian’s decision.
Abdullah and Al-asmari[13] experimented with WEKA algorithms: Naive-Bayes,
Multilayer Perception, J48 and SMO in an attempt to predict anemia types using CBC
reports. The evaluation was done on real data constructed from CBC reports of 41
anemic persons. Similar to [11], J48 decision tree algorithm along with SMO was the
best performer with an accuracy of 93.75%.
Unlike the work in [11] and [13], we have chosen a different set of classifier and local
data in our work.
There are four main tests that are ordered to diagnose anemia disorder which are
complete blood count (CBC), ferritin, PCR (Polymerase chain reaction ) and hemo-
globin electrophoresis.
CBC test is the most frequently blood test to measure overall health and determine
a wide range of diseases [8] including anemia, infection and leukemia. A com-
plete blood count test measures almost 15 tests including: hemoglobin (Hb), Red
blood cells (RBC), hematocrit (HCT), mean corpuscular hemoglobin (MCH), mean
corpuscular volume (MCV), and so on [8].
A ferritin test measures the amount of iron store in the body. High levels
of ferritin indicate an iron storage disorder, such as hemochromatosis. Low levels
of ferritin indicate iron deficiency, which causes anemia.
PCR test is a molecular test, which is used to diagnose genetic disorder.
A hemoglobin electrophoresis test is a blood test used to measure and identify the
different types of hemoglobin in the bloodstream.
4 Methodology
We have used three classifiers namely Random forest, Naive – Bayes and Decision
tree C4.5 algorithm. Figure 2 depicts the flowchart of the proposed method.
Random forest (RF) algorithm derives from decision tree classifier. It is a combina-
tion of tree predictors which aggregates the results of all the trees in the collection
and uses majority voting in prediction.
Data Collection
Classifier Learning
Performance Evaluation
5.1 Dataset
We collect data from different pathology centre and laboratory test centers in nearby
area. The collected dataset consists of 200 test samples. These are CBC test data. The
dataset contains 18 attributes out of which we have selected only those, which are
required for anemia disease detection. These are Age, Gender, MCV, HCT, HGB,
The proposed method uses CBC test values. First, the data is pre-processed to extract
the seven attributes as mentioned in 5.1. Then, we apply the random forest, decision-
tree and NB classifier on it. The performance evaluation is done in terms of accuracy
and mean absolute error (MAE). The mean absolute error (MAE) is measures how
close the predictions are to the eventual outcomes. Table 1 shows the results of the
three classifiers. Ten Fold cross validation has been used to obtain accuracy.
Random Forest Naïve- Bayes C4.5
In this paper, we have compared the performance of three different classifiers in the
prediction of anemia disease. The experimental result on a sample dataset suggests
that Naive- Bayes classification algorithm provides best performance in terms of ac-
curacy as compared to C4.5 and Random forest. Automatic prediction can reduce
manual effort involved in diagnosis. In future, automated tools can be developed
which can helps the prediction results to suggest further diagnosis. Such automated
tools can prove valuable in timely detection of more serious disease. Furthermore,
such disease prediction system can extended to recommend a treatment plan.
