Bellaachia PDF
Bellaachia PDF
Bellaachia PDF
1
accuracy. After a careful analysis of the breast cancer We have used the Weka toolkit to experiment
data used in [9], we have noticed that the number of with these three data mining algorithms [12]. The
not survived patients used does not match the Weka is an ensemble of tools for data classification,
number of not alive (field VSR) patients in the first regression, clustering, association rules, and
60 months of survival time. As a matter of fact, the visualization. The toolkit is developed in Java and is
number of not survived patients is expected to be an open source software issued under the GNU
around 20% based on the breast cancer survival General Public License [10].
statistics of 80% [1]. In our discussion with the Preprocessing the input data set for a knowledge
authors of [9], we found out that the pre-classification discovery goal using a data mining approach usually
process was not accurate in determining the records consumes the biggest portion of the effort devoted in
of the not survived class. They did not take into the entire work [10]. We have developed a set of
consideration neither the Vital Status Recode (VSR), tools to extract and cleanup the raw SEER data.
nor the Cause of Death (COD). They assume that all A simple analysis shows that the SEER data has
patients are dead with cancer, which is not always missing information in the fields of Extent of Disease
true. (EOD) and Site Specific Surgery (SSS) fields for
In our study, we have used a newer version of almost half of the records. Most of the missing
SEER database (period of 1973-2002 with 482,052 information is in the records, which are gathered
records) and, unlike [9], we have included two other prior to 1988. Since we wanted to use all the
fields in the pre-classification process: available fields in the SEER database, we removed
Survival Time Recode (STR), these records from the test data set. These records
Vital Status Recode (VSR), have Coding System for EOD coded as 4. The SSS
Cause of Death (COD) field usage has changed after 1998. Instead of the
The next section presents our pre-classification regular field, the information is split in five other
process. fields. A mapping scheme from new SSS to old SSS
is developed to fill the missing SSS fields. After this
3. Methodology step, the records with missing information are
removed from the data set.
In this paper, we have investigated three data mining The EOD field is composed of five fields
techniques: the Nave Bayes, the back-propagated including the EOD code. These fields (size of tumor,
neural network, and the C4.5 decision tree number of positive nodes, number of nodes, and
algorithms. In this paper, we used these algorithms to number of primaries) contain missing information
predict the survivability rate of SEER breast cancer coded such as 999, 99 or 9 representing the
data set. We selected these three classification unknown information. Please note that, the statistics
techniques to find the most suitable one for in Table 1 do not contain fields with unknown
predicting cancer survivability rate. values. The table also shows the fields used in our
The Nave Bayes technique depends on the analysis.
famous Bayesian approach following a simple, clear
and fast classifier [10]. It has been called Nave due Nominal variable name Number of
distinct values
to the fact that it assumes mutually independent
attributes. In practice, this is almost never true but is Race 19
achievable by preprocessing the data to remove the Marital status 6
Primary site code 9
dependent categories [10]. This method has been
Histologic type 48
used in many areas to represent, utilize, and learn the Behavior code 2
probabilistic knowledge and significant results have Grade 5
been achieved in machine learning [10]. Extension of tumor 23
Lymph node involvement 10
The second technique uses artificial neural
Site specific surgery code 19
networks. In this study, a multi-layer network with Radiation 9
back-propagation (also known as a multi-layer Stage of cancer 5
perceptron) [10] is used.
The third technique is the C4.5 decision-tree Numeric variable name Mean Std. Dev. Range
generating algorithm [11]. C4.5 is based on the ID3 Age 58 13 10-110
algorithm. Tumor size 20 16 0-200
No of positive nodes 1.5 3.7 0-50
It has been shown that the last two techniques
Number of nodes 15 6.8 0-95
have better performance [7, 8, 9]. Therefore we have Number of primaries 1.25 0.5 1-8
included them in our analysis. Table 1: Survivability Attributes
2
As stated in the previous section, we have Figure 1 shows the ranked survivability
adopted a different approach in the pre-classification attributes of our data as calculated by the Weka
process. Unlike [9], we have included three fields: toolkit. It clearly shows that Extension of Tumor has
STR, VSR, and COD. The STR field ranges from 0 a higher rank than the Tumor Size.
to 180 months in the SEER database. The pre- 0.25
Number of primaries
Primary site
Lymph node involv (EOD)
Grade
Site Specific Surgery
Radiation
Age
Behavior code
No of pos nodes (EOD)
Stage of cancer
Marital status
Hostologic type
Race
end if
3
matrix and can be easily converted to true-positive data in the EOD field from the old EOD fields prior
(TP) and false-positive (FP) metrics [10]. to 1988. This might increase the performance as the
The experimental results of our approach as size of the data set will increase considerably.
presented in Table 4. Finally, we would like to try survival time
prediction of certain cancer data such as respiratory
Classification Accuracy Class Precision Recall cancer where the survivability is seriously low. We
Technique (%)
think of discretizing the survival time in terms of one
Nave Bayes 84.5 0 0.70 0.57 year and then classifying using the aforementioned
1 0.88 0.93 data mining algorithms.
Artificial 86.5 0 0.83 0.52
Neural Net 1 0.87 0.97
0 0.80 0.56 References
C4.5 86.7
1 0.88 0.96
[1] American Cancer Society. Breast Cancer Facts
Table 4: Combined Results (our study) & Figures 2005-2006. Atlanta: American
Cancer Society, Inc. (http://www.cancer.org/).
Classification Accuracy Class Precision Recall
Technique (%) [2] Surveillance, Epidemiology, and End Results
(SEER) Program (www.seer.cancer.gov)
C4.5 81.3 0 0.86 0.81 Public-Use Data (1973-2002), National Cancer
1 0.76 0.81
Institute, DCCPS, Surveillance Research
Table 5: Results for C4.5 (dataset as in Table 3) Program, Cancer Statistics Branch, released
April 2005, based on the November 2004
As can be seen in Table 4, neural net and submission.
decision tree have comparable performances. [3] Cox DR. Analysis of survival data. London:
Table 5 shows the experimental results using Chapman & Hall; 1984.
the pre-classification approach used in [9] and the [4] Benjamin F. Hankey, et. al. The Surveillance,
same dataset used in our approach. The results clearly Epidemiology, and End Results Program: A
show that the classification rate (81%) is much lower National Resource. Cancer Epidemiology
than the classification rate of our approach (~87%). Biomarkers & Prevention 1999; 8:1117-1121.
It may be worth noting that the computation [5] Houston, Andrea L. and Chen, et. al.. Medical
times of the algorithms Nave Bayes, neural net and Data Mining on the Internet: Research on a
C4.5 (on an AMD Athlon 64 4000+ machine) were in Cancer Information System. Artificial
the ranges of 1 minute, 12 hours and 1 hour, Intelligence Review 1999; 13:437-466.
respectively. [6] Cios KJ, Moore GW. Uniqueness of medical
These obtained results in this work differ from data mining. Artificial Intelligence in Medicine
the study of Delen et al. [9] due to the facts that we 2002; 26:1-24.
used a newer database (2000 vs. 2002), a different [7] Zhou ZH, Jiang Y. Medical diagnosis with C4.5
pre-classification (109,659 and 93,273 vs. 35,148 and Rule preceded by artificial neural network
116,738) and different toolkits (industrial grade tools ensemble. IEEE Trans Inf Technol Biomed.
vs. Weka). 2003 Mar; 7(1):37-42.
[8] Lundin M, Lundin J, Burke HB, Toikkanen S,
5. Conclusions and Future Work Pylkkanen L, Joensuu H. Artificial neural
networks applied to survival prediction in breast
This paper has outlined, discussed and resolved the cancer. Oncology 1999; 57:281-6.
issues, algorithms, and techniques for the problem of [9] Delen D, Walker G, Kadam A. Predicting breast
breast cancer survivability prediction in SEER cancer survivability: a comparison of three data
database. Unlike the pre-classification process used mining methods. Artificial Intelligence in
in [9], our approach takes into consideration, besides Medicine. 2005 Jun; 34(2):113-27.
the Survival Time Recode (STR), the Vital Status [10] Ian H. Witten and Eibe Frank. Data Mining:
Recode (VSR) and Cause of Death (COD). The Practical machine learning tools and techniques,
experimental results show that our approach 2nd Edition. San Fransisco:Morgan Kaufmann;
outperforms the approach used in [9]. 2005.
This study clearly shows that the preliminary [11] J. R. Quinlan, C4.5: Programs for Machine
results are promising for the application of the data Learning. San Mateo, CA:Morgan Kaufmann;
mining methods into the survivability prediction 1993.
problem in medical databases. [12] Weka: Data Mining Software in Java,
Our analysis does not include records with http://www.cs.waikato.ac.nz/ml/weka/
missing data; future work will include the missing