0% found this document useful (0 votes)
88 views5 pages

Three Dimensional Model For Diagnostic Prediction: A Data Mining Approach

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July – August 2013 ISSN 2278-6856, Impact Factor 2.524 ISRA:JIF
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views5 pages

Three Dimensional Model For Diagnostic Prediction: A Data Mining Approach

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July – August 2013 ISSN 2278-6856, Impact Factor 2.524 ISRA:JIF
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July August 2013 ISSN 2278-6856

Three Dimensional Model for Diagnostic Prediction: A Data Mining Approach


Pooja Mittal1 and Nasib Singh Gill2
Department of Computer Science & Applications Maharshi Dayanand University, Rohtak-124001 Haryana, India Abstract: The Data mining is widely deployed to make high
level decisions using hefty dataset. Prediction or Estimation analysis is very popular application of data mining. In Medical Science such kind of analysis is immensely important as compared to other approaches. In this area, highly accurate & efficient prediction is required. The proposed work is in the same direction. The proposed work is about the combination of three data mining aspects, for retrieving enhanced results. One dimension of analysis is the time series where we will divide the complete data set in a cube based time series. In second dimension, we will perform the clustering on the dataset respective of the frequency of disease occurrence respective to the symptoms. In the final stage, a Markov model approach will be used along with an association to perform the prediction on clustered dataset.

carried out. It is not feasible to study all the diseases, related symptoms and the diagnostic. There are number of health care applications where the mining process has provided the automated solution. Though such kind of applications always suffer from the problem of lack of standardization. Such as, in the case of hypertension, different experts may conceive it differently. The effect of a medicine or the treatment may vary from patient to patient. There are number of such issues which must be considered, while performing mining process in the medical area. To resolve these issues, such kind of applications must be developed under some experts observation. In this present work, we are defining the model of an application which is being used to predict the patients disease. In section 3 of this paper, the model and its detailed description is given. Before understanding such model, some key terms, used in this work, are defined as under. 1.1 Classification of Dataset The medical dataset always consists of large number of attributes and their related data, and it is not feasible and efficient to work on entire dataset every time. Because of this reason, some classification approach is required to reduce the size of raw dataset. In this proposed work, different classes will be generated, based upon patients age or gender. For this class generation, initial step is to identify the classes such as in case of age attribute, the possible classes can be kid, young, middle-age, old, very old etc. Next step is to assign different age groups to these classes and finally, based on this class generation assumption; we will elect the particular class on which the complete work will be carried out. The classes are principally required to reduce the size of an actual dataset on which the work will be performed. 1.2 Associativity between Data-Items Another important concept of data mining is to identify the relationship among different data fields. To enhance the performance, in this work, association is identified at two levels; firstly, between the patients symptoms and the relative disease and second is between the disease and associated diagnostics. To perform such kind of analysis, the support vector will be identified. The data fields that are highly associated will be retained in the dataset, for Page 379

Keywords-Mining, Clustered, Medical, Time Series, Markov, Prediction

1. INTRODUCTION
One of the major approaches provided by data mining is the prediction algorithm. The prediction system is about to observe the statistical and historical data and to identify the chances of occurrence of some event in near future. Health care applications are one of the major research areas which come into the data mining prediction system. It is involved in health care applications in many ways. The reliability of such system depends on two major factors. One is the reliable dataset and other is the expert's suggestions and involvement. Dataset required for such system includes the patient data, symptoms and the relative treatment dataset. These kinds of datasets are mainly maintained by the organizations themselves. The trends, patterns identified from these dataset gives a positive impact to make a reliable decision on such datasets. In this work, we are presenting a prediction model to identify the patients disease. The present work is a parametric approach to perform such kind of decision making. However, as we know a single indicator or factor cannot decide the disease or the diagnostic. For the enhanced results, we need an outsized dataset with large number of possible attributes which can represent a patient completely along with his behavior and physical characteristics. Another factor which we have to decide is about the disease for which the whole work will be Volume 2, Issue 4 July August 2013

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July August 2013 ISSN 2278-6856
further work and rest of the fields, which are less associated, will be removed from the dataset. 1.3 History Based Analysis The next vital step of this analysis is the pattern identification, based on the historical data. In the medical area, the treatment given to other patients in the past is also considered while making some decision. In this work, the history analysis is based on the pattern identification, relative to symptoms, diseases and diagnostic. The analysis also includes the identification of the frequency of some pattern in the dataset. In this present work, the markov model is suggested to perform such kind of pattern analysis. applied epidemiological metric and relative risk factors for measuring the interestingness. By applying these methods on real medical & pharmaceutical datasets, they revealed potentially useful patterns[7]. In a medical case study, a domain ontology driven approach is reported for the medical database of the patients undergoing treatment of chronic kidney disease. They described an approach in which DO can be used to categorize the attributes in preparation for mining association rules [8] and concluded that it is more fruitful then nave mining. In the year 2011, a remarkable contribution in reducing the chances for extra operative Electrocorticography was presented, for mTLE patients, by applying classification techniques. It is required when focal points of seizure in brain, are not clearly identified by preliminary examinations. They also compared the performance of six popular classifiers and concluded that AUC is not sufficient measure, in case of critical domains like medicines[9]. A research in 2010, emphasized that accurate medical coding is very important, as it is used by hospitals for number of reasons. They visualized this coding problem as multi-label large classification problem and pioneered a multi label margin classifier, which not only grasp the underlying structure but also utilizes this information for enhancing the performance[10]. Gracia Jacob and Geetha Ramani designed a data mining framework, which is capable of generating classifier on any clinical datasets. This was used to formulate the decision making to classify the test data. They considered large domain medical datasets including mammography masses, heart, orthopaedic ailments, thyroid and many others. Based upon their work, they concluded that depending on the type of dataset, different classifiers outperforms, like Binary logistic Regression gives 100% classifier accuracy on heart dataset[11]. Ilango and Ramraj, in their research work, aimed to improvise the diagnostic accuracy of diabetes dataset, while considering Pima Indians diabetes Dataset. They proposed a hybrid prediction model which incorporates F-score feature selection to find out optimal features subset from raw input data, by minimizing the clustering errors. They incorporated many accuracy measures like AUC, sensitivity, specificity to prove the efficiency of their proposed work. They also emphasized on dimensionality reduction [12]. Temporal abstraction approach was presented to mine the knowledge from hepatitis disease database, by supervising background knowledge and analysis. They proposed new notions and methods for finding short term and long term changed tests [13]. Harpaz, Haerian, Chase and Friedman in 2010 analyzed that adverse drug events (ADE) is significant for healthcare. They applied regression techniques to demonstrate the feasibility of the method, designed for automated large scale mining. Electronic Health Records (EHR) data is combined with regression based methods to identify potential ADEs. This technique was designed to address the challenge of confounding [14]. Wang, Desai developed an effective mining function for China Page 380

2. LITERATURE SURVEY
For efficient and time effective analysis, feature subset selection is very important. Asha Gowda & M.A. Jayaram [1] in their work developed a model for Indian diabetic databases, with two stages. Depicted by the results, this model filters the given input data incredibly, to reduce the size, for further analysis by using Back Propagation neural networks. Their work proved the enhancement of classification accuracy by feature subset selection. Another big contribution in the shape of methodology for cluster validation, by Linda J. Moniz, Brian H. Feighner and Joseph S. Lombaro[2]. They described a narrative approach for generating Electronic Medical Records, focusing on data mining steps. They verified the approach by taking huge size, varied data set. In the year 2011, Hnin [3] performed a work on disease prediction based on dataset fragmentation. He worked on a dataset of heart disease and performed the k-means clustering on the dataset to generate the fragments. After that sequential pattern was analyzed to predict the disease. The decision tree approach was used to present the result. An association mining approach was used on medical dataset by Wei Wang in the year 2010[4] to generate some rules to decide the disease support with symptoms. He considered the case of heart disease and generated a class hierarchy to extract the rules based on association. In the year 2011, Chen-Guang Zhao [5] works on a rough medical dataset and filter it to improve the fault tolerance. The proposed system had defined some decision support rules to filter the dataset to generate a reduce dataset. On this data set, decision rule is applied to predict the disease. An ACO (Ant Colony Optimization) influenced fuzzy approach was used by Mostafa Fathi Ganji[6] in 2010 to classify the medical dataset. On the available dataset, fuzzy rule set was generated and based on rule set a ACO based classification approach was implemented. Li, Fu and Williams in their work discussed the problem of finding risk patterns in medical dataset. They presented an efficient algorithm, based on anti monotone property, to mine optimal risk pattern sets. They also Volume 2, Issue 4 July August 2013

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July August 2013 ISSN 2278-6856
medicine study, by considering number of sensitive issues. As a result, a fuzzy based Hyper Knowledge Discovery System was developed, useful for information retrieval and association rule mining [15]. the Person gender and the age. Now we have a deducted database for the current query where Size(Deductive S)<<<Size(S) As a result, the size of new database is extremely smaller then existing database. Forthwith find the Association between the Patient Symptoms and the Symptom database. Moving on, we can uncover a list of most associated symptoms and the respective diseases and the diagnostics. Create a collected Database of all three Databases, D[],S[],Di[] Such that D1, S1, S2, S3Sn, Di1, Di2, Di3.Dim Perform the Markov model on Dataset with group of 2 and 3 items collectively and predict the most frequent combinations of Disease, Symptom and Diagnostic. Find the occurrence of frequent and least frequent occurrence of pair in the table as group of disease symptom and diagnostic. Remove all the entries having less frequency then averages. It is a fuzzy criterion that can be changed according to the requirement. Then find the occurrence ratio of disease, respective to symptoms. The maximum ratio will be elected as the disease. In the same way, find the occurrence of Diagnostic respective to disease and elect the peak ratio. Present the disease and the diagnostic as result.

3. RESEARCH METHODOLOGY
In this work, we are aggregating the strengths of three different data mining aspects, to predict the patient disease as well as the diagnostic. One aspect is the statistical analysis. For this analysis, we need a large amount of data respective of large attribute set. The statistical analysis is one of the traditional approaches used by the mining to conclude some result from the

iii)

iv)

v)

vi)

Figure 1: Three Dimensions of Proposed Work given dataset. This work will include the improved association mining along with the fuzzy concept. The association mining will perform two important tasks; first task is to correlate the related attributes and another is to prune the unwanted data as well as attributes from the data set. Another aspect is the use of a cubic model in terms of time series. As we know, we have large dataset in medical applications. Here, we are using the concept of deductive database in which dataset will itself be reduced according to the user query. Most of the disease symptoms give different criteria for male and female datasets. First level of classification can be based on gender. This classification can be followed by patient age as the different age group requires different treatment. For this categorization, we need to perform the data clustering. Number of clusters will depend upon the number of age groups as well as gender. The final aspect is based on the prediction approach. In this work, the Markov model is suggested to take the decision about the disease and the diagnostic. Algorithm(P.,D[],S[][],Di[]) /* Here P is the information taken as input from the patient; it includes the basic details as well as symptoms of the disease. D is the database of possible Disease, S is the dataset for the Symptom and Di is the dataset for related Diagnostics*/ { i) Accept the Patient Age, Gender and Symptoms as the raw input. ii) Divide the Symptoms Database according to Volume 2, Issue 4 July August 2013

vii)

viii)

ix)

x) }

Figure 2 Proposed Architecture In Figure 2, the Gray boxes represent the process, and white boxes represent the input and output of each process. As we can visualize that with each step, the size of the database is being reduced as some deduction on database is performed to increase the process efficiency. As the process is performed in the sequence, the final results will be obtained. Page 381

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July August 2013 ISSN 2278-6856
To implement the proposed model, we need a large dataset of patients with relative information. The information must include the patient personal information, behavioral information and the diseasespecific information. Some basic parameters include the age, sex, eating habits, drinking habits, work environment, etc. More specific, the dataset will be, better the results can be derived from the dataset. Such kind of data will be collected from some hospital, organization or from a researcher who had already performed some related work on it. Once the data will be collected, the next work is about to perform the classification over the dataset. The classification process will divide the dataset in n sub datasets under some criteria. In this model, we have suggested the age driven and gender driven classification over the dataset. Once the dataset is classified, then the new patient will be processed on a classified sub dataset instead of complete dataset. The classification process will be followed by the statistical analysis over the dataset. This classification process will be based on the symptom analysis under different criteria. This process will be performed to remove the dataset impurities and will perform a further reduction over the dataset. The impurities can be in the form of invalid or incomplete information. The data level analysis will be performed to remove such in appropriate data elements from the dataset. After the elimination of these impurities, we will get the error free valid dataset. Just after the removal of these impurities, an associativity check is performed respective to the input query. The association match is mainly carried out to identify the most related attributes over the dataset and to remove the un-necessary attributes. To perform this kind of filtration the fuzzy based association mining is suggested in this model. The fuzzy rule set will be defined based on the expert advice to decide the valid criteria so that the information will be kept in the dataset itself. The fuzzy rule will be applied on the symptom dataset such as eating habits, weight, working environment etc. The rule set will be first generated on individual attributes, and then it will be used in composition on multiple attributes under different fuzzy operators. The least associated attributes or the data elements will be eliminated from the dataset. This model stage will return the most relevant dataset to process further. Now to predict the patient disease, markov model based prediction algorithm is suggested in this model. The markov model will work on patient history. In this work, a multilevel markov model is suggested that will carry out the decision making respective to the frequency of a disease pattern. The markov model work will initially analyze each symptom frequency individually and with each level, a new symptom will be included while performing the pattern match. More the number of levels, better accuracy from the system can be drawn. The system will return the pattern match along with frequency count. Now, from this frequency analysis the most frequent Volume 2, Issue 4 July August 2013 pattern will be identified. The maximum occurrence symptom ratio is then identified in association with the maximum occurred disease with that symptom. Finally, from this association analysis, most associated symptoms and the disease will be identified. Due to sensitivity of the topic, obtained results still requires a second opinion from the expert to verify the outcome of this model. This present model is expected to yield effective results in terms of efficiency as well as reliability.

4. CONCLUSION
The proposed system is the architecture based system where three data mining approaches are collected in a sequence to predict the patients disease as well as the diagnostic. In this system, we started with rough dataset (larger in size) and reduction of data is carried out at each level to perform the work efficiently. On initial stage, the data classification and clustering techniques are used to predict the disease and later on, the association mining along with Markov model is suggested to predict the diagnostic. The proposed model is a generic model which can work on any raw medical dataset.

REFERENCES
[1] Asha Gowda Karegowda M.A.Jayaram, Cascading GA & CFS for Feature Subset selection in Medical Data Mining, 2009 IEEE International Advance Computing Conference (IACC 2009) [2] Linda J. Moniz, Brian H. Feighner, and Joseph S. Lombardo, Mining Electronic Medical Records for Patient Care Patterns, 978-1-4244-27659/09/$25.00 2009 IEEE [3] Hnin Wint Khaing, Data Mining based Fragmentation and Prediction of Medical Data, 9781-61284-840-2/11/$26.00 2011 IEEE [4] Wei Wang, Yaohua Wu, Mining Association rules in Medical Data Based on Concept Lattice, Proceedings of the 8th World Congress on Intelligent Control and Automation July 6-9 2010, Jinan, China [5] Chen-Guang Zhao WenGe, Research of Electronic Medical Record Data Mining Method Based on Variable Precision Rough Set, 2011 International Conference on Electronics and Optoelectronics (ICEOE 2011) [6] Mostafa Fathi Ganji, and Mohammad Saniee Abadeh, Parallel Fuzzy Rule Learning Using an ACO-Based Algorithm for Medical Data Mining, 978-1-4244-6439-5/10/$26.00 2010 IEEE [7] Jiuyong Li, Ada Wai-Chee Fu, Hongxing He, Jie Chen, Mining Risk Patterns in Medical Data, KDD05, August 2124, 2005, Chicago, Illinois, USA. Copyright 2005 ACM 159593135X/ 05/0008 ...$5.00. [8] Yen-Ting Kuo, Andrew Lonie, Liz Sonenberg, Domain Ontology Driven Data Mining, 2007 ACM SIGKDD Workshop on Domain Driven Data Mining Page 382

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 4, July August 2013 ISSN 2278-6856
(DDDM2007), August 12, 2007, San Jose, California, USA. [9] Shobeir Fakhraei, Hamid Soltanian-Zadeh, Farshad Fotouhi, Confidence in Medical Decision Making: Application in Temporal Lobe Epilepsy Data Mining, KDD-DMH11, August 21, 2011, San Diego, California, USA.Copyright 2011 ACM 978-14503-0843-4/11/08...$10.00 [10] Yan Yan ,Glenn Fung, Jennifer G.Dy, Medical Coding Classification by Leveraging Inter- Code Relationships, KDD10, July 2528, 2010, Washington, DC, USA. Copyright 2010 ACM 978-14503-0055-1/10/07 ...$10.00. [11] Shomona Gracia Jacob and R.Geetha Ramani, Mining of Classification Patterns in Clinical Data through Data Mining Algorithms, ICACCI '12, August 03 - 05 2012, CHENNAI, India Copyright 2012 ACM 978-1-4503-1196 [12] B. Sarojini Ilango and N. Ramaraj, A Hybrid Prediction Model with F-score Feature Selection for Type II Diabetes Databases, A2CWiC 2010, September 16-17, 2010, India Copyright 2010 978-1-4503-0194-7/10/0009 $10.00 [13] Tu Bao Ho, Trong Dung Nguyen, Hideto Yokoi, Mining Hepatitis Data with Temporal Abstraction, SIGKDD03, August 24-27, 2003, Washington, DC, USA Copyright 2003 ACM 1-58113-7370/03/0008$5.00. [14] Rave Harpaz, Krystl Haerian, H.S.Chase, Mining Electronic Health Records For Adverse Drug Effecfts Using Regression Based Methods, IHI10, November 11-12, 2010, Arlington, Virginia, USA. Copyright 2010 ACM 978-1-4503-00308/10/11...$10.00. [15] Tongyuan Wang, Bipin C. Desai, Huzhan Zheng Yanjiang Qiao, Knowledge Discovery in Chinese Medicine, C3S2E-08 2008, May 12-13, Montreal [QC, CANADA] Copyright (c) 2008 ACM 978-160558-101-9/08/05 $5.00

Volume 2, Issue 4 July August 2013

Page 383

You might also like