Cba 8 Clinical Decision Support System: Capstone Project
Cba 8 Clinical Decision Support System: Capstone Project
Cba 8 Clinical Decision Support System: Capstone Project
Project
CBA 8
Ravinderpal Wasu
Shailesh Vishwakarma
Sidharth Gupta
THE INDIAN SCHOOL OF BUSINESS (ISB) evolved from the need for a world-
class business school in Asia. The founders, some of the best minds from the corporate and
academic worlds, anticipated the leadership needs of the emerging Asian economies.
They recognized that the rapidly changing business landscape would require young leaders
who not only have an understanding of the developing economies but who also present a global
perspective. The ISB is committed to creating such leaders through its innovative programs,
outstanding faculty and thought leadership. Funded entirely by private corporations, foundations
and individuals from around the world who believe in its vision, the ISB is a not-for-profit
organization.
CBA is a rigorous and challenging program. The schedule will include full
days of teaching and evenings will be used for guest lectures, projects, and
group work. Participants will be required to stay on campus during those
classroom days.
SPONSORS
A healthcare ecosystem:
•Where access to good health is as easy as shopping online.
•That brings the healthcare provider, right at the doorstep.
•Where you are never out of medicines and basic medical supplies.
•Which provides care for loved ones so personal that you are never away from home.
•Where reports/findings are not mere facts but guides to overall good health.
•Where you aren’t just a patient but an inclusive partner to the success of the ecosystem
TEAM
Student Name PGID About
Fig 1
There are many reasons that healthcare facilities implement electronic health records (EHRs);
among those reasons are to recommendation of best suited diagnosis and medicine to
practitioners and avoiding penalties, participating in value-based reimbursement, a desire to
provide better care, and to fulfil a requirement for quality recognition.
Healthcare sector incur a lot of expenditure on customer service for enquiries, booking
appointments, consultation, etc. With growing need of providing better customer services, there
comes a rising cost.
To meet the objective to deliver high-quality and cost efficient, technology can step in and save
healthcare organizations time and money: medical chat-bots powered with data and analytics.
Heath Care sector has been investing a lot of time and money in technological development in
building AI and preventive solutions. Chat-bots could save organizations $8 billion annually
worldwide by 2022, up from $20 million this year, Juniper Research forecasted2.
Fig 2
Fig 3
Clinical Decision Support is a sophisticated health IT component with capability to put various
data pieces together and analyze the same to generate valuable insigths. It requires computable
biomedical knowledge, person-specific data, and a reasoning or inferencing mechanism that
combines knowledge and data to generate and present helpful information to clinicians as care is
being delivered. This information must be filtered, organized and presented in a way that supports
the current workflow, allowing the user to make an informed decision quickly and take action.
The project is to build a Clinical Decision Support System for the CallHealth with two interfaces
for patients and internal team. The Clinical Decision Support System provides real-time response
based on symptoms and other parameters provided by the patient in an interactive manner.
The project will provide an interactive portal to customers to share their symptoms and other
medical details in an interactive manner along with contact details. These details can also be used
by CallHealth team to devise a customized healthcare recommendation for the customer.
The project will be used the CallHealth team to increase the customer base that will further
increase the revenue for the organization and will enhance the efficiency and quality of service
thus reducing the cost to the organization.
The Interactive Customer Portal (ICP) will try to understand the intent of the user. With use of
NLP techniques and algorithms, the ICP will do following:
Once the conversation is accepted, the ICP will present the proper response based on
symptoms or medical conditions mentioned by the user. These details are analyzed on real time
basis before responding the user. Once all the relevant details are captured, the information will
be passed on the CallHealth team for getting in touch with the user with relevant and
customized medical care
2. Project Descriptions
The objective of the project is to develop a solution that will help to reduce the errors in the
consultation with the help of Data and Analytics. The solution will have both preventive and
corrective recommendation. The CDS system is bifurcated into 2 sets:
Interactive Customer Portal (ICP)
The facet of ICP is deal with the patients who can login and share their age, gender and list of
symptoms based on their medical condition in an interactive manner on the portal. The CDS will
suggest relevant questions and symptoms during the conversation/consultation based on the
symptoms, demographics & medical history
Interactive Customer Portal
An Interactive Customer Portal that helps in providing all the healthcare need to the customers. The ICP will be an
automated chatbot that would ask/provide dynamic question/response to the customers based on their symptoms
The idea is to provide a portal to customers to share their symptoms and other medical details in an interactive
manner along with contact details. These details can also be used by CallHealth team to devise a customized
healthcare recommendation for the customer.
Fig 4
The idea is to provide a portal to the doctors, physicians and clinicals that will support them in terms of correct
treatments to the patients.
Fig 5
The solution framework, as depicted in the image below, has the connectivity with the data tables
of the diseases, symptoms and demographic details and medical history or record of the patient.
The live interaction will feed-in the symptoms and other relevant details to the ICP. The
information will be get analyzed the based on the set of algorithms and the output will be displayed
based on the access rights of the user.
Fig 6
The aforementioned services can be explained with use of an example as mentioned below:
Key functionalities – For Patient e.g. Fever and cough
✓ Ask patient for personal details – demographics, if provided
Understanding the ✓ Suppose patient enters the first symptoms as “I have Fever and Cough”
questions and ✓ From the 5 tokens, decompose the symptoms as a) Fever b) Cough using NLP
symptoms
✓ Identify the intent using SVM, SoftMax classifier, etc.
✓ Getting all the corresponding responses, possible matches of combination, key words and intent of the query for
the tokens from the Answers source. Here we could get 1000s responses
Generate
hypothesis for ✓ Create the Hypothesis for each response. At this point quantity is more important than accuracy
each options using ✓ Here we will get all the next questions to be asked from the patient for Fever and Cough
list of responses ✓ We filter out responses based on the combination of keywords and previous inputs. For e.g. if for fever, cough is
a response, which is already exists in combination of keywords in previous input, we filter it out
✓ Now based on the various Hypothesis generated, we will evaluate and test the hypothesis based on the data of
Analyze and rank evidences available in the Evidence database for patients
the hypothesis ✓ We will rank the hypothesis based on the p-value and weightages of the evidences based on the previous inputs
using evidences – Fever and Cough combination and demographics provided by the patient
✓ Here we will filter out the hypothesis with low weightage. The highest ranked answers appear first
✓ Train the model with the available inputs. Here the model will keep asking next question until all the possible
symptoms are provided by the patient based on top 5 or 10 corelated symptoms/diseases with a threshold
confidence level
Provide
✓ Provide the response - next question OR probable diseases, to the patient
Responses
✓ For our example, we will get the top 5 or 10 response in order of their ranks as the next questions Or the
probable diseases to be exposed to the patient
✓ Ask for feedback and use this data to further improvise the model
Fig 7
Impacted Stakeholders:
The project is to develop not just support system for CallHealth’s IT team but to build a complete
solution that would help their marketing team, team of doctors and clinical experts, patient support
team and sales team.
a. Marketing Team: The marketing team will be able to collect analyze more data about the
prospective customers that will help to develop and market more customized products and
services
b. Team of Clinical experts: The clinical experts would have access not just to the
symptoms and reports but also the medical history of the patients such as allergies, etc.
Aided with the data and analytics for patients with similar symptoms and medical
requirements, the clinical experts would be able to help attend and consult the patient in
a more effective and efficient manner.
c. Patient support team: Support team will save time as they would already be aware of
the medical condition of the patient and can directly attend the requirements
d. Sales team: The sales team will be able to make more targeted sales of the customized
products and services to the customers with specific medical requirements
3. Data collection, visual exploration, transformation
Part1
Step 1 - Data Collection
Searched data for Diseases and Symptoms from various sites
SNOMED data collected from http://www.nature.com/articles/ncomms5212#supplementary-
information site →cSupplementary Data 7
SNOMED-CT symptom-disease relationships. File Name - ncomms5212-s8.xls
The data file has six sheets:
Sr. Sheet Name (Nos. of Column names Column Description
No. Rows)
1 disease terms rid Row ID
(1623 rows) disease_cui Unique Disease Id
snomed_code Unique SNOMED Disease Id
Terms Disease names
2 disease list rid Row ID
(1623 rows) disease_cui Unique Disease Id
Number of Number of Symptoms associated with
symptoms the Disease
3 symptom terms rid Row ID
(817 rows) symptom_cui Unique Symptom Id
snomed_code Unique SNOMED Symptom Id
Terms Symptom names
4 symptom list rid Row ID
(817 rows) symptom_cui Unique Symptom Id
Number of Number of Disease associated with the
diseases Symptoms
5 SNOMED semantic rid Row ID
types Semantic type Category code
(131 rows) code
Semantic type Category Name
Number of Number of distinct concepts
distinct concepts
6 disease-symptom rid Row ID
relationships disease_cui Unique Disease Id
(2340 rows) symptom_cui Unique Symptom Id
If it is a medical query, the code will request for the Symptoms from the patient
Here the patient will select (or type in) the Symptoms from the UI.
The patient can select the given Symptoms or enter the Symptom after selecting Other.
Module 2 – Initialized the unique stemmed words to classify. Scored the words, tokenized the
input symptoms
Module 3 – To query the dataset and fetch the next top 5 symptoms.
Here we create a dataset to read the data from the csv file. Created table dfSymptomDisease.
Created table with distinct diseases – dfDistinctDiseaseName for the first Symptoms.
Then for the distinct diseases, selected all distinct Symptoms. Created table dfDistinctSymptom
Then from dfDistinctSymptom, created another table, filtering out those Symptoms that were
already presented / selected by the patient.
Finally, selected top 5 frequent Symptoms and displayed to patient to select next possible
symptoms.
Part 2
Step 1 - Data Collection - Data from CallHealth
Here we got data providing relationship between Symptoms and Diseases and their relationship
to Age and Gender.
• Symptoms along with gender and age: A list of symptoms along with gender and age
categories was provided to analyze the symptoms that will be provided by the users. The list
of data fields are mentioned below:
Field Description
SYMPTOM Symptom Name
SYMP_MALE Yes, if present in Males
SYMP_FEMALE Yes, if present in Females
SYMP_INFANTS_LESS_THAN_'N1'YR Yes, if present in infants
Yes, if present in children of certain age
SYMP_CHILDREN_N1_TO_N2YRS group
Yes, if present in teenagers of certain age
SYMP_TEENAGE_N2_TO_N3YRS group
SYMP_ADULT_N3_TO_N4YRS Yes, if present in adults of certain age group
SYMP_ELDERLY_GREATER_THAN_N5YRS Yes, if present in elders of certain age group
• Diseases along with gender and age: A list of diseases along with gender and age
categories was provided to analyze the symptoms that will be provided by the users. Each
category has the probability of having the disease rated as below:
o No
o Rare
o Uncommon
o Common
o Most common
The list of data fields is mentioned below:
Field Description
DISEASE Disease Name
Male_INFANTS_LESS_THAN_'N1'YR_D Probability = No, Rare, Uncommon, Common, Most Common
ISEASE in Male Infants
Female_INFANTS_LESS_THAN_N1YR_ Probability = No, Rare, Uncommon, Common, Most Common
DISEASE in Female Infants
Male_CHILDREN_N1_TO_N2YRS_DISE Probability = No, Rare, Uncommon, Common, Most Common
ASE in Male Childern
Female_CHILDREN_N1_TO_N2YRS_DI Probability = No, Rare, Uncommon, Common, Most Common
SEASE in Female Childern
Male_TEENAGE_N2_TO_N3YRS_DISE Probability = No, Rare, Uncommon, Common, Most Common
ASE in Male Teenagers
Female_TEENAGE_N2_TO_N3YRS_DIS Probability = No, Rare, Uncommon, Common, Most Common
EASE in Female Teenagers
Male_ADULTS_N3_TO_N4YRS_DISEAS Probability = No, Rare, Uncommon, Common, Most Common
E in Male Adults
Female_ADULTS_N3_TO_N4YRS_DISE Probability = No, Rare, Uncommon, Common, Most Common
ASE in Female Adults
Male_ELDERLY_GREATER_THAN_N5Y Probability = No, Rare, Uncommon, Common, Most Common
RS_DISEASE in Male Elders
Female_ELDERLY_GREATER_THAN_N Probability = No, Rare, Uncommon, Common, Most Common
5YRS_DISEASE in Female Elders
• Diseases to Symptoms and their frequency/count
YES
NO
Stop
Step 4 – Python code to provide outputs with CallHealth data and factoring Age and
Gender (Interpretation of Output/Visualization)
Libraries used – Pandas, NLTK for NLP, PySpark for SQL Context
Module 1 – Intent Analysis – trained the model for medical and non-medical inputs
So here, if the patient enters anything other than a medical query, as below, the code would
capture the intent
If it is a medical query, the code will request for the Symptoms from the patient
Here the patient will select (or type in) the Symptoms from the UI.
The patient can select the given Symptoms or enter the Symptom after selecting Other.
Module 2 – Initialized the unique stemmed words to classify. Scored the words, tokenized the
input symptoms
Module 3 – To query the dataset and fetch the next top 5 symptoms.
Here we create a dataset to read the data from the csv file. Created table dfSymptomDisease.
Created table with distinct diseases – dfDistinctDiseaseName for the first Symptoms.
Then for the distinct diseases, selected all distinct Symptoms. Created table dfDistinctSymptom
Then from dfDistinctSymptom, created another table, filtering out the Symptoms as per Age and
Gender. We further filter those Symptoms that were already presented / selected by the patient.
Finally, selected top 5 frequent Symptoms and displayed to patient to select next possible
symptoms.
Module 4 – Correlation for Symptoms
Here we try to calculate the correlation between the symptoms selected and try to find if this
correlation is below 15%, in order to stop asking for more symptoms
Here for the Distinct diseases as per the Age and Gender, we take the Symptoms selected /
entered by the patient and transform this to a table as below:
We then enter the frequency for each Symptom for that disease from the master data table
As we see that the correlation is higher than 15%, we will ask for more symptoms.
Module 5 – This module is for doctors to get the probable diseases with the probability of
occurrence of the Diseases for the provided Symptoms, Age and Gender
Here we were given 2 scenarios by Callhealth
Scenario 1:
Scenario 1 Age 60 yrs
Gender Male
Symptoms Chest Pain
Excessive Sweating
Chest discomfort
SOB (Shortness Of
Breath)
Ran the python code for this scenario
Below is the result for the Scenario 1, for all probability of occurance
We can further filter it on the basis of the Occurrence, where we can enter the occurrence
choice as below:
Here we will now display only Common and Most Common diseases
Below is the result for the Scenario 2, for all probability of occurrence
We can further filter it on the basis of the Occurrence, where we can enter the occurrence
choice as below:
Here we will now display only Uncommon, Common and Most Common diseases
Scenario 3:
Scenario
3 Age 18 yrs
Gender Female
Symptoms Confusion
Disorientation
Abdominal pain
Itching
Ran the python code for same Symptoms for this scenario
Below is the result for the Scenario 3, for all probability of occurrence
We can further filter it on the basis of the Occurrence, where we can enter the occurrence
choice as below:
Here we will now display only Most Common diseases
4. Model Comparison
While working on the framework design for the solutions we studied many models and
classification technologies. We found 2 methodologies that could provide intelligent insights for
this solution.
i) Multi-Class Classifier (One-Vs-All) (Youtube link for the tutorial)
ii) IBM Watson for Healthcare (Youtube link for the tutorial)
In this example we can consider each Class (Class 1, Class 2,…) = Disease, and each feature
(F11, F12, F13, F21…..) as Symptoms
For training, in Case 1 we mark a +1 (+ve class symbol) for Class 1 and -1 for other Classes. In
Case 2, we mark a +1 for Class 2 and -1 for other Classes.
As seen here, M3 gives the highest +ve score and hence is the suitable class to assign the
dataset to.
Now in our business problem, when we get the number of diseases, given the symptoms
(features), we can apply this 1 v/s many classification technique to find the probability score for
each disease (class)
Here we also see that there are 2 classes with +ve class score – M1 and M3. So, we can assign
the dataset to all possible classes
Similarly, for given Symptoms (features) we can map to more than one Disease (Class)
ii) IBM Watson DeepQA Technology for Healthcare
Step 1 – Question comes in the left, QUESTION ANALYSIS decides if we want to split the
question into parts, sub-clues, sub-questions or not. Each question / sub-question is handled in
parallel
Step 2 - It decides on the Question type, what is the question objective.
a) Type of question – numeric, puzzle, exclude, rhyming
b) Objective of the question / what the question is asking for – place, person, number.
Primary search = keyword search, and get top ranking documents
Each question / sub-question is handled in parallel
Candidate Answer generation – this looks for key words and then looks can be the Candidate
answers from the Top-Ranking documents. This is light-weight processing.
Hypothesis Generation - We then start restricting the candidates to a smaller set by applying
Hypothesis to process for more detailed evidences/processing. Here from the top-ranking
documents, we are trying to get what are the right answers in those documents.
Hypothesis and Evidence scoring – this is to evaluate the different types of evidence. Here we
put the candidate answers back along with the keywords in the Question, do another search and
come back with passages that contain all of this information; and then do a detailed evaluation of
those passages. So, we could see an keyword overlap between that passage and the original
question as one of the source of evidence.
Synthesis – this module is trying to combine the above pieces. If we have split the question in
different parts, it combines those in this module to come up with a single answer.
Final Confidence merging & Ranking – Finally, we have all candid answers and have computed
different evidence scores, here the merging and ranking happens. Here Duplicates are removed
Main things – Things are done in parallel. Candidate answers, evidences, ranking, merging
In our business (medical) problem, the Symptoms can be the various Questions and Diseases
the Answers. For every Symptom, in first iteration, there can be many other Symptoms as the
Keywords. And there would be many Diseases as Answers. We then have to create the
Hypothesis and Evidence scoring to restrict and filter the results. Further Synthesis happens on
the basis of Age, gender, past medical history and other demographics. Final confidence merging,
and ranking happens to provide results with top high scores and high level of confidence
• Final recommendation
Based on the data provided, the above models could not be used. We followed the mode of
Naive Bayes for intent analysis and querying the data using Python – PySpark SQL context
• Additional Insights
We can improvise the model on additional data points as patient’s past medical history,
location, genetic history and more.
More data on various Symptom combinations and the evidence data for confirming on the
hypotheses can be very helpful in applying the above model techniques
• Integration
This would be provided as a service that CallHealth can incorporate into their company
portal
5. Challenges
• Limited availability of data: The data used for the project can be bifurcated into training and
test data. The training data was downloaded from SNOMED which had limited data fields
with respect to patient. The only data available and approved by CallHealth had only
disease and symptom names. The training data to conduct the intent analysis was also
created by the project team. The test data provided by CallHealth was different from the
training data set and also had age and gender details of the patient.
• Changes in the model and algorithms: The test data provided by CallHealth was different
from the training data set and also had age and gender details of the patient due to which
the algorithms and scripts was modified.
6. Conclusion
CallHealth is, firstly, planning to use the model as a service that can be utilized by internal team
and furthermore data can be collected. Secondly, CallHealth intends to use the model as
customer portal that can be used to gather data from the patients and book appointment/time
with doctor/health consultant
CallHealth wanted to predict the disease based on the symptoms, age and gender without
compromising on the accuracy as even a slight error in prediction may have a significant impact
on patient. We agreed to display all the disease based on the symptoms, age and gender with
associated probability ranked on the basis on probability. With more data and details about the
patient, diagnostic test-results, etc. the list of possible disease can go down and the associated
probability can increase.
The model will have to be re-calibrated whenever a new data fields or data table will be
introduced to the model. However, building on to the existing model with additional diseases
and symptoms with age and gender would not require any re-calibration.
7. References
Call Health : https://www.callhealth.com/
Reference 1 : Deliotte Global Healthcare Outlook 2018
https://www2.deloitte.com/content/dam/Deloitte/global/Documents/Life-Sciences-
Health-Care/gx-lshc-hc-outlook-2018.pdf
Multi-Class Classifier (One-Vs-All) (Youtube link for the tutorial)
IBM Watson for Healthcare (Youtube link for the tutorial)
8. Apendix: