(Deemed to be university)
School of Technology, Hyderabad
Department of Computer Science and Engineering

Semantic Understanding and contextual

explanation of medical document

Project Batch No:P4-06

Name of Student(s) and Full Reg No(s)
K Sai Varun Reddy HU21CSEN0101250 Guide Name :
M Uday Sree Chand HU21CSEN0101659 CH Mahender Reddy
T N Sai Ram HU21CSEN0101740
T Ajith Karthikeya HU21CSEN0101761

⮚ Introduction
⮚ Abstract

⮚ Review of literature Survey

⮚ Problem statement

⮚ Objectives

⮚ Requirement analysis/Dataset

⮚ Identification of Tools/Technologies

⮚ Design/Flowchart of complete project

⮚ References
Semantic understanding and contextual explanation of
medical reports bridge the gap between complex
medical jargon and everyday language. This process is
invaluable in real-life healthcare settings, empowering
patients, caregivers, and healthcare professionals to
comprehend medical information more effectively. By
translating technical terms into clear and concise
language, semantic understanding enhances health
literacy, improves communication, and ultimately leads
to better patient outcomes.
This project aims to develop an AI-powered system that utilizes
advanced image recognition and natural language processing (NLP)
techniques to simplify medical reports, such as CT scans, MRI scans,
and other diagnostic reports. The system captures images of these
reports, accurately extracts and digitizes the text, and then generates
concise summaries of the findings. Additionally, it provides detailed,
patient-friendly explanations of the results and any medical
conditions identified in the report, along with the implications of
those findings. By translating complex medical terms into accessible
language, this project seeks to improve patient understanding, assist
in informed decision-making, enhance communication between
patients and healthcare providers, and promote better overall health
Neel Kanwal[1] Attention-based The dataset consists of Attention-based Summaries reduce Observation:Attention Gaps include improving
Year(2022) models summarize electronic health records models, such as clinical note length by models accurately model precision for
clinical notes, (EHRs) and clinical notes, transformers, prioritize 40-50%, retaining prioritize critical less prominent but
highlighting essential containing patient history, key information in crucial details with high medical details but essential details,
diagnosis, treatments,
patient information like clinical notes by accuracy, significantly sometimes overlook addressing data
medications, and
diagnoses, outcomes, sourced from assigning improving healthcare less obvious yet privacy, and ensuring
treatments, and hospitals or medical importance weights to providers' ability to important contextual contextually accurate
outcomes, improving databases. different sections for review patient data, requiring summarization across
information concise summaries. information. additional refinement.. diverse medical cases.
accessibility for The results indicated
healthcare providers The research utilized The study implemented that the best T5 models The observations
Mizuho Nishio[2]
Year(2024) two extensive datasets: four Text-to-Text achieved ROUGE highlighted that the The research gap
The study explores the the MIMIC Chest X-ray Transfer Transformer scores suggesting automatically identified in the paper
development and database, comprising (T5) models as the effective generated summaries is the need for further
evaluation of large 128,032 chest algorithm for summarization, with from the T5 models exploration of
language models for radiograph reports, and automatically 86% of the summaries were not only advanced NLP
the automatic the Japan Medical summarizing the from the MIMIC-CXR quantitatively effective techniques to enhance
summarization of Image Database radiology reports. dataset and 85% from but also received high the summarization
radiology reports, (JMID), which included the JMID deemed ratings for clinical quality of radiology
aiming to enhance 1,101,271 computed clinically useful by utility, demonstrating reports, as well as to
workflow efficiency and tomography and radiologists. their potential to aid evaluate their
improve the clarity of magnetic resonance radiologists in applicability across
medical information for imaging reports from improving report diverse medical
healthcare providers. 10 academic medical comprehension. imaging modalities
centers in Japan. beyond the datasets

Chong Ma[3] The research paper The research paper The research paper The research found The research paper The research paper
Year(2024) proposes an iterative utilized the Medical leverages a large that the proposed observed that identified a gap in the field
optimization framework Information Mart for language model, framework significantly leveraging large of radiology report
for automatically Intensive Care - Chest ChatGPT, combined improved the quality of language models and summarization,
specifically the need for
summarizing radiology X-ray database with an iterative automatically iterative optimization
more accurate and
reports using ChatGPT, (MIMIC-CXR) and the optimization framework generated radiology can significantly efficient methods that can
aiming to improve the Open Access and a dynamic prompt report summaries, as enhance the quality leverage the capabilities
accuracy and Biomedical Image generation technique to evaluated by human and efficiency of of large language models
coherence of Search Engine (OpenI) automatically experts. automatic radiology while addressing domain-
generated summaries. datasets for training summarize radiology report summarization. specific challenges.
and evaluation. reports.

The research paper

Shashank Patel[4] The research paper The research paper The research paper identified a gap in the
Year(2022) proposes a system that The research paper employed a two-step demonstrated The research paper field of medical
automatically utilized a dataset of approach: first, promising results, with observed that using communication, where
summarizes complex medical articles to train extractive the Albert model natural language complex medical
medical articles and and evaluate the summarization to achieving a ROUGE-1 processing techniques articles often hinder
simplifies their proposed identify key sentences, score of 0.3789 and to summarize and patient understanding
language using natural summarization and followed by named ROUGE-L of 0.2084, simplify medical articles and decision-making,
language processing simplification model. entity recognition to indicating improved can significantly necessitating the
techniques like identify and replace readability and improve health literacy development of
extractive complex medical terms comprehension of the and enhance patient automated tools to
summarization and with simpler simplified medical understanding of simplify and summarize
named entity explanations. summaries. complex medical such information.
recognition. information.

• These literature reviews highlights the increasing interest in using natural
language processing (NLP) techniques to automate the summarization and
simplification of medical documents. Studies by Denecke (2008) and
Nishio (2024) explored the use of NLP to extract structured information
from medical documents, while Ma (2024) and Patel (2022) focused on
summarizing complex medical articles and radiology reports. These
studies demonstrate the potential of NLP to improve efficiency and
accessibility of medical information. However, there remains a need for
further research to develop more robust and accurate methods, especially
in extracting complex information and handling diverse medical document

Medical reports, such as MRI scans and lab results, often contain complex
terminology that can be challenging for patients to understand, leading to
confusion and anxiety. Simplifying these reports into concise, patient-
friendly explanations can improve comprehension, promote adherence to
treatment, and enhance communication with healthcare providers. For
example, if an MRI report states, "There is a disc protrusion at L5-S1
causing moderate compression of the spinal cord," a simplified version
could read, "Your lower back MRI shows a disc pressing on your spinal
cord, which may be causing your pain." This clearer explanation helps
patients grasp the issue and take informed actions.

The objective of this project is to develop an AI-driven system that enhances patient comprehension of

medical reports by converting complex medical terminology into simplified, concise explanations.

Specifically, the system will extract key details from diagnostic reports such as MRI scans, CT scans, and

lab results, and generate easily understandable summaries in 2-3 sentences using natural language

processing techniques. This will empower patients to make informed decisions about their health,

improve adherence to treatment plans, reduce miscommunication between patients and healthcare

providers, and ultimately lead to improved health outcomes.

Datasets Used in the Project
1. MIMIC-CXR (Medical Information Mart for Intensive Care - Chest X-ray)
• Large public dataset of de-identified medical records, including chest X-ray images and radiology reports.
• Provides a diverse range of chest X-ray findings, from normal to various pathologies (e.g., pneumonia, lung
cancer, etc.).
• Used for training and evaluating the model's ability to extract relevant information from radiology reports and
associate it with corresponding X-ray images.
2. OpenI (Open Access Biomedical Image Search Engine)
• Open-source platform providing access to a vast collection of biomedical images, including CT scans, MRI scans,
and X-rays.
• Offers a variety of medical conditions and anatomical structures, enabling the model to learn from a wide range
of visual and textual data.
• Used to expand the dataset and improve the model's generalization capabilities.
• Can be used to improve the model's accuracy and relevance to the target clinical setting.

3. RSNA Pneumonia Detection Challenge Dataset
•Curated dataset specifically designed for pneumonia detection tasks.
•Contains chest X-ray images and corresponding labels indicating the presence or absence of pneumonia.
•Used to fine-tune the model's ability to identify specific medical conditions from radiological images and
4. Cancer Imaging Archive (TCIA)
•Repository of medical images and associated clinical data, including CT scans, MRI scans, and other
•Provides a valuable resource for training and evaluating the model's performance on various cancer types
and imaging techniques.
•Used to enhance the model's ability to extract and interpret information from diverse medical imaging
5. Internal Hospital Dataset (Optional)
•If available, a private dataset of de-identified medical reports and images from a specific hospital or
healthcare system.
•Provides valuable real-world data to train and test the model on specific clinical contexts and workflows.

Identification of Tools/Technologies
Natural Language Processing (NLP) Tools:
• NLTK (Natural Language Toolkit): A versatile Python library for various NLP tasks, including tokenization, stemming, lemmatization,
part-of-speech tagging, and named entity recognition.
• SpaCy: A powerful and efficient NLP library for advanced tasks like dependency parsing and semantic role labeling.
• Transformers: A family of deep learning models that have revolutionized NLP, including models like BERT, GPT-3, and RoBERTa, which
excel at understanding the context and meaning of text.
Machine Learning Libraries:
• TensorFlow and PyTorch: Popular frameworks for building and training machine learning models, including those for NLP tasks like
text classification, sentiment analysis, and text generation.
• Scikit-learn: A versatile machine learning library for tasks like feature extraction, model selection, and evaluation.
Knowledge Bases and Ontologies:
• UMLS (Unified Medical Language System): A comprehensive knowledge source for biomedical information, including concepts, terms,
and relationships between them.
• SNOMED CT (Systematized Nomenclature of Medicine - Clinical Terms): A clinical terminology system used to code and classify
medical concepts.
By leveraging these tools and technologies, we can develop sophisticated systems that can accurately extract information from
medical reports, understand the context, and generate clear, concise explanations that are tailored to the needs of different users.

Design/Flowchart of complete project

1.Literature survey sites from google scholar,IEEE and Research gate
1.1 :Attention-based clinical note summarization,
1.2Fully automatic summarization of radiology reports using natural
language processing with large language models
1.3Summarization and Simplification of Medical Articles using Natural Language
2.Data set collection from Drlogy



