Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer
Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer
Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer
1
Dept. of Biomedical Engineering, National University of Singapore, Singapore.
2
Biomedical Image Analysis Group, Imperial College London, UK.
3
Dept. of ECE, National Institute of Technology, Tiruchirappalli, India.
4
Dept. of Electronic Engineering, Chinese University of Hong Kong.
5
Shun Hing Institute of Advanced Engineering, Chinese University of Hong Kong.
lalithkumar_s@u.nus.edu, m.islam20@imperial.ac.uk, 108118004@nitt.edu,
ren@nus.edu.sg/hlren@ee.cuhk.edu.hk
1 Introduction
Lack of medical domain-specific knowledge has left many patients, medical stu-
dents and junior residents with questions lingering in their minds about medical
?
Lalithkumar Seenivasan and Mobarakol Islam are co-first authors.
??
Corresponding author.
2 Seenivasan et al.
diagnosis and surgical procedures. Many of these questions are left unanswered
either because they assume these questions to be thoughtless, or students and
junior residents refrain from raising too many questions to limit disruptions in
lectures. The chances for them finding a medical expert to clarify their doubts
are also slim due to the scarce number of medical experts who are often over-
loaded with clinical and academic works [6]. To assist students in sharpening
their skills in surgical procedures, many computer-assisted techniques [2,17] and
simulators [14,18] have been proposed. Although the systems assist in improving
their skills and help reduce the workloads on academic professionals, the systems
don’t attempt to answer the student’s doubts. While students have also been
known to learn by watching recorded surgical procedures, the tasks of answering
their questions still fall upon the medical experts. In such cases, a computer-
assisted system that can process both the medical data and the questionnaires
and provide a reliable answer would greatly benefit the students and reduce the
medical expert’s workload [20]. Surgical scenes are enriched with information
that the system can exploit to answer questionnaires related to the defective
tissue, surgical tool interaction and surgical procedures.
With the potential to extract diverse information from a single visual fea-
ture just by varying the question, the computer vision domain has seen a recent
influx of vision and natural language processing models for visual question an-
swering (VQA) tasks [15,23,27]. These models are either built based on the long
short-term memory (LSTM) [5, 21] or attention modules [20, 22, 29]. In compar-
ison to the computer vision domain, which is often complemented with massive
annotated datasets, the medical domain suffers from the lack of annotated data,
limiting the exploration of medical VQA. The presence of domain-specific medi-
cal terms also limits the use of transfer learning techniques to adapt pre-trained
computer-vision VQA models for medical applications. While limited works have
been recently reported on medical-VQA [20] for medical diagnosis, VQA for sur-
gical scenes remains largely unexplored.
In this work, (i) we design a Surgical-VQA task to generate answers for
questions related to surgical tools, their interaction with tissue and surgical pro-
cedures (Fig. 1). (ii) We exploit the surgical scene segmentation dataset from
the MICCAI endoscopic vision challenge 2018 (EndoVis-18) [3] and workflow
recognition challenge dataset (Cholec80) [25], and extend it further to intro-
duce two novel datasets for Surgical-VQA tasks. (iii) We employ two vision-text
attention-based transformer models to perform classification-based and sentence-
Questions Answers
Q1: what organ is being operated? A1: kidney
Q2: what is the state of bipolar forceps? A2: retraction
Q3: what is the state of suction? A3: tissue manipulation
Q4: where is bipolar forceps located? A4: bipolar forceps is located at left-top
Q5: what tools are operating the organ? A5: the tools operating are bipolar forceps,
monopolar curved scissors, and suction
Fig. 1. Surgical-VQA: Given a surgical scene, the model predicts answer related to
surgical tools, their interactions and surgical procedures based on the questionnaires.
Surgical-VQA 3
based answering for the Surgical-VQA. (iv) We also introduce a residual MLP
(ResMLP) based VisualBERT ResMLP encoder model that outperforms Visu-
alBERT [15] in classification-based VQA. Inspired by ResMLP [24], cross-token
and cross-channel sub-modules are introduced into the VisualBERT ResMLP
model to enforce interaction among input visual and text tokens. (v) Finally,
the effects on the model’s performance due to the varied number of input image
patches and inclusion of temporal visual features are also studied.
2 Proposed Method
2.1 Preliminaries
where, XSA is the self-attention module output, A, B and C are the learnable
linear layers, GeLU is the GeLU activation function [12] and NORM is the
layer-normalization.
VisualBERT ResMLP
embedding
ResMLP
Visual
Self-attention
Cross-token Cross-channel
Feature Visual
Channels
Tokens
Linear
Pooler
Features
Linear
Linear
extractor
GeLU
Norm
Norm
Τ Τ
embedding
Channels
Word
what is the state of ‘[start]’, ‘what’, ‘is’, ‘the’, ‘state’, ‘of’, ‘bipolar’,
bipolar forceps? ‘forceps’, ‘##?’, ‘[end]’, ‘[pad]’, . . , ‘[pad]’
Encoder layer size = 6
Prediction (linear)
Embedding
multi-head
multi-head
Classification
Adaptive
attention
attention
Masked
grasping or Sentence
BERT tokenizer
‘[start]’, ‘action’, ‘done’, ‘by’, ‘bipolar’, Positional
‘forceps’, ‘is’, ‘grasping’, ‘[end]’, ‘[pad]’, Target tokens encoding
(Shifted right) Decoder layer size = 6
…, ‘[pad]’ Decoder Prediction (linear)
Fig. 2. Architecture: Given an input surgical scene and questions, its text and visual
features are propagated through the vision-text encoder (VisualBERT ResMLP). (i)
classification-based answer: The encoder output is propagated through a prediction
layer for answer classification. (ii) Sentence-based answer: The encoder is combined
with a transformer decoder to predict the answer sentence word-by-word (regressively).
Surgical-VQA 5
3 Experiment
3.1 Dataset
Med-VQA: A public dataset from the ImageCLEF 2019 Med-VQA Chal-
lenge [1]. Three categories (C1: modality, C2: plane and C3: organ) of medical
question-answer pairs from the dataset are used in this work. The C1, C2 and C3
pose a classification task (single-word answer) for 3825 images with a question.
The C1, C2 and C3 consist of 45, 16 and 10 answer classes, respectively. The
train and test set split follows the original implementation [1].
4 Results
The performance of classification-based answering on the EndoVis-18-VQA (C),
Cholec80-VQA (C) and Med-VQA (C1, C2 and C3) datasets are quantified based
on the accuracy (Acc), recall and F-score in Table 1. It is observed that our
proposed encoder (VisualBERT ResMLP) based model outperformed the current
medical VQA state-of-the-art MedFuse [20] model and marginally outperformed
the base encoder (VisualBERT [15]) based model in almost all datasets. While
the improvement in performance against the base model is marginal, a k-fold
study (Table 2) on EndoVis-18-VQA (C) dataset proves that the improvement is
consistent. Furthermore, our model (159.0M) requires 13.64% lesser parameters
compared to the base model (184.2M).
For the sentence-based answering, BLEU score [16], CIDEr [26] and ME-
TEOR [4] are used for quantitative analysis. Compared to the LSTM-based
MedFuse [20] model, the two variants of our proposed Transformer-based model
1
github.com/lalithjets/Surgical_VQA.git
Surgical-VQA 7
Surgical scene Ground-truth answer Surgical scene Ground-truth answer Surgical scene Ground-truth answer
action done by monopolar action done by monopolar the current phase of the
curved scissors is cutting curved scissors is idle image is preparation
MedFuse MedFuse MedFuse
action done by monopolar bipolar forceps is located at scissors is not used in calot
curved scissors is idle left-top triangle dissection
VisualBERT + TD VisualBERT + TD VisualBERT + TD
monopolar curved scissors action done by monopolar the current phase of the
Question is located at right-top Question curved scissors is idle Question image is preparation
what is the state of VisualBERT ResMLP + TD what is the state of VisualBERT ResMLP + TD VisualBERT ResMLP + TD
what is the phase of
monopolar curved action done by monopolar monopolar curved action done by monopolar the current phase of the
image?
scissors? curved scissors is cutting scissors? curved scissors is cutting image is preparation
(a) VisualBERT vs VisualBERT ResMLP (Classification-based answering) (b) Accuracy vs patch size study (Classification-based answering)
0.640 0.900
0.898
0.899
0.898
0.897
0.891
0.892
0.889
0.885
0.885
0.882
0.631
0.632
0.623
0.623
0.622
0.619
0.614
0.612
0.610
0.630 0.800
0.890
Accuracy
Accuracy
Accuracy
0.625
0.600
0.620 0.885
0.400
0.615 0.880
0.610 0.200
0.875
0.605 0.000
0.600 0.870 1 4 9 16 25
Number of patches
EndoVis-18-VQA (C) Cholec80-VQA (C)
VisualBERT (EndoVis-18 (C)) VisualBERT ResMLP (EndoVis-18 (C))
VisualBERT VisualBERT ResMLP VisualBERT (Cholec80 (C)) VisualBERT ResMLP (Cholec80 (C))
(c) BLEU-4 vs patch size study (Sentence-based answering) (d) Single frame vs temporal frame features study (Sentence-based answering)
0.700 1.000
0.956
0.956
0.954
0.954
0.950
0.952
0.950
0.923
0.901
0.833
0.691
0.687
0.687
0.687
0.687
0.689
0.685
0.682
0.683
0.800
BLEU-4
BLEU-4
0.690 0.900
BLEU-4
0.600
0.685 0.850
0.400
0.200 0.680 0.800
0.000 0.675 0.750
1 4 9 16 25 Cholec80-VQA (S)
EndoVis-18-VQA (S)
Number of patches
VisualBERT + TD (EndoVis-18 (S)) VisualBERT ResMLP + TD (EndoVis-18 (S)) VisualBERT + TD VisualBERT ResMLP + TD
VisualBERT + TD (Cholec80 (S)) VisualBERT ResMLP + TD (Cholec80 (S)) Temporal VisualBERT + TD Temporal VisualBERT ResMLP + TD
features are then used as visual tokens for sentence-based answering. Fig. 4 (d)
shows the model’s performance on a varied number of input patches (patch size
= 1, 4, 9, 16 and 25) from single frames vs. temporal frames on EndoVis-18-VQA
(S) and Cholec80-VQA (S). It is observed that both transformer-based models’
performance reduces when the temporal features are used.
6 Acknowledgement
We thank Ms. Xu Mengya, Dr. Preethiya Seenivasan and Dr. Sampoornam
Thamizharasan for their valuable inputs during project discussions. This work
is supported by Shun Hing Institute of Advanced Engineering (SHIAE project
BME-p1-21, 8115064) at the Chinese University of Hong Kong (CUHK), Hong
Kong Research Grants Council (RGC) Collaborative Research Fund (CRF C4026-
21GF and CRF C4063-18G) and (GRS)#3110167.
References
1. Abacha, A.B., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., Müller,
H.: Vqa-med: Overview of the medical visual question answering task at imageclef
2019. clef2019 working notes. In: CEUR Workshop Proceedings. CEUR-WS. org<
http://ceur-ws. org>. September. pp. 9–12
2. Adams, L., Krybus, W., Meyer-Ebrecht, D., Rueger, R., Gilsbach, J.M., Moesges,
R., Schloendorff, G.: Computer-assisted surgery. IEEE Computer graphics and
applications 10(3), 43–51 (1990)
3. Allan, M., Kondo, S., Bodenstedt, S., Leger, S., Kadkhodamohammadi, R., Luengo,
I., Fuentes, F., Flouty, E., Mohammed, A., Pedersen, M., et al.: 2018 robotic scene
segmentation challenge. arXiv preprint arXiv:2001.11190 (2020)
10 Seenivasan et al.
4. Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im-
proved correlation with human judgments. In: Proceedings of the acl workshop on
intrinsic and extrinsic evaluation measures for machine translation and/or summa-
rization. pp. 65–72 (2005)
5. Barra, S., Bisogni, C., De Marsico, M., Ricciardi, S.: Visual question answering:
which investigated applications? Pattern Recognition Letters 151, 325–331 (2021)
6. Bates, D.W., Gawande, A.A.: Error in medicine: what have we learned? (2000)
7. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer
for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 10578–10587 (2020)
8. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-
scale hierarchical image database. In: 2009 IEEE conference on computer vision
and pattern recognition. pp. 248–255. Ieee (2009)
9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
10. Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3d resid-
ual networks for action recognition. In: Proceedings of the IEEE International
Conference on Computer Vision Workshops. pp. 3154–3160 (2017)
11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
12. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415 (2016)
13. Islam, M., Seenivasan, L., Ming, L.C., Ren, H.: Learning and reasoning with the
graph structure representation in robotic surgery. In: International Conference
on Medical Image Computing and Computer-Assisted Intervention. pp. 627–636.
Springer (2020)
14. Kneebone, R.: Simulation in surgical training: educational issues and practical
implications. Medical education 37(3), 267–277 (2003)
15. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: A simple
and performant baseline for vision and language. arXiv preprint arXiv:1908.03557
(2019)
16. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic
evaluation of machine translation. In: Proceedings of the 40th annual meeting of
the Association for Computational Linguistics. pp. 311–318 (2002)
17. Rogers, D.A., Yeh, K.A., Howdieshell, T.R.: Computer-assisted learning versus
a lecture and feedback seminar for teaching a basic surgical technical skill. The
American Journal of Surgery 175(6), 508–510 (1998)
18. Sarker, S., Patel, B.: Simulation and surgical training. International journal of
clinical practice 61(12), 2120–2125 (2007)
19. Seenivasan, L., Mitheran, S., Islam, M., Ren, H.: Global-reasoned multi-task learn-
ing model for surgical scene understanding. IEEE Robotics and Automation Letters
(2022)
20. Sharma, D., Purushotham, S., Reddy, C.K.: Medfusenet: An attention-based mul-
timodal deep learning model for visual question answering in the medical domain.
Scientific Reports 11(1), 1–18 (2021)
21. Sharma, H., Jalal, A.S.: Image captioning improved visual question answering.
Multimedia Tools and Applications pp. 1–22 (2021)
Surgical-VQA 11
22. Sharma, H., Jalal, A.S.: Visual question answering model based on graph neu-
ral network and contextual attention. Image and Vision Computing 110, 104165
(2021)
23. Sheng, S., Singh, A., Goswami, V., Magana, J., Thrush, T., Galuba, W., Parikh,
D., Kiela, D.: Human-adversarial visual question answering. Advances in Neural
Information Processing Systems 34 (2021)
24. Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E.,
Izacard, G., Joulin, A., Synnaeve, G., Verbeek, J., et al.: Resmlp: Feedfor-
ward networks for image classification with data-efficient training. arXiv preprint
arXiv:2105.03404 (2021)
25. Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy,
N.: Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE
transactions on medical imaging 36(1), 86–97 (2016)
26. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image
description evaluation. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. pp. 4566–4575 (2015)
27. Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: Simvlm: Sim-
ple visual language model pretraining with weak supervision. arXiv preprint
arXiv:2108.10904 (2021)
28. Wiseman, S., Rush, A.M.: Sequence-to-sequence learning as beam-search optimiza-
tion. arXiv preprint arXiv:1606.02960 (2016)
29. Zhang, S., Chen, M., Chen, J., Zou, F., Li, Y.F., Lu, P.: Multimodal feature-wise
co-attention method for visual question answering. Information Fusion 73, 1–10
(2021)