Cross-Attention Mechanism for Medical Visual Question Answering

Nada Fadhil Mohammed, College of Information Technology, University of Babylon, Babylon, IraqFollow
Israa H. Ali, College of Information Technology, University of Babylon, Babylon, IraqFollow

Abstract

Visual Question Answering (VQA) is a machine learning task that aims to create systems capable of answering natural language questions based on given images. Medical VQA systems, a domain-specific application of VQA, assist in understanding clinically relevant information from medical images. These systems leverage deep neural techniques to generate accurate answers to questions, which can be closed-ended or open-ended. This paper proposes a Cross-Attention Mechanism-based Medical VQA system. The proposed Medical VQA system achieves good performance through the integration of three-key components, each addressing critical challenges in medical visual question answering: The biomedical domain-specific pretraining of BioBERT enables extracting powerful contextual features from questions. It is highly effective in handling rare medical terminology, abbreviations and suitable for processing and grammatically complex medical questions compared to generic language models. Denoising Autoencoder model for visual features extraction enables extracting strong visual features and focusing on small object through slicing image into overlapping patches, thereby improving the localization of abnormalities within image. Finally propose a cross-attention mechanism which applied to model the hidden relationship between medical image and question and enhance the fused features vector. The attention mechanism comprises intramodal (within-modality) and intermodal (cross-modality) components, enabling the model to focus on relevant parts of image and question for answer generation. Experiments conducted on the VQA-RAD and Med-VQA 2019 datasets demonstrate that the proposed system achieves good results and accuracies was 76.5% and 78.3%, respectively, outperforming baseline models that use traditional attention mechanisms like the Bilinear attention networks (BAN) or Stacked Attention Networks (SAN).

Keywords

BAN, BioBERT, Computer vision, Cross-attention mechanism, DAE, Medical VQA, Natural language processing

Subject Area

Computer Science

Article Type

Article

First Page

1270

Last Page

1281

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite this Article

Mohammed, Nada Fadhil and Ali, Israa H. (2026) "Cross-Attention Mechanism for Medical Visual Question Answering," Baghdad Science Journal: Vol. 23: Iss. 4, Article 11.
DOI: https://doi.org/10.21123/2411-7986.5267

Download

COinS

Cross-Attention Mechanism for Medical Visual Question Answering

Abstract

Keywords

Subject Area

Article Type

First Page

Last Page

Creative Commons License

How to Cite this Article

Search

Submission Locations

Cross-Attention Mechanism for Medical Visual Question Answering

Authors

Abstract

Keywords

Subject Area

Article Type

First Page

Last Page

Creative Commons License

How to Cite this Article

Share

Search

Submission Locations