Abstract
Visual Question Answering (VQA) is a machine learning task that aims to create systems capable of answering natural language questions based on given images. Medical VQA systems, a domain-specific application of VQA, assist in understanding clinically relevant information from medical images. These systems leverage deep neural techniques to generate accurate answers to questions, which can be closed-ended or open-ended. This paper proposes a Cross-Attention Mechanism-based Medical VQA system. The proposed Medical VQA system achieves good performance through the integration of three-key components, each addressing critical challenges in medical visual question answering: The biomedical domain-specific pretraining of BioBERT enables extracting powerful contextual features from questions. It is highly effective in handling rare medical terminology, abbreviations and suitable for processing and grammatically complex medical questions compared to generic language models. Denoising Autoencoder model for visual features extraction enables extracting strong visual features and focusing on small object through slicing image into overlapping patches, thereby improving the localization of abnormalities within image. Finally propose a cross-attention mechanism which applied to model the hidden relationship between medical image and question and enhance the fused features vector. The attention mechanism comprises intramodal (within-modality) and intermodal (cross-modality) components, enabling the model to focus on relevant parts of image and question for answer generation. Experiments conducted on the VQA-RAD and Med-VQA 2019 datasets demonstrate that the proposed system achieves good results and accuracies was 76.5% and 78.3%, respectively, outperforming baseline models that use traditional attention mechanisms like the Bilinear attention networks (BAN) or Stacked Attention Networks (SAN).
Keywords
BAN, BioBERT, Computer vision, Cross-attention mechanism, DAE, Medical VQA, Natural language processing
Subject Area
Computer Science
Article Type
Article
First Page
1270
Last Page
1281
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite this Article
Mohammed, Nada Fadhil and Ali, Israa H.
(2026)
"Cross-Attention Mechanism for Medical Visual Question Answering,"
Baghdad Science Journal: Vol. 23:
Iss.
4, Article 11.
DOI: https://doi.org/10.21123/2411-7986.5267
