A Multi-Model Ensemble Framework for Assistive Image Captioning with Voice Interaction for the Visually Impaired Users

Alaa Noori Mazher, Department of Computer Science, College of Science, University of Baghdad, Baghdad, IraqFollow
Ghadah K. AL-Khafaji, Department of Computer Science, College of Science, University of Baghdad, Baghdad, IraqFollow

Abstract

Visual impairment greatly restricts an individual's ability for perception and interaction with the environment, making everyday activities challenging without the availability of appropriate assistive technologies. One major barrier is the inability to perceive visual scenes, which are essential for spatial awareness, wayfinding, and situational knowledge. This study aims to address this gap through the proposal of an intelligent assistive image captioning system specifically designed for visually impaired individuals. The system employs the best available computer vision and natural language processing techniques in generating context-oriented text captions for an image infused with interactive voice responses. Here, four models of Convolutional Neural Networks, namely InceptionV3, InceptionResNetV2, Xception, and DenseNet201, with LSTM-based decoders, are used each to generate the initial captions. The additional captions were generated using a transformer-based ViT-GPT2 architecture. An ensemble method is used to choose the best caption, which is done using BLEU scores. For audio, Google Text-to-Speech is used, while for real-time voice queries, YOLOv8n is used to detect humans. We have tested our system using the Flickr8k dataset, and the results show that the ensemble method outperforms CNN-LSTM and ViT-GPT2 architectures. To be precise, the ensemble method achieved a BLEU-1 score of 0.7363, a BLEU-4 score of 0.2642, a METEOR score of 0.4545, and a ROUGE-L score of 0.5107. This shows that using multiple models is helpful to obtain better captions, thus increasing interactivity, which is a significant step towards real-time, accessible, and intelligent assistance for visually impaired individuals.

Keywords

CNN-LSTM models, Ensemble learning, Image captioning, Visually impaired, ViT-GPT2 transformer

Subject Area

Computer Science

Article Type

Article

First Page

1299

Last Page

1315

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite this Article

Mazher, Alaa Noori and AL-Khafaji, Ghadah K. (2026) "A Multi-Model Ensemble Framework for Assistive Image Captioning with Voice Interaction for the Visually Impaired Users," Baghdad Science Journal: Vol. 23: Iss. 4, Article 13.
DOI: https://doi.org/10.21123/2411-7986.5269

Download

COinS

A Multi-Model Ensemble Framework for Assistive Image Captioning with Voice Interaction for the Visually Impaired Users

Abstract

Keywords

Subject Area

Article Type

First Page

Last Page

Creative Commons License

How to Cite this Article

Search

Submission Locations

A Multi-Model Ensemble Framework for Assistive Image Captioning with Voice Interaction for the Visually Impaired Users

Authors

Abstract

Keywords

Subject Area

Article Type

First Page

Last Page

Creative Commons License

How to Cite this Article

Share

Search

Submission Locations