Abstract
Visual impairment greatly restricts an individual's ability for perception and interaction with the environment, making everyday activities challenging without the availability of appropriate assistive technologies. One major barrier is the inability to perceive visual scenes, which are essential for spatial awareness, wayfinding, and situational knowledge. This study aims to address this gap through the proposal of an intelligent assistive image captioning system specifically designed for visually impaired individuals. The system employs the best available computer vision and natural language processing techniques in generating context-oriented text captions for an image infused with interactive voice responses. Here, four models of Convolutional Neural Networks, namely InceptionV3, InceptionResNetV2, Xception, and DenseNet201, with LSTM-based decoders, are used each to generate the initial captions. The additional captions were generated using a transformer-based ViT-GPT2 architecture. An ensemble method is used to choose the best caption, which is done using BLEU scores. For audio, Google Text-to-Speech is used, while for real-time voice queries, YOLOv8n is used to detect humans. We have tested our system using the Flickr8k dataset, and the results show that the ensemble method outperforms CNN-LSTM and ViT-GPT2 architectures. To be precise, the ensemble method achieved a BLEU-1 score of 0.7363, a BLEU-4 score of 0.2642, a METEOR score of 0.4545, and a ROUGE-L score of 0.5107. This shows that using multiple models is helpful to obtain better captions, thus increasing interactivity, which is a significant step towards real-time, accessible, and intelligent assistance for visually impaired individuals.
Keywords
CNN-LSTM models, Ensemble learning, Image captioning, Visually impaired, ViT-GPT2 transformer
Subject Area
Computer Science
Article Type
Article
First Page
1299
Last Page
1315
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite this Article
Mazher, Alaa Noori and AL-Khafaji, Ghadah K.
(2026)
"A Multi-Model Ensemble Framework for Assistive Image Captioning with Voice Interaction for the Visually Impaired Users,"
Baghdad Science Journal: Vol. 23:
Iss.
4, Article 13.
DOI: https://doi.org/10.21123/2411-7986.5269
