•  
  •  
 

Abstract

Visual impairment greatly restricts an individual's ability for perception and interaction with the environment, making everyday activities challenging without the availability of appropriate assistive technologies. One major barrier is the inability to perceive visual scenes, which are essential for spatial awareness, wayfinding, and situational knowledge. This study aims to address this gap through the proposal of an intelligent assistive image captioning system specifically designed for visually impaired individuals. The system employs the best available computer vision and natural language processing techniques in generating context-oriented text captions for an image infused with interactive voice responses. Here, four models of Convolutional Neural Networks, namely InceptionV3, InceptionResNetV2, Xception, and DenseNet201, with LSTM-based decoders, are used each to generate the initial captions. The additional captions were generated using a transformer-based ViT-GPT2 architecture. An ensemble method is used to choose the best caption, which is done using BLEU scores. For audio, Google Text-to-Speech is used, while for real-time voice queries, YOLOv8n is used to detect humans. We have tested our system using the Flickr8k dataset, and the results show that the ensemble method outperforms CNN-LSTM and ViT-GPT2 architectures. To be precise, the ensemble method achieved a BLEU-1 score of 0.7363, a BLEU-4 score of 0.2642, a METEOR score of 0.4545, and a ROUGE-L score of 0.5107. This shows that using multiple models is helpful to obtain better captions, thus increasing interactivity, which is a significant step towards real-time, accessible, and intelligent assistance for visually impaired individuals.

Keywords

CNN-LSTM models, Ensemble learning, Image captioning, Visually impaired, ViT-GPT2 transformer

Subject Area

Computer Science

Article Type

Article

First Page

1299

Last Page

1315

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

 
COinS