•  
  •  
 

Abstract

Preparing a collection of Punjabi-English social media text and creating a next word prediction system for Punjabi users are the main goals of this study. As the Gurmukhi script has a large character set, typing in Punjabi takes a lot of time. That is the reason behind using the Roman script for communication on social media platforms like WhatsApp, Facebook, and Twitter by many Punjabi users. For such users, we have proposed a sequential CNN-BiLSTM architecture to provide suggestions to improve typing speed and make communication convenient. In our proposed model, we used 128 filters and a kernel size of 5 for the 1D convolutional layer, and then we placed a bidirectional LSTM layer with 150 units. The model was trained with a batch size of 256, and training took 38.22 minutes. In this study, we also demonstrate the key challenges in collecting and preprocessing social media text. We collected 311271 WhatsApp sequences ranging from 2 to 33 words to train our model. Experimental evaluation of this study shows that the proposed sequential CNN-LSTM model achieved higher accuracy compared to other LSTM-based approaches, such as LSTM, BiLSTM, and CNN-LSTM. This proposed model efficiently learns and thus produces common linguistic patterns in Punjabi-English bilingual social media texts produced by Punjabi users. The results of this study explore the important role of deep learning techniques in solving complex linguistic challenges and improving language modeling in multilingual texts.

Keywords

BiLSTM, CNN, LSTM, Next-word prediction, Social media text

Subject Area

Computer Science

Article Type

Article

First Page

18190

Last Page

18200

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS