Sentiment Analysis on Roman Urdu Students’ Feedback Using Enhanced Word Embedding Technique

Students’ feedback is crucial for educational institutions to assess the performance of their teachers, most opinions are expressed in their native language, especially for people in south Asian regions. In Pakistan, people use Roman Urdu to express their reviews, and this applied in the education domain where students used Roman Urdu to express their feedback. It is very time-consuming and labor-intensive process to handle qualitative opinions manually. Additionally, it can be difficult to determine sentence semantics in a text that is written in a colloquial style like Roman Urdu. This study proposes an enhanced word embedding technique and investigates the neural word Embedding (Word2Vec and Glove) to determine which performs better for Roman Urdu Sentiment analysis. Our suggested model employs the BiLSTM network to maintain the context in both directions and eventually, results for ternary classification are obtained by using the final softmax output layer. A manually labeled data set was used to evaluate the model, data is collected from the HEIs of Pakistan. Model was empirically evaluated on two datasets of Roman Urdu, the newly developed student’s feedback dataset and RUSA-19 publically available data set of Roman Urdu. Our model performs effectively using the word embedding and BiLSTM layer. The proposed model is compared with the baseline models of CNN, RNN, GRU and classic LSTM. The experimental findings demonstrate the proposed model's efficacy with an F1score of 90%.


Introduction
In Pakistan and its subcontinent Roman Urdu is a highly popular language on social media.Roman Urdu is primarily used by peoples to post reviews and other feedback on social media and other venues.One of the dialects of the third-largest Urdu language in the world is Roman Urdu, however https://doi.org/10.21123/bsj.2024.9822P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal language into account.Numerous languages, most notably Hindi and Roman Urdu, are perceived as resource-poor in comparison to English because there aren't as many resources available to conduct sentiment analysis on them 2 .The term "Roman Urdu" refers to the English alphabet used to write the Urdu language in Roman script.Roman Urdu is seen to be an example of a morphologically difficult but academically rich language 3 .Roman Urdu faces multiple challenges because the English and Latin alphabets offer less morphological support.The informal style of Roman Urdu writing is the main obstacle to interpreting the language.To address this issue, researchers have faced a significant barrier due to the uneven representation of text.In other words, there are no norms for Roman Urdu due to spelling irregularities and variations 4 .As the official language of Pakistan, Urdu is used by over 200 million people daily, while another 100 million people worldwide use it to communicate in writing 5 .
Less linguistic techniques, including as stop word lists, lemmatization and stemming tools, are available for use in processing language in Urdu.
The English alphabet can be used to translate Urdu into Roman script.As an example, the term "Acha" in Roman Urdu means "Good" As a result, the term Acha has many different spellings, like Asha, Achha, Achaa, Ascha etc.
Due to the informal nature of Roman Urdu text, the task of sentiment analysis in natural language processing has always been difficult.Most studies have focused on customer reviews for products, services 5 music, mobile devices, accommodations, etc. and limited work investigated on Education domain especially in the Roman Urdu Language 6 .
In existing studies, machine and deep learning techniques have been primarily used to conduct the sentiment analysis task with combination of word embedding.In context of Roman Urdu, standard word embedding is not available because of the colloquial nature of text and the standard feature extraction models are insufficient and must be revised for handling such colloquial text.So, to overcome the limitation of existing studies, this paper represents a first attempt to do sentiment analysis on a corpus of Roman Urdu for faculty teaching evaluations using enhanced word embedding and the BiLSTM model.
The primary contributions this paper makes are as follows: 1.A new student's feedback dataset (Roman Urdu) has been developed & annotated into three groups: positive, negative, and neutral.
2. Roman Urdu sentiment analysis model has been developed by using BiLSTM and enhanced word embedding technique.
3. The suggested approach is evaluated using various deep neural network configurations (CNN, RNN, GRU, and classic LSTM) to determine which one produces the most accurate enhanced word embedding for Roman Urdu data.The best method for resolving the Sentiment analysis issue in Roman Urdu has been identified by comparing the effectiveness of several deep network configurations.
The rest of the paper is structured as follows: Deep learning and word embedding algorithms for Roman Urdu sentiment analysis are discussed in Section II.The proposed approach is the subject of Section III.Results and conclusions from the experiments are presented in Sections IV and V.

Related Work
Sentiment analysis is a branch of "text classification" and the most important research field in the area of (NLP) natural language processing 7 .The term "sentiment analysis" refers to collecting and analyzing people's opinions, feelings, and perspectives on various issues, like services and products 8 .Sentiment Analysis has achieved greater success in the domain of Products, Restaurants, Services, Movies, and Hotel reviews datasets 5,9 .Recently, Researchers in the field of education have also given sentiment analysis a lot of attention 10 .Since the COVID-19 epidemic, when the majority of educational institutions switched from traditional face-to-face instruction to an online form, student feedback has grown in popularity and importance.
Countries are trying to improve their educational https://doi.org/10.21123/bsj.2024.9822P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal institutions to grow further in society 10 .Higher education institutions are also trying to enhance the quality of education by evaluating instructors' teaching, student progress, and course analysis using feedback 11 .Deep learning (DL) models with the combination of word embedding techniques have significantly enhanced the performance of NLP applications including sentiment analysis and question answering [12][13][14][15] .Word embedding is used by Deep Learning models in NLP tasks to automatically extract and recognize high level features by using textual input 16 .
In the light of limitations of traditional machine learning approaches, various DNNs have been used for sentiment classification 17 ; these include CNNs, RNNs, RAEs, and DBNs.The RNN is the preferred option for sequential modelling tasks, Due to its ability to handle sequences of any length 18 .
Similar to the LSTM, the RNN has shown remarkable performance while keeping contextual data to classify texts.After training with enough data and computational power, RNN variants, including LSTM, GRU 19 , and Bi-LSTM 20 were utilized for sentiment analysis, with promising results 21,22 .The author addressed the sentiment analysis problem by comparing Roman Urdu data using deep learning-based approaches.In order to determine which machine learning classifier performed most effectively, the suggested deep learning model, LSTM, was put through its trials.In comparison with machine learning, the deep learning model LSTM was found to be more effective 23 .TF-IDF and GloVe embedding were used with BiLSTM classifier and achieved better results 24 .In several NLP problems, although the LSTM has shown to be highly successful, it still has room for improvement.According to the literature, BiLSTM performs better mostly for resource-rich language (English).It is observed from the literature review, limited study has been conducted on (Roman Urdu).So, there is a need to design a sentiment analysis model based on enhanced embedding and BiLSTM for Roman Urdu corpus.

Methodology
In this study, an enhanced word embedding method and BiLSTM were used to create a deep learning model for Roman Urdu sentiment analysis.Fig. 1 shows the operational research design for proposed model.The details of the framework, a deep learning model, and the findings will be covered in the next section.

Dataset:
Dataset of student's feedback reviews has been collected for faculty performance evaluation in order to create an academic domain dataset.Data have been collected from Higher education institute (HEIs) of Pakistan.A sample of comments from Roman Urdu dataset with English translation are shown in Table 2. Three specialists with a strong grasp of Urdu script manually annotate reviews.Reviews were annotated into three groups (positive, negative, and neutral).

Data Pre-processing:
Basic pre-processing steps like stop-word removal, normalization, and stemming 30 are required before applying the classification to the data.A rule-based lexical normalization approach was used, to normalize the text.Creation of lexical normalization standards has been extremely difficult due to the inconsistent and uneven style of Roman Urdu text.

Lexical Normalization & Stemming:
Stemming primarily aims to decrease the variety of expressions by combining related words into a single stem word.Before the data is fed in, a stemming procedure is performed using a lexical dictionary annotated by humans.The lexicon based dictionary is developed to standardize textual data with an extensive variety of spellings.In order to group related words with similar meanings, stemming technique was implemented based on dictionaries.Stemming is applied on data after applying the rule based lexical normalization approach.Multiple stemmers are available for English language, including porter-stemmer and snowball stemmer.There isn't a single stemmer available for Roman Urdu, though, it is important that stemming for Roman Urdu is significantly difficult and challenging than the stemming in other languages.A stemming function based on dictionary has been proposed.Table 3   Our stemming method is built on the mapping function f: W → S, which connects the word W to the stem word S. In this case, W is a set of words against which a possible stem word from S is mapped.If the method can't match the particular lexicon to its stem word, it returns the actual word.
Further, as shown in Fig. 2, each word's index is separately handled using the hashing technique.Such that the stem word can be immediately retrieved, the abnormalities of data has been reduced by stemming the entire corpus using this mapping strategy.The 1st layer represents the word embedding process; In light of the literature review, it is noticed that neural word embedding performs better in the text classification problem; Word embedding is primarily used to generate dense vectors of reduced dimensions for the subsequent layers of a neural network.The colloquial nature of Roman Urdu text is the primary obstacle in processing Roman Urdu.
This informal text style has presented researchers with a formidable challenge in addressing this issue.Word embedding's have achieved significant success in performing sentiment analysis tasks, especially in resource-rich languages (English).However, limited studies investigated word embedding for Roman Urdu text because of the informal style of the text.So, our study proposes an enhanced word embedding technique that learned/trained on an annotated data by giving the extra domain-specific knowledge to enhance the quality of word vector.Pre-trained neural embedding Word2Vec and GloVe were employed to investigate which performs better for Roman Urdu Sentiment analysis.At the end, output of enhanced word embedding in the form vectors will be transferred to the neural network layer, the BiLSTM network layer.
In the 2nd layer the proposed BiLSTM model uses bidirectional-LSTM; since it operates with two hidden layers, it processes data in both directions and can intelligently grasp immediate and subsequent contexts.Our data set includes short and long reviews, so our proposed model, BiLSTM can further deal with long-term dependency problem; details are mentioned in next section (Recurrent Neural Network).
Further, the proposed model consists of text sequence input layers: an embedding layer, Bidirectional LSTM layers with specific regularizations; a regularisation dropout layer; RMSprop optimizer, softmax activation function, followed by a dense softmax output layer, which is used to obtain the results of ternary classification.

Recurrent Neural Network:
RNN is capable of collecting previous information.In a nutshell, Simple RNNs are incapable of remembering long time stamps and can only recall short time stamps.This issue is known as long-term dependency, and it can be addressed with LSTM.The LSTM's recurrent architecture can solve the gradient descent's exploding and vanishing problem in addition to maintaining contextual information.LSTM's gating units allow it to regulate the flow and determine what to ignore and update.If there is an input neuron   , a hidden output state ℎ  , and a previous hidden output state ℎ (−1) , simple RNN can be expressed as follows: ℎ  = ǥℎ(    +   ℎ (−1) +  ℎ ) 1 Here, ℎ  represents the hidden output, ǥℎ is a squashing function, W is the weight-matrix, b represents a bias, and   indicates the final output.
The core idea behind LSTM is that cell states serve as conveyor belts.With little linear interaction, this conveyor belt transfers data in sequences and moves through multiple cell states.During this operation, information is added to or withdrawn from the cell states utilizing gates.The traditional LSTM has four memory blocks, three multiplicative components known as gates, and one cell state c.Fig. 4  Eq.6 states that the previous cell state  −1 , is subsequently changed to the new cell state   .In order to identify which part of the cell state could be added as output   , a sigmoid layer [0, 1] is eventually executed, it is then inputted with the cell state and element-wise multiplied (abbreviated with) a tanh layer.Values between "1 and 1" are the only ones that the tanh function will accept.This will lead to the production of presented output ht.To find the best weights for the suggested model, model learning optimizers were empirically evaluated across the appropriate amount of epochs.
For learning the model 27 epochs were used.

Figure 5. Training and Validation accuracy for ternary classification on the Roman Urdu Student's feedback Dataset
Most deep learning models suffers from the overfitting issue on the training data, in order to prevent overfitting; drop out layer and batch normalization layer are added.Initially the dropout ratio is set as 0.1 with return sequence set as true, the other drop layer turned to 0.2.When dropout rate was applied more than 0.2 it was observed that model loss the important features and it gets negative impact on the performance.It is clearly shown from Fig. 5 that the accuracy is getting better up to the 27th epochs, and thereafter it generally declines.The experimental evaluation shows that there would be more losses when continue training the model, so final weights that the model learned was after 27 epochs.

Training and Validation Loss
Training loss Validation loss

Experiments:
Model performance was evaluated on our own created academic domain student feedback Roman Urdu dataset includes (2000 reviews).Model was tested on the well-known RUSA-19 31 (10,021) Customer Reviews publically available dataset in order to see if the proposed model is generalizable.This section describes the experimental design and specific baseline techniques with discussion.

Experimental Setup:
Tensor flow Keras deep learning library has been employed in our test environment.Additionally, the data used for training and testing were separated using sckit-learn.In our empirical configuration the integrity of the neural word embedding's Word2Vec and GloVe will be assessed due to their promising performance in several NLP tasks.Keras Preprocessing deep-learning library is used to generate tokens and apply padding and sequences to balance the size of the tokens.Our newly developed academic domain student feedback dataset were annotated for Roman Urdu sentiment analysis tasks and categorized into three groups, i.e., positive, negative, and neutral.Unfortunately, there is no publicly accessible dataset that can be used as a benchmark and contains student reviews with relative polarity values recorded in the ground truth.
Consequently, dataset of 2000 student reviews was applied, which includes 1064 positive, 514 negative, and 422 as illustrated in Table 5.The dataset is distributed in an 80%:20% ratio into training and test set respectively.In the dataset, the shortest review has 4 to 5 words, and the most comprehensive review has 42 words.This distribution of reviews shows that most reviews are between 1 and 12 words.

RUSA-19 Dataset:
The RUSA-19 dataset consists of 10,021 (Roman Urdu) customer reviews on a variety of subjects, including drama, technology, food, software, sports and blogs.The data were acquired from different social media platforms.RUSA-19 corpus includes 3302 neutral reviews, 3778 positive reviews and 2941 negative reviews.

Annotation Process & Guidelines:
In order to annotate the student reviews manually, the entire annotation procedure is discussed in this section.The procedure entails creating manual annotation rules, estimating inter-annotator agreement, and manual annotation.Three native speakers of Urdu who have a thorough understanding of Roman Urdu, especially, were chosen to annotate the dataset.Annotators were given a set of instructions to follow while annotating.

Guidelines for Positive Class:
If every aspect of the student review of the assigned text is positive, the review is considered positive.A student review is a collection of one or more sentences that express feelings toward a teacher.Positive class selected if the student reviews are mixed with a neutral and positive attitude.All reviews were marked as positive if they directly use the words "good," "great," "wonderful," "brilliant," or "fantastic."Positive words like "acha" ("good"), "madad-gar" ("helping"), "azeem" ("great"), "shandar" ("wonderful") and "laajawab" ("outclass") should be employed rather than negative ones like "Na," "Nahi," and "mat," as these terms flip the polarity 32 .

Guidelines for Negative Class:
A sentence is considered negative if a certain word expresses a negative feeling or if there is a strong sense of apparent conflict.A text review is identified as negative when negative sentiment predominates over other sentiment categories.Sentences that include unpleasant, depressing, or sadness or include negative keywords such as "bura" (bad), "sakht"(harsh), "bad-ikhlaaq" (illmannered), "bakwass" (rubbish), "badtameez"(rude), without any additional modifiers like "Na," "Nahi," or "mat"; is perceived as negative 32 .All the unhappy, angry, or violent reviews are likewise labeled as negative.

Guidelines for Neutral Class:
Factual sentences, or those in which any idea is expressed, are considered neutral.Statements like "shayad" (Perhaps) and "ho saktaa hey" (Maybe) are examples of neutral statements since they convey a low degree of certainty and liability.A phrase expressing positive and negative opinions on specific characteristics and entities is said to be neutral 32 .

Data Annotation Process:
A manual annotation process were followed in which the annotators manually add annotations to the student feedback dataset.Three annotators (x, y, and z) were chosen for tagging the dataset to create a new students' feedback dataset.Experts who have a solid understanding of Roman Urdu and Urdu have annotated the entire corpus.According to the provided annotation rules, each annotator has been given a review to categorize the review into three groups (positive, negative, or neutral).To evaluate the efficacy of the annotation procedure, a sample of 300 reviews annotated by X and Y were checked for contradicting pairs.In the context of a controversy, annotator Z had assigned to be a third annotator to assign a label to the controversial review.Additionally, an (IIA) inter-annotator agreement between annotators has been developed for the entire corpus.

Evaluation Metrics:
Precision, recall, F1-measure, and accuracy were utilized as evaluation measures to assess the performance of the suggested model.In order to increase the effectiveness of model in experimental setup, the fully connected layer and dropout layer were mainly focused with a specific size of neurons at each layer.It is observed that the maximum length of the tokens is 67 size used at neuron input size and applies a positive effect on the performance of the model rather than using neuron size in the sequence like 128,64, 32 and so on.

Findings and Discussion:
In this part, the findings of our experiments performed on two datasets of Roman Urdu are discussed, one is our newly created students feedback dataset other is RUSA-19 31 customer reviews dataset for sentiment analysis.Under two alternative neural word embedding techniques, our suggested model significantly beats the benchmark deep learning models.

Comparison with Baseline Method:
as also demonstrated performance comparable to CNN, RNN, and GRU.The LSTM obtained F1-scores of 0.85 for word2vec and 0.86 for GloVe.Our suggested approach performed better than existing models and greatly enhanced performance.Two embedding techniques and all four evolutionary measures showed improved performance for our proposed model.Our recommended model, however, performed the best when using the word2vec embedding scheme, with precision scores of 0.90, recall scores of 0.89, F1scores of 0.90, and accuracy scores of 0.90.Our model was evaluated on the publically available RUSA-19 31 Customer Reviews dataset to see if the suggested approach is generalizable.The classification results obtained from the model on this dataset, shown in neural embedding performed slightly better compared to GloVe embedding as shown in Table 6 Since Roman Urdu is a very casual method of written communication, people often employ multiple spellings for the same terms, leading to a wide variety of word forms and a high degree of multilingualism.Some people make unnecessary spelling changes when attempting to convey their emotions through the written word.For example, for the sentence "Teacher is so good (in English)," in Roman Urdu, it is like "Teacher buht acha hy"; some people might write alternatively "Teacher booooot achaaa hy."However, this is easily managed through a robust stemmer but Roman Urdu unfortunately lacks.As discussed in the preprocessing section, the stemming & normalization process based on the dictionary has been applied to reduce the complexity and anomalies of data and it is observed from the experiments that the model performance is improved due to the stemming process.Comparatively, in case of the RUSA-19 dataset, general pre-processing techniques were applied instead of applying manual stemmer and normalization.So, this is the main reason in the performance difference of both datasets.
At the end, Sentiment polarity is predicted using the final Softmax output layer; which is positive, negative, or neutral.

Limitations and Future Work:
The task of sentiment polarity identification was successfully performed well by our system.However, there were a few comments that were misclassified.The model will be improved by adding a fusion of another model with BiLSTM.Additionally, the current system only deals with comments given in the Roman Urdu.It was noticed that most of the students gives feedback in English and Roman Urdu, thus the system would be further expanded by processing Bilingual (English and Roman Urdu) reviews.

Figure 2 .
Figure 2. Stemmer mapping function for Roman Urdu

Figure 3 .
Figure 3. Proposed Roman Urdu Sentiment Analysis Model

Global_max_pooling1d
RMSprop and ADAM are used as optimizer function with the BiLSTM model, ADAM is slightly better than https://doi.org/10.21123/bsj.2024.9822P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal RMSprop with this dataset features.Finally, the categorical cross-entropy is used with the Softmax layer to predict close values towards three groups Positive, Negative and Neutral.

Table 1 . Existing work on Roman Urdu Sentiment Analysis S.no Title Year Dataset Algorithm Feature Extraction
29n-sequential data but has not yet shown adequate results when dealing with sequential data29.In one of existing research, the author suggested a particularly multi-channel hybrid technique; he concluded that most machine learning techniques for sentiment analysis work better with TF-IDF.However, customized deep-learning techniques work better with word2vec.Experiments showed that the proposed model LSTM with Word2Vec performs better, with an accuracy of 75.5%.It is observed from the existing literature that LSTM and BiLSTM performs better compared to other ML and Deep learning classifiers for the classification of sentiment analysis of RU.It is observed from the existing literature that LSTM and BiLSTM performed better when combined with word embedding technique.

Table 3 . Table for lexical normalization of Roman Urdu data
depicts the gates as input gate   , forget gate   , and output gate   .A single LSTM memory cell can be demonstrated mathematically as follows: , and ℎ  represent the cell state, and concealed state, respectively.W is the weight matrix, and b indicates the bias, for each individual layer.First step in the LSTM process is to finding the data that can be eliminated from the cell state as unnecessary.This decision is taken by the forget https://doi.org/10.21123/bsj.2024.9822P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal layer, also known as the sigmoid layer.Second, two-pass layers are used to decide what new information will be added.

Table 6 ,
looks reasonable with an accuracy of 0.73 with word2vec https://doi.org/10.21123/bsj.2024.9822P-ISSN:2078-8665-E-ISSN: 2411-7986 Baghdad Science Journal embedding.The main reason in the performance difference of both data sets is colloquial nature of text, due to uneven and informal representation of Roman Urdu text, Stemming & normalization process discussed in the preprocessing section was performed on student's feedback dataset to reduce inconsistency and complexity of data.In case of RUSA-19 dataset, such type of stemmer hasn't been developed because of the huge amount of data.So, it is observed from the experiments that stemming process gives positive effect on the performance of the model.ConclusionIt is very complex to perform sentiment analysis tasks on data that is of a colloquial type.The most critical issues here are feature extraction and classifier design.Another ability that helps the deep learning model in terms of semantics is pattern recognition, most standard deep learning models are not able to do this.The traditional resolved this issue at some extent, but still there is a need of improvement.So, a proposed BiLSTM model performs better with enhanced word embedding word2Vec using Roman Urdu corpus.The enhanced word embedding is implemented to give the model more semantic power and improve its capacity for extracting patterns.The next step in this investigation is to compare several deep network configurations (CNN, RNN, GRU and classic LSTM) under already trained neural word embedding to determine which deep learning technique is best for handling the sentiment analysis task for Roman Urdu.It is clearly percieved from the experiments that BiLSTM with word2Vec 2024, 21(2 Special Issue): 0725-0739 https://doi.org/10.21123/bsj.2024.9822P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal