The Effect Of Optimizers On The Generalizability Additive Neural Attention For Customer Support Twitter Dataset In Chatbot Application

When optimizing the performance of neural network-based chatbots, determining the optimizer is one of the most important aspects. Optimizers primarily control the adjustment of model parameters such as weight and bias to minimize a loss function during training. Adaptive optimizers such as ADAM have become a standard choice and are widely used for their invariant parameter updates' magnitudes concerning gradient scale variations, but often pose generalization problems. Alternatively, Stochastic Gradient Descent (SGD) with Momentum and the extension of ADAM, the ADAMW, offers several advantages. This study aims to compare and examine the effects of these optimizers on the chatbot CST dataset. The effectiveness of each optimizer is evaluated based on its sparse-categorical loss during training and BLEU in the inference phase, utilizing a neural generative attention-based additive scoring function. Despite memory constraints that limited ADAMW to ten epochs, this optimizer showed promising results compared to configurations using early stopping techniques. SGD provided higher BLEU scores for generalization but was very time-consuming. The results highlight the importance of finding a balance between optimization performance and computational efficiency, positioning ADAMW as a promising alternative when training efficiency and generalization are primary concerns.


Introduction
Integrating artificial intelligence (AI) through the use of neural networks is a widely used approach in various fields such as object and speech recognition, healthcare, and business, including chatbots.Chatbots based on neural networks typically aim to find the best function approximation by finding network parameters that minimize the error function during training data 1 .An error function measures how accurate the output of a model is compared to the actual output (target values).To improve the

Abstract
When optimizing the performance of neural network-based chatbots, determining the optimizer is one of the most important aspects.Optimizers primarily control the adjustment of model parameters such as weight and bias to minimize a loss function during training.Adaptive optimizers such as ADAM have become a standard choice and are widely used for their invariant parameter updates' magnitudes concerning gradient scale variations, but often pose generalization problems.Alternatively, Stochastic Gradient Descent (SGD) with Momentum and the extension of ADAM, the ADAMW, offers several advantages.This study aims to compare and examine the effects of these optimizers on the chatbot CST dataset.The effectiveness of each optimizer is evaluated based on its sparse-categorical loss during training and BLEU in the inference phase, utilizing a neural generative attention-based additive scoring function.Despite memory constraints that limited ADAMW to ten epochs, this optimizer showed promising results compared to configurations using early stopping techniques.SGD provided higher BLEU scores for generalization but was very time-consuming.The results highlight the importance of finding a balance between optimization performance and computational efficiency, positioning ADAMW as a promising alternative when training efficiency and generalization are primary concerns.To find an optimal weighting for the minimum loss function, the backpropagation algorithm can be used by adjusting the gradients of the loss function.
Backpropagation is an algorithm for computing gradients from the output using the chain rule 1 and is an example of optimization techniques for training neural models based on gradients.However, the use of an algorithm based on finding gradients is very limited in its ability to find solutions for generalization.This limitation has led to the investigation of other optimization algorithms using decoupled decay regularization techniques such as ADAM and ADAMW, which are known for their superior performance.The efficiency of these chatbots in simulating human dialogues largely depends on the optimal tuning of the neural network weights, which is usually achieved by gradientbased algorithms such as backpropagation.
The optimizer determines how the network is updated based on the loss function.An optimizer concatenates the loss function and the model parameters by updating the model in response to the output of the loss function.Optimizers help minimize the loss function.There are two types of optimizers: gradient descent-based and adaptive optimizers.These different types of optimizers are based on an operational aspect where the learning rate is manually adjusted in the case of gradient descent algorithms such as batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, while it is automatically adjusted in the case of adaptive algorithms, e.g., Adagrad, Adadelta RMSprop, ADAM, ADAMW, and ADAMAX, to name a few, as shown in Fig. 1.

Figure 1. Optimizer Categorization.
Among commonly used optimizers, adaptive gradient-based methods such as ADAM have shown potential for performance improvements over SGD in some scenarios and have become the default choice in most studies 2,3 .However, recent studies show that ADAM, which is known for its scaleinvariant parameter updates, is often criticized due to concerns about its generalization performance compared to SGD in image classification 2,4 .
Although ADAMWa variant in which the weight decay is managed after controlling the parameter-wise step size-presents an interesting alternative, there are few comparative studies between these optimizers.Therefore

Materials and Methods
This section provides an overview of the current methodological approach to research on the neural generative attention mechanism of the seq2seq model.The seq2seq learning task model is generally based on an encoder-decoder architecture consisting of three parts: encoder, context vector (final hidden/internal state vector), and decoder.To improve the performance of this structure, the augmentation layer of attention and the use of bi-LSTM are adopted in the encoder part.Before this model is performed, several preprocessing steps are required to conduct the current experimental study.The first step begins with splitting the initial dataset into a training set and a test set.The whole dataset is split into 75% and 35% for the training and validation/test sets, respectively.In this study, the publicly available dataset "Customer Support on a Twitter (CST)" from Kaggle was used to train and evaluate the models.The dataset should then be prepared for modeling.The preparation process includes preprocessing and feature extraction.For feature extraction, a transfer learning approach was adopted by using FastText pre-trained word embeddings to speed up training and increase model performance 5 .This approach considers knowledge transfer between networks trained on different datasets.The result of this step is incorporated into the neural generative attention model, which is trained with a training set.The training of this model to predict the response matches the groundtruth answers.The training process can be represented as minimizing the loss function L(θ), where θ represents the model parameters.The objective is to find the optimal θ that minimizes the difference between the predicted response and the ground truth, which can be mathematically defined by Eq. 1. ) where L(θ) is the average loss over the training set, N is the number of examples in the training set, y i refers as the ground truth for i, y ̂i is the predicted response for  generated by the model, and L(y i ,y ̂i) is the loss for i calculated using a loss function suitable for the problem at hand such as sparsecategorical cross-entropy loss for this case.
The optimization process to minimize L(θ) can be performed using a gradient-based optimizer such as SGD or adaptive methods like ADAM and ADAMW.These methods iteratively update parameter θ based on the gradient of the loss function with respect to θ 6 .These iterations continue until a stopping criterion is met, e.g., a predefined number of epochs or until the change in L(θ) falls below a certain threshold.The final result is an optimized set of parameters  that can be used to make predictions that are very close to the ground truth.Finally, prepare the validation or test data set accordingly and use it to evaluate the models.Fig. 2  compared with a learning rate of 0.003 for the optimization 8 .The hyperparameter learning rate feeds into the optimization function.In the case of the SGD optimizer, only the momentumaccelerating gradient descent  ∈ {0.09} was tested.Here, 0 represents the vanilla gradient descent and 0.9 represents the convention 9 .A gradient clipping of 50.0 was also added to counteract the 'exploding gradient' problem.In this way, the gradients from growing exponentially and either overflowing (undefined values) or exceeding cliffs in the cost function.All weights and biases are initialized using the Xavier Uniform Glorot and Bengio (2010) distribution 10 .300-dimensional pretrained word embeddings for FastText were used.An early stopping technique with patience 5 was also employed to prevent overfitting.However, there are limitations to using the ADAMW optimizer since our memory resources are not occupied by the early stopping technique for training.Therefore, ten epochs for ADAMW were implemented without an early stopping technique for the model in this study.The hyperparameters and for training the models are listed in Table 1.

Results and Discussion
In this section, the experimental results of the model for the aforementioned dataset are presented.The experiment evaluated the performance of the different optimizers on the neural additive attention model with the pre-trained FastText embedding as an input feature to a model.Table 2 and Fig

Conclusion
The study aimed to investigate the performance of different optimizers (SGD, ADAM, and ADAMW) on the neural additive attention model with FastText pre-trained embedding based on their sparse categorical loss during training and BLEU in the inference phase.During the training phase, ADAM proved to be the most efficient optimizer in minimizing loss.However, this did not directly translate into superior performance in all aspects of the inference phase, with ADAMW showing robust generalization and performing well on unseen data despite running on minimal 10 epochs, especially in beam search, while SGD was competitive in BLEU scores but very time-consuming.These results highlight the need to balance training efficiency with various aspects of validation and search strategies when selecting an optimizer.According to our result, ADAMW is a promising alternative when training efficiency and generalization performance are the main concerns as it can achieve comparable results for all evaluation aspects even though 10 epochs were used to train the model without implementing an early stopping technique due to memory constraints.If more epochs are used, it can be inferred that a satisfactory result can be obtained for the model performance.
2024, 21(2 Special Issue): 0655-0661 https://doi.org/10.21123/bsj.2024.9743P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal output (response), such parameters (weights) have to be optimized using optimization functions.Such parameters can be learned by training on labeled data (target values).Thus, the error is measured by comparing the values for each prediction y with the actual output (target values).The measurement of this error is associated with a loss or cost function 1 .
, this study aims to compare and investigate the effects of the optimizers SGD with Momentum, ADAM and ADAMW on the text chatbot CST dataset.The objective is to evaluate their performance based on https://doi.org/10.21123/bsj.2024.9743P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal training and validation losses and the BLEU scores for different search strategies to gain insight into the balance between optimization performance and computational efficiency.By revealing the performance nuances of these optimizers, this study seeks to guide the choice of optimization techniques in the development of neural network-based chatbots to improve their conversational quality and practicality.The structure of this paper is outlined as follows: Section 2 presents the methodology of our experiments.The results obtained from the experiments are reported and discussed in Section 3, and finally, Section 4 summarizes the research findings and suggests directions for future studies.

Figure 2 .
Figure 2. Illustration of the Methodology Step.
.3 show the performance results of the different optimizers on the model based on the sparse-categorical entropy loss during training and the BLEU scores metric in the inference phase.Due to memory issues, only ten epochs were run for Config 2, while an early stopping technique was used for the other configs during the training phase.The result shows that ADAM is the most effective optimizer during the training process, as it achieves the lowest training loss of 1.004115, which means that it converges the fastest during the training phase.On the other hand, SGD recorded the highest training loss of 1.557569, indicating a slower and less effective learning process.However, the validation loss result showed that ADAMW had the lowest validation loss of 1.138623, indicating that it is the most effective at generalizing and performing wellon unseen data despite running on minimal 10 epochs.In addition, ADAMW achieved the highest BLEU score in the beam search scenario.This shows that ADAMW was able to learn efficiently in a minimal number of epochs.In the inference phase, the BLEU score analysis revealed nuances in the performance characteristics of the different optimizers.The highest BLEU in the greedy search was obtained by the SGD optimizer, indicating a better prediction of response quality with this search strategy.However, it is too time-consuming (almost a week to train a single model), which makes it seem less practical.This emphasizes the importance of considering multiple aspects when selecting an optimizer, including not only training efficiency but also generalization capabilities for unseen data and specific performance metrics under different inference techniques.Considering this aspect, the results highlight the importance of finding a balance between optimization performance and computational efficiency, positioning ADAMW as a promising alternative when training efficiency and generalization performance are primary concerns.

Figure 3 .
Figure 3. BLEU Score for Optimizers During the Inference Phase.