Using VGG Models with Intermediate Layer Feature Maps for Static Hand Gesture Recognition

: A hand gesture recognition system provides a robust and innovative solution to nonverbal communication through human–computer interaction. Deep learning models have excellent potential for usage in recognition applications. To overcome related issues, most previous studies have proposed new model architectures or have fine-tuned pre-trained models. Furthermore, these studies relied on one standard dataset for both training and testing. Thus, the accuracy of these studies is reasonable. Unlike these works, the current study investigates two deep learning models with intermediate layers to recognize static hand gesture images. Both models were tested on different datasets, adjusted to suit the dataset, and then trained under different methods. First, the models were initialized with random weights and trained from scratch. Afterward, the pre-trained models were examined as feature extractors. Finally, the pre-trained models were fine-tuned with intermediate layers. Fine-tuning was conducted on three levels: the fifth, fourth, and third blocks, respectively. The models were evaluated through recognition experiments using hand gesture images in the Arabic sign language acquired under different conditions. This study also provides a new hand gesture image dataset used in these experiments, plus two other datasets. The experimental results indicated that the proposed models can be used with intermediate layers to recognize hand gesture images. Furthermore, the analysis of the results showed that fine-tuning the fifth and fourth blocks of these two models achieved the best accuracy results. In particular, the testing accuracies on the three datasets were 96.51%, 72.65%, and 55.62% when fine-tuning the fourth block and 96.50%, 67.03%, and 61.09% when fine-tuning the fifth block for the first model. The testing accuracy for the second model showed approximately similar results.


Introduction:
People use many ways to express meaning; they may speak, sign, or write to convey their ideas to others.However, deaf people cannot communicate with others using voiced language.Therefore, they depend on sign language, which incorporates different body and hand movements.Hand gestures represent a primary part of sign language in which different postures of hands have varying meanings.A specific hand posture may mean a single alphabetical letter, a word, or a sentence 1 .
Static [2][3][4] or dynamic 5 hand gestures are commonly used in recognition applications.In static hand gestures, meaning is expressed by hand postures.In contrast, in dynamic hand gestures, hand movements are also involved in conveying meaning.This study is concerned with static hand gestures, wherein each hand gesture represents the meaning of a single alphabetical letter.Hand gesture recognition has many real-life applications, such as facilitating communication with deaf people and producing ways of interaction that can be used nowadays across different applications, such as virtual environments, gaming, and appliance control.
Consequently, many methods and techniques dealing with the hand gesture recognition problem are available 6 .Hand gesture recognition systems can be categorized into two types: sensor-based (also known as glove-based) and image-based (also known as vision-based) systems.There are some limitations and drawbacks to sensor-based systems.
In particular, the signer must wear gloves attached to sensors to convey gesture information, and these wearable gloves may be cumbersome for the user.Furthermore, sensors may be expensive.These limitations have led to the use of image-based systems as an alternative to sensor-based systems 6- 8 .Similar to other recognition systems, a handgesture recognition system consists of two parts: feature extraction and classification.Researchers have developed different methods to extract features from digital images and different classifiers.The extracted features are then fed to a classifier to recognize gestures.Traditional methods include artificial neural networks (ANNs), hidden Markov models (HMMs), support vector machines (SVMs), and transform-based models 7,8 .Recently, deep learning models have been successfully applied to digital image and speech 9 tasks, including convolutional neural networks (CNNs), prompting researchers to investigate such models in exploring recognition problems.Unlike machine learning approaches that require extracting handcrafted features 10 from data, CNNs automate the process of feature extraction.These models have hierarchical architectures and learn features with various levels of abstraction at each layer 2 .Some researchers 2,4,11 have conducted recognition experiments using CNN models and they have trained these models from scratch on hand gesture image datasets.Others 12 have used pre-trained models to train the model on the new dataset rather than the whole model.This second method is known as "transfer learning."The benefits of fine-tuning models are that less time is required to train the model, and it can be used even when the dataset size is not large enough to train the model from scratch 12,13 .
Inspired by Le-Net-5, a study 11 proposed a CNN model, which the authors trained on a dataset of 39 classes of hand gesture images representing alphanumeric data.In addition, they reported improved recognition results over traditional methods of k-nearest neighbor and SVM.A model comprising three convolutional layers was introduced 2 to recognize hand gestures with complex backgrounds.The authors evaluated the model on two public datasets: one consisting of 10 different classes and another comprising 24 letter classes acquired under similar lighting conditions.Their results showed that using this model eliminated the need for the hand segmentation method, which is a challenging task for images with cluttered backgrounds.Overall, they showed promising results for their proposed model.Some authors 3 applied deep learning methods to recognize 24 classes of hand gesture images.They used CNNs and stacked denoising autoencoders to solve the problem.They demonstrated that their models could recognize similar hand gestures with higher recognition rates compared with other methods, such as ANN.
Meanwhile, some studies have utilized existing models to address the problem of hand gesture recognition.Others 14 modified two network architectures based on AlexNet and VGGNet, respectively, to recognize hand gestures.Using a combination of three components-hand detection, hand tracking, and hand recognition-they found that this approach is feasible in practical applications.However, they implemented the model using only six classes of hand gestures.Other researchers 15 used the inception model for hand gesture recognition, in which they fed the model with depth image data in addition to the color image data acquired by an acquisition device known as Kinect.Using transfer learning to train the last layer of the model on a target dataset of 10 classes, they reported highly accurate results for their model.Some authors 16 proposed a CNN model and fine-tuned pre-trained Visual Geometry Group (VGG) models.They conducted different experiments on a dataset with 33 classes of static hand gestures to evaluate these models.In addition, they reported high-end accuracy for the proposed model and a further increase in performance with these fine-tuned VGG models.A study 12 used pretrained VGG-16 and ResNet152 models to recognize 32 different classes of hand gestures.Their reported results revealed that these fine-tuned models had high recognition accuracy during the experiments.Another study 13 examined pre-trained VGG models as feature extractors and fine-tuned them on ear images.Using their methodology of training and recognition, they reported the superior performance of the fine-tuned models compared to other learning methods.
Although recent studies have introduced various pre-trained deep learning models to recognize static hand gestures, these studies have not stated whether one can select an intermediate layer within these models to extract features from hand gesture images.Therefore, this paper aims to fill this research gap.The contributions of this study are as follows:  This study adopted two VGG models 17 as efficient deep learning models in image classification to conduct recognition experiments on hand gesture images.The models were adjusted to suit the image datasets and then trained under different learning strategies, including fine-tuning intermediate layers of the models.To the best of our knowledge, this strategy has not been previously tackled in similar studies.
 This study introduced a new hand-gesture image dataset to evaluate the performance of models under challenging conditions.
 The models are trained on a dataset and tested on various datasets, unlike most previous studies 12,[18][19][20][21] in this field, which divided the same dataset 22 into training and testing sets.The remainder of this paper is organized into sections.The next section explains the datasets used and considers the adapted VGG model.The training methods are also applied in this section.The results are then discussed in the next section.The details of the models are presented in Table .1, and a comparison with other similar studies is shown in Table .2. The training and validation accuracies are shown in graphs.The final section presents our conclusions.

Materials and Methods:
This study used VGG models with intermediate layer feature maps to recognize static hand gesture images in the Arabic sign language.

Datasets
This study focused on three datasets.The first, known as ArASL 22 , is a large, public dataset prepared for image recognition and classification tasks.Fig. 1(a) displays sample images from this dataset.These images are grayscale 64 × 64 pixels and fall into 32 classes corresponding to the Arabic alphabet letters.However, six classes were excluded from this dataset to match the number of classes in the other test datasets.Furthermore, the number of samples was evenly distributed among the classes, leaving 1293 samples in each class.This dataset was divided into two sets: training 70% and testing 30%.The second and third datasets were used to evaluate the performance of the models under challenging conditions.This means that all the images of these datasets are used for testing purposes.The second dataset was collected to test the models on different samples, where 30 individuals signed the gestures as in the first dataset.These are 840 grayscale images of diverse sizes, of which 50% of the images are of the left hand and the other 50% are of the right hand.

Proposed Model
This study considered two VGG models, known as VGG-16 and VGG-19.The VGG is the creator of these models, which have proven to be effective in many computer vision and image classification tasks.Each model takes an image as an input and produces a classification output.The structure of each model consists of five convolutional blocks, followed by fully connected layers (FCs).The convolutional blocks extract features from the image, and the FC layers classify an image according to those features.
The VGG models were adjusted to make them more suitable for handling the image dataset.Minor changes were made between these models, as

Training Methods
This study considers the exact training steps described in a previous work 23 .As shown in Table .1, each model starts with a series of convolutional blocks and ends with a fully connected classifier.First, each model was given random weights and trained from scratch for comparison purposes.Next, the pre-trained convolutional blocks of the model were employed to extract the features.A new classifier was added on top of these blocks and trained from scratch.Then, the top layers of the model were jointly trained with the added classifier.The final training method is called "fine-tuning."Due to the hierarchical learning structure of CNN, only the top layers of the model are fine-tuned.In this study, the convolutional blocks of the models were fine-tuned under three scenarios: first, beginning with the fifth block layer, then with the fourth block layer, and finally, the third block layer.The steps of fine-tuning are as follows 23 : 1. Add a new layer on top of the pre-trained base network.
2. Make the base network untrainable.3. Train the new layer.4. Make some layers in the base network trainable.

Training Details
A widespread problem with machine learning models is overfitting.Thus, some transformation operations were applied randomly to the training images before being fed to the models to alleviate overfitting.As a result, the model learns various kinds of patterns from the augmented dataset.The following augmentation operations were applied to the images: 1. Rescaling with a 1/255 factor.2. Random rotation within the range of 40 degrees.
3. Random horizontal and vertical translation within the range of 0.2 of total width or height.
4. Random shearing transformations with a factor of 0.2.
5. Random zooming with a factor of 0.2.6. Random horizontal flipping.
The stopping condition was used to interrupt training when the validation loss was no longer improving, in which a batch size of 20 and a learning rate of 10 −5 with an RMSProp optimizer were applied.The loss function was a categorical cross-entropy suitable for multiclass, single-label classification.All experiments were conducted with the Google collaboratory environment using the Tensorflow 24 and the Keras libraries.

Results and Discussion
The main purpose of this study was to test the recognition performance of the pre-trained VGG models with different intermediate layers on handgesture images.These models were tested on the abovementioned datasets, and two popular metrics, accuracy and top-5 accuracy, were evaluated in the validation set.The accuracy measure shows the percentage of correctly classified samples, while the top-5 accuracy indicates the percentage of classifications in which the correct label appears in the five classes with the highest scores.The obtained results are shown in Table .2.  Regarding the training methods, it can be seen from Table .2 that the models achieved the best performance on all datasets when fine-tuning the fourth and fifth blocks.Focusing on these two training cases, the performance on the first and second datasets indicated that block five contained unnecessary information to recognize the images.These results confirmed the experimental results in a previous work 13 , in which the authors demonstrated the superior performance of finetuning over training from scratch and feature extraction.However, the results also revealed that the performance dropped, particularly with VGG-19, when fine-tuning the third block.This case indicated that the fourth block contained the necessary feature information for the classifier to recognize the images.When the models were trained from scratch, the performance was lower than the two best cases of fine-tuning.Moreover, this came at the expense of training time because more training iterations were required for the model to stop training.Thus, the performance of the models as feature extractors revealed that training the classifier alone was not sufficient to extract feature information.

Table. 2. A comparison of accuracy and top
Turning now to the datasets, the models showed variations in their performance, as shown in Table .2. The models were trained on a subset of sample images from the first dataset, and these images showed a high degree of similarity.Obviously, these models achieved the highest recognition accuracy when tested on sample images from the first dataset.These datasets created a challenge for the models, as the conditions of the sample images in the second and third datasets varied.A comparison of the performance with other published studies on the same dataset is shown in Table .3.   Table .4 shows the number of trainable parameters and the training time for each VGG model.The original VGG models have been proposed for the ImageNet classification task, which has 1000 classes.However, the datasets in this study have only 28 classes, resulting in fewer parameters in the fully connected layers.As a result, there are fewer trainable parameters in this table than in the original VGG models.
Finally, there are some limitations to these experiments that should be mentioned.First, the models were trained on a dataset of images with a high degree of similarity, resulting in low recognition accuracy when testing on the other two datasets.Therefore, there is a need for a handgesture dataset comprising varied images.Second, the experiments were conducted in an environment that depended upon the speed of the internet connection, which affected the training time.Thus, if the training was achieved on a local machine that satisfied the requirements of the experiments, the training time would decrease.

Conclusion:
The hand gesture recognition system aims to forge an interaction between humans and computers.Most previous studies overcame this issue by proposing new model architectures or finetuning pre-trained models.In this study, two versions of VGG models were trained on handgesture images using different strategies.First, the models were trained from scratch, then they were used as feature extractors, and the pretrained models were finally fine-tuned with intermediate layers.The models were slightly modified and tested on different datasets to create challenging conditions.The first dataset, which was publicly available, was divided into two subsets: one for training and the other for testing.The other two datasets were used for testing purposes only.One of these testing datasets was created as part of this study.The third dataset consisted of colored images, unlike the first and second ones, which consisted of grayscale images.
The experimental results revealed that the best recognition accuracy of these models could be obtained when the models were fine-tuned at the fourth and fifth blocks.Moreover, significantly reducing the number of training parameters leads to a reduction in the training time.Furthermore, the accuracies of the models were decreased on the second and third datasets because the images in these datasets were more different from the images in the first dataset.Therefore, the proposed models could be used in practical applications if trained on datasets with more varied images.However, this study is different from other studies in its field because a similar strategy applied to hand-gesture images was not found during our investigation.Therefore, future works could extend these experiments to investigate other deep learning models, such as ResNet.Furthermore, there is a need to understand what these layers have learned to determine the most relevant regions of the hand gesture that led the model to make its decision during the recognition process.

1 .
These models were changed to receive 64 × 64 images to match the size of the training images, and two fully connected layers were stacked instead of three layers in the original model.This is because the two layers are sufficient for the current recognition task.See Fig.2

Figure 2 .
Figure 2. VGG-based models with configured settings.VGG-16 on the left, and VGG-19 on the right.Table1.Details of the VGG models with configured settings.
-5 accuracy for the two VGG-based models on the three datasets.The results are given in percentages.

Figures. 3
Figures.3 and 4show the training and validation accuracies as well as the training and validation losses of the configured VGG models during scratch training, respectively.As shown in Fig.4, the validation losses were lower than their corresponding training losses.Thus, it can be concluded that the scratch training of these models did not result in overfitting.

Figure 3 .Figure 4 .
Training and validation accuracies for (a) the VGG-16 and (b) the VGG-19 models under scratch training.Training and validation losses for (a) the VGG-16 and (b) the VGG-19 model under scratch training.