Human Pose Estimation Algorithm Using Optimized Symmetric Spatial Transformation Network

. Compared


Introduction
Human pose estimation (HPE) research involves detecting the location of key joints in two or three dimensions as a basis for identifying and reconstructing parts of the human body 1,2 .Benefiting from a great deal of research in artificial intelligence in the last decade, human pose assessment has gradually become a topical problem in the areas of motion judgment, pose capture, and robotic interaction 3,4 .
Reliance on manual labeling is the main feature of traditional human pose estimation algorithms, but the accuracy is poor because it is a direct regression to the coordinates 5 .Significant progress in CNNbased pose estimation research has been made after 2013 in response to the development of Convolutional Neural Networks (CNNs) and the publicly available pose datasets and the precision and speed have been significantly enhanced compared to the traditional algorithms 6 .Compared with traditional https://doi.org/10.21123/bsj.2024.9775P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal human posture estimation methods, CNN-based methods can learn more adequate representations from the data, so how to utilize CNNs to enhance the performance of pose assessment has become a popular research direction 7 .

Related Works
HPE is a complicated task that incorporates both two-and three-dimensional estimation algorithms 8 .3D pose estimation is based on 2D methods with the addition of relative depth data, and the same two-dimensional pose might represent several threedimensional ones, which is an inherent problem that needs to be solved for estimating 3D pose estimation from images or videos 9 .
Posture detection is categorized into two types: single and multiple joint point inspection 10 .Nevertheless, the core of top-down and bottom-up approaches starts with the detection of a single keypoint 11 .

Single-Person HPE Methods
Deep learning extracts abstract features of data layer by layer through a multi-layer neural network structure, which is better adapted to different data types and application scenarios 12 .In addition, deep learning can be robust to noise, occlusion, and other disturbing factors, which greatly improves the accuracy of joint detection.
Single-person HPE is used to inspect one person's key point information in the scene using three forms: one is an approach based on coordinate regression; the second based on heat map detection; the last one is a regression and detection hybrid model Fig. 1 13 .

Figure 1. Single human posture estimation method summary
According to the technical approach, pose estimation using convolutional neural networks is categorized into two types: coordinate-based regression and heatmap-based regression 14,15 .
Coordinate-based regression methods directly return the coordinates of the key points and compute the probability value corresponding to each pixel in image 16 .In 2014, DeepPose proposed by Toshev et al 17 is the first convolutional neural network for estimating a pose-based model that learns the location of critical points in an image directly without a mannequin or part detector.Predicting key point locations directly from the scene is extremely hard and requires more powerful and robust network models to be introduced.In 2017, Sun et al. introduced a neural network regression method using ResNet-50 as a base, which employs a reparametrized gesture representation that utilizes the joint-connected structures to define a component loss function for encoding remote interaction information 18 .Using only sparse keypoint information is lacking in robustness, and converting heatmap supervision to keypoint supervision retains the benefits of both methods.Luvizon et al. proposed a pose regression method using Soft-argmax functions to transform heat maps into critical point location information in a differentiable manner, resulting in an end-to-end training framework 19 .
In single-person pose estimation, heat map-based regression uses heat maps to represent the location of joint points and then trains a neural network to obtain the detection probability of each pixel point.This method has higher accuracy compared to the traditional manual feature method, so the heat maps regression-based method is more popular.
In 2016, Wei et al. proposed the serialized Convolutional Pose Machine (CPM), which uses multiple VGGNet-based sub-networks to form a cascade network to augment the perceptual scope across the network and model the dependencies between key points.The CPM uses the feature maps and confidence maps generated in the prior period as inputs for the next period.The method can learn rich implicit spatial models, employs multi-stage outputs, and uses intermediate supervision to resolve the gradient vanishing problem resulting from the depth of the network.

Multi-Person HPE methods
Unlike single-person HPE methods, multi-person methods face greater challenges as it requires not only determining how many people are in the same scene and where each person is located but also grouping joints belonging to different human bodies.Depending on the solution, multi-person methods are categorized as top-down and bottomup.

Top-Down Evaluation Method
This kind of methodology recognizes each individual from the image background by a target detection algorithm and then performs skeletal joint point detection, which consists of two processes: joint point detection and joint point clustering.
Because of the characteristic of the top-down approach, it is more likely to have duplicate detection and false estimation when the human bodies are occluded from each other in the image.

Sun et al. introduced a High-Resolution Network
(HRNet) and iteratively fused features produced by multi-scale sub-networks to produce robust highresolution 21 .For more precise key point positioning, Cai et al 22 introduced a method to effectively fuse characteristics with the same size to obtain fine localized features, which retain rich lowlevel spatial information conducive to the precise localization of key points.

Bottom-Up Evaluation Methods
This kind of methodology has two constituent processes: key point detection and key point clustering, i.e.It first detects the entire skeletal joints followed by clustering all these joints as corresponding individuals.The DeepCut network proposed by Pishchulin et al is the first HPE method using the bottom-up method 23 .The method first locates all human key points labels each keypoint, and then solves the key point association problem using the Integer Linear Program (ILP) algorithm.Cao et al.Introduced OpenPose which is a more efficient multi-person approach using heat maps to predict posture coordinates and associates key points with each human body via Part Affinity Field (PAF) 24 .

Newell et al. presented the Stacked Hourglass
Network (SHN) 25 .This algorithm allows the hourglass modules to learn information from the previous hourglass module and refine body-related scaling features, allowing each module to generate a https://doi.org/10.21123/bsj.2024.9775P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal complete heat map, thereby improving joint prediction accuracy.
As a target detection method for CNN networks, Faster R-CNN 26 integrates characteristic extraction, suggestion extraction, box selection regression (rectangular refinement), and classification, significantly improving the performance.
Transformers 27 are currently extremely popular, and a large number of Transformer-based algorithms, such as ViTPose 28 , have been generated in HPE, which can collect global information and improve accuracy.However, the disadvantage of these algorithms is that they must perform massive computations on huge human posture datasets, which is highly demanding in terms of computational power.They are not suitable for some of the speed-demanding scenarios, such as 2D video surveillance, motion detection, etc.How to optimize the CNNs algorithm to extract the human body frames quickly and reduce arithmetic power usage is the goal of this research.

Methods
In The first step of this method introduces parametric pose non-maximum suppression to eliminate redundant pose estimation, the second step uses SHN to fit the single-person pose for estimation, and the third step applies elimination of similar poses to obtain unique human pose estimation results, to get unique human body position estimation results, and finally SSD is used as a target detection algorithm for detecting the human body target.
The model takes images of arbitrary size as input and uses an attention-based module to pair find possible points of interest, performs multilayer convolutional computation, generates a sliding window on the image to be convolved, and then inputs a fully connected layer for regression.The number of convolutional layers that can be shared in this module is set to seven to balance accuracy and efficiency.
N candidate boxes are extracted from each inference score map S j .However, when estimating each candidate box, incorrect estimation will occur due to blur or similar background color.Therefore, integer linear programming was introduced to connect each joint to the correct human body and eliminate erroneous contours.The pose of an individual in a given image is defined as χ={X j } j=1,...,J , where J denotes 1-14 junction.The j th junction's location within a scene is represented as X j ∈ x.
The background information from the previous stage is utilized in each following stage and generates a new confidence score: where: S t ∈ R w×h×(J+1) denotes the set of confidence scores for all joints in stage t and ψ(X, S t-1 ) denotes the feature mapping starting from the confidence mapping S t-1 to position X.

Data set creation
In this study, a public and highly mainstream dataset for evaluating human pose recognition, MPII Human Pose, is used to test and refine the system.This dataset contains images obtained from a vast number of human action videos with joint nodes and action annotations.10,000 images from MPII Human Pose are selected for network training, and 1000 images are selected for testing the trained network model.
In addition, data expansion is performed on selected datasets to enhance the generalization ability of the network model.In this paper, random horizontal and vertical flipping with a range of ±45° of random swivel and random scaling to [0.7, 1.35] were used for filtering the images for training the network model.By the above data enhancement methods, the complexity of the samples in the data set is increased and overfitting of the model can also be avoided.

Experimental platform and performance evaluation index
The experimental environment is set up on an Ubuntu 16.04 system with Inter(R) XeonSilver4110 CPU, NVIDIA GeForce RTX 3080Ti GPU, and Pytorch deep learning framework.
The object keypoint Similarity (OKS) has been chosen for measuring comparability between the expected keypoints with the real keypoints, which is calculated by the following formula: Where: p denotes the ID of the human body; i denotes the ID of the key point, d pi represents Euclidean distance among predicted ith critical spot of the pth human body and authentic labeled key point; S p denotes the area occupied by the target boundary of the pth body; σ  is the key point normalization factor, which is a constant and denotes whether the ith critical spot of ith human body is visible or not; and δ is the selection function.

Training and Testing
During training, an error update algorithm was chosen from Adam, and the initial study percentage was set to 0.0001, the study percentage was reduced to 0.5 with 20 iterations, and set batch length to 8 with smaller data, for 80 rounds of training iterations.
In the testing phase, the Stacked Hourglass Network (SHN) HPE network is selected for the singleperson network and the SSD algorithm is selected to https://doi.org/10.21123/bsj.2024.9775P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal detect human targets.To ensure that the target frame obtained by object detention algorithm can cover entire body area, the detected human target frame is extended by 15% in both height and width directions.

Results and Discussion
On the MPII 30 data set, the precision of the strategy worked on by 0.8% over DeepCut, and the computation time improved from 57995s to 10s.To assess the impact of the recognition precision of the detector on the HPE, the accuracy was significantly improved from 49.3% to 76.9% when the true position of the detected human body was given, showing that a superior human identifier would additionally further develop the human posture assessment results.
Another set of human detection algorithms and a single HPE network are included to demonstrate the robustness of the calculation presented in this paper.

Conclusion
Despite the fact that pose estimate studies have made substantial advances in recent years attributed to the persistent work of numerous researchers, the existing algorithms still present significant challenges.This paper focuses on how to optimize the algorithm in the pose estimation task for solving the invalid problem of key point localization accurately affected by the performance of detection algorithms, and how to increase the precision and accuracy in crowded situations.An optimized symmetric spatial transformation network algorithm is proposed for the case of inaccurate and redundant detection of human target detection algorithms, which integrates a symmetric spatial transformation model to a concurrent pose assessment network to propose a first-class human target shell from an inaccurate human bounding box and introduces parametric pose non-extreme value suppression eliminates the redundant pose estimation.The results of the experiments indicate that the methods presented in this paper effectively enhance accuracy of human body pose estimation.In the future, HPE will occupy a more important position in the future real life as an important part of computer vision.

Figure 2 .
Figure 2. Presentation of the proposed calculationFaster R-CNN and the CPN pose estimation algorithm are chosen to compare the presentation of the first calculation and the proposed calculation after adding this paper, respectively, and the results are shown in Fig.2, with "*" denoting the addition of the algorithmic framework presented in this paper.AP means Average Precision and mAP means mean Average Precision (Average of precision for all categories).AP@0.5 (0.75) represents the average prediction with the IOU (Intersection-Over-Union) greater than 0.5(0.75).AP in the pose estimation is defined by the OKS.For example, AP0.5 (AP at OKS = 0.50).From the

Figure 3
Figure 3 Relationship of the experimental schemes designed to test each group of experimental schemesComparing experimental scenarios a, c, and e, it very well may be seen that the exactness of the calculation diminishes by 2.2% when the symmetric spatial change organization and the equal singleindividual posture assessment network are missing, where i is the index of the anchor of each bounding box, and p i is the predicted probability of the box with index i.The predicted probability of the box with index i is the predicted probability of an object, p i* is the probability that the object is correct, t i is a vector representing the parameterized coordinates of the predicted bounding box, and t i* is the coordinates of the actual box associated with the positive anchor.The term α is a balancing parameter that weighted the result L. The term β is a bias, optional hyper parameter.L c is the logarithmic loss.L r was used to represent the regression loss.
In the second stage, a fully convolutional network is used to predict the heat map and offset of the body in each body bounding box, and the precise location of the joint points is obtained by fusing the heat https://doi.org/10.21123/bsj.2024.9775P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal map and offset map.The offset map represents the offset from each pixel position of the heat map to the correct node position (each offset map correlates into one channel, denoting the x-coordinate and ycoordinate).The three channels are fused to vote the true node position from the predicted offset.The fusion is performed as follows. 3