Variable Selection Using aModified Gibbs Sampler Algorithm with Application on Rock Strength Dataset

Variable selection is an essential and necessary task in the statistical modeling field. Several studies have triedto develop and standardize the process of variable selection, but it isdifficultto do so. The first question a researcher needs to ask himself/herself what are the most significant variables that should be used to describe a given dataset’s response. In thispaper, a new method for variable selection using Gibbs sampler techniqueshas beendeveloped.First, the model is defined, and the posterior distributions for all the parameters are derived.The new variable selection methodis tested usingfour simulation datasets. The new approachiscompared with some existingtechniques: Ordinary Least Squared (OLS), Least Absolute Shrinkage and Selection Operator (Lasso), and Tikhonov Regularization (Ridge). The simulation studiesshow that the performance of our method is better than the othersaccording to the error and the time complexity. Thesemethodsare applied to a real dataset, which is called Rock StrengthDataset.The new approach implemented using the Gibbs sampler is more powerful and effective than other approaches.All the statistical computations conducted for this paper are done using R version 4.0.3 on a single processor computer.


Introduction:
The relationship between a set of variables, 1 , 2 , … , , and a response, , can be expressed by the following linear regression model, = 0 + 1 1 + 2 2 + ⋯ + + , ∼ (0, 2 ) ∀ = 1, … , . (1) The system of equations (Eq.1) can be expressed in matrix notation as follows, = + , ∼ (0 n , Σ n×n ) (2) where is a × 1 vector, is a × ( + 1) matrix, is a ( + 1) × 1 vector, is a × 1 vector, 0 is a × 1 vector and Σ is a × matrix.Generally, selectingcertain explanatory variables that can be used to describe the response variable is called feature selection (shrinkage). The feature selection is used to i) remove the unimportant variables which do not add any information; ii) reduce the computation time by shrinking the data size; iii) avoid the overfitting. To decide which variables are irrelevant is hard for high dimensional datasets. On the other hand, it is difficult to build and interpret a model that uses all the explanatory variables. In this case,variables selection techniques can play an important role. The set of coefficients, , can express whether the explanatory variablesare important for the model or not. When the value of a coefficient is zero or very close to zero, then its corresponding variable is not significant to be chosen in the model.
Variable selection can be made using several traditionalapproaches. For example, Chi-square 1 , ANOVA 2 , and Pearson correlations can compute the variables' impact. Depending onthe coefficient values, itcan be determined whether the variable is important or not. Moreover, forward and backward selection methods are used to select the best subsets of variables by following some steps 3 .These methods are slow with large datasets 4 . In this paper, variables are selected based onthe influence of their coefficientson the model.
In general, the set of parameters, { i : = 1, … , },can be estimated from of the observations using the Ordinary Least Squares (OLS) criterion.
The set of estimated parameters is denoted by ̂ and defined as follows, From Eq.3, ̂ is the value of that givesas the minimum squared norm of error between the observed value and estimated value. The first step to derive ̂, let ℎ = ( − ) ( − ), and by expanding h yields: ℎ = − − + = − 2 + Taking the derivative to , ℎ = −2 + 2 . If ℎ = 0, then = Therefore, ̂= ( ) −1 (4) OLS will have unbiased results if and have an approximately linear relationship.It also has a low variance if ≫ . However, in real life, in many datasets such as health, business, and economy datasets, the number of explanatory variables can be much larger than the number of samples, ≫ . Hence,the OLS solution is not unique. A datasetmay be high variability in the estimators, which causes poor predictive and overfitting. For these types of datasets,researchers usually use Ridge and Lasso modelsto select variables.
This paper is organized as follows: Section 2 provides a background for Ridge and Lasso models. In section 3, Bayesian inference is discussed. Markov chain Monte Carlo and Gibbs sampler are discussed in sections 4 and 5; respectively. In section 6, a new variable selection method is applied to a simulation dataset. Real data analysis is introduced in section 7. Section 8 presentsresults and discussion of the variable selection for a real dataset between the new method and some commonly used methods. In the end, the conclusionis given in section 9.

Ridge and Lasso
This section reviews two variable selection (shrinkage) methods named Ridge 5 and Lasso. Shrinkagemethods can be used under some constraints depending onthe size of thedataset.The regression model can be fitted using all the p variables, but the shrinkage technique improves the accuracy and stability by reducing the number of variables. TheRidge and Lasso models aim to estimate the coefficient of some variables as0 or close to zero so that those variables can be excluded from the model.̂= (̂0, ̂1 , … ,̂) that minimizes the Residual Sum Squares (RSS) is the solution to the OLS fitting procedure; i.e., Similarly, Ridge regression seeks the vector ̂ that minimizes the penalized , where the value of is the upper bound for the sum of the coefficients. The complexity parameter, , is greater than or equals to 0. If = 0, then ̂= , as → ∞, ̂→ 0 . And 0 < < ∞ balances linear regression model fitting and shrinkage of the coefficients. The shrinkage penalty is small when 1 , 2 , … , and are close to zero 6 .
Unlike OLS, ridge solutions are not unique. As a result, before the estimation, the inputs should be standardized. First, 0 is estimated separately as ̅ = ∑ =1 , and the remaining parameterscan be estimated by using the data matrix as follows, To constrain the size of OLS estimatesdifferent kinds of penalization can be considered. For example, 1 norm can be used as penalty encompasses, so Lasso propertyallowsexcluding variables by setting their coefficients to be zero 7 . The reduced model becomes more efficient, especially when the number of variables is much larger than the number of samples, ≫ .

Bayesian Inference of a Multivariate Linear Regression
Eq.2 is used to apply Gibbs sampler in the Multivariate Linear Regression (MLR). From Eq.2,it can be concluded that ∼ ( ; 2 ).Therefore, the likelihood function,denoted by ( ), can be expressed as follows, The prior for is chosen to be Multivariate Normal Distribution with mean 0 and covariance matrix 0 +1 ; i. e. , ∼ (0, 0 +1 ). 0 is usually chosen to be a large positive value that leads to a large variance 8 . The prior for 2 is chosen to be inverse Gamma; i. e., 2 ∼ ( 0 , 0 ). 0 and 0 are the initial valuesthat can be any positive numbers. The conditional posterior distribution for can be written as follows, Consider = 2 and = 2 + 0 , so the posterior distribution for can be written as follows, Eq.7 is a MVN density with = −1 and Σ = −1 .Hence, the full conditional posterior for is The above function isthe density function of the inverse gamma distribution with a shape equal to 2 + 0 and rate equal to 2 + 0 . i.e., So far, the posterior distributions for (Eq.7) and 2 (Eq.8) have been derived. Hence, the estimated values for and 2 can be found by calculating theirsample means. This can be done using Markov chain Monte Carlo (MCMC)without calculating the marginal likelihood for and 2 .In the following section, a brief discussion of MCMC is given, and thena particular case from MCMC (Gibbs sampler) is explained in detail.

Markov Chain Monte Carlo
MCMC is an essential technique, and it is used frequently in many statistical applications. In many cases, it is challenging to sample from a target posterior density. ThenMCMC is used 9 . There are three popular MCMC sampling techniques, such as Metropolis-Hastings, slice sampling 10 , and Gibbs sampling 11 .MCMC methods are derived from a Monte Carlo (MC) 12 . A chain is used to approximate samples of desired distribution in MCMC, and the approximation is generallyimproved after several stepshave been done 13 .

Gibbs Sampler
In Bayesian inference, Gibbs sampling is commonly used bystatistical inference without calculating the marginal likelihood function. A high dimensional problem can be broken down into numbers of lowdimensional problemswhen Gibbs

Simulation Studies
Four datasetswithfour differentcovariates (8, 20, 40, 60)were generated with 1000cases.The covariates were simulated independently from a normal distribution, and then Eq.1was used to find the responses for each dataset. The smallest dataset, 8 covariates, is discussed in detail, and the results of other datasets are summarized. Mean Squared Error (MSE) and time-complexity are represented in Fig.1. Figure 1a shows that Gibbs gives the lowest MSE in all four simulated datasets. Moreover, time consumption is checked forall datasets.   associated with 0 , 2 , 4 , 6 and 7 were selected as the most significant covariates because they were close to the true model coefficients, as showninTable 1.However, in Lasso and Ridge methods,all the covariateswere selected as important variables.Computationally, selecting all the variables as important variables is inefficient because both the error and time willincrease for the large datasets.
Moreover, in Table1, the parameters' actual values are compared with their corresponding posterior sample means.The 95percentcredible intervals (CI) are calculated for all the parameters. It is clear that the values that are not considered a significant lie in large CI; On the other hand, all significant parametersare centered innarrow CI. This indicates that the estimation is reasonable and practical.

Real data Analysis
In this section, the Rock Strength Dataset (RSD) is analyzed. RSD contains information regarding the relationship between 8 predictors, which are %Quartz ( ), %Plagaoclase( ), %K. feldspar (kfds),%Hornblende (hb), Grain size ( ), Grain area ( ), Shape Factor (sf) ,Aspect Ratio ( ), and the response, Uniaxial Compressive Strength (UCS), for 30 rock specimens.The dataset is collected from theUCI Machine Learning Repository. Figure 3a shows the heatmap for the8 predictors. The heatmap did not give us sufficient information about the data. As can be seen,the level of correlation is represented across all samples. The orange colorrepresents the high correlation, and the low correlation is marked with yellow color. The dataset was normalized, and then predictors' boxplots are plotted.InFig.3b, some outliers in the dataset are realized.So,they areremoved before running Gibbsand Lasso variables selection methods.The correlation matrix for the 8 predictors in the real data set (RSD) is given in Fig.4

Result and Discussion of the Variable Selection in RSD
Gibbs sampler has been used to select the essential variables in RSD. If all the variables are used, then the general multivariate model should be = 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + where ∼ ( , ) In the beginning, the dataset isnormalized. Gibbs sampler is run for1000 iterations to create the posterior samples of the parameters. The prediction performanceischecked by using leave-one-out-cross validation (LOOCV) (10). In LOOCV,one of the observations left out,and the model's coefficients are estimatedwith the rest of the observations. Since the real data has only 30 observations, the procedure isrepeated 30 times. It was found that 5 out of8 posterior means lie inside the 95percent credible intervals. These variables were selected as important predictors. Therefore, using the Gibbs method, the new model becomes = 0 + 1 + 3 + 5 + 7 + where ∼ ( , ) OLS, Ridge, Lasso, and Gibbs selection methods were applied on the RSD, and the outputs are shown in Table2. Gibbs methodgives the smallest MSE, and takes less time compared to the other methods. Table 3. shows that the posterior means for the parameter arerepresented with their 95percent CI. Gibbs selects 4 variables as essential variables: , , , and sf, while Lasso selects variables, and Ridge selects all the variables.

Conclusions:
A new variable selection approachusing the Gibbs samplerhas been discussed in this article.The posterior distributions for and 2 have been derived, andthe Gibbs sampler algorithmis used to sample from the corresponding distributions. The simulation datasets show that the Gibbs sampler is better than other existing methods (Lasso and Ridge, OLS). In both simulations and real datasets,the variable selection using Gibbs is faster and gives less error. Asshown in SRD, the new method performs better than the other strategies by selecting only 50 percent of the variables; In contrast, Lasso and Ridge have selected 75 percent and 100 percent of the variables; respectively, with less accuracy and more time-consuming.