A Modified Support Vector Machine Classifiers Using Stochastic Gradient Descent with Application to Leukemia Cancer Type Dataset

: Support vector machines (SVMs) are supervised learning models that analyze data for classification or regression. For classification, SVM is widely used by selecting an optimal hyperplane that separates two classes. SVM has very good accuracy and extremally robust comparing with some other classification methods such as logistics linear regression, random forest, k-nearest neighbor and naïve model. However, working with large datasets can cause many problems such as time-consuming and inefficient results. In this paper, the SVM has been modified by using a stochastic Gradient descent process. The modified method, stochastic gradient descent SVM (SGD-SVM), checked by using two simulation datasets. Since the classification of different cancer types is important for cancer diagnosis and drug discovery, SGD-SVM is applied for classifying the most common leukemia cancer type dataset. The results that are gotten using SGD-SVM are much accurate than other results of many studies that used the same leukemia datasets.


Introduction:
For every subject in a given dataset, suppose information on dimensional covariate vector, ×1 exists, and a response that has two possible categories are given. In statistics, besides SVMs there are many classifier methods such as ANN (1), LDA (2), PCA (3), random forest (4), naïve Bayes (5), and NN (6). SVM is a supervised learning technique. It means a classifier based on a training dataset can be created, then that classifier can be used for future observations. The goal of the SVM is to create an algorithm, so that given information for a new observation, the category of the response can be predicted.
For example, let's define = +1 if the response is in the first category, and = −1 if the response is in the second category. The aim is to design a classifier rule ( ) as follows, = +1 if ( ) > 0 and = −1 if ( ) < 0. This can be used to determine the response category given the covariate information. A geometric procedure is used in SVM that finds the classifier according to some optimization criterion; unlike LDA which uses a distribution for given its category (7). A linear SVM (hard margin SVM), i.e., ( ) = 0 + 1 where 0 and 1 are unknown parameters, creates a hyperplane in thespace that acts as a separator between the two response categories. Linearity is a simplifying assumption, and in some cases, it may work sufficiently well; hence the non-linearity case does not need to be considered (8). However, if the data is not linearly resorbable, as non-linear SVM (soft margin SVM) should be assumed (3). Soft margin SVM is a tool for many real-world applications. This paper is organized as follows, first, both hard and soft margins SVM are discussed. For nonlinear SVM, some types of kernels are introduced and used. Then our modified method is explained, SGD-SVM. It was tested on two simulation datasets with 50 and 100 observations. SGD-SVM is applied, in the end, on a real dataset, the most common cancer type (Leukemia dataset) (9). SGD-SVM was compared with some existing methods which are K-nearest neighbor, random forest, and naïve Bayes.

Hard Margin SVM
The hard margin SVM is the case when two classes are linearly separable. If = {−1, +1}, and ∈ ℝ , then = {( , )} =1 is the data points. Figure 1.a shows the case when the labels of are linearly separable. Several hyperplanes (e.g., 1 , 2 , 3 , and 4 ) can be defined to separate the data points with two classes (10).
Our goal is to find the best separable hyperplane; i.e., the hyperplane should be in the middle of the two classes, so that the distance from the closest point on either side to the hyperplane is the same (11). By assuming 1 and 2 are two points that lie on the optimal hyperplane 0 = x + 0 , then . Let be any point on one side of the hyperplane, then + 0 = 0 implies that 0 = − . To find the distance of a point to the hyperplane ( ), a point can be chosen, say 0 , so that Since the data point can be either side, and y can take a positive and negative sign, the distance of 0 to the hyperplane is . Therefore, the margin can be defined as follows, For any point that is not on the hyperplane, y i ( i + 0 ) ≥ . This implies that: y ( + 0 ) ≥ 1; i. e., y ( ′ + 0 ′ ) ≥ 1 for some ′ = and 0 ′ = 0 .
In Fig.1b, the points on either of the two boundary hyperplanes 1 and 2 are called support vectors, and they correspond to positive Lagrange multipliers > 0. All the points that are far away from the 1 and 2 are not important. So, the training model will depend only on the support vectors. For a support vector on either 1 and 2 , the constraining condition is ( + 0 ) = 1 … (5) where ∈ , and is the set of all indices of support vectors that corresponding to > 0.
into Eq.  For the optimal values of and 0 , define , | | 2 equals to ∑ ∈ . So, the margin, the distance between 1 (or 2 ) and the optimal decision hyperplane 0 , is

Soft Margin SVM
The condition of the optimal hyperplane can be relaxed by including an extra term, when the response classes are not linearly separable as follows, To get a minimum error, ≥ 0 should be minimized as well as | |, so the objective function becomes, subject to ( + 0 ) ≥ 1 − and ≥ 0 where = 1, … , . is a regularization parameter that controls the agreement between minimizing the training error and maximizing the margin. Small influence to emphasize the margin while ignoring the outliers in the training dataset , while large may tend to overfit the training dataset . When = 1, Eq. (9) is called first norm soft margin, and when = 2, it is called second norm soft margin. The algorithm based on the first norm soft margin is less sensitive to outliers in the training dataset where it ignores the outliers. In the next section, the first and second norm soft margins are going to be discussed.
in to Eq.(11), the following dual problem is gotten , the optimal decision hyperplane, and 0 , with the margin becomes

Second Norm Soft Margin
By plugging = 2 in Eq. (9), subject to ( + 0 ) ≥ 1 − , ∀ = 1, … , . Notice that the condition ≥ 0 is dropped and set = 0 if it is less and equal to 0; hence the objective function is further reduced. In this case, the prime Lagrangian is ( , 0 , , , ) Substituting the condition that given in Eq. (12) into Eq.(15), the following dual problem will be gotten ( ) . Solving Eq.(16) for , the optimal values for and 0 with the margin becomes

Karush-Kuhn-Tucker (KKT) conditions
In the previous sections, the cases where the datasets are linearly separable are discussed, so the solution has been found by solving the dual form of Lagrangian. This can be done by minimizing a quadratic function subject to a set of constraints. To find the dual objective function, the following conditions (KKT conditions) should be satisfied: i. stationarity, ii. dual feasibility, iii. complementary slackness, and iv. primal feasibility (13). To minimize ( ) subject to the constraint ( ) ≥ 0 ∀ , then the Lagrangian function becomes is optimal for our cost function, the necessary KKT conditions for ⋆ to be the local minimum are i. Stationarity: Feasibility: ≥ 0, iii. Complementary Slackness: ( ⋆ ) = 0, and iv. Primal Feasibility: ( ⋆ ) ≥ 0. The primal function is not convenient if any of the KKT conditions is not satisfied. In general, if the dataset has variables, the computational complexity is ( 3 ) (14).

Non-linearly SVM
When a dataset is not linearly separable, the techniques introduced in the previous section do not converge. However, a hyperplane (decision surface) can be found by transforming the data to the high dimensional spaces by using an appropriate kernel. Even though the original dataset is not linearly separable, a hyperplane can be found to separate the mapped datasets (12). Fig.2 shows an example of transforming data points from two-dimensional (2D) spaces to three-dimensional (3D) spaces. In the 2D, the data points are not linearly separable, but the data points are easily can be separated by a surface when they are transformed into 3D.

Gradient Descent
The goal is to minimize the following function, Eq. (19) is convex in , and it is a quadratic optimization problem. In previous methods, the technique from QP is used, but it is very slow. Whenever there are no constraints, the gradient descent (GD) can be used (19). In general, the gradient is the direction of the steepest increase in the function, so it goes in the opposite direction to get to the minimum as shown in Fig.3.

Figure 3. The general strategy for minimizing a function ( )
The general GD strategy for minimizing Eq. (19) starts with an initial value for , say 0 , then iterate until convergence. GD-SVM is faster than QP, and it is still slow. The GD-SVM in Algorithm 1 can be summarized as follows:

Stochastic Gradient Descent
Computing ∇( ) takes ( ) time, where is the size of the training dataset, so GD-SVM is slow when is large. In the GD-SVM algorithm, the value of the objective function is improved at every step. It takes fewer steps to converges, as can be shown in Fig.4., but each step takes much longer time to be computed (20). To speed the GD-SVM algorithm, the gradient is evaluated for each training example instead of evaluating it for all examples.
This process is called stochastic gradient descent SVM (SGD-SVM). It improves the value of the objective function noisily. This process takes many more updates than gradient descent, but each update is less computationally expensive, and hence SGD-SVM is faster than GD-SVM. SGD-SVM algorithm (Algorithm 2) is guaranteed to converge to the minimum of if is small enough (21)

Simulation Studies
To test the GD-SVM and SGD-SVM methods, two simulation datasets are generated with 50 and 100 observations. Each dataset has a set of variables and a response that has two classes. In real life, it is unknown whether the datasets are linearly or nonlinearly separable. So, as can be shown in Fig.5 and Fig.6, complex datasets (nonlinearly separable) are generated. GD-SVM and SGD-SVM are applied to the two datasets by using different types of kernels. In both methods, the RBF kernel gives the best accuracy with two datasets. In Tables 1 and 2, GD-SVM and SGD-SVM are compared concerning the best value of , the number of support vectors, using the sensitivity (also called the true positive rate) and specificity (also called the true negative rate).

Real Dataset
One of the universal cancer types is Leukemia dataset. Its diagnosis and classification are complex. For experimental evaluation, The Leukemia dataset is used in this section. The dataset was published by Golub et al in 1999 (9). It comes from a proof of concept study. It shows how the gene expression monitoring (via a DNA microarray) can classify the new cases of cancer, providing a common approach for assigning tumors to known classes (11). Using this type of data set, patients can be classified into Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL) (13)(23). The complete Golub-Merge dataset is available in the golubEsets packages. It has 3051 genes and 72 observations. Working with large datasets confront many difficulties such as timeconsuming and inefficient results (24).

Analyzing Golub Datasets
To analyze the Leukemia dataset, the most significant genes for cancer type need to be selected. This means the genes that are differentially expressed across classes should be considered. Since the differentially expressed genes between two groups, a t-test seems like the common choice. However, the t-test requires normality assumption which might not be a logical assumption. A 3051 × 2 histogram cannot be plotted to have an idea about the justification for normality. Mann Whiteny U test looks more adequate in this case (25), and it is nearly as efficient as the t-test.
After running the test and adjusting pvalues according to Benjameni-Hochberg method, it is ended up with only 329 significant genes. The majority of genes does not seem to have different mean values across classes. The same result for the median is gotten. The median differences between classes are clustered around zero, Fig.8. Only a few numbers of genes seem interesting, and they can be easily indicated. Fig.7 provides a quick summary of the selected genes (26). Some important genes in Fig. 9 were plotted. As can be seen, the most significant genes id that influences the model are G4847, G3252, G1882, G6855, and G2288.   Results and Discussion: SGD-SVM was applied to the Leukemia dataset. A comparison between some different kernel functions had been done. The most common kernel which are linear, polynomial, RBF and MLP kernels were used. Also, our method was compared with some existing methods. These methods used the same dataset for leukemia classification which are k-nearest neighbor random forest and naïve Bayes (13). In Fig.10, SGD-SVM models for the Leukemia dataset classification are plotted. Only the most two important genes in which their id are G3252 and G4847 are used. These two genes expressions are relevant to judge if a new sample related to ALL or AML classes. As it is shown from the plot (Fig.10) and the table (Table 3)

Conclusion:
In this paper, the SGD-SVM method was proposed. The method was developed using a stochastic Gradient descent process. Two simulation datasets were used to test the performance of the method. The results showed that SGD-SVM has a larger accuracy rate comparing with the regular SVM method. By applying the SGD-SVM on Leukemia datasets, it found that the best accuracy exists for classification of the two types of Leukemia cancer when the RBF kernel has been applied.