K-Nearest Neighbor Method with Principal Component Analysis for Functional Nonparametric Regression

: This paper proposed a new method to study functional non-parametric regression data analysis with conditional expectation in the case that the covariates are functional and the Principal Component Analysis was utilized to de-correlate the multivariate response variables. It utilized the formula of the Nadaraya Watson estimator (K-Nearest Neighbour (KNN)) for prediction with different types of the semi-metrics, (which are based on Second Derivative and Functional Principal Component Analysis (FPCA)) for measureing the closeness between curves. Root Mean Square Errors is used for the implementation of this model which is then compared to the independent response method. R program is used for analysing data. Then, when the covariates are functional and the Principal Component Analysis was utilized to de-correlate the multivariate response variables model, results are more preferable than the independent response method. The models are demonstrated by both a simulation data and real data.


Introduction
In recent years the issue of nonparametric functional regression has become a topic of growing interest, due to the sophistication in recent technological advances regarding collecting and storing data as curves. The functional data have become more combined in growing numbers of fields, such as biology, engineering, medical science, meteorology, psychology, statistics, among others. Ramsay and Silverman 1 pioneered the area of functional data analysis which becomes popular, while case studies and applied problems with linear regression and multiple regression (parametric models) are pointed out by 2 . The linear methods for regression with functional response and scalar input for more information see 3,4 . In the situation when both the output and the input are functions to estimate functional multivariate data in functional multivariate linear regression method were examined by 5-7. The story of nonparametric functional data began with 1 and the nonparametric functional regression models have then been the object of several researches.
The existing literature contains a considerable number of theoretical and exerimental studies with different models on nonparametric functional data when the response is an independent response variable and covariate is functional. For instance, the functional Nadayara-Watson (NW) estimator approach, the functional k-nearest neighbour estimator method, the functional local linear estimator model, and distance-based local linear estimator model [8][9][10][11] showed the nonparametric models and the related theories for the situation when the output and the ifunctional covariates are both functions.
In recent years, the multivariate nonparametric functional regression model was also presented with functional data, for example by 5,[12][13][14] which examined the relationship between multiple scalar responses and functional predictors by using Gaussian basis function model. Chaouch and Laïb 15 explained the issue of multivariate response model from functional covariates based on the 1-median regression estimation approach. Wang and Chen 16 approached the Gaussian process regression with multivariate output and used principal component analysis to de-correlate multivariate response with functional and multivariate covariate variables. The nonparametric functional regression model for multivariate longitudinal data with multiple responses which is illustrated by different types of data for more detials see 17 .
Omar and Wang 18 expanded the independent response method to multivariate responses method with functional covariate in nonparametric functional regression which is applied with real data and simulated data then the new model results (multivariate responses model) are preferable than the independent response method. The paper proposed a new model to deal with multivariate responses variables and functional covariate. This paper used the K-Nearest Neighbour model with Principal component analysis to de-correlate the multivariate responses, and also ultizes the K-NN method for independent prediction regression. In the K-NN model, the semi-metrics as measure of closeness between the functional covariates was used which will be clarified in more details in the methodology.
The purpose of this study is to add some new results to the nonparametric regression of the conditional expectation when the covariates is functional, is multivariate response. In the literature, the multivariate responses issue with principal component analysis from covariate function has not been studied before. The achievement of the presented method is compared with the independent output from covariate function in the nonparametric approaches.
The article is organized as follows. Section 1 contains the model of estimation. Section 2 proposes the competence of the presented method through a simulation instance. Real data examples are presented in Section 4. Finally, a general conclusion is supplied in Section 5.

Methodology
Let (  and ( ), that is, for = 1, . . . , and = 1, . . . , where is a kernel and ℎ is a bandwidth (depending on n). The KNN estimator to determine optimal bandwidth of neighbours is defined by In this work, we fixed the semi-metric (d) as the measure of closeness and the kernel function (K).
Using the same number of neighbour for any curve provides a global choice, and ℎ relies on ( ) ( the bandwidth ℎ is such that only the − nearest neighbours of ( ) are taken into account) but is the same for any curve ( ), for more details see 8 , 11 . In the literature, different types of semi-metrics have been introduced. In our numerical instances, we used the semi-metrics based on Second Derivative and Functional Principal Component Analysis (FPCA); see for more details , 9,19 Let 1 , . . . , be n curves and = { ( ) ; ∈ }. Semi-metric based on FPCA is determined as where 1 , . . . , are the orthonormal eigenfunctions of the covariance function Γ ( , ) = ( ( ) ( )) connected with the largest eigenvalues; see for more details 9 . This kind of semi-metric is suitable for rough curves. Semi-metric built on derivatives is determined in more details 8,11 where ( ) is the ℎ derivatives of with regard to , which is computed using the B-spline approximation of the curves in exercise and see 8 for more details. Suppose ⋆ is a test point and ⋆ the corresponding response point. Therefore, the mean prediction ̂ and variances of the scores then can be obtained by nonparametric functional regression and presented by ⋆ and ̂⋆ 2 for = 1, . . . , . Thus the m-dimensional response ⋆ for the predictive mean and variance are given by

Simulation Study
The goal of this part is to verify the theoretical outcomes over the simulated data, which contains the sample of size n=215. The outcomes get from the new model are compared with Independent response method. Using R program for analysing data.
Using the nonparametric functional regression: = ( ) + = 1, . . . , = 215 First of all, generating the curves: ( ) = ( ) + ℎ ( − 0.5) 2 + , = 1, . . . , when 0 = 1 < 2 <. . . < 100 = 1 are equally spaced points and ℎ , are independently taken from a normal distribution ℎ ∼ (0, (1) 2 ) and ∼ (0, (1) 2 ). Figure 1 shows the 215 curves from one replication. The pursuance of the proposed method (M-P) is discussed with that of I-R method where the two responses are determined independently and without taking into account the correlation between responses. The average of the RMSEs is presented in Table 1 , after 20 iterations. Table 1 shows that, in both situations the Multivariate Response model considerably progresses the outcomes compared with each Independent Response (I-R) model. Then, it is clear that from Table1, even no correlation between the components of the response variables the Multivariate method is more appropriate for prediction than the independent model.

Real Application:
In this part of the article, testing the presented method on two different kinds of real data sets, Tecator data and Soil data. The importance of applying two types of real data for conferring the proposed model outcomes is better than the models in the literature. The R program is used to analyse data in our study. Tecator data This data is extremely common in the society of nonparametricians because various implementations have been done on it and by different models 9,10,11 . Spectrometric Data arrives from the quality control problem and can be found at http://lib.stat.cmu.edu/datasets/tecator.
The objective of Tecator data is to permit for the exposure of the proportion of the specific chemical meaning because the examination by chemistry procedure would take more time and be more costly. This instance works out 8,20 when the response variable is scalar and covariates are function. Indeed, the correlation coefficients between 3-contents (Fat, Water, and Protein contents) are given by The three varibles in meat (Fat, Water, and Protein) are strongly correlated so it will be more appropriate to estimate these contents together rather than each one, individually.
We divide the original sample into two subsamples. The first 160 sample units are used for training sample and the second sample includes the last 55 for testing sample. Same as the simulation example, the RMSE is then computed for the methods as = ( 1 55 Runing the funopare.knn.gcv function in R structure for estimation for independent response model, and it is valid on the site of Nonparametric Functional Data Analysis (NFDA). Also using the semi-metric based on the second derivative (q=2) for both models (independent response and Multivariate response vectors by principal component analysis). Table 2 is reported to discuss the capacity of the methods, taking 10 times randomly 55 testing sample curves then taking the average of 10 times.
Table2 concludes that the M-P method notably progress the estimation accuracy for the Fat, Water and Protein compared to I-R method.

Soil Data
Rinnan and Rinnan 21 analysed this data set originaly, after that 16 took a sample of these data and utilized Gaussian process regression with multivariate response on two components soil organic matter (SOM) and ergosterol concentration (EC). The soil data samples were obtained from a long-term field experiment at a subarctic fell in Abisko, northern Sweeden. The number of samples is 108, and the wave-length interval of 400-2500 nm (visible and near infrared spectrum) which was scanned at 2 nmintervals with an INR spectrophotometer; for more detail see 20 . Two component varibles, Soil Organic Matter (SOM) was weighted as loss on ignition at 550 0 , and Eergosterol Concentration (EC) was defined through HPLC. As the functional covariates were smooth, the semimetric built on second derivative was adopted in our instase. To know the efficacy of the study, leaveone-out cross validation was undertaken, that is, each of the 108 samples was left as test data while the rest data were utilised for model training. Table 3 presents that, same as the previous example the Root Mean Square Errors is computed as the measure of efficiency for the comparison of two approaches (M-P model and I-R model).
The root mean square errors is computed as measure of efficiency for the compare two methods (M-P model and I-R model) as stated in Table 3.
The proposed M-P model presented significantly improves the efficiency of the rediction for both SOM and EC in comarison with the I-R model.

Conclusion:
This study presented a new model for nonparametric regression analysis where the covariate is functional and uses Principal Component Analysis to de-correlate the multivariate response variables. It uzed the formula of the Nadaraya Watson estimator (K-Nearest Neighbour (KNN)) for prediction. It is presented that the results obtained from a new model supplies better estimations when compared with the outcomes from the independent