Application of Data Mining Techniques on Tourist Expenses in Malaysia

Tourism plays an important role in Malaysia’s economic development as it can boost business opportunity in its surrounding economic. By apply data mining on tourism data for predicting the area of business opportunity is a good choice. Data mining is the process that takes data as input and produces outputs knowledge. Due to the population of travelling in Asia country has increased in these few years. Many entrepreneurs start their owns business but there are some problems such as wrongly invest in the business fields and bad services quality which affected their business income. The objective of this paper is to use data mining technology to meet the business needs and customer needs of tourism enterprises and find the most effective data mining technology. Besides that, this paper implementation of 4 data mining classification techniques was experimented for extracting important insights from the tourism data set. The aims were to find out the best performing algorithm among the compared on the results to improve the business opportunities in the fields related to tourism. The results of the 4 classifiers correctly classifier the attributes were JRIP (84.09%), Random Tree (83.66%), J48 (85.50%), and REP Tree (82.47%). All the results will be analyzed and discussed in this paper.


Introduction:
Tourism is an important economic source for Malaysia, which was once ranked 9th in the world for tourist arrivals (1). Tourism has become Malaysia's third largest source of foreign exchange income (2). This means that there are many entrepreneurial opportunities and problems to be solved in the tourism industry in Malaysia.
This paper mainly based on business needs and customer needs solves the problem of investors investing by obtaining the income level of different tourist destinations and the problem of managers' judgment on the area of tourists to different destinations. Judgment of the area where the tourists belong is conducive to the manager of the destination to make relevant adjustments to attract more tourists and obtain the maximum benefit. For example, if the managers judge that the majority of tourists in the hotel belongs to the European people, then the menus, prompts and other places with text in hotel can add European languages, while adding some European customs and elements.
This paper needs to use data mining technology, which can not only reduce costs, but also use this technology to increase business opportunities (3). Data mining is the process of using data as input and generating output knowledge. For example, customer and tourist destination as the data input and provide output on recommending tourist destination. Business managers can use data mining technology to obtain the maximum benefit while reducing the cost of customer research, thereby prompting more people to start a business. This research conducted data mining on the simulated income data of various tourist destinations in Malaysia and the simulated tourist location data to determine the tourist area, so as to help real merchants in Malaysia use data mining to make correct judgments.
In the following part, this paper will study the key data mining task: use WEKA to implement 4 data mining classification techniques experiments, extract important information from the travel data set. The goal is to find the best performing algorithm in the comparison results to improve business opportunities in the travel-related fields. The correct classification results of the four classifiers are: JRIP (84.09%), Random tree (76.62%), J48 (85.39%) and REP Tree (83.44%). This article will analyze and discuss all the results. Before end, this paper provides a list of data mining resources and tools for people who want to get more information on this topic.

Problem Statement
After investigation, this paper found that there are still enterprises that do not use big data mining in Malaysia's tourism industry today (4). Data mining has found typical patterns and influencing factors in the data, and it is difficult for managers to find these typical patterns and influencing factors (3).
From the perspective of business needs, the lack of big data applications, such as investors' lack of business income data form different tourist destinations, may lead them to make inaccurate investments in Malaysian tourist destinations. If the business managers of the tourist destinations lack information where the tourists come from, it is impossible to judge the source of tourists, so that the service quality provided for tourist cannot be improved in a targeted manner, resulting in the loss of passenger traffic. Therefore, from the analysis of these two aspects, the lack of data will affect the income of the industry.
From the perspective of customer needs, if the service level of a tourist destination fails to satisfy them, they may not recommend the tourist destination to friends in social media, resulting in a decrease in passenger flow at the tourist destination. From the analysis of these two aspects, the lack of data will affect the income of the industry.

Objective
In response to these two problems, this paper applies data mining technology to analyze and study the simulated business income data and the simulated data of the tourist's hometown, and obtain the highest or lowest income tourist destinations and the place where the tourists with the largest proportion of different tourist destinations belong.
From the perspective of business needs, it can help investors make accurate investments in different tourist destinations, and at the same time allow business managers to improve service levels in a targeted manner. At the same time, the most effective data mining classifier is obtained through experiments in this article, which is convenient for tourism enterprises.
From the perspective of customer needs, after the targeted improvement of the service quality of the tourist destination, they may recommend this tourist destination to friends around them to increase the passenger flow of the tourist destination. Therefore, the use of data mining technology can meet both business needs and customer needs.

Related Work
This section represents several related types of research on application data mining in tourism. All the related works were using different techniques in classification and the best method in getting the best result will be mentioned.
Algur et al. used the number of travelers from 2002 to 2013 to classify Historical Monument places. The location data set is preprocessed and allocated with different class labels such as low, medium, and high according to the number of visitors every year. There are several classification methods under a decision tree with 10 crossvalidation folds is used such as Random Tree, REP Tree, Random Forest, and J47 algorithms. Those results showed Random Forest is the best among other classifiers by analyzing their performance metrics (5).
Irawan et al. selected a place that can be developed based on public and tourists to access tourist site which is more helpful to develop. Their experiment outcomes showed by using C4.5 shown that Nature Tourism object in Simalungun district can be developed in a level of recall of 83.33% and accuracy of 90%. C4.5 can provide better results on tourist location compare to other methods. Irawan et. al mentioned using C4.5 algorithm with 10 rules as a reference in the design and development of the application's GUI in classification for recommending tourist attraction which is a good method (6).
Srivihok et al. mentioned market segmentation is an important tool for dividing markets into smaller groups for comprised of individuals and they proposed a market segmentation method for travelers who visit Thailand for business. The technique is to evaluation unsupervised learning techniques such as SOM neural network, K-means and Hierarchical clustering by the number of the average Silhouette index and comparing the performance of supervised machine learning techniques such as J48, One R, Decision Table, MLP and Naïve Bayes. The classes of data (segments of tourist) used in supervised learning method are provided by the unsupervised learning method. The results indicated by Naïve Bayes performance are better compared to others to forecast the segments of new business tourists as part of the production from clustering method (7). Urgessa  selection of the attributes. Their research was framed by classification models which constructed using the after and before selected algorithms based on information gain to compare the performance of each situation. The methods selected by them are Decision Tree (J48, Random Forest, PART) and Support Vector Machine (SMO) (8). Their models were constructed on the tourism data to find out the noise-tolerant classification algorithm in the domain to recognize user behavior, improve the service, and business chances. The best performing algorithm is identified and the result showed Random Forest and SVM are more noise-tolerant as showed better performance (8).
Wang et al. mentioned that travel agencies cannot identify valuable travelers and tourist next destination. In their study which used the RFM model to describe valuable travelers. C4.5 decision tree was used to segment the valuable traveler for effectively proposing the promotion strategies for travel company by forecasting the destination and package tour cross-selling promotion to increase profits (9). Their research used Taiwanese travelers as mining samples with the applied decision tree to find valuable tourist, decision making behaviors, and demographics. The research is focused on using the mining process to segmenting valuable travelers and analyzing travel destination correlation to create a mining procedure for travel company to do better database marketing (9).

Methodology
This section is describing the detailed of the methodology applied in this paper. The steps of methodology are shown in Fig 1. This research model consists of several components which are Dataset and preprocess, Classification process, and result analysis and KDD process.

Figure 1. Research model
Data mining is a process that uses data as input and produces knowledge as output. The input is the tourism data set and the output is the rules, performance matrix, and the accuracy of the results of tourist's data. Data mining used algorithmic step in data mining process which known as Knowledge Discovery in Databases process (KDD) (10). Data mining required in the use of potentially large and diverse data set which may need for preprocessing to transformed into a representation suitable for data mining algorithm to remove missing and irrelevant data or attribute to tourism. The data mining software Waikato Environment for Knowledge Analysis (Weka) are using in this research as the tool of classification and analysis the results.
The dataset of tourist is collected from the year 2011 to 2012 from the online dataset. The data file is converted from excel (.csv) to Weka file (.arff). The dataset contains information of 8 types of business income in numeric in USD (art galleries, dance clubs, juice bars, restaurants, museums, resorts, parks or picnic spots, and beaches), periods, and 1 class is nominal which is the region of tourist (Africa, America, Asia, Europe, Oceania, and unstated). There are 924 instances in this dataset. The tourist's datasets are shown in Fig. 2.

Figure 2. Tourist's Dataset
In Data mining, preprocessing is very important as it decides the quality of the result and exploit predictive data mining algorithms in knowledge discovery process (11). The effective preprocessing is needed to make the dataset be clean and consistent before used in the classification process. The tourist dataset does not contain any missing value in all the numeric attributes. The data are not going to convert into range by discretization as the concept of hierarchy in binary are only consist of the amount of income is more than an equal (income >= a) and less than (income < a) are going to be used in this research. The period is going to remove and class is set as a class attribute in Weka as the next process is classification. Figure 3 shows the concept of hierarchy. The classification process is learning a function that maps or classifies the data object into one of the predefined classes (12). For example, the tourist from going to juice bar spend more than 100 will class as Asian people in this research, there are 4 models of classification used which are JRIP, Random Tree, J48, and REP Tree. The function of each model is discussed in below section.
JRIP is a propositional rule learner that repeated incremental pruning for RIPPER. JRIP is constructed using WEKA and the classification rules. It will start with an empty set for the less prevalent to the more frequent value. JRIP consist of building and optimization stage. During the building stage, it will be repeated on grow (adding conditions) and prune (incrementally prune every rule) until the error rate >= 50% description length. Optimization is computing the original rule for a final representative of ruleset, if there are still residual positive. Then more rules are generated based on the residual positive and repeated in the build stage.
Random Tree randomly constructs decision trees. Random Tree is constructed using WEKA and the tree is represented by classification rules. Construction of each tree, algorithm picks a feature randomly at each node without any purity function check. If the categorical feature such as "Asia" has not been chosen before from the root of tree to the present node. It is useless to choose the repeat feature once more on the similar decision path as the pattern in the same path will have the same value but continuous feature such as "juice bar" can be picked more than once in the similar decision path. The tree stops growing if no more examples split in the current node or the depth of tree goes too deep.
J48 can be considered as C4.5 classification. J48 produces a classification-decision tree for the tourist dataset by recursive partitioning the tuples. J48 Tree classifier is constructed using WEKA and the built tree is represented by classification rules shown in Table 1. The depth-first strategy is used to build the decision tree. J48 considers all the possible tests to split the tourist dataset and selects the best information gain. The information gain of the binary partition point is based on distinct value and sub trees are built accordingly. This process is repeated for all attributes.
REP Tree Classification Models also is called fast decision tree learner. REP Tree is built using WEKA and the decision tree is represented by classification rules shown in Table 2. REP Tree builds a decision tree using prunes and information gained by reduced-error pruning. The REP Tree Classification sorts values for numeric attributes only one time.
The results of the 4 different models will be evaluated using performance evaluation metrics proved by Weka which are incorrectly classified instances, correctly classified instances, FP rate, TP rate, Precision, Recall and others. All the results will be compared for the knowledge discovery in the discussion section. Table 1

Discussion:
The data set used in this paper is about tourists from different regions visit Malaysia and income in USD of different places. This data set is built by 4 different methods in classification which are JRIP, Random Tree, J48, and Random Tree in WEKA with 10-fold cross-validation with the 924 tourist instances. The classification for 4 different types of classification is using "If…then" rule which is shown in Table 3 with its explanation. There are 20 rules applied in JRIP.
There are 227 of tree sizes for Random Tree. There are 57 leaves and 113 of tree for J48. There are 67 of tree for REP Tree. By comparing those rules generate by the 4 types of classifier J48 applied the most on the dataset so that the result will be more accurate compare to other 3 classifiers. Although Random Tree having more rule it split the dataset to deep.  The decision tree built from the tourist dataset by WEKA shown in Table 4 for J48, REP Tree, but JRIP does not have hierarchy tree as it is rule-based. All the decision tree is binary as it only contains 2 types of meaning which are more or equal to (>=) and less than (<). The decision tree can be explained by converting them into rules such as using the if…the rules in Table 4. The "oval" shape in Table 4 represents to the attributes and "square" shape in Table 4 represent as class. Table 4 shows value inside "square" shape or attributes of hierarchy tree represent to number of classified object and follow by number of incorrect classified object. The most complicated tree is Random Tree then follow by J48. Besides that, WEKA also shows the error in the scatter plot the square is incorrect classified and x is correctly classified. There are 6 different colors used to represent to a different class. For the scatter plot which can clearly visualize most of the American are wrongly classified and most Asian are correctly classified. Table 5 shows the classifier error generate by WEKA. The results describe performance evaluation metrics on the correctly classified all the tourist's instances on their percentage of correctly classified, incorrectly classified, Kappa statistic, Mean absolute, TP rate, FP rate, Precision, Recall, F-Measure. All the measurement results are shown in Table 6    The TP (True Positive) rate and FP (False Positive) rate of the 4 classifiers in-depth as the results show in the form of confusion matrix with 6 x 6. Table 8 shown the Confusion Matrix of J48, and REP Tree. The confusion matrices that show "a" in the row and column is representing to the region of tourist come from is Africa, "b" in the row and column is representing to the region of tourist come from is America, "c" in the row and column is representing to the region of tourist come from is Asia, "d" in the row and column is representing to the region of tourist come from is Europe, "e" in the row and column is representing to the region of tourist come from is Oceania, and "f" in the row and column is representing to the region of tourist come from is unstated. The green dotted line in the confusion matrix in Table 8 represents the correct classified instances.
There are total 324 instances are originally classified as "c" by using J48 there are 301 instances correctly classified and 23 instances are incorrectly classified. The 23 instances should classify in class "c" but there are 3 incorrectly classified in "a", 4 incorrectly classified in "b", 16 incorrectly classified in "d", and 2 incorrectly classified in "e". There are total 312 instances are originally classified as "d" by using REP Tree there are 259 instances correctly classified and 53 instances are incorrectly classified. The 53 instances should be classified in class "d" but there are 6 incorrectly classified in "a", 26 incorrectly classified in "b", 13 incorrectly classified in "c", 6 incorrectly classified in "e", and 2 incorrectly classified in "f".

Conclusion:
This paper conducted experiments on the use of WEKA to implement 4 data mining and classification technologies on data from the Malaysian tourism industry, including JRIP (84.09%), Random tree (76.62%), J48 (85.39%) and REP Tree (83.44%). Extract important information from the data set about the income data of tourist destinations and the places where tourists belong, this paper finds the best performing algorithm in the comparison results to improve business opportunities in the travel-related fields. It provides information for investors to make accurate investments in different tourist destinations, and also helps managers to accurately judge the region of tourists come from in different tourist destinations. The most effective method is to use the J48 classifier for analysis, and the least effective is to use REP Tree for data mining analysis.
However, it should be noted that the performance of the data mining process directly depends on the number of available cases (instances) that can be used. Its use does not guarantee the best business results, but it can greatly reduce the risk of making wrong decisions. The results show that no one optimal algorithm can beat other algorithms in all cases (3). Finally, this paper also provides a list of data mining resources and tools for those who wish to obtain more information on this topic.