Using Graph Mining Method in Analyzing Turkish Loanwords Derived from Arabic Language

: Loanwords are the words transferred from one language to another, which become essential part of the borrowing language. The loanwords have come from the source language to the recipient language because of many reasons. Detecting these loanwords is complicated task due to that there are no standard specifications for transferring words between languages and hence low accuracy. This work tries to enhance this accuracy of detecting loanwords between Turkish and Arabic language as a case study. In this paper, the proposed system contributes to find all possible loanwords using any set of characters either alphabetically or randomly arranged. Then, it processes the distortion in the pronunciation, and solves the problem of the missing letters in Turkish language relative to Arabic language. A graph mining technique was introduced, for identifying the Turkish loanwords from Arabic language, which is used for the first time for this purpose. Also, the problem of letters differences, in the two languages, is solved by using a reference language (English) to unify the style of writing. The proposed system was tested using 1256 words that manually annotated. The obtained results showed that the f-measure is 0.99 which is high value for such system. Also, all these contributions lead to decrease time and effort to identify the loanwords in efficient and accurate way. Moreover, researchers do not need to have knowledge in the recipient and the source languages. In addition, this method can be generalized to any two languages using the same steps followed in obtaining Turkish loanwords from Arabic.


Introduction:
Data mining is the task of extracting useful information from data that has been performed by analysts. There are two primary goals of data mining which are prediction and description 1,2 . Predictive data mining produces the model of the system described by a given data set, whereas descriptive data mining produces new and nontrivial information based on an available data set 3 .
In general, a graph is a group of nodes which are almost represented by circles; the link between each two nodes, which is represented by a line, is called an edge 4 . That means the graph is represented by a set of connected nodes by edges. For example, the routers or computers (nodes) are connected by wires or wireless (links) 5 . These nodes and links overall are represented by a tree 6 . The field of extracting useful information by traversing a tree looking for specified nodes is called the graph mining 7 . Graph mining has become an important topic of research recently because of numerous applications and a wide variety of data-mining problems in computational biology, chemical data analysis, drug discovery, and communication networking. Also, graph mining can be used in loanwords from other nonnative language 8 .
Loan words are words coming from a lending language to a receptor language 9 . The language of the original word is called "donor", "lending", "source", or "borrowing" language. Whereas, a language receiving words from a lending language is called "receptor" or "recipient" language 10 . Taking some words from some languages is beneficial to compensate some missing meanings in the recipient language 10 . This phenomenon contributed in a very important way to the fields of bilingualism 11 . The main factor for choosing the Turkish language in this work is that Turkish language has adopted many words from Arabic language during the ottoman invasion.
Ottoman is the old name of Turkish republic. They invaded and occupied Iraq in 1534. Iraq was under their control until 1623. During that era, Turkish troops adopted many words and transferred those words to their country after they left Iraq. Because of this invasion, many Arabic words found their way into Turkish language. These adopted words from Arabic into Turkish are called loan words 12 .
Loan words in Turkish are the words which are taken from Arabic language and integrated in Turkish. Some of these words are transferred from the lending language (Arabic) to the borrowing language (Turkish) without any changes in the pronunciation. For Example; ‫"تمام"‬ is an Arabic word loaned to Turkish; this word has the same pronunciation in the source (Arabic) and recipient (Turkish) languages which is "tamam". Most of the loan words in Turkish have been subjected to slight changes. For example, the word ‫"كتاب"‬ is a Turkish loanword from Arabic, but the pronunciation of this word is a little bit different in the two languages; it pronounces in Turkish "kitap", but pronounces in Arabic "kitab". There are many other examples but one is taken in the two situations for explanation. That means, a phonological changes have been happened to the Turkish loanwords for adaptation to be fitted into their native language. In general, borrowing words from some languages is to fill the lexical gaps in the language's dialect.
In this paper, the reasons that led to the great interaction between the Arabic and Turkish languages were explained, which led to the large number of Turkish words borrowed from the Arabic language. The researches on this topic have been reviewed. Graph mining was used to find loanwords by generating and filtering all possible words. The changes in words were also processed to reflect the mutation that occurred when words switched between two different languages. It can be emphasized that this method can be generalized to find loanwords between any two different languages.
Finding the origin of words is one of the most difficult tasks for linguists, and it is the main motivation of the proposed work. Also, the existing works in this field are very limited with low accuracy. The main contributions, for our research, are by (i) using graph mining techniques for detection loanwords, and (ii) dealing with Turkish and Arabic language, it is used for the first time and (iii) building an efficient system for detecting the loanwords. This paper has two main contributions. First, detection loanwords by using graph mining techniques, and dealing with the Turkish and Arabic languages. Second, it is used for the first time to detect loanwords in efficient way.

Related Works:
There are few works for automatic detection of loanwords but none of them dealt with Turkish-Arabic loanwords. In this section almost all the related works with any pair of languages are presented.
Buhmaid S 9 stated that many words from English penetrated the Hadhrami Arabic. He used specific phonological, morphological and semantic features for detection of loanwords. Salman et. al 13 showed that "borrowing can be achieved when some words are imported from one language to the lexicon of another language". They discussed the adaption process in the level of phonology and morphology. Loanword adaptations represent phonetically minimal transformations". Using the original form is not always; sometimes some changes happen to the borrowed words in order to be compatible with the structure and rules of the borrowing language.
Peperkamp et. al 14 showed that there are three problems in loanword adaption which are learnability, phonetically driven adaption, and unfaithful perception.
Peperkamp S 15 stated that some of the loanwords are not compatible with the native phonology, and the adaption process is subject to a minimal transformation. Rao 16 discussed loanwords between English and other eleven languages. He showed that borrowing words into English language is very important because the target language needs to fill gaps in their lexicon. Also, he states that most of the words find their way into English from other languages such as Arabic, Greek, Russian, Spanish, French, Latin, etc. Farazandeh-pour et. al 17 described a mechanism of adaption related to Persian words coming from German origins.
Mi et al. 18 proposed a recurrent neural network (RNN) based framework to identify loanwords (Chinese, Russian and Arabic loanwords) in Uyghur. They also suggested two features: inverse language model and collocation feature to optimize the output of loanword identification model. They stated that the model achieved significant improvements in loanwords detection task. Koo 19 stated, that the loanwords coming from other languages to Korean can be identified by using unsupervised classifier. The results showed that the F-score of the classifier is 94.77. Also, he showed that the method can also be applied to other languages that have the same phoneme such as Japanese language. Miller et. al 20 proposed a method to find loanwords in mono-lingual texts by using an automated frame work exploiting the phonological and phonotactic clues. The method depends on the use of Support Vector Machines, Markov models, and recurrent neural networks. The results show that using phonological and phonotactic clues derived from monolingual language are inadequate to identify the loanwords.
Aboh et.al 21 used edit-distance measures and a sound-class based method to measure the phonological similarity. They showed that this measuring can be neglected because words coming from the source language to a borrowing language are subject to some changes to fit the phonology of the borrowing language.
All these and other works, according to our knowledge, did not deal with Turkish loanwords from Arabic origin. Also, graph mining was not used for loanwords detection task.

Extracting Loanwords Method
Our proposed methodology is used for answering two questions: (i) Is a given Turkish word a loanword from Arabic language? And (ii) what are the Turkish loanwords for a set of character and a specific range of word length? For answering these questions, two stages should be done in our methodology. These stages are (i) transformation of Turkish and Arabic dictionaries into reference words, (ii) identification of Turkish loanwords from Arabic origin.

Transformation of Turkish and Arabic Dictionaries into Reference Words
Basically, when dealing with the Turkish and Arabic languages, there are two main problems have been faced; the two languages have different scripts, and Arabic language is rich in synonyms which produce dictionary gap between these languages 22 .
For solving Arabic richness problem, compared to Turkish language, two dictionaries are taken; Turkish-English and Arabic-English dictionaries. English language is the intermediate language because it is very simple compared to Arabic language and it has many to one mapping where many Arabic synonyms have one meaning in English language. Also, the same situation for Turkish language with English language therefore the dictionary gap will be solved without using semantic relation of the words.
The other main problem is the difference between Turkish and Arabic languages in scripts. Therefore, a reference scripts should be used. This is can be done by converting the Turkish words into a reference language (English) depending on the pronunciation. In this case, each character in Turkish word is converted into its equivalent character in English ignoring the vowel letters to avoid some problems related to the accent differences and also to get correct matching results as shown in (Fig. 1). On the other hand, Arabic words in another data base need to be converted into English scripts in the same way with Turkish words as illustrated in (Fig. 2). This transformation will unify the scripts of equivalent words in Turkish and Arabic languages based on their pronunciation. Some letters have more than one pronunciation therefore many words will have more than one reference word, i.e., each word of this type will have a list of reference words based on different candidate pronunciations of this word.
Algorithm 1 shows the transformation of Turkish-English and Arabic-English dictionaries into reference words. The input to the algorithm is two dictionaries in the form <Turkish word, meaning in English> and <Arabic word, meaning in English> and the output will be in the forms < Turkish word, meaning in English and list of reference words> and <Arabic word, meaning in English and list of reference words >. It is easy to see that any Turkish word, that has the same reference word and meaning of an Arabic word, it is a loanword.   Jmhr, jmhrt, jmhrd, jmhvr, jmhvrt, jmhvrd * In some cases the character "t" is pronounced as "d" or therefore this pronunciation is taken in account.

Identification of Turkish Loanwords from Arabic Origin:
The identification of Turkish loanword from Arabic origin is based on the equivalence of the reference words and their meaning. But in some situation, the loanwords suffered from small and large modifications. Therefore, there is not exact matching in reference words or meaning for such words. For this reason, an efficient technique should be used for detection or identification. It is clear that Turkish loanwords from Arabic language fall into the following types: 1. Loanwords have same pronunciation and meaning of the Arabic origin. 2. Loanwords have same pronunciation but they modified in meaning compared to the Arabic origin. 3. Loanwords have modification in pronunciation but they have same meaning of the Arabic origin.

Loanwords have modification in pronunciation
and meaning compared to Arabic origin. The modification in meaning or pronunciation of loanwords with Arabic origin may be large (complete) or small modification. Because two dictionaries are used, the modification in meaning can be detected if the modification is done using synonyms. This is the main reason for using two dictionaries.
The small modification in pronunciation can be detected easily using graph theory as will be shown.
In the proposed system, the main objectives (tasks) are the identification of Turkish loanwords for (i) a specific word or (ii) for all words in range of specific length and consists of limited subset of characters. The first task is subtask of the second task therefore the second subtask will be explained in this section.
According to these objectives, any set of characters can be taken that construct real Turkish words in range of length n. It is done by selecting all the words, in the Turkish dictionary, that have length of or less than n as real words. A binary search is used for this purpose because the data base is ascendingly sorted, and the time complexity of binary search is O (log (n)) compared with sequential search which is O (n). These real words are compared with the character set for neglecting the words that have any character outside this character set. Some of these words are adopted from Arabic language. Therefore, the next step is to find Turkish words coming from Arabic origins. This is can be done by using graph mining technique. It starts by retrieving reference words list of a given Turkish word and then this reference words list will be used for construction directed graph that represent all the combinations of reference words as shown in (Fig. 3). Any Arabic word that has a reference word as a subgraph of this graph, it is candidate to be origin of the selected Turkish word. Any matching in the meaning for the candidate origin with the meaning of the selected Turkish word, it will be the Arabic origin of this Turkish word. Any matching in the meaning for the candidate origin with the meaning of the selected Turkish word means this Turkish word is from Arabic origin. Algorithm 2 shows the identification of Turkish loanwords from Arabic origin.

Step2: Build undirected graph UG that represents each character in ST as node
Step3: for each word w in TW: If length (w) <=ML and w is subgraph of UG:  Add w to RealWords  Also record its meaning & reference words list.
Step4: For each word w in RealWords do: A-Produce graph that represent all reference words of the word w.
B-Search the equivalent reference word in AW that is subgraph of this graph.//the length of this subgraph should be same or less by one with the length of original graph.
C-If any of two reference words have the same meaning, add this word with the equivalent Arabic origin into Loanwords list.

Results and Discussion:
The proposed algorithm was tested using 1256 manually classified words where half of them (628) are loanwords and the other are non-loanwords. Also variant string lengths were used as input to the system and the results were checked manually. This section has three parts; dataset and experimental setting, transformation of datasets to reference words, and loanwords detection.

Dataset and Experimental Setting
The proposed algorithm was implemented using VB.net programing language on laptop of Intel core i7 CPU, 8 G RAM, and windows 10 operating system. Two dictionaries were used in this work. The first one is Turkish-English dictionary in the form <Turkish Entry, English meaning (list)>. The Turkish entry is lemma of Turkish word. The second dictionary is Arabic-English dictionary in the form <Arabic Entry, English meaning (list)> where the

Transformation of Datasets into Reference Words
Arabic and Turkish words should be transformed into a reference language (English Language) as explained in the previous sections and shown in (Figs. 1, 2). The reference language is used because the two languages (Arabic and Turkish) have different styles in writing. This leads to impossibility of loanword matching in the two languages. The output of this step is two dictionaries in the form <Turkish Entry, English meaning (list), reference words (list)> and <Arabic Entry, English meaning (list), reference words (list)>. Each Turkish and Arabic word has list of references in the reference language. Reference words represent the different spelling of the Turkish entry or Arabic entry.

Loanwords Detection
Firstly, the system was tested using 1256 manually classified words. Only nine words of the loanwords are not recognized by the system as loanwords while all the non-loanwords are identified. This is means that the precision, recall and f-measure are 1, 0.98 and 0.99 respectively.
Then the system was tested for extracting all the loanwords that range in a specific length with a predefined character set. Many tests were done for different character sets and different word lengths. The entire outputs were checked manually. Two samples of these tests are shown in (Figs. 4, 5). The first step is entering any set of characters and the maximum length of the generated substrings. Then the button "Generate Substrings" needs to be pressed. This will extract all real words in this range by traversing the graph which represents the entered set of characters. Our methodology has few limits which can be summarized by: (i) it does not work directly on Turkish and Arabic encoding but it works on a reference language to bypass problems of encoding, and (ii) many comparisons will be done if a character set is taken with a specific range of word length.

Conclusion:
Detection of loanwords is an important task for linguists and history scientists. It is complicated and time consuming task. This paper proposed a novel methodology for detecting such words automatically between Turkish and Arabic words. The proposed system detects almost all loanwords in Turkish language from Arabic origin. This work faced many challenges such as the difference in writing scripts of the used two languages. Also the used two languages, Turkish and Arabic, are highly different in the morphology and word construction level. Therefore; the functions of calculating the distances in order to find the matching among terms or words are very difficult. Consequently, a reference language was employed to represent the terms or words in the two target languages.
The main motivation of the proposed work is to find the loanwords in Turkish language from Arabic language where the main contributions are (i) using graph mining techniques for detection loanwords, and (ii) dealing with Turkish and Arabic language, it was used for the first time, and (iii) building an efficient system for detecting the loanwords.
Furthermore, data mining has a good ability to deal with change in pronunciation, especially when words move among different languages through processing of vowels. Finally, it can be used with different languages despite the lack of full knowledge of the components of these languages. Manually testing of 1256 loanwords proved that our system works with high precision.
The suggestion for the future works are; (i) applying the proposed system on other languages, (ii) combining different methodologies for improving efficiency of the system, and using deep learning for detecting the small relations of words.