Short Text Semantic Similarity Measurement Approach Based on Semantic Network

: Estimating the semantic similarity between short texts plays an increasingly prominent role in many fields related to text mining and natural language processing applications, especially with the large increase in the volume of textual data that is produced daily. Traditional approaches for calculating the degree of similarity between two texts, based on the words they share, do not perform well with short texts because two similar texts may be written in different terms by employing synonyms. As a result, short texts should be semantically compared. In this paper, a semantic similarity measurement method between texts is presented which combines knowledge-based and corpus-based semantic information to build a semantic network that represents the relationship between the compared texts and extracts the degree of similarity between them. Representing a text as a semantic network is the best knowledge representation that comes close to the human mind's understanding of the texts, where the semantic network reflects the sentence's semantic, syntactical, and structural knowledge. The network representation is a visual representation of knowledge objects, their qualities, and their relationships. WordNet lexical database has been used as a knowledge-based source while the GloVe pre-trained word embedding vectors have been used as a corpus-based source. The proposed method was tested using three different datasets, DSCS, SICK, and MOHLER datasets. A good result has been obtained in terms of RMSE and MAE.


Introduction:
Many NLP and text mining tasks require finding similarity scores between texts, tasks such as information retrieval 1 , text classification, text summarization 2 , sentiment analysis 3 , automatic student short answers assessment, machine translation 4 , etc.In the traditional similarity calculation methods, the texts are converted into a vector in vector space 5 .The vector is constructed using the concepts or words of the texts, the similarity between compared texts is then calculated as the cosine similarity of their vectors.This means the similarity score between compared texts is computed based on the number of common words between them.These methods work well with large texts the size of an article or document, based on the assumption that similar texts tend to share similar words, but cannot be relied upon in dealing with short texts with a sentence length or a few sentences, as two short texts can carry almost the same meaning and they do not have a single word in common.For example, the sentence "She is a beautiful woman" and the sentence "She is a nice girl" has similar meaning but do not share any common word except stop words.Here, the need arises to find a method that takes into account the semantic similarity aspects between the words of two texts which means taking into account synonyms, hypernyms, hyponyms, and other relationships between words.
Finding similarities between words is an important element of text similarity, which is subsequently utilized as a starting point for text similarity.Words can be similar in both lexical and semantic aspects.Words are lexically similar if their character sequences are similar.Words are semantically similar if they have the same meaning, are opposites, are employed in the same way, are employed in the same context, and one word is a type of another.String-Based methods can be used to compute lexical similarity, whereas Corpus-Based or Knowledge-Based algorithms can be used to calculate semantic similarity.String-based algorithms ensure that a string comparison metric is utilized to compare the similarity of distinct character sequences.Corpus-Based algorithms use information acquired in a huge corpus to compute semantic similarity between words, whereas Knowledge-Based algorithms use the knowledge obtained from semantic networks to determine the semantic closeness of words 6,7 .
In this paper, a method has been presented for measuring semantic similarity between texts by building a semantic network that represents the relationship between the words of the texts, where each node represents a word with its part of speech (PoS) tag, and the edge represents the similarity degree between the nodes.To measure the similarity score between words.WordNet 8 as a knowledge-based source and GloVe 9 as a corpusbased source are used.Two sources are used to take the advantage of both and to avoid the problem of the lack of some words in one of the sources.
The next sections are structured as follows: Section 2 briefly reviews some similar works.Section 3 describes the semantic network.Section 4 explains the word-to-word similarity sources.Section 5 describes the proposed method for semantic similarity measurement between short texts.Section 6 details the results obtained.Section 7 presents the conclusion.

Related Work:
In 10 Liu and Wang, 2013 adopted a vector space model to consolidate word-to-word similarity.Initially, both sentences are converted to a bag-ofwords form.The approach then creates a combined word set by combining sentence 1 and sentence 2. For every sentence, a semantic vector is created, with the combined word serving as a vector element.The highest similarity degree of a word pair between each word in the combined word set and each word in a sentence is represented by every element of the semantic vector.They develop similarity metrics based on concept vectors to assess the similarity of word pairs.Following the formation of each sentence's semantic vector, the cosine coefficient of these semantic vectors can be used to determine the sentence's similarity.This technique yields a precision of 0.738 and a recall of 0.902 in the paraphrase detection on the Microsoft Research Paraphrase Corpus (MSRP).
In 11 Croft et al, 2013 suggested lightweight semantic similarity (LSS), a short text similarity that integrates the vector space model with path length word-to-word similarity.The first stage in the procedure is to create a combined word set from the two sentences and use it as a vector space dimension.Every sentence is represented as a vector using the procedure.Evert word in the combined word set is evaluated for every word in a sentence.The sum of the word-to-word similarity score for each term in the combined word set is considered as the score of a vector component relating to that phrase.The process is recurring until each vector component (term) has a value.The approach creates a vector representation for the second sentence using a similar process.Cosine similarity on sentence vectors is used to calculate overall sentence similarity.The method's performance is measured on 65-word pairs from Rubenstein and Goodenough, with every word being substituted by its description from the Collin Cobuild lexicon.The LSS algorithm and human judgment are then used to assess the similarity of the noun-sentence pair's definitions.They attained a Pearson correlation of 0.807.
In 12  Kusner et al. 2015, produced word embedding from Google News corpora, using the word2vec approach advanced by Mikolov et al 13 .The term "word embedding" refers to the representation of words as a dense numerical vector.To quantify sentence similarity, the approach constructs the text as normalized bag-of-words vectors.The word mover distance (WMD) function is used to calculate the distance between the two sentences.The function determines the shortest cumulative distance that a word in one sentence must travel to match a word in the second sentence exactly.The Euclidean distance between the word embedding vectors is used to compute the distance between words.As a result of WMD calculation, the greater the space between two sentences, the less similar the two sentences will be.
In 14  Vu et al. 2014, used explicit semantic analysis (ESA) in conjunction with ROUGE (Recall-Oriented Understudy for Gisting Evaluation).ROGUE is a measure of lexical similarity based on n-gram co-occurrence information.They calculated text similarity using each technique, then use a linear combination and a tuning parameter to obtain the final similarity.They put the method to the test by creating their dataset from Wikipedia articles.They achieved a Person correlation of 0.82.
Concerning the biomedical area, in 15 Soğancıoğlu et al. 2017, presented a technique to compute text similarity.They used the outcome of combining numerous sentence similarity metrics as an input for a supervised machine learning approach.Following text preparation, the technique assesses each sentence's knowledge-based, stringbased, and corpus-based similarity.Each measurement's result was fed into the supervised regression model.They synthesized their dataset, which contained biomedical sentence pairings, for testing reasons.They calculated the Pearson correlation by comparing their result to the score of similarity of human judgment.A Pearson correlation of 0.836 was achieved using this strategy.
In 16 Pawar et al. 2018, suggested a technique for determining sentence similarity that takes the semantic and word position information into consideration.They used a method that was both knowledge-based and corpus-based.The approach creates a joint word set by combining two input texts.The sentences are then translated into a semantic vector using WordNet knowledge and a joint vocabulary set.The degree of similarity between compared words was considered using the shortest route between the two words and the deepness of the subsumer in WordNet.The mark of similarity between words was then weighted using information content obtained from a corpus and cosine similarity was applied to the two vectors.Order vectors were likewise produced and order similarity was determined using a similar approach.Lastly, the similarity was calculated by merging semantic and order similarity.Rubenstein and Goodenough word pairings were used to test the approach.The approach had a Pearson correlation of 0. 0.8794, which was rather good.This approach has a disadvantage, even though it produces promising results.Word sense disambiguation is not done, and there is a problem if sentences contain terms that are not in WordNet.
In 17 Yang et al. 2021, suggested a strategy for combining semantic and syntactic information in short text similarities.The semantic information is derived from semantic vectors of short texts, which are dynamically created by comparing short texts and term similarity.A constituency parse tree was used to retrieve syntactic information.The two parts were then linearly integrated.To address the phenomena of polysemy, they employed knowledge and corpora to express the meaning of phrases.
They tested their method on semantic textual similarity (STS) tasks which contained 24 datasets.Good results were achieved in terms of the Pearson correlation coefficient, however, using a tree parser was computationally expensive which made the method unsuitable for real-time applications.
In 18 Lubis et al. 2021, proposed semantic similarity based on word embedding for the automatic short answer grading system.They trained the word2vec model on a full Wikipedia dump in Indonesia to obtain a word embedding vector.The student answer and correct answer were converted to sentence vectors by computing the average of their words vector and the semantic similarity was then computed as cosine similarity between their vectors.They tested their method on a dataset consisting of 224 student responses from a computer network engineering class.They achieved a Pearson correlation of 0.7.
In 19 Mijbel et al. 2021, suggested an approach for measuring semantic similarity between texts based on the semantic network and word description.The method began with text processing, parts of speech tagging, and building the semantic network.The semantic similarity was calculated from three aspects, which were the similarity of the nodes, the similarity of the parts of speech, and the similarity of the edge relationship, the final similarity was obtained as a linear combination of the three similarities.They tested the method on the DSCS dataset 20 , 1.17 mean absolute error was achieved.
Through this review of previous works, the following points can be noted: 1-The knowledge-based methods can be limited when some words are not present in the lexical database used, especially with the use of informal words in texts.

2-
The methods that depend on word embedding vectors are biased by the nature of the corpus that was used to extract the values of word embedding vectors, for example, the use of political corpus shows a great similarity between 'Iraq' and 'Afghanistan', while the use of cultural or historical corpus shows a similarity between 'Iraq' and 'Mesopotamia'.
Our main contributions are as follows: -Presenting a new hybrid method for short text semantic similarity measurement that integrates the knowledge-based and corpus-based semantic information.

1851
-The proposed method is based on the semantic network where the semantic, syntactical, and structural knowledge of the sentence are considered in the calculation of the similarity degree.
-Our method is easy to implement, fast, and computationally inexpensive

Semantic Networks:
The semantic network is one of the ways of representing knowledge, where the nodes represent objects or concepts and the edges represent the binary relationship that connects two nodes.The network representation provides a pictorial representation of knowledge objects, their attributes, and the relationship between them 21 .
Representing a text as a semantic network is the best representation of knowledge that comes close to the human mind's understanding of texts, where the semantic network reflects the semantic, syntactical, and structural knowledge of the sentence.The relationship between the nodes can be 'is a", 'a kind of', 'a part of' and so on.Fig. 1 is an example of a semantic network.As mentioned, calculating the similarity between words is the cornerstone for measuring similarity between texts.In this section, the sources that have been used to compute the semantic similarity between words are described.

WordNet
WordNet is an example of a semantic network in which words-concepts-are linked by synonymy or meronymy links.WordNet is a lexical database with over 100,000 English words that are commonly used for knowledge-based semantic similarity approaches 22 .The lexicon is divided into nouns, verbs, adjectives, and adverbs by WordNet.These clusters of words are grouped to form synsets or synonym sets.A synset is a concept in which all of the terms have the same meaning.In certain syntax, the words in a synset are interchangeable.The definitions of these words, as well as pointers to other related synsets are included in a synset's knowledge.In WordNet, synsets are organized in a tree-like hierarchical structure, with many specialized terms at the bottom and a few general phrases at the top.Following trails of superordinate terms in "is a" or "is a sort of" (ISA) relations connects the lexical hierarchy.Each word rises the lexical tree until the two climbing paths meet to form a path between them.The subsumer is the synset at the intersection of the two climbing paths; a path connecting the two words is then identified through the subsumer.Counting synset links along the path between the two words yields the path length.Counting the levels from the subsumer to the top of the lexical hierarchy yields the depth of the subsumer.If a word is polysemous (meaning it has numerous meanings), there may be multiple pathways between the two terms 23

GloVe
Word embeddings are representations of words as a vector that maintain the basic linguistic link between words 25 .These vectors are computed by a variety of methods, including neural networks, word co-occurrence matrices, and representations based on the context of the word. 22.Some of the most often used pre-trained word embeddings are word2vec 26 , GloVe 9 , fastText 27 , BERT 28 .GloVe pre-trained word vector adopted in the proposed method besides WordNet as a word to word similarity resource.
GloVe, developed at Stanford University, employs a global word co-occurrence matrix based on the underlying corpus.It calculates similarity based on the fact that words that are similar to every other commonly transpire together.A single run across the underlying huge corpus is used to populate the co-occurrence matrix with occurrence values.The GloVe model was trained on five corpora, the majority of which were Wikipedia dumps.Words are chosen within a given context window for constructing vectors because words further away have less significance to the context word in consideration.The GloVe loss function reduces the least square distance between the cooccurrence values in the context window and the global co-occurrence values.To discriminate words based on context, GloVe vectors were enhanced to generate contextualized word vectors 22 .

Proposed Method:
The proposed method for measuring the semantic similarity between two texts is based on semantic network construction that represents the relationship between the elements of the two texts.Fig. 2 below shows the process of finding semantic similarity between two texts.The method starts with text preprocessing, where the input text is converted into a clean format that can be analyzed and processed.In the next step, the semantic network is built, which represents the binary relationships between the words of the two texts.The word-toword semantic similarity is found through the lexical database (WordNet) and the pre-trained embedding vectors (GloVe).In the last step, the semantic similarity between the two texts is computed using the information provided by the constructed semantic network.

Text Preprocessing
Text pre-processing is a necessary step to convert the text into a clean format that can be processed and analyzed.The text preprocessing process consists of several steps, which are as follows.

Cleaning and Normalizing
The original text often comes with some unwanted additions that do not affect the semantic meaning of the text, and the process of cleaning the text comes to remove these additions, such as duplicated whitespaces, special characters, HTML tags, punctuation marks, URL links, etc. after cleaning the text is converted to lowercase.

Tokenization
Tokenization is the procedure of dividing the text into smaller components known as tokens 29 .

Part of Speech Tagging
Part-of-speech tagging is the procedure of giving a part-of-speech tag to each word in the text.This is done based on its meaning and its context.Tagging is a disambiguation task; ambiguous words have more than one possible part of speech and the goal is to find the correct tag for the situation 30 .

Stop Word Removal
Stop words are a list of high-frequency words like pronouns (they, we, you), conjunctions like (for, and, while), etc.They have less impact on the semantic meaning of texts, so deleting these words is a suitable option in many NLP applications.

Lemmatization
It is the procedure of returning a word to its root form.For example, 'run', 'ran', and 'running', are all conjugations of the verb 'run'.In semantic similarity measurement applications, lemmatization is preferred on stemming because it avoids the overgeneralization of the stemmers and takes into account the PoS tag of the word where the word 'running' that has a noun PoS tag remains the same while 'running' with verb PoS tag reduced to 'run'.

Semantic Network Construction
Following the texts are preprocessed, the process of building a semantic network that represents the relationship between the two input texts begins.Each word of the two input texts with its PoS represents a node in this network, while the score of semantic similarity between the words represents the edges.To build this semantic network, the semantic similarity between each node from the first text and the nodes from the second text is found, and the highest value that links a node with the node of the other text represents the weight of this node.

Semantic Similarity Measurement
Following the semantic network is constructed now both texts have a list of nodes with their associated weights that represent their relationship with other text.To compute semantic similarity between the two texts, calculate how the first text is similar to the second text by summing text1 nodes weights that are greater than the threshold (ϴ) value divided by the number of its nodes.In the same way, the similarity of the second text to the first text is calculated, and the final similarity score is the average of two similarities.Let S1 be the similarity of text1 with text2 Let S2 be the similarity of text2 with text1 Let S be the final similarity

……….…3
Where A1, A2 is the summation of weights of text1 nodes and weights of text2 nodes which is greater than ϴ value respectively, and N1, N2 is the number of nodes of text1 and text2 respectively The final similarity is computed as follows. = (1+2) 2

Illustrative Example
To illustrate how the proposed method calculates the semantic similarity between two sentences, let's take this example Text1: "A boy of young age is playing in the park with his mother" Text2: "A young child and his mum are playing in the field." In the first step, the input texts are entered into the text preprocessing process.The output of this process is a list of words with their part of speech tags for each text.
The value of ϴ is inversely proportional to the semantic similarity score between the texts.The higher the ϴ value, the less similarity between the two texts and vice versa.To find the semantic properties of words and use them to the maximum extent possible while keeping noise to a minimum, several ϴ values have been tested and found that the best results have been obtained with ϴ values between "0.7 to 0.85" So the sum of text1 nodes weights that are greater than ϴ (0.7) are: 4.747 The number of nodes: 6 S1=4.747/6S1= 0.791 The sum of text2 nodes weights that are greater than ϴ (0.7) are: 4.747 The number of nodes: 5 S2=4.747/5S2=0.949 The final similarity is S = (s1+s2)/2 Final similarity= 0.87

Experimental Result:
The proposed method was implemented using Python programming language, and several text processing libraries provided by the language were used.To evaluate the proposed method, it was tested on three different datasets, these datasets are different and varied in their size and domain.The first is a small group of sentencespairs in the field of computer science, the second is a large group that contains English sentences pairs for general purposes, and the third is within the field of automated assessment of students' answers.

Datasets Description DSCS Dataset
Domain-Specific Complex Sentence (DSCS) Semantic Similarity dataset 20 .It comprises 50 pairings of sentences from the computer science field, with associated similarity scores supplied by 15 human annotators, and the similarity score was calculated by averaging the replies of the 15 annotators.The similarity score between sentence pairs is determined on a scale of 0 to 5, with 0 representing complete dissimilarity and 5 indicating complete similarity.

SICK Dataset
The Sentences Involving Compositional Knowledge (SICK) 31 dataset is made up of around ten thousand English sentence pairs that were constructed from two sources: the 8K ImageFlickr data collection and the SemEval 2012 STS MSR-Video Description dataset.Each pair of sentences has two aspects of annotation: relatedness and entailment.The degree of relatedness sorts from 1 to 5, the entailment relation is categorical, consisting of neutral, contradiction, and entailment.

Mohler Dataset
The Mohler 32 dataset consists of 10 assignments with four to seven questions and two tests with ten questions each.These assignments/exams were given to students in an introductory computer science class at the University of North Texas.There are 87 questions in total, and each question has a standard answer provided by the examiner.Each question was answered by 26 to 31 students.Each answer in the assignment is scored on a scale of 0 (not correct) to 5 (completely correct) by two evaluators who are experts in computer science.The standard score of each answer is the average of the two evaluators' scores.

Results
Mean absolute error (MAE), and root mean squared error (RMSE) has been calculated as evaluation metrics to assess the proposed methods.The best results achieved with the three datasets are shown in Table 1 below.The method was tested with several ϴ values, and it is clear that the best results are obtained with ϴ value between )0.7, 0.85) except in the case of the Mohler dataset, where the best results were obtained with no threshold used.This is explained by the fact that the results given by the evaluator tend to be high, which makes each word, even if it has a little degree of weight, influential in the value of total similarity.For further detailed results, Table 2 below shows the difference between the semantic similarity values of the proposed method and the human similarity value, which shows that the difference between the predicted and the actual score does not exceed '1' (from range 0-5) in most cases.Out of 50 sentences pair in the DSCS dataset, in 33 pairs the difference between our method score and human score is less than 0.5 (the score scale from 0 to 5), and less than 1.5 in 48 pairs.With the SICK dataset, our method gave very close similarity score to the human judgment score in about 73% of compared text pairs.Our results with the Mohler dataset were also good, where 75% of cases, the degree given by our method close to that given by the professor's assessor.Table 3 provides a comparison between our results and the results of similar previous work.Our method got better or competitive results compared to the previous works.Compared with Mijbel et al 19 , which is the closest approach to our method, it is based on the semantic network.The results showed a significant improvement in our method in reducing the error rate in terms of MAE on both DSCS and Mohler datasets.

Discussion
The proposed method has achieved good and encouraging results, as shown in Table 2, the difference between the estimated similarity score of the proposed method and the actual value is less than 1 (from range 0-5) in most cases.After examining the few cases in which the difference was large, it was found that the reasons were due to the following: 1.One of the two texts has its meaning expressed in a mathematical form, which causes the system to fail to measure the similarity between the mathematical and textual expressions.2. One of the two texts expresses a meaning very briefly in one or two words, while the comparative text is much longer.3. The presence of spelling errors in the student's answer (in the Mohler dataset), which the assessor overlooked and considered that the answer was correct.
In general, our method showed encouraging results, as our method gives a degree of similarity between two short texts compared close to that given by a human evaluator, but some limitations can be observed which are: 1-Automatic correction of spelling errors was not adopted, adopting the correction of spelling errors in future work may give better results.2-The word order is not taken into account, for example, the sentence "Ahmad bought Ali's car" and the sentence "Ali bought Ahmed's car" is considered to be completely similar.However, even for human judgment, the two sentences remain related.

Conclusion:
A method for the semantic similarity measurement between texts has been presented in this paper, based on the semantic network.
Knowledge-based and corpus-based semantic information were combined to build the semantic network.WordNet lexical database and GloVe pre-trained vectors have been combined to calculate word-to-word similarity.The method is simple, effective, and fast to implement.The results that have been achieved are good and can be improved in the future.d h s rht ehT (ϴ) value has an important effect on the results.Choosing a higher ϴ value leads to lower similarity values between the compared texts and vice versa.The experimental results showed that the best ϴ value ranges from 0.7 to 0.85 with some exceptions.Utilizing word embedding vectors that are trained on domain-specific corpora and domain-specific lexical databases may give better results.For future work, incorporating machine learning techniques that take the information provided by the semantic network as features to predict the value of semantic similarity seems a suitable option that can add more accuracy to the process of determining the semantic similarity of texts.

Figure 1 .
Figure 1. the semantic network of the sentence "John gave Merry a book of data structure in the class" . In WordNet, numerous approaches have been developed for identifying semantic similarity between words and concepts.Path-based, information content (IC)based, gloss-based, feature-based, and hybrid measures are the five categories of measures.The proposed method uses Wu and Palmer 24 measure to compute the similarity between words.Wu and Palmer define the similarity between two concepts as the equation below (1, 2) = 2 1+2+2 ………………1 Where a1 and a2 denote the number of links from x1 and x2, to the deepest common subsumer x, and k to the number of links from x to the root of the taxonomy.

Fig. 3 Figure 3 .P
Fig. 3 below shows the semantic network that was constructed between the two texts

2078-8665 Published Online First: Suppl. November 2022 2022, 19(6): 1581-1591 E-ISSN: 2411-7986
P-ISSN: The node to node similarity scores are identified in two case Case 1: node A and node B have the same word hence the relation assigned to 1 Case 2: node A and node B are not the same, here the external sources are used to compute the similarity between two nodes, calculating Wu & Palmer similarity with WordNet and cosine similarity of GloVe word embedding vector of two nodes then the relation between node A and node B assigned to the highest score of those two similarities.Algorithm 1 below shows the process of semantic network construction.