Main Article Content
Estimating the semantic similarity between short texts plays an increasingly prominent role in many fields related to text mining and natural language processing applications, especially with the large increase in the volume of textual data that is produced daily. Traditional approaches for calculating the degree of similarity between two texts, based on the words they share, do not perform well with short texts because two similar texts may be written in different terms by employing synonyms. As a result, short texts should be semantically compared. In this paper, a semantic similarity measurement method between texts is presented which combines knowledge-based and corpus-based semantic information to build a semantic network that represents the relationship between the compared texts and extracts the degree of similarity between them. Representing a text as a semantic network is the best knowledge representation that comes close to the human mind's understanding of the texts, where the semantic network reflects the sentence's semantic, syntactical, and structural knowledge. The network representation is a visual representation of knowledge objects, their qualities, and their relationships. WordNet lexical database has been used as a knowledge-based source while the GloVe pre-trained word embedding vectors have been used as a corpus-based source. The proposed method was tested using three different datasets, DSCS, SICK, and MOHLER datasets. A good result has been obtained in terms of RMSE and MAE.
Received 30/3/2022, Accepted 8/8/2022, Published Online First 25/11/2022
This work is licensed under a Creative Commons Attribution 4.0 International License.
Liu J, Kong X, Zhou X, Wang L, Zhang D, Lee I, et al. Data mining and information retrieval in the 21st century: A bibliographic review. Comput Sci Rev. 2019; 34: 100193.
El-Kassas WS, Salama CR, Rafea AA, Mohamed HK. Automatic text summarization: A comprehensive survey. Expert Syst Appl. 2021; 165: 113679.
AL-Jumaili AS. A hybrid method of linguistic and statistical features for Arabic sentiment analysis. Baghdad Sci J. 2020; 17(1 (Suppl.)): 0385-0385.
Moussallem D, Wauer M, Ngomo A-CN. Machine translation using semantic web technologies: A survey. J Web Semant. 2018; 51: 1-19.
Singh R, Singh S. Text similarity measures in news articles by vector space model using NLP. J Inst Eng (India): B. 2021; 102(2): 329-338.
Sánchez Rodríguez I. Text similarity by using GloVe word vector representations [Master thesis]: Polytechnic University of Valencia; 2017.
Hassani H, Beneki C, Unger S, Mazinani MT, Yeganegi MR. Text mining in big data analytics. Big Data Cogn Comput. 2020; 4(1): 1.
Lee YY, Ke H, Yen TY, Huang HH, Chen HH. Combining and learning word embedding with WordNet for semantic relatedness and similarity measurement. J Assoc Inf Sci. 2020; 71(6): 657-670.
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X. Pre-trained models for natural language processing: A survey. Sci China Technol Sci. 2020; 63(10): 1872-1897.
Liu H, Wang P. Assessing Sentence Similarity Using WordNet based Word Similarity. J Softw. 2013; 8(6): 1451-1458.
Croft D, Coupland S, Shell J, Brown S, editors. A fast and efficient semantic short text similarity metric. 13th UK workshop on computational intelligence (UKCI); 2013.
Kusner M, Sun Y, Kolkin N, Weinberger K, editors. From word embeddings to document distances. International conference on machine learning. Proc Int Conf Mach Learn. 2015; 37: 957-966. Available from: https://proceedings.mlr.press/v37/kusnerb15.html.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst. 2013; 26.
Vu HH, Villaneau J, Saïd F, Marteau P-F, editors. Sentence similarity by combining explicit semantic analysis and overlapping n-grams. Proc Int Conf TSD . 2014: Springer. Available from: https://doi.org/10.1007/978-3-319-10816-2_25.
Soğancıoğlu G, Öztürk H, Özgür A. BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics. 2017; 33(14): i49-i58.
Pawar A, Mago V. Calculating the similarity between words and sentences using a lexical database and corpus statistics. arXiv preprint arXiv:180205667. 2018.
Yang J, Li Y, Gao C, Zhang Y. Measuring the short text similarity based on semantic and syntactic information. Future Gener Comput Syst. 2021; 114: 169-180.
Fetty Fitriyanti Lubis MDWTSYRAPAAA. Automated Short-Answer Grading using Semantic Similarity based on Word Embedding. Int J Technol 2021; 12(3): 291-319.
Mijbel SH, Liatsis P, Sadiq AT. Text Similarity Approach based on Semantic Networks and Words Description. Des Eng. 2021: 15217-15228.
Chandrasekaran D, Mago V. Domain Specific Complex Sentence (DCSC) Semantic Similarity Dataset. arXiv preprint arXiv:201012637. 2020.
Chowdhary K. Fundamentals of artificial intelligence. New Delhi: Springer; 2020. Available from: https://doi.org/10.1007/978-81-322-3972-7.
Chandrasekaran D, Mago V. Evolution of semantic similarity—A survey. ACM Comput Surv. 2021; 54(2): 1-37.
Li Y, McLean D, Bandar ZA, O'shea JD, Crockett K. Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng. 2006; 18(8): 1138-1150.
Araque O, Zhu G, Iglesias CA. A semantic similarity-based perspective of affect lexicons for sentiment analysis. Knowl Based Syst. 2019; 165: 346-359.
Rodriguez PL, Spirling A. Word embeddings: What works, what doesn’t, and how to tell the difference for applied research. J Polit. 2022; 84(1): 101-115.
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013.
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017; 5: 135-146.
Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018.
Al Maksur I, Muhajir M. MyBotS Prototype on Social Media Discord with NLP. Baghdad Sci J. 2021; 18(1 (Suppl.)): 0753-0753.
Chiche A, Yitagesu B. Part of speech tagging: a systematic review of deep learning and machine learning approaches. J Big Data. 2022; 9(1): 1-25.
Pawar A, Mago V. Challenging the boundaries of unsupervised learning for semantic similarity. IEEE Access. 2019; 7: 16291-16308.
Mohler M, Bunescu R, Mihalcea R, editors. Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. Portland, Oregon, USA. 2011: 752–762 Available from: https://aclanthology.org/P11-1076.