Vowel Recognition for Rehabilitation Assessment of Speech Disorder Patients via Multi-source Frequency Spectrum Images

Authors

  • Nur Syahmina Ahmad Azhar Fakulti Teknologi dan Kejuruteraan Elektronik dan Komputer, Universiti Teknikal Malaysia Melaka, Melaka Malaysia.
  • Nik Mohd Zarifie Hashim Fakulti Teknologi dan Kejuruteraan Elektronik dan Komputer, Universiti Teknikal Malaysia Melaka, Melaka Malaysia.
  • Masrullizam Mat Ibrahim Fakulti Teknologi dan Kejuruteraan Elektronik dan Komputer, Universiti Teknikal Malaysia Melaka, Melaka Malaysia.
  • Mahmud Dwi Sulistiyo School of Computing, Telkom University, West Java, Indonesia.

DOI:

https://doi.org/10.21123/bsj.2024.9202

Keywords:

Convolutional Neural Network (CNN), Deep Learning, Mel-Frequency Cepstral Coefficient (MFCC), Rehabilitation, Spectrogram, Vowel Recognition

Abstract

Communication impairments have a broad spectrum of medical causes, such as speech disorders, hearing loss, brain injury, stroke, and physical impairments. As a result, communication disorders can affect social development and interpersonal relationships. Speech impairments can benefit from early speech treatments; however, the majority of rehab facilities across the world still carry out this process manually. A wide range of studies has been conducted on speech processing for various human languages. Machine learning and deep learning have been applied to the medical and healthcare industry to enhance rehabilitation by utilizing the new technology. This study analyzed the classification accuracy of the designed network and other pre-trained models (VGG-Net, AlexNet, and Inception) and performed a complete comparative analysis to assess the classification accuracy of several pre-trained models. The sound is converted to the image as a new way to see them in the neural network via a newly proposed concept named image-profiled data. These image-profiled datasets that used a spectrogram and a Mel-frequency cepstral coefficient (MFCC) produced this study's best results and accuracy. This project aims to develop a new neural network that can successfully distinguish between the vowels from the voices of normal people, patients with speech disorders and the mix from the prior two groups using the six and twelve classes of Malay vowels. The designed network model, which used 6 batch sizes, 20 epochs, and ADAM as the optimizer, this study presented and achieved the maximum accuracy values of both classes for image-profiled audio data in analyses conducted.

References

Peter L, Keith J. A Course in Phonetics. 6th edition. Cengage Learning; 2010. 1-336.

Julio Cesar CV, Anders E. Acoustic Analysis of Vowel Formant Frequencies in Genetically Related and Non-Genetically Related Speakers with Implications for Forensic Speaker Comparison. PLoS ONE. 2021; 1-31. https://doi.org/10.1371/journal.pone.0246645

Rebecca T, Victor B, Ruth T, Kira R. Children's Phonology Awareness: Confusions between Phonemes that Differ Only in Voicing. J Exp Child Psychol. 1998; 68(1): 3-21. https://doi.org/10.1006/jecp.1997.2410

Halil I. CERF-oriented Probe into Pronunciation: Implications for Language Leaners and Teachers. J Lang Linguist Stud. 2019; 2(4): 420-436. http://dx.doi.org/10.17263/jlls.586087

Susan WJ, Dylan E. Postproke Aphasia Rehabilitation: Why All Talk and No Action.

Neurorehabil Neural Repair. 2019; 33(4): 235-244. http://dx.doi.org/10.1177/1545968319834901

Perrotta G. Aphasia: Definition, Clinical Contexts, Neurobiological Profiles and Clinical Treatments. Ann Alzheimers Dement Care. 2020; 4(1): 21-26. http://dx.doi.org/10.17352/aadc.000014.

Perrotta G. Dysarthria: Definition, Clinical Contexts, Neurobiological Profiles and Clinical Treatments. Arch Community Med Public Health. 2020; 6(2): 142-145. http://dx.doi.org/10.17352/2455-5479.000094

Aisha J, Fernando L, Omer R. Interaction between People with Dysarthria and Speech Recognition Systems:A Review. Assistive Technology: Assist Technol. 2023; 35(4): 330-338. http://dx.doi.org/10.1080/10400435.2022.2061085.

Jung EP. Apraxia: Review and Update. J Clin Neurol. 2017; 13(4): 317-324. http://dx.doi.org/10.3988/jcn.2017.13.4.317

Jeremy L, Alexander N, Yehoshua YZ. Classification of Audio Signals using Spectrogram Surfaces and Extrinsic Distortion Measures. EURASIP J Adv Signal Process. 2022. https://doi.org/10.1186/s13634-022-00933-9.

Nurul AT, Siuly S, Hua W, Frank W, Kate W, Yanchun Z. A Spectrogram Image based Intelligent Technique for Automatic Detection of Autism Spectrum Disorder from EEG. PLoS ONE. 2021; 16(6), e0253094. https://doi.org/10.1371/journal.pone.0253094

Prabakaran D, Sriuppili S. Speech Processing: MFCC Based Feature Extraction Techniques- An Investigation. J Phys Conf Ser . 2021. http://dx.doi.org/10.1088/1742-6596/1717/1/012009

Shikha G, Jafreezal J, Fatimah WA, Arpit B. Feature Extraction Using MFCC. Signal and Image Processing an International Journal. 2013; 4(4): 101-108. http://dx.doi.org/10.5121/sipij.2013.4408

Shalbbya A, Safdar T, Syed SK, Naseem R. Mel Frequency Cepstral Coefficient: A Review. Proceedings of the 2nd International Conference of ICT for Digital, Smart, aand Sustainable Development (ICIDSSD). 2021. http://dx.doi.org/10.4108/eai.27-2-2020.2303173

Niyada R, Sunee P. An Acoustic Feature-Based Deep Learning Model for Automatic Thai Vowel Pronunciation Recognition. Appl Sci. 2022. Vol and pages?? http://dx.doi.org/10.1109/iSAI-NLP48611.2019.9045520

Amna A, Hamid M, Fatimah A, Hafiz FA, Abdulaziz A. An Approach for Pronunciation Classification of Classical Arabic Phonemes Using Deep Learning. Appl Sci. 2022; 12: 1-19. https://doi.org/10.3390/app12010238

Chandra KD, Afiahayati. Suitable CNN Weight Initialization and Activation Function for Javanese Vowels Classification. Procedia Comput Sci. 2018; vol.? 124-132. https://doi.org/10.1016/j.procs.2018.10.512

Md. N. A Spectrogram Image based Intelligent Technique for Automatic Detection of Autism Spectrum Disorder from EEG. PLoS ONE. 2021; 16(6): 1-20. https://doi.org/10.1371/journal.pone.0253094

Shikha G, Jafreezal J, Wan FW, Arpit B. Feature Extraction using MFCC. Signal and Image Processing: An International Journal (SIPIJ). 2013; 4(4): 101-108. http://dx.doi.org/10.5121/sipij.2013.4408

Luis CS, Sergio V, Omar L, Ana CC, Jhon S, Jan BR. Recognition of EEG Signals from Imagined Vowels Using Deep Learning Methods. Sensors. 2021; 21(9): 6503. https://doi.org/10.3390/s21196503

Nemanja M. Introduction to Convolutional Neural Networks: with Image Classification using PyTorch. Apress. 2020.

Shawn H, Sourish C, Daniel P, Jort FG. CNN Architectures for Large-scale Audio Classification. In IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings. 2017; p. 131-135. https://doi.org/10.48550/arXiv.1609.09430

Parashar D, Praveen D, Ahmad YJ, Vijay D. A Near Real-time Automatic Speaker Recognition Architecture for Voice-based User Interface. Mach Learn Knowl Extr. 2019; 1(1): 504-520. https://doi.org/10.3390/make1010031

Zrifie H, Adilah HZ, Juzaila AL, Rostam AH, Farizal H, Maisarah K. Analysis on Vowel /E/ in Malay Language Recognition Via Convolution Neural Network (CNN). Theor Appl Inf Technol. 2022; 5: 1301-1318.

Giulio P. Aphasia: Definition, Clinical Contexts, Neurobiological Profiles and Clinical Treatments. Psychologist sp.ed Strategic Psychotherapist. 2020; 4(1): 21-26. http://dx.doi.org/10.17352/aadc.000014

Tarza HA, Fattah A, Berivan HA. COVID-19 Diagnosis System using SimpNet Deep Model. Baghdad Sci J. 2022; 19(5): 1078-1089. https://doi.org/10.21123/bsj.2022.6074

Osamah YF, Bashar SM, Ayad RA. Using VGG Models with Intermediate Layer Feature Maps for Static Hand Gesture Recognition. Baghdad Sci J. 2023; 20(5): 1808-1816. https://doi.org/10.21123/bsj.2023.7364

Downloads

Issue

Section

article

How to Cite

1.
Vowel Recognition for Rehabilitation Assessment of Speech Disorder Patients via Multi-source Frequency Spectrum Images. Baghdad Sci.J [Internet]. [cited 2024 Nov. 21];22(3). Available from: https://bsj.uobaghdad.edu.iq/index.php/BSJ/article/view/9202