Transformer-Based Turkish Automatic Speech Recognition
Davut Emre Taşar, Kutan Koruyan, Cihan ÇılgınToday, businesses use Automatic Speech Recognition (ASR) technology more frequently to increase efficiency and productivity while performing many business functions. Due to the increased prevalence of online meetings in remote working and learning environments after the COVID-19 pandemic, speech recognition systems have seen more frequent utilization, exhibiting the significance of these systems. While English, Spanish or French languages have a lot of labeled data, there is very little labeled data for the Turkish language. This directly affects the accuracy of the ASR system negatively. Therefore, this study utilizes unlabeled audio data by learning general data representations with self-supervised learning end-to-end modeling. This study employed a transformer-based machine learning model with improved performance through transfer learning to convert speech recordings to text. The model adopted within the scope of the study is the Wav2Vec 2.0 architecture, which masks the audio inputs and solves the related task. The XLSR-Wav2Vec 2.0 model was pre-trained on speech data in 53 languages and fine-tuned with the Mozilla Common Voice Turkish data set. According to the empirical results obtained within the scope of the study, a 0.23 word error rate was reached in the test set of the same data set.
PDF View
References
- Akhilesh, A., Brinda, P., Keerthana, S., Gupta, D., & Vekkot, S. (2022). Tamil speech recognition using XLSR Wav2Vec2.0 & CTC algorithm. 2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT), 1-6. https://doi.org/10.1109/ICCCNT54827.2022.9984422 google scholar
- Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., ... & Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. ICML’16: Proceedings of the 33rd International Conference on International Conference on Machine Learning, Volume 48, 173-182. https://dl.acm.org/doi/10.5555/3045390.3045410 google scholar
- Annam, S. V., Neelima, N., Parasa, N., & Chinamuttevi, D. (2023, March). Automated Home Life using IoT and Speech Recognition. In 2023 International Conference on Innovative Data Communication Technologies and Application (ICIDCA) (pp. 809-813). IEEE. google scholar
- Baevski, A., Schneider, S., & Auli, M. (2019). vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv. https://doi.org/10.48550/arXiv.1910.05453 google scholar
- Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representa-tions. Advances in neural information processing systems: 34th conference on neural information processing systems (NeurIPS 2020), https://proceedings.neurips.cc/paper_files/paper/2020 google scholar
- Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., ... & Wellekens, C. (2007). Automatic speech recognition and speech variability: A review. Speech communication, 49(10-11), 763-786. https://doi.org/10.1016/j.specom.2007.02.006 google scholar
- Chi, P. H., Chung, P. H., Wu, T. H., Hsieh, C. C., Chen, Y. H., Li, S. W., & Lee, H. Y. (2021). Audio albert: A lite bert for self-supervised learning of audio representation. 2021 IEEE Spoken Language Technology Workshop (SLT), 344-350. https://doi.org/10.1109/SLT48900.2021.9383575 google scholar
- Chiu, C. C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., ... & Bacchiani, M. (2018). State-of-the-art speech recognition with sequence-to-sequence models. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4774-4778. https://doi.org/10.1109/ICASSP.2018.8462105 google scholar
- Coto-Solano, R., Nicholas, S. A., Datta, S., Quint, V., Wills, P., Powell, E. N., ... & Feldman, I. (2022). Development of automatic speech recognition for the documentation of Cook Islands Maori. Proceedings of the Thirteenth Language Resources and Evaluation Conference, 3872-3882. https://aclanthology.org/volumes/2022.lrec-1/ google scholar
- Cutajar, M., Gatt, E., Grech, I., Casha, O., & Micallef, J. (2013). Comparative study of automatic speech recognition techniques. IET Signal Processing, 7(1), 25-46. https://doi.org/10.1049/iet-spr.2012.0151 google scholar
- Danis, C., & Karat, J. (1995). Technology-driven design of speech recognition systems. DIS ’95: Proceedings of the 1st conference on Designing interactive systems: processes, practices, methods, & techniques, 17-24. https://doi.org/10.1145/225434.225437 google scholar
- Dai, Y., & Wu, Z. (2023). Mobile-assisted pronunciation learning with feedback from peers and/or automatic speech recognition: A mixed-methods study. Computer Assisted Language Learning, 36(5-6), 861-884. google scholar
- Filippidou, F., & Moussiades, L. (2020). A benchmarking of IBM, Google and Wit automatic speech recognition systems. IFIP Advances in Information and Communication Technology, 73-82. https://doi.org/10.1007/978-3-030-49161-1_7 google scholar
- Ghai, W., & Singh, N. (2012). Literature review on automatic speech recognition. International Journal of Computer Applications, 41(8), 42-50. http://dx.doi.org/10.5120/5565-7646 google scholar
- Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd international conference on Machine learning - ICML ’06, 369-376. http://dx.doi.org/10.1145/1143844.1143891 google scholar
- Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv. https://doi.org/10.48550/arXiv.1606.08415 google scholar
- Hu, S., Xie, X., Jin, Z., Geng, M., Wang, Y., Cui, M., ... & Meng, H. (2023). Exploring self-supervised pre-trained ASR models for dysarthric and elderly speech recognition. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1-5. https://doi.org/10.1109/ICASSP49357.2023.10097275 google scholar
- Inaguma, H., Cho, J., Baskar, M. K., Kawahara, T., & Watanabe, S. (2019). Transfer learning of language-independent end-to-end ASR with language model fusion. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6096-6100). https://doi.org/10.1109/ICASSP.2019.8682918 google scholar
- Jain, R., Barcovschi, A., Yiwere, M., Bigioi, D., Corcoran, P., & Cucu, H. (2023). A wav2vec2-based experimental study on self-supervised learning methods to improve child speech recognition. IEEE Access, 11, 46938-46948. https://doi.org/10.1109/ACCESS.2023.3275106 google scholar
- Klakow, D., & Peters, J. (2002). Testing the correlation of word error rate and perplexity. Speech Communication, 38(1-2), 19-28. https://doi.org/10.1016/S0167-6393(01)00041-3 google scholar
- Koruyan, K. (2015). Canlı internet yayınları için otomatik konuşma tanıma tekniği kullanılarak alt yazı oluşturulması [Generating captions using automatic speech recognition technique for live webcasts]. Bilişim Teknolojileri Dergisi, 8(2), 111-116. https://doi.org/10.17671/btd.31441 google scholar
- Kurian, C., & Balakrishnan, K. (2009). Speech recognition of Malayalam numbers. 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), 1475-1479. https://doi.org/10.1109/NABIC.2009.5393692 google scholar
- Levis, J., & Suvorov, R. (2012). Automatic speech recognition. In The encyclopedia of applied linguistics. Retrieved from https://onlinelibrary.wiley.com google scholar
- Liu, A. T., Yang, S. W., Chi, P. H., Hsu, P. C., & Lee, H. Y. (2020). Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6419-6423. https://doi.org/10.1109/ICASSP40776.2020.9054458 google scholar
- Malik, M., Malik, M. K., Mehmood, K., & Makhdoom, I. (2021). Automatic speech recognition: A survey. Multimedia Tools and Applications, 80, 9411-9457. https://doi.org/10.1007/s11042-020-10073-7 google scholar
- Mohamed, A., Okhonko, D., & Zettlemoyer, L. (2019). Transformers with convolutional context for ASR. arXiv. https://doi.org/10.48550/arXiv.1904.11660 google scholar
- Mussakhojayeva, S., Dauletbek, K., Yeshpanov, R., & Varol, H. A. (2023). Multilingual speech recognition for Turkic languages. Information, 14(2), 74. https://doi.org/10.3390/info14020074 google scholar
- Negrao, M., & Domingues, P. (2021). SpeechToText: An open-source software for automatic detection and transcription of voice recordings in digital forensics. Forensic Science International: Digital Investigation, 38, 301223. https://doi.org/10.1016/j.fsidi.2021.301223 google scholar
- Olev, A., & Alumae, T. (2022). Estonian speech recognition and transcription editing service. Baltic Journal of Modern Computing, 10(3), 409-421. https://doi.org/10.22364/bjmc.2022.10.3.14 google scholar
- Oyucu, S., & Polat, H. (2023). A language model optimization method for Turkish automatic speech recognition system. Politeknik Dergisi, (Early Access). https://doi.org/10.2339/politeknik.1085512 google scholar
- Oyucu, S., Polat, H., & Sever, H. (2020). Sessizliğin kaldırılması ve konuşmanın parçalara ayrılması işleminin Türkçe otomatik konuşma tanıma üzerindeki etkisi [The effect of removal the silence and speech parsing processes on Turkish automatic speech recognition]. Düzce Üniversitesi Bilim ve Teknoloji Dergisi, 8(1), 334-346. https://doi.org/10.29130/dubited.560135 google scholar
- Özden, B. (2021, September 14). Common voice Türkçe’nin durumu [Web blog post]. Retrieved from https://discourse.mozilla.org/t/common-voice-turkcenin-durumu/85895 google scholar
- Padmanabhan, J., & Johnson Premkumar, M. J. (2015). Machine learning in automatic speech recognition: A survey. IETE Technical Review, 32(4), 240-251. https://doi.org/10.1080/02564602.2015.1010611 google scholar
- Pallett, D. S. (2003). A look at NIST’s benchmark ASR tests: Past, present, and future. 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), 483-488. https://doi.org/10.1109/ASRU.2003.1318488 google scholar
- Pham, N. Q., Waibel, A., & Niehues, J. (2022). Adaptive multilingual speech recognition with pretrained models. arXiv. https://doi.org/10.48550/arXiv.2205.12304 google scholar
- Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., . . . Vesely, K. (2011). The Kaldi speech recognition toolkit. IEEE 2011 workshop on automatic speech recognition and understanding, https://www.fit.vut.cz/research/publication/11196/.en google scholar
- Pragati, B., Kolli, C., Jain, D., Sunethra, A. V., & Nagarathna, N. (2023, January). Evaluation of Customer Care Executives Using Speech google scholar
- Emotion Recognition. In Machine Learning, Image Processing, Network Security and Data Sciences: Select Proceedings of 3rd International Conference on MIND 2021 (pp. 187-198). Singapore: Springer Nature Singapore. google scholar
- Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv. https://doi.org/10.48550/arXiv.1904.05862 google scholar
- Shahgir, H. A. Z. S., Sayeed, K. S., & Zaman, T. A. (2022). Applying wav2vec2 for speech recognition on Bengali common voices dataset. arXiv. https://doi.org/10.48550/arXiv.2209.06581 google scholar
- Shi, Z. (2021). Intelligence science: Leading the age of intelligence. Elsevier. google scholar
- Showrav, T. T. (2022). An automatic speech recognition system for Bengali language based on wav2vec2 and transfer learning. arXiv. https://doi.org/10.48550/arXiv.2209.08119 google scholar
- Song, Y., Lian, R., Chen, Y., Jiang, D., Zhao, X., Tan, C., ... & Wong, R. C. W. (2022). A platform for deploying the TFE ecosystem of automatic speech recognition. Proceedings of the 30th ACM International Conference on Multimedia, 6952-6954. https://doi.org/10.1145/3503161.3547731 google scholar
- Tombaloğlu, B., & Erdem, H. (2020). Deep learning based automatic speech recognition for Turkish. Sakarya University Journal of Science, 24(4), 725-739. https://doi.org/10.16984/saufenbilder.711888 google scholar
- Tran, D. T., Truong, D. H., Le, H. S., & Huh, J. H. (2023). Mobile robot: automatic speech recognition application for automation and STEM education. Soft Computing, 1-17. google scholar
- Vaessen, N., & Van Leeuwen, D. A. (2022). Fine-tuning wav2vec2 for speaker recognition. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7967-7971. https://doi.org/10.1109/ICASSP43922.2022.9746952 google scholar
- Vasquez-Correa, J. C., & Âlvarez Muniain, A. (2023). Novel speechrecognition systems applied to forensics within child exploitation: Wav2vec2. 0 vs. whisper. Sensors, 23(4), 1843. https://doi.org/10.3390/s23041843 google scholar
- Wills, S., Bai, Y., Tejedor-Garcia, C., Cucchiarini, C., & Strik, H. (2023). Automatic speech recognition of non-native child speech for language learning applications. arXiv. https://doi.org/10.48550/arXiv.2306.16710 google scholar
- Xie, T. (2023). Artificial intelligence and automatic recognition application in B2C e-commerce platform consumer behavior recognition. Soft Computing, 27(11), 7627-7637. google scholar
- Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The Microsoft 2017 conversational speech recognition system. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5934-5938. https://doi.org/10.1109/ICASSP.2018.8461870 google scholar
- Yakar, Ö. (2016). Sözcük ve hece tabanlı konuşma tanıma sistemlerinin karşılaştırılması (Master’s thesis). Retrieved from https://tez.yok.gov.tr/UlusalTezMerkezi/ google scholar
- Y i, C., Wang, J., Cheng, N., Zhou, S., & Xu, B. (2020). Applying wav2vec2.0 to speech recognition in various low-resource languages. arXiv. https://doi.org/10.48550/arXiv.2012.12121 google scholar
- Y i, C., Wang, J., Cheng, N., Zhou, S., & Xu, B. (2021). Transfer ability of monolingual wav2vec2.0 for low-resource speech recognition. 2021 International Joint Conference on Neural Networks (IJCNN), 1-6. https://doi.org/10.1109/UCNN52387.2021.9533587 google scholar
- Y u, D., & Deng, L. (2016). Automatic speech recognition (Vol. 1). Berlin: Springer. google scholar
- Zekveld, A. A., Kramer, S. E., Kessens, J. M., Vlaming, M. S., & Houtgast, T. (2009). The influence of age, hearing, and working mem-ory on the speech comprehension benefit derived from an automatic speech recognition system. Ear and Hearing, 30(2), 262-272. https://doi.org/10.1097/aud.0b013e3181987063 google scholar
Citations
Copy and paste a formatted citation or use one of the options to export in your chosen format
EXPORT
APA
Taşar, D.E., Koruyan, K., & Çılgın, C. (2024). Transformer-Based Turkish Automatic Speech Recognition. Acta Infologica, 8(1), 1-10. https://doi.org/10.26650/acin.1338604
AMA
Taşar D E, Koruyan K, Çılgın C. Transformer-Based Turkish Automatic Speech Recognition. Acta Infologica. 2024;8(1):1-10. https://doi.org/10.26650/acin.1338604
ABNT
Taşar, D.E.; Koruyan, K.; Çılgın, C. Transformer-Based Turkish Automatic Speech Recognition. Acta Infologica, [Publisher Location], v. 8, n. 1, p. 1-10, 2024.
Chicago: Author-Date Style
Taşar, Davut Emre, and Kutan Koruyan and Cihan Çılgın. 2024. “Transformer-Based Turkish Automatic Speech Recognition.” Acta Infologica 8, no. 1: 1-10. https://doi.org/10.26650/acin.1338604
Chicago: Humanities Style
Taşar, Davut Emre, and Kutan Koruyan and Cihan Çılgın. “Transformer-Based Turkish Automatic Speech Recognition.” Acta Infologica 8, no. 1 (Nov. 2024): 1-10. https://doi.org/10.26650/acin.1338604
Harvard: Australian Style
Taşar, DE & Koruyan, K & Çılgın, C 2024, 'Transformer-Based Turkish Automatic Speech Recognition', Acta Infologica, vol. 8, no. 1, pp. 1-10, viewed 22 Nov. 2024, https://doi.org/10.26650/acin.1338604
Harvard: Author-Date Style
Taşar, D.E. and Koruyan, K. and Çılgın, C. (2024) ‘Transformer-Based Turkish Automatic Speech Recognition’, Acta Infologica, 8(1), pp. 1-10. https://doi.org/10.26650/acin.1338604 (22 Nov. 2024).
MLA
Taşar, Davut Emre, and Kutan Koruyan and Cihan Çılgın. “Transformer-Based Turkish Automatic Speech Recognition.” Acta Infologica, vol. 8, no. 1, 2024, pp. 1-10. [Database Container], https://doi.org/10.26650/acin.1338604
Vancouver
Taşar DE, Koruyan K, Çılgın C. Transformer-Based Turkish Automatic Speech Recognition. Acta Infologica [Internet]. 22 Nov. 2024 [cited 22 Nov. 2024];8(1):1-10. Available from: https://doi.org/10.26650/acin.1338604 doi: 10.26650/acin.1338604
ISNAD
Taşar, DavutEmre - Koruyan, Kutan - Çılgın, Cihan. “Transformer-Based Turkish Automatic Speech Recognition”. Acta Infologica 8/1 (Nov. 2024): 1-10. https://doi.org/10.26650/acin.1338604