Transformer-Based Turkish Automatic Speech Recognition

Taşar, Davut; Koruyan, Kutan; Çılgın, Cihan

doi:https://dx.doi.org/10.26650/acin.1338604

Araştırma Makalesi

DOI :10.26650/acin.1338604 IUP :10.26650/acin.1338604 Tam Metin (PDF)

Transformer-Based Turkish Automatic Speech Recognition

Davut Emre Taşar, Kutan Koruyan, Cihan Çılgın

Today, businesses use Automatic Speech Recognition (ASR) technology more frequently to increase efficiency and productivity while performing many business functions. Due to the increased prevalence of online meetings in remote working and learning environments after the COVID-19 pandemic, speech recognition systems have seen more frequent utilization, exhibiting the significance of these systems. While English, Spanish or French languages have a lot of labeled data, there is very little labeled data for the Turkish language. This directly affects the accuracy of the ASR system negatively. Therefore, this study utilizes unlabeled audio data by learning general data representations with self-supervised learning end-to-end modeling. This study employed a transformer-based machine learning model with improved performance through transfer learning to convert speech recordings to text. The model adopted within the scope of the study is the Wav2Vec 2.0 architecture, which masks the audio inputs and solves the related task. The XLSR-Wav2Vec 2.0 model was pre-trained on speech data in 53 languages and fine-tuned with the Mozilla Common Voice Turkish data set. According to the empirical results obtained within the scope of the study, a 0.23 word error rate was reached in the test set of the same data set.

Anahtar Kelimeler: Wav2vec2, automatic speech recognition, speech-to-text transcription, natural language processing, transformer architecture

Referanslar

Akhilesh, A., Brinda, P., Keerthana, S., Gupta, D., & Vekkot, S. (2022). Tamil speech recognition using XLSR Wav2Vec2.0 & CTC algorithm. 2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT), 1-6. https://doi.org/10.1109/ICCCNT54827.2022.9984422 google scholar
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., ... & Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. ICML’16: Proceedings of the 33rd International Conference on International Conference on Machine Learning, Volume 48, 173-182. https://dl.acm.org/doi/10.5555/3045390.3045410 google scholar
Annam, S. V., Neelima, N., Parasa, N., & Chinamuttevi, D. (2023, March). Automated Home Life using IoT and Speech Recognition. In 2023 International Conference on Innovative Data Communication Technologies and Application (ICIDCA) (pp. 809-813). IEEE. google scholar
Baevski, A., Schneider, S., & Auli, M. (2019). vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv. https://doi.org/10.48550/arXiv.1910.05453 google scholar
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representa-tions. Advances in neural information processing systems: 34th conference on neural information processing systems (NeurIPS 2020), https://proceedings.neurips.cc/paper_files/paper/2020 google scholar
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., ... & Wellekens, C. (2007). Automatic speech recognition and speech variability: A review. Speech communication, 49(10-11), 763-786. https://doi.org/10.1016/j.specom.2007.02.006 google scholar
Chi, P. H., Chung, P. H., Wu, T. H., Hsieh, C. C., Chen, Y. H., Li, S. W., & Lee, H. Y. (2021). Audio albert: A lite bert for self-supervised learning of audio representation. 2021 IEEE Spoken Language Technology Workshop (SLT), 344-350. https://doi.org/10.1109/SLT48900.2021.9383575 google scholar
Chiu, C. C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., ... & Bacchiani, M. (2018). State-of-the-art speech recognition with sequence-to-sequence models. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4774-4778. https://doi.org/10.1109/ICASSP.2018.8462105 google scholar
Coto-Solano, R., Nicholas, S. A., Datta, S., Quint, V., Wills, P., Powell, E. N., ... & Feldman, I. (2022). Development of automatic speech recognition for the documentation of Cook Islands Maori. Proceedings of the Thirteenth Language Resources and Evaluation Conference, 3872-3882. https://aclanthology.org/volumes/2022.lrec-1/ google scholar
Cutajar, M., Gatt, E., Grech, I., Casha, O., & Micallef, J. (2013). Comparative study of automatic speech recognition techniques. IET Signal Processing, 7(1), 25-46. https://doi.org/10.1049/iet-spr.2012.0151 google scholar
Danis, C., & Karat, J. (1995). Technology-driven design of speech recognition systems. DIS ’95: Proceedings of the 1st conference on Designing interactive systems: processes, practices, methods, & techniques, 17-24. https://doi.org/10.1145/225434.225437 google scholar
Dai, Y., & Wu, Z. (2023). Mobile-assisted pronunciation learning with feedback from peers and/or automatic speech recognition: A mixed-methods study. Computer Assisted Language Learning, 36(5-6), 861-884. google scholar
Filippidou, F., & Moussiades, L. (2020). A benchmarking of IBM, Google and Wit automatic speech recognition systems. IFIP Advances in Information and Communication Technology, 73-82. https://doi.org/10.1007/978-3-030-49161-1_7 google scholar
Ghai, W., & Singh, N. (2012). Literature review on automatic speech recognition. International Journal of Computer Applications, 41(8), 42-50. http://dx.doi.org/10.5120/5565-7646 google scholar
Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd international conference on Machine learning - ICML ’06, 369-376. http://dx.doi.org/10.1145/1143844.1143891 google scholar
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv. https://doi.org/10.48550/arXiv.1606.08415 google scholar
Hu, S., Xie, X., Jin, Z., Geng, M., Wang, Y., Cui, M., ... & Meng, H. (2023). Exploring self-supervised pre-trained ASR models for dysarthric and elderly speech recognition. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1-5. https://doi.org/10.1109/ICASSP49357.2023.10097275 google scholar
Inaguma, H., Cho, J., Baskar, M. K., Kawahara, T., & Watanabe, S. (2019). Transfer learning of language-independent end-to-end ASR with language model fusion. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6096-6100). https://doi.org/10.1109/ICASSP.2019.8682918 google scholar
Jain, R., Barcovschi, A., Yiwere, M., Bigioi, D., Corcoran, P., & Cucu, H. (2023). A wav2vec2-based experimental study on self-supervised learning methods to improve child speech recognition. IEEE Access, 11, 46938-46948. https://doi.org/10.1109/ACCESS.2023.3275106 google scholar
Klakow, D., & Peters, J. (2002). Testing the correlation of word error rate and perplexity. Speech Communication, 38(1-2), 19-28. https://doi.org/10.1016/S0167-6393(01)00041-3 google scholar
Koruyan, K. (2015). Canlı internet yayınları için otomatik konuşma tanıma tekniği kullanılarak alt yazı oluşturulması [Generating captions using automatic speech recognition technique for live webcasts]. Bilişim Teknolojileri Dergisi, 8(2), 111-116. https://doi.org/10.17671/btd.31441 google scholar
Kurian, C., & Balakrishnan, K. (2009). Speech recognition of Malayalam numbers. 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), 1475-1479. https://doi.org/10.1109/NABIC.2009.5393692 google scholar
Levis, J., & Suvorov, R. (2012). Automatic speech recognition. In The encyclopedia of applied linguistics. Retrieved from https://onlinelibrary.wiley.com google scholar
Liu, A. T., Yang, S. W., Chi, P. H., Hsu, P. C., & Lee, H. Y. (2020). Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6419-6423. https://doi.org/10.1109/ICASSP40776.2020.9054458 google scholar
Malik, M., Malik, M. K., Mehmood, K., & Makhdoom, I. (2021). Automatic speech recognition: A survey. Multimedia Tools and Applications, 80, 9411-9457. https://doi.org/10.1007/s11042-020-10073-7 google scholar
Mohamed, A., Okhonko, D., & Zettlemoyer, L. (2019). Transformers with convolutional context for ASR. arXiv. https://doi.org/10.48550/arXiv.1904.11660 google scholar
Mussakhojayeva, S., Dauletbek, K., Yeshpanov, R., & Varol, H. A. (2023). Multilingual speech recognition for Turkic languages. Information, 14(2), 74. https://doi.org/10.3390/info14020074 google scholar
Negrao, M., & Domingues, P. (2021). SpeechToText: An open-source software for automatic detection and transcription of voice recordings in digital forensics. Forensic Science International: Digital Investigation, 38, 301223. https://doi.org/10.1016/j.fsidi.2021.301223 google scholar
Olev, A., & Alumae, T. (2022). Estonian speech recognition and transcription editing service. Baltic Journal of Modern Computing, 10(3), 409-421. https://doi.org/10.22364/bjmc.2022.10.3.14 google scholar
Oyucu, S., & Polat, H. (2023). A language model optimization method for Turkish automatic speech recognition system. Politeknik Dergisi, (Early Access). https://doi.org/10.2339/politeknik.1085512 google scholar
Oyucu, S., Polat, H., & Sever, H. (2020). Sessizliğin kaldırılması ve konuşmanın parçalara ayrılması işleminin Türkçe otomatik konuşma tanıma üzerindeki etkisi [The effect of removal the silence and speech parsing processes on Turkish automatic speech recognition]. Düzce Üniversitesi Bilim ve Teknoloji Dergisi, 8(1), 334-346. https://doi.org/10.29130/dubited.560135 google scholar
Özden, B. (2021, September 14). Common voice Türkçe’nin durumu [Web blog post]. Retrieved from https://discourse.mozilla.org/t/common-voice-turkcenin-durumu/85895 google scholar
Padmanabhan, J., & Johnson Premkumar, M. J. (2015). Machine learning in automatic speech recognition: A survey. IETE Technical Review, 32(4), 240-251. https://doi.org/10.1080/02564602.2015.1010611 google scholar
Pallett, D. S. (2003). A look at NIST’s benchmark ASR tests: Past, present, and future. 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), 483-488. https://doi.org/10.1109/ASRU.2003.1318488 google scholar
Pham, N. Q., Waibel, A., & Niehues, J. (2022). Adaptive multilingual speech recognition with pretrained models. arXiv. https://doi.org/10.48550/arXiv.2205.12304 google scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., . . . Vesely, K. (2011). The Kaldi speech recognition toolkit. IEEE 2011 workshop on automatic speech recognition and understanding, https://www.fit.vut.cz/research/publication/11196/.en google scholar
Pragati, B., Kolli, C., Jain, D., Sunethra, A. V., & Nagarathna, N. (2023, January). Evaluation of Customer Care Executives Using Speech google scholar
Emotion Recognition. In Machine Learning, Image Processing, Network Security and Data Sciences: Select Proceedings of 3rd International Conference on MIND 2021 (pp. 187-198). Singapore: Springer Nature Singapore. google scholar
Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv. https://doi.org/10.48550/arXiv.1904.05862 google scholar
Shahgir, H. A. Z. S., Sayeed, K. S., & Zaman, T. A. (2022). Applying wav2vec2 for speech recognition on Bengali common voices dataset. arXiv. https://doi.org/10.48550/arXiv.2209.06581 google scholar
Shi, Z. (2021). Intelligence science: Leading the age of intelligence. Elsevier. google scholar
Showrav, T. T. (2022). An automatic speech recognition system for Bengali language based on wav2vec2 and transfer learning. arXiv. https://doi.org/10.48550/arXiv.2209.08119 google scholar
Song, Y., Lian, R., Chen, Y., Jiang, D., Zhao, X., Tan, C., ... & Wong, R. C. W. (2022). A platform for deploying the TFE ecosystem of automatic speech recognition. Proceedings of the 30th ACM International Conference on Multimedia, 6952-6954. https://doi.org/10.1145/3503161.3547731 google scholar
Tombaloğlu, B., & Erdem, H. (2020). Deep learning based automatic speech recognition for Turkish. Sakarya University Journal of Science, 24(4), 725-739. https://doi.org/10.16984/saufenbilder.711888 google scholar
Tran, D. T., Truong, D. H., Le, H. S., & Huh, J. H. (2023). Mobile robot: automatic speech recognition application for automation and STEM education. Soft Computing, 1-17. google scholar
Vaessen, N., & Van Leeuwen, D. A. (2022). Fine-tuning wav2vec2 for speaker recognition. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7967-7971. https://doi.org/10.1109/ICASSP43922.2022.9746952 google scholar
Vasquez-Correa, J. C., & Âlvarez Muniain, A. (2023). Novel speechrecognition systems applied to forensics within child exploitation: Wav2vec2. 0 vs. whisper. Sensors, 23(4), 1843. https://doi.org/10.3390/s23041843 google scholar
Wills, S., Bai, Y., Tejedor-Garcia, C., Cucchiarini, C., & Strik, H. (2023). Automatic speech recognition of non-native child speech for language learning applications. arXiv. https://doi.org/10.48550/arXiv.2306.16710 google scholar
Xie, T. (2023). Artificial intelligence and automatic recognition application in B2C e-commerce platform consumer behavior recognition. Soft Computing, 27(11), 7627-7637. google scholar
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The Microsoft 2017 conversational speech recognition system. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5934-5938. https://doi.org/10.1109/ICASSP.2018.8461870 google scholar
Yakar, Ö. (2016). Sözcük ve hece tabanlı konuşma tanıma sistemlerinin karşılaştırılması (Master’s thesis). Retrieved from https://tez.yok.gov.tr/UlusalTezMerkezi/ google scholar
Y i, C., Wang, J., Cheng, N., Zhou, S., & Xu, B. (2020). Applying wav2vec2.0 to speech recognition in various low-resource languages. arXiv. https://doi.org/10.48550/arXiv.2012.12121 google scholar
Y i, C., Wang, J., Cheng, N., Zhou, S., & Xu, B. (2021). Transfer ability of monolingual wav2vec2.0 for low-resource speech recognition. 2021 International Joint Conference on Neural Networks (IJCNN), 1-6. https://doi.org/10.1109/UCNN52387.2021.9533587 google scholar
Y u, D., & Deng, L. (2016). Automatic speech recognition (Vol. 1). Berlin: Springer. google scholar
Zekveld, A. A., Kramer, S. E., Kessens, J. M., Vlaming, M. S., & Houtgast, T. (2009). The influence of age, hearing, and working mem-ory on the speech comprehension benefit derived from an automatic speech recognition system. Ear and Hearing, 30(2), 262-272. https://doi.org/10.1097/aud.0b013e3181987063 google scholar

Atıflar

Biçimlendirilmiş bir atıfı kopyalayıp yapıştırın veya seçtiğiniz biçimde dışa aktarmak için seçeneklerden birini kullanın

DIŞA AKTAR

APA

Taşar, D.E., Koruyan, K., & Çılgın, C. (2024). Transformer-Based Turkish Automatic Speech Recognition. Acta Infologica, 8(1), 1-10. https://doi.org/10.26650/acin.1338604

AMA

Taşar D E, Koruyan K, Çılgın C. Transformer-Based Turkish Automatic Speech Recognition. Acta Infologica. 2024;8(1):1-10. https://doi.org/10.26650/acin.1338604

ABNT

Taşar, D.E.; Koruyan, K.; Çılgın, C. Transformer-Based Turkish Automatic Speech Recognition. Acta Infologica, [Publisher Location], v. 8, n. 1, p. 1-10, 2024.

Chicago: Author-Date Style

Taşar, Davut Emre, and Kutan Koruyan and Cihan Çılgın. 2024. “Transformer-Based Turkish Automatic Speech Recognition.” Acta Infologica 8, no. 1: 1-10. https://doi.org/10.26650/acin.1338604

Chicago: Humanities Style

Taşar, Davut Emre, and Kutan Koruyan and Cihan Çılgın. “Transformer-Based Turkish Automatic Speech Recognition.” Acta Infologica 8, no. 1 (Apr. 2025): 1-10. https://doi.org/10.26650/acin.1338604

Harvard: Australian Style

Taşar, DE & Koruyan, K & Çılgın, C 2024, 'Transformer-Based Turkish Automatic Speech Recognition', Acta Infologica, vol. 8, no. 1, pp. 1-10, viewed 26 Apr. 2025, https://doi.org/10.26650/acin.1338604

Harvard: Author-Date Style

Taşar, D.E. and Koruyan, K. and Çılgın, C. (2024) ‘Transformer-Based Turkish Automatic Speech Recognition’, Acta Infologica, 8(1), pp. 1-10. https://doi.org/10.26650/acin.1338604 (26 Apr. 2025).

MLA

Taşar, Davut Emre, and Kutan Koruyan and Cihan Çılgın. “Transformer-Based Turkish Automatic Speech Recognition.” Acta Infologica, vol. 8, no. 1, 2024, pp. 1-10. [Database Container], https://doi.org/10.26650/acin.1338604

Vancouver

Taşar DE, Koruyan K, Çılgın C. Transformer-Based Turkish Automatic Speech Recognition. Acta Infologica [Internet]. 26 Apr. 2025 [cited 26 Apr. 2025];8(1):1-10. Available from: https://doi.org/10.26650/acin.1338604 doi: 10.26650/acin.1338604

ISNAD

Taşar, DavutEmre - Koruyan, Kutan - Çılgın, Cihan. “Transformer-Based Turkish Automatic Speech Recognition”. Acta Infologica 8/1 (Apr. 2025): 1-10. https://doi.org/10.26650/acin.1338604

Cilt 8, Sayı 12024, S. 1-10

ZAMAN ÇİZELGESİ

Gönderim	06.08.2023
Kabul	30.11.2023
Çevrimiçi Yayınlanma	29.02.2024

LİSANS

Attribution-NonCommercial (CC BY-NC)

This license lets others remix, tweak, and build upon your work non-commercially, and although their new works must also acknowledge you and be non-commercial, they don’t have to license their derivative works on the same terms.

Acta Infologica

Araştırma Makalesi

Transformer-Based Turkish Automatic Speech Recognition

PDF Görünüm

Referanslar

Atıflar

DIŞA AKTAR

APA

AMA

ABNT

Chicago: Author-Date Style

Chicago: Humanities Style

Harvard: Australian Style

Harvard: Author-Date Style

MLA

Vancouver

ISNAD

ZAMAN ÇİZELGESİ

LİSANS

PAYLAŞ