An Error Coding System for the Turkish Learner Corpus

Golynskaia, Anna

doi:https://dx.doi.org/10.26650/jol.2022.1216218

Research Article

DOI :10.26650/jol.2022.1216218 IUP :10.26650/jol.2022.1216218 Full Text (PDF)

An Error Coding System for the Turkish Learner Corpus

Learner corpora are electronic collections of texts produced by learners of a foreign or second language. Learner corpora are reliable tools for investigating learner language and are widely used in the fields of second language acquisition and foreign language learning. This paper describes an error tagging system that has been designed on the basis of the 274,000-word Turkish learner corpus that comprises the Turkish examination papers written by learners coming from 94 different countries. The texts were manually keyboarded while retaining all errors, after which a 44,000-word component of the corpus was error coded using the specially devised error tag set. The majority of codes are based on a three- or four-letter system in which the first letter represents the error domain and the next series of letters identify the error category, as well as the word class where relevant. The error tag set has a total of 58 possible codes. The designed error tagging system can be used to assess the linguistic competence of Turkish learners and to build other error-annotated Turkish learner corpora.

Keywords: Error Tagging, Error Taxonomy, Error Analysis, Learner Corpus, Turkish as a Foreign Language

DOI :10.26650/jol.2022.1216218 IUP :10.26650/jol.2022.1216218 Full Text (PDF)

Türkçe Öğrenen Derlemi için Hata Etiketleme Sistemi

Anna Golynskaia

Öğrenen derlemleri, bir dili yabancı veya ikinci dil olarak öğrenenlerin ortaya koyduğu metinlerin elektronik koleksiyonlarıdır. İkinci dil edinimi ve yabancı dil öğrenimi alanlarında yaygın olarak kullanılan öğrenen derlemleri, öğrenen dilini araştırmaya olanak sağlayan güvenilir araçlar olarak karşımıza çıkmaktadır. Bu çalışmanın amacı, 94 ülkeden gelen öğrencilere ait sınav kağıtlarından oluşan 274 bin kelimelik Türkçe Öğrenen Derleminin temelinde Türkçeye özgü bir hata etiketleme sistemini geliştirmektir. Derlemdeki metinler, orijinallerine sadık kalınarak manuel olarak bilgisayar ortamına aktarıldıktan sonra 44 bin kelimelik çekirdek derlem oluşturulmuş ve üzerinde hata etiketlemesi yapılmıştır. Hata etiketleme sistemi, ilk harfin hata alanını temsil ettiği, sonraki harflerin ise hata kategorisini ve geçerli ise kelime türünü belirttiği üç veya dört harfli kodlara dayanmaktadır. Toplamda 58 hata kodu mevcuttur. Tasarlanan hata etiketleme sistemi, Türkçe öğrenenlerin dil yeterliliğini değerlendirmek ve hata etiketlemesi içeren diğer Türkçe öğrenen derlemlerini oluşturmak için kullanılabilecektir.

Keywords: Hata Etiketleme, Hata Taksonomisi, Hata Analizi, Öğrenen Derlemi, Yabancı Dil Olarak Türkçe

EXTENDED ABSTRACT

Learner corpora, also known as interlanguage or L2 corpora, are defined as electronic collections of authentic foreign or second language data. Learner corpora differ from other data types used in the fields of second language acquisition (SLA) and foreign language teaching (FLT) in two main ways. Firstly, they are easily analyzable using various software tools due to being computerized. Secondly, learner corpora enable learner language to be correctly described and modeled due to the large amount of data they contain. In addition to the morphological and syntactic tagging and lemmatization commonly used in other corpus types, learner corpora are usually annotated with the help of a standardized system of error tags. During the error annotation process, errors are detected by one or more annotators who assign labels indicating the error type and add examples of the correct usage next to the labels. Error tagging is performed based on the error taxonomy developed for the given corpus. Error classifications used in learner corpora today can be divided into three groups: classifications indicating the source of error (e.g., morphology, lexis, case, number), classifications based on the ordering of the units in the source text (e.g., omission, addition, improper formation, improper ordering), and classifications that take various aspects into consideration, including error domain (e.g., morphology, lexicology, pragmatics), error category (e.g., diacritic usage, inflection, gender, mood), and word category.

This research discusses within its scope the process of designing an error annotation system based on the 44,000-word error-tagged component of a 274,000-word Turkish learner corpus that is comprised of Turkish examination papers written by learners from 94 different countries. The data were collected from eight Turkish teaching centers located in the Turkish provinces of Istanbul, Ankara, Sakarya, Samsun, Edirne, and Kocaeli that agreed to participate in the study. Therefore, the corpus comprises data produced by foreign students learning Turkish in an immersive language learning environment. Because the data collection phase coincided with the quarantine restrictions resulting from the COVID-19 pandemic, the study accessed the exam papers that had been stored in the archives of the Turkish teaching centers and scanned the sections on writing from the Turkish Proficiency Exam and C1 exams. The texts were manually keyboarded while retaining all errors. A tag was created containing metadata such as the nationality and gender of the student, the year of the exam, the subject, and the genre of writing was created for each text that had been saved as a separate Word file.

The digitization of the data that had been obtained from universities was completed in January 2021. In the process of transferring the texts to the computer environment, personal information (e.g., first and last name, university, residence address, age) was anonymized. Real information was replaced with randomly chosen names and numerical data; however, attention was paid to ensure that an equivalent word containing the same mistake was used in cases involving orthographic mistakes. In order to ensure that the maximum variety of texts were included in the core corpus, a matrix was prepared that specified the students’ genders, nationalities, writing topics, and genres. The core corpus contains 43,518 words and 50,487 tokens.

In addition to keywording the texts, the study investigated the error annotation systems developed for other corpora and identified the codes that could be used for the Turkish learner corpus. The preliminary error tags were divided into five categories: spelling, punctuation, morphology, syntax, and vocabulary. As the study progressed, it added other categories to these five, bringing the total number of error labels to 58. The majority of codes are based on a three- or four-letter system in which the first letter represents the error domain and the following letters identify the error category, as well as the word class where relevant.

The designed error tagging system can be used to assess Turkish learners’ linguistic competence or to build other error-annotated Turkish learner corpora. In addition to the present study’s contributions to the field, it also has some limitations. The research mostly focused on error labels related to the linguistic competence of Turkish learners, leaving discourse errors beyond the scope of the paper.

References

Ak Başoğul D. & Can, F.S. (2014). Yabancı dil olarak Türkçe öğrenen Balkanlı öğrencilerin yazılı anlatımda yaptıkları hatalar üzerine tespitler. Dil ve Edebiyat Eğitimi Dergisi, 10, 100-119. google scholar
Albayrak, F. (2010). Türkçe öğrenen Moğol öğrencilerin yazılı anlatım yanlışlarının dil bilgisi açısından değerlendirilmesi. (Yüksek Lisans Tezi). Atatürk Üniversitesi Sosyal Bilimler Enstitüsü, Erzurum. google scholar
Aytan, T. & Güney, N. (2015). Türkçeyi yabancı dil olarak öğrenen öğrencilerin yazılı anlatımlarında karşılaşılan sorunlar (Yıldız Tömer örneklemi). International Journal of Languages’ Education and Teaching, 3(2), 275- 288. google scholar
Büyükikiz, K.K. & Hasırcı, S. (2013). Yabancı dil olarak Türkçe öğrenen öğrencilerin yazılı anlatımlarının yanlış çözümleme yaklaşımına göre değerlendirilmesi. Ana Dili Eğitimi Dergisi, 1(4), 51-62. google scholar
Çerçi, A., Derman, S. & Bardakçı, M. (2016). Yabancı dil olarak Türkçe öğrenen öğrencilerin yazılı anlatımlarına yönelik yanlış çözümlemesi. Gaziantep University Journal of Social Sciences, 15(2), 695-715. google scholar
Çetinkaya, G. (2015). Yanlış çözümlemesi: yabancı dil olarak Türkçe öğrenen B2 düzeyindeki öğrencilerin yazılı metinlerine ilişkin görünümler. International Journal of Language’s Education and Teaching, 3(1), 164-178. google scholar
Dagneaux, E., Denness, S., Granger, S., Meunier, F., Neff, J., & Thewissen J. (Eds.) (2005). UCL error-tagging manual, Version 1.2. Louvain-la-Neuve: Centre for English Corpus Linguistics, Université Catholique de Louvain.Díaz-Negrillo, A. & Fernández-Domínguez, J. (2006). Error tagging systems for learner corpora. Revista española de lingüística aplicada, 19, 83-102. Erişim adresi: https://www.researchgate. net/publication/28137922_Error_Tagging_Systems_for_Learner_Corpora google scholar
Díez Bedmar, M. B. (2011). Detecting learning disorders in students’ written production in the foreign language: Are learner corpora of any help? Porta Linguarum, 15, 35-54. google scholar
Emiroğlu, S. (2014). Türkçe öğrenen yabancı öğrencilerin yazılı anlatımlarında Türkçenin dil bilgisi ve yazım özellikleriyle ilgili karşılaştığı zorluklar. International Journal of Language Academy, 2(3), 99-119.Ersoy, Ş. (1997). Türkçe öğrenen yabancıların yazılı anlatım yanlışlarının dil bilgisi açısından değerlendirilmesi. (Yüksek Lisans Tezi). Ankara Üniversitesi Sosyal Bilimler Enstitüsü, Ankara. google scholar
Granger S. (2008). Learner corpora. In Lüdeling, A. & Kytö, M. (Eds.) Corpus Linguistics. An International Handbook (pp. 259-275). Berlin & New York: Walter de Gruyter. Erişim adresi: https://www.researchgate. net/publication/273480731_Learner_Corpora google scholar
Granger, S. (2013). Error-tagged Learner Corpora and CALL: A Promising Synergy. CALICO Journal, 20(3), 465-480. https://doi.org/10.1558/cj.v20i3.465-480 google scholar
Hana, J., Rosen, A., Skodová, S., & Stindlová, B. (2010). Error-Tagged Learner Corpus of Czech. Proceedings of the Fourth Linguistic Annotation Workshop, 11-19. Erişim adresi: https://aclanthology.org/W10-1802.pdf google scholar
Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press. google scholar
Islıoğlu, S. (2014). Yabancı dil olarak Türkçenin öğretiminde nesne durum ekinin kullanımı ile ilgili yanlışlar ve çözüm önerileri. Route Educational and Social Science Journal, 1(2), 101-115. google scholar
İnan, K. (2014). Yabancı dil olarak Türkçe öğrenen İranlıların yazılı anlatımlarının hata analizi bağlamında değerlendirilmesi. International Periodical for the Languages, Literature and History of Turkish or Turkic, 9(9), 619-649. google scholar
Izumi, E., Uchimoto, K., & Isahara, H. (2005). Error Annotation for Corpus of Japanese Learner English. Proceedings of the Sixth International Workshop on Linguistically Interpreted Corpora (LINC-2005), 71-80. Erişim adresi: https://aclanthology.org/I05-6009.pdf google scholar
Kıvrak, D. (2019). Türkçeyi ikinci dil olarak öğrenen öğrencilerin Türkçe yeterlik algı düzeyleri ve Türkçe yazılı anlatımlarındaki yazım yanlışları. (Yüksek Lisans Tezi). Muğla Sıtkı Koçman Üniversitesi Eğitim Bilimleri Enstitüsü, Muğla. google scholar
López, W.C. (2009). Error analysis in a learner corpus: what are the learners’ strategies? Erişim adresi: https:// www.um.es/lacell/aelinco/contenido/pdf/45.pdf google scholar
Meyer, Charles F. (2002). English Corpus Linguistics. Cambridge: Cambridge University Press. google scholar
Nicholls, D. (2003). The Cambridge Learner Corpus error coding and analysis for lexicography and ELT. Proceedings of the Corpus Linguistics 2003 Conference, 16, 572-581. Erişim adresi: http://ucrel.lancs. ac.uk/publications/cl2003/papers/nicholls.pdf google scholar
Sarıca, N. ve Od, Ç. (2015). Yabancı dil olarak Türkçe öğretiminde tamlama algısı sorunları. International Journal of Language Academy, 3(1), 389-398. google scholar
Subaşı Adalar, D. (2010). Tömer’de yabancı dil olarak Türkçe öğrenen Arap öğrencilerin kompozisyonlarında hata analizi. Dil Dergisi, 148, 6-16. google scholar
Şahin, E. Y. (2013). Yabancı dil olarak Türkçe öğrenen öğrencilerin yazılı anlatımlarındaki ek yanlışları. Tarih Okulu Dergisi, 6 (15), 433-449. google scholar
Yılmaz, F. ve Bircan, D. (2015). Türkçe öğretim merkezinde okuyan yabancı öğrencilerin yazılı kompozisyonlarının yanlış çözümleme yöntemine göre değerlendirilmesi. International Journal of Language Academy, 3(1), 113-126. google scholar

Citations

Copy and paste a formatted citation or use one of the options to export in your chosen format

EXPORT

APA

Golynskaia, A. (2022). An Error Coding System for the Turkish Learner Corpus. The Journal of Linguistics, 0(39), 67-87. https://doi.org/10.26650/jol.2022.1216218

AMA

Golynskaia A. An Error Coding System for the Turkish Learner Corpus. The Journal of Linguistics. 2022;0(39):67-87. https://doi.org/10.26650/jol.2022.1216218

ABNT

Golynskaia, A. An Error Coding System for the Turkish Learner Corpus. The Journal of Linguistics, [Publisher Location], v. 0, n. 39, p. 67-87, 2022.

Chicago: Author-Date Style

Golynskaia, Anna,. 2022. “An Error Coding System for the Turkish Learner Corpus.” The Journal of Linguistics 0, no. 39: 67-87. https://doi.org/10.26650/jol.2022.1216218

Chicago: Humanities Style

Golynskaia, Anna,. “An Error Coding System for the Turkish Learner Corpus.” The Journal of Linguistics 0, no. 39 (May. 2025): 67-87. https://doi.org/10.26650/jol.2022.1216218

Harvard: Australian Style

Golynskaia, A 2022, 'An Error Coding System for the Turkish Learner Corpus', The Journal of Linguistics, vol. 0, no. 39, pp. 67-87, viewed 9 May. 2025, https://doi.org/10.26650/jol.2022.1216218

Harvard: Author-Date Style

Golynskaia, A. (2022) ‘An Error Coding System for the Turkish Learner Corpus’, The Journal of Linguistics, 0(39), pp. 67-87. https://doi.org/10.26650/jol.2022.1216218 (9 May. 2025).

MLA

Golynskaia, Anna,. “An Error Coding System for the Turkish Learner Corpus.” The Journal of Linguistics, vol. 0, no. 39, 2022, pp. 67-87. [Database Container], https://doi.org/10.26650/jol.2022.1216218

Vancouver

Golynskaia A. An Error Coding System for the Turkish Learner Corpus. The Journal of Linguistics [Internet]. 9 May. 2025 [cited 9 May. 2025];0(39):67-87. Available from: https://doi.org/10.26650/jol.2022.1216218 doi: 10.26650/jol.2022.1216218

ISNAD

Golynskaia, Anna. “An Error Coding System for the Turkish Learner Corpus”. The Journal of Linguistics 0/39 (May. 2025): 67-87. https://doi.org/10.26650/jol.2022.1216218

Issue 392022, P. 67-87

TIMELINE

Submitted	08.12.2022
Accepted	28.12.2022
Published Online	18.01.2023

LICENCE

Attribution-NonCommercial (CC BY-NC)

This license lets others remix, tweak, and build upon your work non-commercially, and although their new works must also acknowledge you and be non-commercial, they don’t have to license their derivative works on the same terms.

The Journal of Linguistics

Research Article

An Error Coding System for the Turkish Learner Corpus

Türkçe Öğrenen Derlemi için Hata Etiketleme Sistemi

EXTENDED ABSTRACT

PDF View

References

Citations

EXPORT

APA

AMA

ABNT

Chicago: Author-Date Style

Chicago: Humanities Style

Harvard: Australian Style

Harvard: Author-Date Style

MLA

Vancouver

ISNAD

TIMELINE

LICENCE

SHARE