Türkçe Öğrenen Derlemi için Hata Etiketleme Sistemi
Anna GolynskaiaÖğrenen derlemleri, bir dili yabancı veya ikinci dil olarak öğrenenlerin ortaya koyduğu metinlerin elektronik koleksiyonlarıdır. İkinci dil edinimi ve yabancı dil öğrenimi alanlarında yaygın olarak kullanılan öğrenen derlemleri, öğrenen dilini araştırmaya olanak sağlayan güvenilir araçlar olarak karşımıza çıkmaktadır. Bu çalışmanın amacı, 94 ülkeden gelen öğrencilere ait sınav kağıtlarından oluşan 274 bin kelimelik Türkçe Öğrenen Derleminin temelinde Türkçeye özgü bir hata etiketleme sistemini geliştirmektir. Derlemdeki metinler, orijinallerine sadık kalınarak manuel olarak bilgisayar ortamına aktarıldıktan sonra 44 bin kelimelik çekirdek derlem oluşturulmuş ve üzerinde hata etiketlemesi yapılmıştır. Hata etiketleme sistemi, ilk harfin hata alanını temsil ettiği, sonraki harflerin ise hata kategorisini ve geçerli ise kelime türünü belirttiği üç veya dört harfli kodlara dayanmaktadır. Toplamda 58 hata kodu mevcuttur. Tasarlanan hata etiketleme sistemi, Türkçe öğrenenlerin dil yeterliliğini değerlendirmek ve hata etiketlemesi içeren diğer Türkçe öğrenen derlemlerini oluşturmak için kullanılabilecektir.
An Error Coding System for the Turkish Learner Corpus
Anna GolynskaiaLearner corpora are electronic collections of texts produced by learners of a foreign or second language. Learner corpora are reliable tools for investigating learner language and are widely used in the fields of second language acquisition and foreign language learning. This paper describes an error tagging system that has been designed on the basis of the 274,000-word Turkish learner corpus that comprises the Turkish examination papers written by learners coming from 94 different countries. The texts were manually keyboarded while retaining all errors, after which a 44,000-word component of the corpus was error coded using the specially devised error tag set. The majority of codes are based on a three- or four-letter system in which the first letter represents the error domain and the next series of letters identify the error category, as well as the word class where relevant. The error tag set has a total of 58 possible codes. The designed error tagging system can be used to assess the linguistic competence of Turkish learners and to build other error-annotated Turkish learner corpora.
Learner corpora, also known as interlanguage or L2 corpora, are defined as electronic collections of authentic foreign or second language data. Learner corpora differ from other data types used in the fields of second language acquisition (SLA) and foreign language teaching (FLT) in two main ways. Firstly, they are easily analyzable using various software tools due to being computerized. Secondly, learner corpora enable learner language to be correctly described and modeled due to the large amount of data they contain. In addition to the morphological and syntactic tagging and lemmatization commonly used in other corpus types, learner corpora are usually annotated with the help of a standardized system of error tags. During the error annotation process, errors are detected by one or more annotators who assign labels indicating the error type and add examples of the correct usage next to the labels. Error tagging is performed based on the error taxonomy developed for the given corpus. Error classifications used in learner corpora today can be divided into three groups: classifications indicating the source of error (e.g., morphology, lexis, case, number), classifications based on the ordering of the units in the source text (e.g., omission, addition, improper formation, improper ordering), and classifications that take various aspects into consideration, including error domain (e.g., morphology, lexicology, pragmatics), error category (e.g., diacritic usage, inflection, gender, mood), and word category.
This research discusses within its scope the process of designing an error annotation system based on the 44,000-word error-tagged component of a 274,000-word Turkish learner corpus that is comprised of Turkish examination papers written by learners from 94 different countries. The data were collected from eight Turkish teaching centers located in the Turkish provinces of Istanbul, Ankara, Sakarya, Samsun, Edirne, and Kocaeli that agreed to participate in the study. Therefore, the corpus comprises data produced by foreign students learning Turkish in an immersive language learning environment. Because the data collection phase coincided with the quarantine restrictions resulting from the COVID-19 pandemic, the study accessed the exam papers that had been stored in the archives of the Turkish teaching centers and scanned the sections on writing from the Turkish Proficiency Exam and C1 exams. The texts were manually keyboarded while retaining all errors. A tag was created containing metadata such as the nationality and gender of the student, the year of the exam, the subject, and the genre of writing was created for each text that had been saved as a separate Word file.
The digitization of the data that had been obtained from universities was completed in January 2021. In the process of transferring the texts to the computer environment, personal information (e.g., first and last name, university, residence address, age) was anonymized. Real information was replaced with randomly chosen names and numerical data; however, attention was paid to ensure that an equivalent word containing the same mistake was used in cases involving orthographic mistakes. In order to ensure that the maximum variety of texts were included in the core corpus, a matrix was prepared that specified the students’ genders, nationalities, writing topics, and genres. The core corpus contains 43,518 words and 50,487 tokens.
In addition to keywording the texts, the study investigated the error annotation systems developed for other corpora and identified the codes that could be used for the Turkish learner corpus. The preliminary error tags were divided into five categories: spelling, punctuation, morphology, syntax, and vocabulary. As the study progressed, it added other categories to these five, bringing the total number of error labels to 58. The majority of codes are based on a three- or four-letter system in which the first letter represents the error domain and the following letters identify the error category, as well as the word class where relevant.
The designed error tagging system can be used to assess Turkish learners’ linguistic competence or to build other error-annotated Turkish learner corpora. In addition to the present study’s contributions to the field, it also has some limitations. The research mostly focused on error labels related to the linguistic competence of Turkish learners, leaving discourse errors beyond the scope of the paper.