Can we automatically determine the proficiency of Hebrew second language learners?
Anat Prior - Department of Learning Disabilities
Shuly Wintner - Department of Computer Science
Artificial Intelligence
Digital Humanities
Processing
Database Collection Grant 2022
Assessing the proficiency of second language learners is a complicated and labor-intensive endeavor, that has several practical applications. In the Israeli context, for example, institutes of higher education require candidates to reach a proficiency threshold in order to be admitted. The National Institute for Testing and Evaluation (NITE) administers Hebrew proficiency tests for all universities and colleges in Israel. As part of this test, candidates write a short essay in Hebrew. Our motivating research questions was whether computational evaluation of these texts can enable us to automatically rate the proficiency of the author.
Natural language processing technology has improved in recent years, such that it is now possible to accurately extract from texts different types of information about their authors. Such technology, which is based on machine learning algorithms, has been used for various classification tasks, including identifying the gender of an author, or their native language (when they are writing in a second language). Machine learning algorithms require a collection of manually annotated texts, from which they can “learn” the patterns that are used in order to then classify unseen texts.
In this project, we teamed up with NITE, to curate a dataset of 3000 Hebrew essays, written by non-native Hebrew speakers whose native language is Arabic, French or Russian. For comparison, we also included similar essays authored by native Hebrew speakers. All learner essays were manually graded for proficiency by NITE.
We defined three computational tasks – to distinguish between native and non-native authors; to identify the native language of the non-native authors; and to determine their proficiency level. Crucially, we identified linguistic features of Hebrew texts that helped the classifiers perform these tasks with high accuracy. The features that turned out to be the most useful for the classifier also shed light on the linguistic properties of non-native Hebrew texts. Specifically, we discovered typical errors made by Hebrew learners with different linguistic backgrounds. This knowledge can be utilized for designing evidence-based pedagogical approaches to teaching Hebrew as a second language.
Assessing the proficiency of second language learners is a complicated and labor-intensive endeavor, that has several practical applications. In the Israeli context, for example, institutes of higher education require candidates to reach a proficiency threshold in order to be admitted. The National Institute for Testing and Evaluation (NITE) administers Hebrew proficiency tests for all universities and colleges in Israel. As part of this test, candidates write a short essay in Hebrew. Our motivating research questions was whether computational evaluation of these texts can enable us to automatically rate the proficiency of the author.
Natural language processing technology has improved in recent years, such that it is now possible to accurately extract from texts different types of information about their authors. Such technology, which is based on machine learning algorithms, has been used for various classification tasks, including identifying the gender of an author, or their native language (when they are writing in a second language). Machine learning algorithms require a collection of manually annotated texts, from which they can “learn” the patterns that are used in order to then classify unseen texts.
In this project, we teamed up with NITE, to curate a dataset of 3000 Hebrew essays, written by non-native Hebrew speakers whose native language is Arabic, French or Russian. For comparison, we also included similar essays authored by native Hebrew speakers. All learner essays were manually graded for proficiency by NITE.
We defined three computational tasks – to distinguish between native and non-native authors; to identify the native language of the non-native authors; and to determine their proficiency level. Crucially, we identified linguistic features of Hebrew texts that helped the classifiers perform these tasks with high accuracy. The features that turned out to be the most useful for the classifier also shed light on the linguistic properties of non-native Hebrew texts. Specifically, we discovered typical errors made by Hebrew learners with different linguistic backgrounds. This knowledge can be utilized for designing evidence-based pedagogical approaches to teaching Hebrew as a second language.