Text readability is one of the main research areas widely developed in several languages but highly limited when dealing with the Arabic language. The main challenge in this area is to identify an optimal set of features that represent texts and allow us to evaluate their readability level. To address this challenge, we propose in this study various feature selection methods that can significantly retrieve the set of discriminating features representing Arabic texts. The second aim of this paper is to evaluate different sentence embedding approaches (ArabicBert, AraBert, and XLM-R) and compare their performances to those obtained using the selected linguistic features. We performed experiments with both SVM and Random Forest classifiers on two different corpora dedicated to learning Arabic as a foreign language (L2). The obtained results show that reducing the number of features improves the performance of the readability prediction models by more than 25% and 16% for the two adopted corpora, respectively. In addition, the fine-tuned Arabic-BERT model performs better than the other sentence embedding methods, but provided less improvement than the feature-based models. Combining these methods with the most discriminating features produced the best performance.
Key words: Readability, Feature Selection, Sentence Embedding, Arabic language, Education
|