Roots extraction is an important primary process in most Arabic applications such as Information retrieval systems, text mining, text classifiers, question answering systems, data compression, indexes, spelling checkers, text summarization, and machine translation. Any weaknesses of root extraction will affect negatively the performance of these applications. Sonbols Arabic roots extraction algorithm achieves high accuracy performance and gives new classification for Arabics letters that minimize the affix ambiguity. The comparison and testing of the existing Arabic root extraction algorithms on unify datasets shows that they still need some enhancements. Arabic roots extraction is mainly based on using the patterns, as much as the algorithm has patterns as much as the accuracy is better. In this study, we improve Sonbols Arabic root extraction algorithm, by enhancing its rules and increasing its patterns. We use (4320) patterns to extract the roots, which is the longest patterns list were extracted by Thaljis corpus [1]. We test the new algorithm on Thaljis corpus that contains (720000) word-root pairs, this corpus is mainly build to test and compare Arabic roots extraction algorithm. The new algorithm is compared with Sonbols Arabic roots extraction algorithm. Sonbols algorithm achieves 68% accuracy, whereas the new algorithms accuracy achieves 92%.
Key words: Arabic Root Extraction Algorithm; Stemming; Arabic Language Processing.
|