A PROPOSED MODEL OF SELECTING FEATURES FOR CLASSIFYING ARABIC TEXT

Ahmed M. D. E. Hassanein; Mohamed Nour

doi:10.5455/jjcit.71-1564059469

JJCIT. 2019; 5(3): 275-290

A PROPOSED MODEL OF SELECTING FEATURES FOR CLASSIFYING ARABIC TEXT

Ahmed M. D. E. Hassanein, Mohamed Nour.

Abstract	Download PDF		Post
Classification of Arabic text plays an important role for several applications. Text classification aims at assigning predefined classes to text documents. Unstructured Arabic text can be easily processed by humans while it is harder to be interpreted and understood by the machines. So, before classifying Arabic text or documents some preprocessing operations should be done. This work presents a proposed model for selecting features from the adopted Arabic text/documents. In this work, the words text and documents are used interchangeably. The adopted documents are taken from Al-Khaleej-2004 corpus. The corpus contains thousands of documents talking about news in different domains such as economics, international, local and sport news. Some preprocessing operations are done to extract the highly weighted terms that best describe the content of documents. The proposed model contains many steps to define the most relevant features. After defining the initial number of features; based on the weighted words; the steps of the model begin. The first step is based on calculating the correlation between each feature and the class one. Depending on a threshold value, the most highly correlated features are chosen. This reduces the number of chosen features. The number of features can be again reduced by calculating the intra-correlation between the resulted features. This is done in the second step. The third step selects the best features resulting from the second step by adopting some logical operations. The logical operations; specifically Logical AND or Logical OR; are applied to fuse the values of features depending on their structure, nature and semantics. The obtained features are then reduced. The fourth step is based on adopting the idea of document clustering i.e. the obtained features from step three are placed in one cluster and then we use iterative operations to group features into two clusters. Each cluster can be further partitioned into two clusters and so on. That partitioning can be repeated till the clusters' contents are not changed. The contents of each cluster are fused together using cosine rule. This reduces the overall number of features. This work adopts four types of classifiers mainly: Naïve Bayes (NB), Decision Tree, CART, and KNN respectively. A comparative study is done among the behavior of the adopted classifiers on the resulted number of features. The comparative study considers some measurable criteria mainly: precision, recall, F-measure, and accuracy. This work is implemented using WEKA and MatLab software packages. From the obtained results the best performance is achieved for the CART classifier while the worst one is for the KNN. Key words: Text Classification, Text Clustering, Feature Selection, Arabic Datasets, Machine Learning Methods, and Performance Evaluation.

A PROPOSED MODEL OF SELECTING FEATURES FOR CLASSIFYING ARABIC TEXT

Abstract