Abstract:
Text classification is the process of classifying documents into a predefined set of categories based on their content. A variety of classifiers are used to classify Arabic and English text documents using many classification algorithms. The main objective of this research is to compare many classification algorithms on Arabic and English text documents that have the same content which called Parallel Documents to figure which classification algorithm is better on them.
In this research, we will use four classification algorithms (Naïve Bayes, k-nearest neighbor, Sequential minimal optimization, and J84).
The first algorithm Naïve Bayes and this algorithm have shown equal efficiency in the classification of Arabic and English documents, but it took close to twice the time on the Arabic documents.
The second algorithm is a k-nearest neighbor and this algorithm has shown high accuracy on English documents, but it shows less accuracy on Arabic documents.
The third algorithm is Sequential minimal optimization and this algorithm has shown high accuracy on English and Arabic documents, and it is the best algorithm that has provided highly efficient and very closes classification accuracy and classification time as well.
The last algorithm J48 and this algorithm have shown equal efficiency in the classification of Arabic and English documents, but it took almost twice the time of classification on English documents than the Arabic documents.
The experiments were done using WEKA data mining tool. And we have using United Nation Parallel Documents. We used a platform of Intel Core i3 Processing power of 2.13 GHz CPU with 4GB RAM. Depending on these results some of the classification algorithms achieving higher accuracy on English document than the Arabic documents.