A Comparative Studyfor Two Stemming Algorithmsfor Arabic Wikipedia Documents Classification Based on Similarity Measures

khamis, Mohamed Idris Ali; Supervisor, - Ali Ahmed Alfaki Abdalla

SUST Home
→
Theses and Dissertations
→
College of Computer Science and Information Technology
→
Masters Dissertations : Computer Science and Information Technology
→
View Item

dc.contributor.author	khamis, Mohamed Idris Ali
dc.contributor.author	Supervisor, - Ali Ahmed Alfaki Abdalla
dc.date.accessioned	2018-11-29T06:59:36Z
dc.date.available	2018-11-29T06:59:36Z
dc.date.issued	2018-11-01
dc.identifier.citation	khamis, Mohamed Idris Ali.A Comparative Studyfor Two Stemming Algorithmsfor Arabic Wikipedia Documents Classification Based on Similarity Measures\Mohamed Idris Ali khamis;Ali Ahmed Alfaki Abdalla.-Khartoum:Sudan University of Science & Technology,College of Computer Science and Information Technology,2018.-68p.:ill.;28cm.-M.Sc.	en_US
dc.identifier.uri	http://repository.sustech.edu/handle/123456789/21938
dc.description	Thesis	en_US
dc.description.abstract	Text mining is an important field in information retrieval; it organize alargenumber of text documents that available on the internet to facilitate the retrieved processing and increase efficiency. Text classification is automatically determining the category to new or unseen documents that depends on content of document itself. In text classification, text preprocessing is a fundamental step to obtained a better result. The Arabic text processing depends on stemming algorithms to achieve high accuracy. This research aims to compare between two stemming algorithms stem approach (snowball light) and root approach (Shereen Khoja) using three similarity measures: Euclidean distance, cosine similarity, and pearson correlation distance. This research use Arabic Wikipedia dataset and TF-IDF as weight scheme to construct the vector space model to represent the weight of selected features of text. For evaluation measures, the research applies overall accuracy, average recall, average precision, and average F1 measure to assess the results of the classified text documents. The collection of document is divided into training and test documents according to three experimental (85% – 15%) (80% – 20%) (90% – 10%) for training and test document respectively. The results showed the overall accuracy of Shereen Khoja stemmer is better than Snowball stemmerin all experimental excluding cosine similarity in the first experimental and Euclidean distance in the third experimental which has a better accuracy when use Snowball stemmer.	en_US
dc.description.sponsorship	Sudan University of Science and Technology	en_US
dc.language.iso	en	en_US
dc.publisher	Sudan University of Science & Technology	en_US
dc.subject	Arabic Wikipedia Documents	en_US
dc.subject	Two Stemming Algorithms	en_US
dc.subject	Similarity Measures	en_US
dc.title	A Comparative Studyfor Two Stemming Algorithmsfor Arabic Wikipedia Documents Classification Based on Similarity Measures	en_US
dc.title.alternative	دراسة مقارنة لخوارزميتي تحليل الجذور لتصنيف ملفات الويكيبيديا العربية بناءً على مقاييس التشابه	en_US
dc.type	Thesis	en_US