Please use this identifier to cite or link to this item: https://repository.sustech.edu/handle/123456789/10497
Title: Improving Stemming Algorithm for Arabic Text Search
Authors: Babiker, Afag Salah Aldeen
Supervisor - Mohammed Mustafa Ali
Keywords: Stemming Algorithm
Arabic Text Search
Arabic language
Stemming
Over-Stemming problem
Light 10
Issue Date: Aug-2014
Publisher: Sudan University of Science and Technology
Citation: Babiker, Afag Salah Aldeen. Improving Stemming Algorithm for Arabic Text Search/ Afag Salah Aldeen BabikerØ›. Mohammed Mustafa Ali.-Khartoum : sudan university of science and technology,network,2014.-88p:ill;28cm.M.Sc.
Abstract: Building an effective stemmer for Arabic language has always been a hot research topic in the IR field. This is because Arabic, as one of the Semitic languages, has a very rich and complex morphology. From that perspective, several approaches have been developed for Arabic stemming and for the analysis of the best approach to index Arabic words. Formally, Arabic stemming techniques can be also classified into two major techniques: root-based techniques (known also as heavy or morphological analysis based stemming) and light stemming-based techniques (known also as affix removal stemming. Each of two approaches has major drawbacks. On one hand, root-based stemming may result in an over-stemming problem, in which words with different meanings may erroneously, grouped together. On the other hand, light-based stemming often results in an under-stemming problem, in which words with the same meaning do not stemmed together. Nevertheless, it was concluded in IR light stemming and light-10 in particular is the best developed approach for indexing Arabic documents. Inspired by light-10, this research attempts to improve some of the drawbacks identified in light-10 stemmer. It simply adds some additional prefixes and suffixes. These extended prefixes have been added after a deep analysis and several experiments conducted by the developer to understand the nature of the Arabic words. The step has been also accompanied by developing a new algorithm, also inspired by light-10, so as to control the process of determining which prefix and/or suffix should be stripped off. Test results showed that the proposed Extended-10 stemmer could yield significant better results when it was compared to the best known Arabic stemmer so far, that is light-10. Results also prove to be efficient for improving Arabic IR retrieval.
Description: Thesis
URI: http://repository.sustech.edu/handle/123456789/10497
Appears in Collections:Masters Dissertations : Computer Science and Information Technology

Files in This Item:
File Description SizeFormat 
Improving Stemming Algorithm .pdfResearch1.46 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.