Abstract:
Building an effective stemmer for Arabic language has always been a hot research topic in the IR field. This is because Arabic, as one of the Semitic languages, has a very rich and complex morphology.
From that perspective, several approaches have been developed for Arabic stemming and for the analysis of the best approach to index Arabic words. Formally, Arabic stemming techniques can be also classified into two major techniques: root-based techniques (known also as heavy or morphological analysis based stemming) and light stemming-based techniques (known also as affix removal stemming. Each of two approaches has major drawbacks. On one hand, root-based stemming may result in an over-stemming problem, in which words with different meanings may erroneously, grouped together. On the other hand, light-based stemming often results in an under-stemming problem, in which words with the same meaning do not stemmed together. Nevertheless, it was concluded in IR light stemming and light-10 in particular is the best developed approach for indexing Arabic documents.
Inspired by light-10, this research attempts to improve some of the drawbacks identified in light-10 stemmer. It simply adds some additional prefixes and suffixes. These extended prefixes have been added after a deep analysis and several experiments conducted by the developer to understand the nature of the Arabic words. The step has been also accompanied by developing a new algorithm, also inspired by light-10, so as to control the process of determining which prefix and/or suffix should be stripped off.
Test results showed that the proposed Extended-10 stemmer could yield significant better results when it was compared to the best known Arabic stemmer so far, that is light-10. Results also prove to be efficient for improving Arabic IR retrieval.