Abstract:
Plagiarism has become an infamous problem in the global academic community. Detection of plagiarism in Arabic documents is particularly a challenging task due to the complexity of the structure of the language. This dissertation provides a model and framework for detection of plagiarism in Arabic documents, which is based on a logical representation of a document as paragraphs, sentences, and words. The main purpose of this research is to develop and implement the Arabic Documents Plagiarism Detection Model “ADPDM” which is based on the model that is capable in detection of plagiarism in Arabic documents and search mechanism for the similar candidate documents within the corpus collection. Through developing pre-processing method including stop word removal, stemming and rooting. The implementation is constructe around a content-based method consisting mainly in fingerprinting the texts according to Arabic language specificity and comparing their logical representations by using Heuristic algorithms. We have introduced a plagiarism detection tool for Arabic language by using the Brian Kernighan and Dennis Ritchie (BKDR) hash function for chunk (3-gram) hashing. The second goal of the logical document representation is to save computation time by avoiding unnecessary comparisons. For that reason, we have defined a heuristic algorithm for each level in the tree: document level, paragraph level, and sentence level. We measure it using the Longest Common Substring (LCS) metric. The ADPDM system for detecting plagiarism in electronic resources for Arabic documents were tested and evaluated using a set of the corpora used in this study. It has 100 documents, 90% of the documents were collected from AraPlagDet (Arabic Plagiarism Detection) web-site divided in three categories dataset1 (Small) , Dataset2 (medium) and dataset3 (Large) , and 10% of the documents were collected from the Decision Support System (DSS) document. The original documents has builded randomly replces and were constructed with different degrees of plagiarism Named dataset4. In this study, preliminary experiments were conudacted using our tool ADPDM and WCopyFind. The result shows that percentages of dateset1 is 14% plagiarize detection during 501 second where WCopyFind is detected 0% in 135 second, in dataset2 shows 8% in 1374 second where WCopyFind is detected 0% in 475 second. As well as dataset3, shows 18% in 1430 second where WCopyFind is detected 6.33% in 271 second, while dataset4 is detected 94% in1682.79 second where WCopyFind find out 81.44% in 357 second. The main conclusion that ADPDM is the best result handled plagiarism detection while it is weak in the time taken and WCopyFind it is weak to handled plagiarism detection while it best in the time taken. Filnaly, the experimental results shows perfect performance of ADPDM as it achieved a Recall value represents 0.780351, with Precision of 0.994264 and F- Measure 0.865688.