Building the Multilingual Hadith Corpus to Enhance Performance of Information Retrieval System for Hadith

Hassan, Samah Mohamed Osman; Supervised, -Eric Atwell

SUST Home
→
Theses and Dissertations
→
College of Computer Science and Information Technology
→
PhD theses : Computer Science and Information Technology
→
View Item

dc.contributor.author	Hassan, Samah Mohamed Osman
dc.contributor.author	Supervised, -Eric Atwell
dc.date.accessioned	2017-10-23T07:16:26Z
dc.date.available	2017-10-23T07:16:26Z
dc.date.issued	2017-08-12
dc.identifier.citation	Hassan, Samah Mohamed Osman .Building the Multilingual Hadith Corpus to Enhance Performance of Information Retrieval System for Hadith /Samah Mohamed Osman Hassan ;Eric Atwell.- Khartoum: Sudan University of Science and Technology, college of Computer science and information technology, 2017 .- 139p. :ill. ;28cm .- PhD	en_US
dc.identifier.uri	http://repository.sustech.edu/handle/123456789/18855
dc.description	Thesis	en_US
dc.description.abstract	Information retrieval (IR) systems retrieve relevant information relating to a specific query by the user, and this requires the extraction of related unstructured information from data which may be texts, sound, images. In this context, an important problem facing information retrieval, in particular from text files, is reliance on exact matching of the word or words in the query and the same words in a specific text file. This leads in many cases to the loss of results where files contain synonyms with words in the query which may be useful to the user. This dilemma appears in most information retrieval systems for unstructured text data, and with most languages, especially with regard to the Arabic language. This research will deal with the problem of information retrieval from the Hadith across many languages, by building a parallel corpus with multiple languages containing the Hadith in Arabic as well as translated texts in English, French and Russian. We have built a parallel corpus containing the text of 2030 Arabic Hadith along with the translation in English, French, and Russian languages. Thus the parallel corpus contains 8120 Hadith consisting of 2,470,913 words. Our matching algorithm to applied into thedata for the retrieval process, calculating the weight of the words in the query based on their importance and then comparing this with the existing documents, which have been processed to calculate the importance of words in each document. Then a similarity coefficient is calculated from the particular query and existing documents. To improve performance, the system has a dictionary of words with identifying all files that contain those words as an inverted index. We built a web portal to allow user search via World Wide Web. We designed and evaluated the proposed solution by using a selection of important concepts, for which we have pre-determined the results manually without referring to the system. The evaluation work calculates both the average precision and average recall for each language. The results showed that the proposed method has good results for retrieval in all four languages: the average precision and average recall of the Arabic language were 96.5% and 82%, consequently for the English language they were 98.4% and 90%, the French language were 97.5% and 91.7% and the Russian language they were 98% and 91%.	en_US
dc.description.sponsorship	Sudan University of Science and Technology	en_US
dc.language.iso	en	en_US
dc.publisher	Sudan University of Science and Technology	en_US
dc.subject	Building the Multilingual	en_US
dc.subject	Hadith Corpus to Enhance	en_US
dc.subject	Retrieval System for Hadith	en_US
dc.title	Building the Multilingual Hadith Corpus to Enhance Performance of Information Retrieval System for Hadith	en_US
dc.title.alternative	بىاء مجاميغ متؼذدة انهغاث نهحذيث بغرض تحسيه كفاءة وظاو استرجاع الاحاديث انىبويت	en_US
dc.type	Thesis	en_US