Abstract:
Information retrieval (IR) systems retrieve relevant information relating to a specific query by the user, and this requires the extraction of related unstructured information from data which may be texts, sound, images. In this context, an important problem facing information retrieval, in particular from text files, is reliance on exact matching of the word or words in the query and the same words in a specific text file. This leads in many cases to the loss of results where files contain synonyms with words in the query which may be useful to the user. This dilemma appears in most information retrieval systems for unstructured text data, and with most languages, especially with regard to the Arabic language. This research will deal with the problem of information retrieval from the Hadith across many languages, by building a parallel corpus with multiple languages containing the Hadith in Arabic as well as translated texts in English, French and Russian. We have built a parallel corpus containing the text of 2030 Arabic Hadith along with the translation in English, French, and Russian languages. Thus the parallel corpus contains 8120 Hadith consisting of 2,470,913 words. Our matching algorithm to applied into thedata for the retrieval process, calculating the weight of the words in the query based on their importance and then comparing this with the existing documents, which have been processed to calculate the importance of words in each document. Then a similarity coefficient is calculated from the particular query and existing documents. To improve performance, the system has a dictionary of words with identifying all files that contain those words as an inverted index. We built a web portal to allow user search via World Wide Web. We designed and evaluated the proposed solution by using a selection of important concepts, for which we have pre-determined the results manually without referring to the system. The evaluation work calculates both the average precision and average recall for each language. The results showed that the proposed method has good results for retrieval in all four languages: the average precision and average recall of the Arabic language were 96.5% and 82%, consequently for the English language they were 98.4% and 90%, the French language were 97.5% and 91.7% and the Russian language they were 98% and 91%.