Abstract:
Information retrieval (IR) is defined as an activity of satisfying the user's
information needs from a collection of unstructured data (text, image, and video). One of
disadvantage of most IR systems is that the search is based on query terms that entered
by users. Then, when Arab user write the query using the term in his dialect or in
Modern Stander Arabic (MSA) form, the documents were retrieved contained this
query's term only. This problem appears clearly in scientific Arabic's documents, for
illustration, the documents that show the compiler concept; it can be found written by
the one of the following Arabic words: " اٌجعِع " , " اٌّفغش " or " ُاٌّخشا ". Thus, our research
is focused on the Arabic language, as it is one of the widely spread languages with
different dialects.
We propose a pre-retrieval (offline) method to build a statistical based dictionary
to expand the query which is based on a statistical methods (co-occurrence technique
and Latent Semantic Analysis (LSA) model) which can be defined as a flexible approach
because it is based on mathematical foundations to improve the effectiveness of the
search result by retrieving the most relevant documents regardless of their dialect was
used to formulate the queries.
We designed and evaluated our method and the baseline methods from a small
corpus collected manually using Google search engine. The evaluation was done using
the average recall (Avg-R), average precision (Avg-P) and average F-measure (Avg-F).
The result of our experiments indicated that the proposed method is a proven to
be efficient for improving retrieval via expands the query by regional variation's
synonyms, with accuracy 83% in form of Avg-F. Also, statistically our model is
significant when it is compared to traditional IR systems by acquired 5.43594E-16 in the
t-test.