Abstract:
Cross-language information retrieval (CLIR), where queries and documents are in
different languages, become one of the major topics within the information retrieval
community. The important step in CLIR is the translation. This research proposes a
term translation disambiguation method based on co-occurrence statistics for
translation in Arabic-English CLIR.
There are multiple ways to perform query translations: employing machine
translation techniques, using parallel corpora or using bilingual dictionaries. The
first two approaches are very labour intensive. Manual hand-coding of linguistic,
semantic and pragmatic knowledge is required for a machine translation engine to
produce good translations. This can be quite overwhelming when the domain of
coverage is wide.
A great deal of work is also required for building parallel
collections when using the second approach. With the increasing availability of
machine-readable bilingual dictionaries, the third approach has become a viable
approach to Cross-Language Information Retrieval (CLIR), but in this approach,
resolving term ambiguity is a crucial step.
In this research the ambiguity problem was resolved by co-occurrence statistics. Co-
occurrence technique based on the hypothesis that correct translations tend to co-
occur together in the target language collection. Therefore, the valid translation
among a set of possible synonymous candidates of a certain source query term is
expected to have high frequency of co-occurrence with the translations of the other
terms in the same source query.
After the document set divided to fixed size window to overcome varying in
document length problem, the degree of association is calculated using mutual
information measure because it simple and produce high correlation between terms
even though they not appeared very frequently in document set.
The results of developed method proved that co-occurrence statistics can reduce the
ambiguity problem and it works well in case of diacritics and homonymous.