Abstract:
Efficient techniques to detect similar data in many data sources has become one of the most important and challenging issues in many areas such as Data Base, Bioinformatics and Data Mining.In this research, a three phase framework for similarity detection is proposed:
In the first phase:
Data Sources were collected from the web, depending on how it relates to a predetermined domain. The base source is the source of the data available, which describes the domain.
In the second phase:
the sources obtained are filtered to select data sources with a greater probability of containing data describing the domain by examining the degree of similarity between the base source, and each source from the sources obtained "External Sources". Whereas the selection is only for the external sources which its simi_degree value is less than, or equal to the average of the simi_degree values of all sources.
In the third phase:
Content similarity is examined between the base source, and all the selected external sources in phase 1, by using the proposed "Probability Measure" that gives a value on the basis of which it is determined whether the content of external sources is similar to the content of the base resource. Experimental result shows that the researcher's similarity framework can achieve better quality result than the conventional approaches.