Abstract:
Web Usage Mining (WUM) can be defined as an application of data mining to extract the knowledge hidden in the Web log files, such as user access patterns from Web data. However, the structure of these log files does not present accurately a picture of the users’ accesses to Web sites. The WUM process goes through three phases: data pre-processing, patterns discovery and pattern analysis. Data pre-processing can be used to filter and organize only the appropriate information before using Web mining algorithms on the log files’ data for pattern discovery and analysis. Pattern analysis is the process of analyzing the access patterns of the log file. There are many efforts that have been conducted to accomplish the work of clustering and classification. These efforts resulted in the development of various tools and techniques, which can generate fixed reports from web logs; they typically do not allow ad-hoc analysis queries. Moreover, such tools cannot discover hidden patterns of access embedded in the access logs. In addition, they do not take an approach such as ensemble when used as machine learning tool. Moreover, the tools that have been developed for the analysis using data warehouse, populate the fact table directly from the web logs without prepressing step, which is necessary for data cleansing and enrichment. Therefore, the proposed work focuses on closing the gaps of the developed tools especially in the aforementioned issues. It takes the SUST log file as a case study. A preprocessing step is conducted before loading the log file data in a database table to make the data of the log file ready for accomplishing the mining and analysis task. The results obtained after the pre-processing were satisfactory and contained valuable and adequate information about the log files. In the mining process, the following tasks are curried out: clustering, rule based mining, and classification. In clustering, K-means clustering algorithm and Density based clustering are used to cluster web log based on the two types of clusters: user clusters and page clusters. It was found that the Density-based clustering has a better performance compared to K-means clustering with and without features selection. A priori algorithm is used for the task of rule-based mining to discover relationship among data. In this study the accuracy of ensemble models, which take advantage of groups of base learners is compared with the accuracy of several base classifiers. Stacking and Voting are used as an aggregation method to combine the results of the multiple base learners. The results show that the ensemble machine learning models using voting can significantly improve users sessions classification. To accomplish the task of pattern analysis, the log data is extracted transformed and loaded in a data warehouse. Online Analytical Processing (OLAP) is used to analyze the data that is loaded in the data warehouse. As for future work, there is a need to solve problems related to parallel processing, especially for large-scale data that resulted from the click streams of the growing usage of the web. Also due to the complexity of the dataset and the difficulty in understanding them, a visualization tools are needed to render the information related to these complex dataset in an easy and understandable way. In addition, an efficient way to analyze such large scale and complex data is needed, and it can be carried out through the use of parallel algorithms.