Abstract:
Class imbalance is one of the challenges of machine learning and data mining fields. Imbalanced data set degrades the performance of data mining and machine learning techniques as the overall accuracy and decision-making would be biased to the majority class, which leads to misclassifying the minority class samples or furthermore treated them as noise. The classification problem of imbalanced data gets complicated whenever the class of interest is relatively rare and has small number of instances compared to the majority class. Moreover, the cost of misclassifying the minority class is very high in comparison with the cost of misclassifying the majority class as occurs in many real applications such as medical diagnosis, fraud detection, network intrusion detection…etc.
In this dissertation, we started by investigating the problem of two class classification. A series of experiments are conducted using imbalanced data with its original distribution, balanced data using sampling methods and meta learning methods. Then, we developed a hybrid ensemble that implemented multi resampling methods at various rates. The experimental results on many real world applications for two class imbalanced data sets, confirms that the proposed hybrid ensembles have better performance using different evaluation measures.
Next, we investigated the multi class imbalanced problem. A series of experiments are conducted using direct multi class classification and meta learning methods. We developed a hybrid Error Correcting Output Code ensemble utilizing weighted Hamming distance and AdaBoost meta learning method. The experimental results on many real applications multi class imbalanced data sets show that our proposed hybrid ensemble performed effectively better by improving the classification performance in minority classes and significantly outperformed other tested methods