Abstract:
The main aim of this thesis is to develop suitable and high performance Credit Scoring Models (CSMs) to assess credit risk of personal loans for the Sudanese commercial banks using data mining techniques.
Two Sudanese credit datasets were constructed. These datasets were provided by Agricultural Bank of Sudan and Al Salam Commercial Bank. In addition to these two datasets, a German credit dataset was also employed in this research as a benchmarking dataset.
Three data mining classification techniques were employed in this research: Artificial Neural Network (ANN), Support Vector Machine(SVM) and Decision Tree (DT). Genetic Algorithm (GA) is also applied as a feature selection technique. Two validation methods (split validation with two ratios (70:30 and 60:40) and 10-cross validation) were used to validate the proposed credit scoring models.
As a result of combining GA with the specified classification techniques, tables of attributes and their weights were produced. By using these tables new reduced sets of features were identified for each dataset (i.e. new reduced datasets were produced from the original datasets).
Experiments in this research were conducted in three stages. In stage 1, classification techniques were applied individually to each dataset .In stage 2, these techniques were combined with GA and in stage 3 these techniques were applied to the reduced datasets.
Nine proposed credit scoring models for each dataset were developed for each stage. These models were compared for each dataset in terms of fiveevaluation measures: Accuracy, Precision (Defaulter), Precision (Non-defaulter), Type and Type П errors. As a result of these comparisons, the suggestions for the best models for each dataset were given.
The experiments carried out in this research show that:
• For all datasets, combining GA as a wrapper-feature selection technique with ANN, SVM and DT classification techniques is more beneficial than applying these techniques individually. Applying specified classification techniques to the reduced datasets does not bring a significant improvement to the major models in terms of the specified five measure indicators compared to the resulting models from applying these techniques to the original datasets.In addition, and as well-known fact the performance of each technique heavily depends on the nature of datasets.