Parallel Support Vector Machine for big data Classification

Abdelkarim, Iatimad Mohammed Sati; Supervisor, -JohnsonAgbinya

SUST Home
→
Theses and Dissertations
→
College of Computer Science and Information Technology
→
PhD theses : Computer Science and Information Technology
→
View Item

Parallel Support Vector Machine for big data Classification

Abdelkarim, Iatimad Mohammed Sati; Supervisor, -JohnsonAgbinya

URI: http://repository.sustech.edu/handle/123456789/26739

Date: 2020-11-12

Abstract:

With the rapid growth of data in various fields, big data analysis is considered a great challenge for traditional management systems and scientists. This research deals with big data analysis using parallel computing through some algorithms for machine learning methods. This research deals with big data analysis using parallel computing through some algorithms. A framework of Parallel SVMs based MapReduce is implemented on different datasets to perform supervised classification. Support Vector Machines are an excellent example of the commonly used methods for producing classification problems. It is a suitable classifier machine learning because of its generalization ability and expertise to classify big data accurately. However, the traditional SVM is not appropriate for huge datasets due to its high computational complexity. This research studies the SVM algorithm and Parallel Support Vector Machine (PSVMs) and their applications in different big data fields. The implementation of PSVM is done in the Hadoop cluster running in the HPC center in Sudan. Three models are implemented in four datasets for classification. The PSVM is applied to real data. Then the k-means clustering is combined with the support vector machine. The real water quality dataset from the ministry of health and different water stations in Sudan (2006-2017) is used to classify whether the water is suitable for drinking or not. The Adult dataset is used to classify the income of a person. The diabetes data set is used to classify whether the patient has diabetes or not. The cover type dataset is used to classify seven wilderness areas located in the Roosevelt National Forest of northern Colorado. The numerical experiment applying the PSVM is compared with k- means clustering applied to SVM and SVM frameworks. The results showed that applying the parallel support vector machine gives the highest accuracy and positively reduces computation time. The performance is compared using time-consuming accuracy.