Abstract:
Protein secondary structure prediction from its sequence of amino acids remains an important issue. Determining the secondary structure of protein in the laboratory is very costly and consumes a lot of time. Development of precise and efficient method for secondary structure prediction is very important. In this research we propose an approach that uses the clustering algorithm as preprocessing steps for machine learning methods for solve unbalanced dataset problem to predict Protein secondary structure and compare the result when using the clustering algorithm, with the result without using it in the prediction. We utilize position specific scoring matrices (PSSMs) as features. The preprocessing for the data will be done using K-means clustering to prepare clusters that can be used as input for a support vector machines (SVM) and kernel logistic regression (KLR) models In this study we achieved high prediction accuracy compared by previous study Qtotal of 86.5%, 77.6%, on α-helix and coil secondary structure respectively when we used SVM method and also we achieved Qtotal of 82.18%, 75.3%and 82.9% on α-helix, coil and extended beta-sheet secondary structure respectively when we used KLR method .Achieves satisfactory performance in predicting secondary structure as measured by the Matthew’s correlation coefficient (MCC), Qpredicted and Qobserved on RS126 datasets