Saeideh Mahmoudian* , Abdulaziz Yousef and Nasrollah Moghadam Charkari
Protein Protein Interaction (PPI) data is essential for understanding cellular and biological processes. Therefore, PPI identification plays an important role in comprehending these processes and detecting the reasons for numerous diseases progression, such as COVID-19. Since experimental methods of PPI identification are time consuming, costly, and inaccurate, numerous machine learning approaches have been developed for this purpose. Using these approaches leads to noise reduction and more accurate and general PPI prediction. This paper proposed a sequence based framework for predicting PPIs called GASA-SVM. The proposed framework uses a Support Vector Machine (SVM) with Gaussian Radius Basis-kernel Function (RBF) for classification. The performance and classification accuracy of the SVMs are highly dependent on the kernel parameters and the selection of an appropriate subset of features. The Principal Component Analysis (PCA) method is employed as a feature extraction algorithm to reduce the training time and minimize the impact of noisy PPI data. A combination of the Genetic Algorithm (GA) and Simulated Annealing (SA) is then used to select the most significant features and determine the optimal values of the SVM kernel parameters. Our proposed method can successfully predict PPIs with an accuracy of 96.373% on Saccharomyces cerevisiae, and with an accuracy of 75.31% on KUPS (The University of Kansas Proteomics Service) dataset which outperforms the other methods. According to the experimental results, GASA-SVM can effectively reduce the number of features while maintaining high prediction accuracy compared to the other available methods.