您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2020, Vol. 50 ›› Issue (2): 76-82.doi: 10.6040/j.issn.1672-3961.0.2019.292

• 机器学习与数据挖掘 • 上一篇    下一篇

基于Laplacian支持向量机和序列信息的microRNA-结合残基预测

马昕1(),王雪2   

  1. 1. 南京审计大学统计与数学学院, 江苏 南京 211815
    2. 南京审计大学实验中心, 江苏 南京 211815
  • 收稿日期:2019-06-06 出版日期:2020-04-20 发布日期:2020-04-16
  • 作者简介:马昕(1982—),女,江苏镇江人,副教授,博士,硕士生导师,主要研究方向为生物信息学,机器学习. E-mail:maxin@nau.edu.cn

Prediction of microRNA-binding residues based on Laplacian support vector machine and sequence information

Xin MA1(),Xue WANG2   

  1. 1. School of Statistics and Mathematics, Nanjing Audit University, Nanjing 211815, Jiangsu, China
    2. Experimental Center, Nanjing Audit University, Nanjing 211815, Jiangsu, China
  • Received:2019-06-06 Online:2020-04-20 Published:2020-04-16

摘要:

提出一种半监督学习算法预测蛋白质序列中microRNA-结合残基的新式的方法。通过Laplacian支持向量机(Laplacian support vector machine,LapSVM)算法结合新提出的混合特征构建预测模型。混合特征是由三类信息组合获得:二级结构信息、HKM特征和新提出的氨基酸理化特性和进化信息结合的特征。比较各种特征的预测性能,新提出的这一特征对预测性能的提高贡献最大。结果表明,通过特征选择,本研究构建的预测模型准确性达到88.72%,敏感性达到54.18%,特异性达到91.15%,明显优于其他方法。

关键词: microRNA-结合残基, Laplacian支持向量机, 进化信息, 物理化学特征, 特征筛选

Abstract:

A new method of semi-surpervised learning algorithm was proposed to predict miRNA-binding residues in protein sequences. The Laplacian support vector machine (LapSVM) algorithm was combined with the newly proposed hybrid features to build a prediction model. The hybrid features were obtained from a combination of secondary structure information, HKM features, and the newly proposed feature combination of amino acid physicochemical properties and evolutionary information. Performance comparison of the various features indicated that our novel feature contributed the most to prediction improvement. The results demonstrated that accuracy of our LapSVM model achieved 88.72%, sensitivity achieved 54.18% and specificity achieved 91.15% using feature selection. The LapSVM model significantly outperformed other approaches at miRNA-binding site prediction.

Key words: microRNA-binding residues, Laplacian support vector machine, evolutionary information, physicochemical properties, feature selection

中图分类号: 

  • Q811.4

表1

PDB和UniProt构成的Main数据集"

数据库 蛋白质序列ID号
PDB 2LI8_A,2N82_B,3A6P_A,3A6P_C,3ADI_A,3ADL_A,3TRZ_A,4L8R_C,4NGB_A,4QOZ_B,4W5N_A
UniProt O04379,O04492,P09651,P43243,P48432,P98175,Q01860,Q06787,Q1PRL4,Q2VB19,Q3UHX9,Q4R979,Q5D1E8,Q5RCW2,Q6GLC9,Q80U58,Q8CJF8,Q8K3Y3,Q8R205,Q8R418,Q9JIK5,Q9R0B7,Q9SKN5,Q9U489,Q9UGR2,Q9XGW1,Q9ZVD5

表2

基于不同的特征构建的LapSVM模型的预测性能"

特征 准确率 敏感性 特异性 MCC
PSSM 0.711 2 0.215 7 0.748 2 0.021
PSSMPP 0.782 0 0.333 3 0.815 5 0.096
PSSMPP+SS 0.836 5 0.451 0 0.865 3 0.221
PSSMPP+HKM 0.831 6 0.414 2 0.861 1 0.188
PSSMPP+SS+HKM 0.875 2 0.493 3 0.903 5 0.291
Optimal 44 features 0.887 2 0.541 8 0.911 5 0.334

图1

165个特征子集构建的模型的MCC曲线"

表3

不同的机器学习算法的预测性能比较"

特征 准确率 敏感性 特异性 MCC
RF 0.739 8 0.352 9 0.768 7 0.072
SVM 0.410 1 0.137 3 0.430 5 0.000
LapSVM 0.887 2 0.541 8 0.911 5 0.334
1 AGATA F . MiRNA: new mechanisms of gene expression control[J]. Postepy Biochemii, 2007, 53 (4): 413- 419.
2 MALHAS A , SAUNDERS N J , VAUX D J . The nuclear envelope can control gene expression and cell cycle progression via miRNA regulation[J]. Cell Cycle, 2010, 9 (3): 531- 539.
doi: 10.4161/cc.9.3.10511
3 BARTEL D P . MicroRNAs target recognition and regulatory functions[J]. Cell, 2009, 136 (2): 215- 233.
4 CUSHING L , JIANG Z , KUANG P , et al. The roles of microRNAs and protein components of the microRNA pathway in lung development and diseases[J]. American Journal of Respiratory Cell and Molecular Biology, 2015, 52 (4): 397- 408.
doi: 10.1165/rcmb.2014-0232RT
5 DAI R , AHMED S A . MicroRNA, a new paradigm for understanding immunoregulation, inflammation, and autoimmune diseases[J]. Translational Research: the Journal of Laboratory and Clinical Medicine, 2011, 157 (4): 163- 179.
doi: 10.1016/j.trsl.2011.01.007
6 LEI W , LI G , ZHENG J , SHUI X , et al. Roles of microRNA in vascular diseases in cardiac and pulmonary systems[J]. Die Pharmazie, 2014, 69 (9): 643- 647.
7 LU T X , ROTHENBERG M E . Diagnostic, functional, and therapeutic roles of microRNA in allergic diseases[J]. The Journal of Allergy and Clinical Immunology, 2013, 132 (1): 3- 13.
doi: 10.1016/j.jaci.2013.04.039
8 WAHID F , KHAN T , KIM Y Y . MicroRNA and diseases: therapeutic potential as new generation of drugs[J]. Biochimie, 2014, 104, 12- 26.
doi: 10.1016/j.biochi.2014.05.004
9 WU J S , ZHOU Z H . Sequence-based prediction of microRNA-binding residues in proteins using cost-sensitive Laplacian support vector machines[J]. IEEE/ACM Transactions on Computational Biology and Bioinfor-matic, 2013, 10 (3): 752- 759.
doi: 10.1109/TCBB.2013.75
10 BELKIN M , NIYOGI P , SINDHWANI V . Manifold regularization: a geometric framework for learning from labeled and unlabeled examples[J]. Journal of Machine Learning Research, 2006, 7, 2399- 2434.
11 BERMAN H M , WESTBROOK J , FENG Z , et al. The protein data bank[J]. Nucleic Acids Research, 2000, 28 (1): 235- 242.
12 UNIPROT C . UniProt: a hub for protein information[J]. Nucleic Acids Research, 2015, 43, D204- 212.
doi: 10.1093/nar/gku989
13 ALTSCHUL S F , MADDEN T L , SCHAFFER A A , et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs[J]. Nucleic Acids Research, 1997, 25 (17): 3389- 3402.
doi: 10.1093/nar/25.17.3389
14 CHEN Y C , LIM C . Predicting RNA-binding sites from the protein structure based on electrostatics, evolution and geometry[J]. Nucleic Acids Research, 2008, 36 (5): e29.
doi: 10.1093/nar/gkn008
15 CHENG C W , SU E C , HWANG J K , et al. Predicting RNA-binding sites of proteins using support vector machines and evolutionary information[J]. BMC Bioinformatics, 2008, 12 (Suppl.9): 6.
16 TONG J , JIANG P , LU Z H . RISP: a web-based server for prediction of RNA-binding sites in proteins[J]. Computer Methods and Programs in Biomedicine, 2008, 90 (2): 148- 153.
doi: 10.1016/j.cmpb.2007.12.003
17 MA X , GUO J , WU J S , et al. Prediction of RNA-binding residues in proteins from primary sequence using an enriched random forest model with a novel hybrid feature[J]. Proteins, 2011, 79 (4): 1230- 1239.
18 WU J Z , ZHOU Z H . Sequence-based prediction of microRNA-binding residues in proteins using cost-sensitive Laplacian support vector machines[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatic, 2013, 10 (3): 752- 759.
19 MA X , GUO J , LIU HD , et al. Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatic, 2012, 9 (6): 1766- 1775.
doi: 10.1109/TCBB.2012.106
20 HUANG Y F , HUANG C C , LIU Y C , et al. DNA-binding residues and binding mode prediction with binding-mechanism concerned models[J]. BMC Genomics, 2009, 10 (Suppl.1): 3- 23.
21 ROHS R , WEST S M , SOSINSKY A , et al. The role of DNA shape in protein-DNA recognition[J]. Nature, 2009, 461 (7268): 1248- 1253.
doi: 10.1038/nature08473
22 WANG L , YANG M Q , YANG J Y . Prediction of DNA-binding residues from protein sequence information using random forests[J]. BMC Genomics, 2009, 9 (Suppl. 12)
23 SHARON E , LUBLINER S , SEGAL E . A feature-based approach to modeling protein-DNA interactions[J]. PLoS Computational Biology, 2008, 4 (8): e1000154.
doi: 10.1371/journal.pcbi.1000154
24 VELJKOVIC V , VELJKOVIC N , ESTE J A , et al. Application of the EⅡP/ISM bioinformatics concept in development of new drugs[J]. Current Medicinal Chemistry, 2007, 14 (4): 441- 453.
doi: 10.2174/092986707779941014
25 BONCHEV D . The overall Wiener index: a new tool for characterization of molecular topology[J]. Journal of Chemical Information and Computer Sciences, 2001, 41 (3): 582- 592.
doi: 10.1021/ci000104t
26 BALABAN David H , LEAVELL Jr Byrd S , OBLINGER Michael , et al. Low volume bowel preparation for colonoscopy: randomized endoscopist-blinded trial of liquid sodium phosphate versus tablet sodium phosphate[J]. The American Journal of Gastroenterology, 2003, 98 (10): 2328- 2329.
27 FRISHMAN D , ARGOS P . Seventy-five percent accuracy in protein secondary structure prediction[J]. Proteins, 1997, 27 (3): 329- 335.
28 WANG L , HUANG C , YANG MQ . BindN+ for Accurate Prediction of DNA and RNA-Binding Residues from Protein Sequence Features[J]. BMC Systems Biology, 2010, 4 (Suppl.1)
29 QI Z , TIAN Y , SHI Y . Successive overrelaxation for laplacian support vector machine[J]. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26 (4): 674- 683.
30 WU J , DIAO Y B , LI M L , et al. A semi-supervised learning based method: Laplacian support vector machine used in diabetes disease diagnosis[J]. Interdisciplinary Sciences, Computational Life Sciences, 2009, 1 (2): 151- 155.
doi: 10.1007/s12539-009-0016-2
31 TERRIBILINI M , LEE J H , YAN C , et al. Prediction of RNA binding sites in proteins from amino acid sequence[J]. RNA, 2006, 12 (8): 1450- 1462.
doi: 10.1261/rna.2197306
32 KUMAR M , GROMIHA MM , RAGHAVA GP . Prediction of RNA binding sites in a protein using SVM and PSSM profile[J]. Proteins, 2008, 71 (1): 189- 194.
33 WANG L , BROWN S J . BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences[J]. Nucleic Acids Research, 2006, 34 (Web Server issue): W243- 248.
34 BREIMAN L . Random Forests[J]. Machine Learning, 2001, 45, 5- 32.
doi: 10.1023/A:1010933404324
35 VAPNIK V N . Statisical learning theory[M]. Wiley, UK: Wiley-Interscience, 1998.
36 LIAW A W , MATTHEW W . Classification and regression by random forest[J]. R News, 2002, 18- 22.
[1] 吴红岩,冀俊忠. 基于花授粉算法的蛋白质网络功能模块检测方法[J]. 山东大学学报(工学版), 2018, 48(1): 21-30.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 陈瑞,李红伟,田靖. 磁极数对径向磁轴承承载力的影响[J]. 山东大学学报(工学版), 2018, 48(2): 81 -85 .
[2] 梁京芸,王明刚,柴家前,刘永庆 . 1.6-二-(N5-取代苯基-N1-二胍)己烷盐酸盐的合成和体外抗菌活性[J]. 山东大学学报(工学版), 2008, 38(3): 104 -107 .
[3] 刘斌 李术才 张庆松 李树忱 薛翊国. 隧道地质灾害预警体系中岩溶裂隙水综合预报技术研究[J]. 山东大学学报(工学版), 2009, 39(3): 115 -121 .
[4] 李梦丽 王威强 徐书根 宋明大 王功 苗光同. 物料化学爆炸引起尿塔塔体爆破可能性分析[J]. 山东大学学报(工学版), 2008, 38(6): 1 -6 .
[5] 王海涛 赵东标 高素美. NURBS曲线实时插补中S型加减速算法的研究[J]. 山东大学学报(工学版), 2010, 40(1): 63 -67 .
[6] 吴恩启1,杜宝江1,王海鹏1,余建平2. 基于虚拟现实的地下电力管线可视化规划研究[J]. 山东大学学报(工学版), 2010, 40(6): 54 -57 .
[7] 魏守水,江兴娥,白光磊,姜春香 . 直管形行波微流体驱动模型的模态与谐响应分析[J]. 山东大学学报(工学版), 2006, 36(6): 67 -70 .
[8] 刘琼 吴小俊. 一种改进的免疫克隆选择算法[J]. 山东大学学报(工学版), 2009, 39(6): 8 -12 .
[9] 卢丹, 周以齐. 基于EEMD和CWT的挖掘机座椅振动分析[J]. 山东大学学报(工学版), 2015, 45(3): 58 -64 .
[10] 吴俊亮,安平, 孙文浩 . 少齿数圆柱斜齿轮加工新技术及理论计算[J]. 山东大学学报(工学版), 2006, 36(4): 48 -51 .