• 中国精品科技期刊
  • 《中文核心期刊要目总览》收录期刊
  • RCCSE 中国核心期刊(5/114,A+)
  • Scopus收录期刊
  • 美国《化学文摘》(CA)收录期刊
  • WHO 西太平洋地区医学索引(WPRIM)收录期刊
  • 《中国科学引文数据库(CSCD)》核心库期刊 (C)
  • 中国科技核心期刊
  • 中国科技论文统计源期刊
  • 《日本科学技术振兴机构数据库(中国)》(JSTChina)收录期刊
  • 美国《乌利希期刊指南》(UIrichsweb)收录期刊
  • 中华预防医学会系列杂志优秀期刊(2019年)

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于随机森林变量重要性评分的变量筛选方法及其在肿瘤分型诊断中的应用

王文杰 马金沙 高倩 王彤

王文杰, 马金沙, 高倩, 王彤. 基于随机森林变量重要性评分的变量筛选方法及其在肿瘤分型诊断中的应用[J]. 中华疾病控制杂志, 2023, 27(3): 274-280. doi: 10.16462/j.cnki.zhjbkz.2023.03.005
引用本文: 王文杰, 马金沙, 高倩, 王彤. 基于随机森林变量重要性评分的变量筛选方法及其在肿瘤分型诊断中的应用[J]. 中华疾病控制杂志, 2023, 27(3): 274-280. doi: 10.16462/j.cnki.zhjbkz.2023.03.005
WANG Wen-jie, MA Jin-sha, GAO Qian, WANG Tong. Variable selection methods based on variable importance measurement from random forest and its application in diagnosis of tumor typing[J]. CHINESE JOURNAL OF DISEASE CONTROL & PREVENTION, 2023, 27(3): 274-280. doi: 10.16462/j.cnki.zhjbkz.2023.03.005
Citation: WANG Wen-jie, MA Jin-sha, GAO Qian, WANG Tong. Variable selection methods based on variable importance measurement from random forest and its application in diagnosis of tumor typing[J]. CHINESE JOURNAL OF DISEASE CONTROL & PREVENTION, 2023, 27(3): 274-280. doi: 10.16462/j.cnki.zhjbkz.2023.03.005

基于随机森林变量重要性评分的变量筛选方法及其在肿瘤分型诊断中的应用

doi: 10.16462/j.cnki.zhjbkz.2023.03.005
基金项目: 

国家自然科学基金 81872715

国家自然科学基金 82073674

山西省科技重大专项项目 202005D121008

山西省重点研发计划项目 202102130501003

详细信息
    通讯作者:

    王彤,E-mail: tongwang@sxmu.edu.cn

  • 中图分类号: R181.3; R733.4

Variable selection methods based on variable importance measurement from random forest and its application in diagnosis of tumor typing

Funds: 

National Natural Science Foundation of China 81872715

National Natural Science Foundation of China 82073674

Major Science and Technology Project of Shanxi Province 202005D121008

Major Science and Technology Project of Shanxi Province 202102130501003

More Information
  • 摘要:   目的  探究高维组学数据中结局为二分类时基于随机森林(random forest, RF)变量重要性评分的变量筛选方法,并选择合适方法构建结局预测模型。  方法  首先根据不同的变量筛选目标,对最小优化变量筛选类RF算法[递归特征消除(recursive feature elimination, RFE)-RF、biosigner]与全部相关变量筛选类RF算法(Boruta、vita、altmann、r2vim)在高维数据中识别重要变量的能力进行了模拟比较。然后结合不同方法优势用于弥漫大B细胞淋巴瘤(diffuse large B-cell lymphoma, DLBCL)分型相关基因的筛选,并构建DLBCL分型诊断模型。  结果  模拟研究表明,vita方法的灵敏度较高,biosigner方法的阳性预测值较高。实例分析表明,经vita方法筛得1 019个与DLBCL分型相关的基因,后经biosigner方法筛得77个与DLBCL分型相关的基因。所建DLBCL分型诊断模型的受试者工作特征(receiver operating characteristical, ROC)曲线下面积(area under the ROC curve,AUC)为0.910。  结论  vita及biosigner方法可用于DLBCL分型相关基因的初步和最终筛选阶段。由最终筛得基因所建立的模型可有效实现DLBCL的分型诊断。
  • 图  1  各模拟情景下筛得总变量数目

    Figure  1.  The total number of selected variables in each simulated scenario

    图  2  各模拟情景下筛得变量强相关灵敏度

    Figure  2.  Sensitivity of selected strongly relevant variables in each simulated scenario

    图  3  各模拟情景下筛得变量弱相关灵敏度

    Figure  3.  Sensitivity of selected weakly relevant variables in each simulated scenario

    图  4  各模拟情景下筛得变量的阳性预测值

    Figure  4.  Positive predictive value of selected variables in each simulated scenario

    图  5  原始真实结局及模型预测结局情况下的Kaplan-Meier曲线

    Figure  5.  Kaplan-Meier curves in the case of the original true outcome and the predicted outcome

  • [1] Kohavi R, John GH. Wrappers for feature subset selection[J]. Artif Intell, 1997, 97(1-2): 273-324. DOI: 10.1016/S0004-3702(97)00043-X.
    [2] Tang YC, Zhang YQ, Huang Z. Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis[J]. IEEE/ACM Trans Comput Biol Bioinform, 2007, 4(3): 365-381. DOI: 10.1109/TCBB.2007.70224.
    [3] Stańczyk U, Jain LC. Feature selection for data and pattern recognition[M]. Berlin: Springer, 2015: 12-14.
    [4] Nilsson R, Peña JM, Björkegren J, et al. Detecting multivariate differentially expressed genes[J]. BMC Bioinformatics, 2007, 8: 150. DOI: 10.1186/1471-2105-8-150.
    [5] Ein-Dor L, Kela I, Getz G, et al. Outcome signature genes in breast cancer: is there a unique set?[J]. Bioinformatics, 2005, 21(2): 171-178. DOI: 10.1093/bioinformatics/bth469.
    [6] 张鼎, 赵亚双. 生物信息学在分子流行病学中的应用[J]. 中华疾病控制杂志, 2021, 25(1): 20-24. DOI: 10.16462/j.cnki.zhjbkz.2021.01.005.

    Zhang D, Zhao YS. Applications of bioinformatics in molecular epidemiology[J]. Chin J Dis Control Prev, 2021, 25(1): 20-24. DOI: 10.16462/j.cnki.zhjbkz.2021.01.005.
    [7] Breiman L. Random forests[J]. Mach Learn, 2001, 45(1): 5-32. DOI: 10.1023/A:1010933404324.
    [8] Nicodemus KK, Malley JD, Strobl C, et al. The behaviour of random forest permutation-based variable importance measures under predictor correlation[J]. BMC Bioinformatics, 2010, 11: 110. DOI: 10.1186/1471-2105-11-110.
    [9] Degenhardt F, Seifert S, Szymczak S. Evaluation of variable selection methods for random forests and omics data sets[J]. Brief Bioinform, 2019, 20(2): 492-503. DOI: 10.1093/bib/bbx124.
    [10] Wu XY, Wu ZY, Li K. Identification of differential gene expression for microarray data using recursive random forest[J]. Chin Med J (Engl), 2008, 121(24): 2492-2496. DOI: 10.3238/arztebl.2008.0900a.
    [11] Rinaudo P, Boudah S, Junot C, et al. Biosigner: a new method for the discovery of significant molecular signatures from omics data[J]. Front Mol Biosci, 2016, 3: 26. DOI: 10.3389/fmolb.2016.00026.
    [12] Kursa MB, Jankowski A, Rudnicki WR. Boruta-a system for feature selection[J]. Fundam Informaticae, 2010, 101(4): 271-285. DOI: 10.3233/fi-2010-288.
    [13] Altmann A, Toloşi L, Sander O, et al. Permutation importance: a corrected feature importance measure[J]. Bioinformatics, 2010, 26(10): 1340-1347. DOI: 10.1093/bioinformatics/btq134.
    [14] Szymczak S, Holzinger E, Dasgupta A, et al. r2VIM: a new variable selection method for random forests in genome-wide association studies[J]. BioData Min, 2016, 9: 7. DOI: 10.1186/s13040-016-0087-3.
    [15] Janitza S, Celik E, Boulesteix AL. A computationally fast variable importance test for random forests for high-dimensional data[J]. Adv Data Anal Classif, 2018, 12(4): 885-915. DOI: 10.1007/s11634-016-0276-4.
    [16] Shin M, Bhattacharya A, Johnson VE. Scalable Bayesian variable selection using nonlocal prior densities in ultrahigh-dimensional settings[J]. Stat Sin, 2018, 28(2): 1053-1078. DOI: 10.5705/ss.202016.0167.
    [17] Yan WH, Jiang XN, Wang WG, et al. Cell-of-origin subtyping of diffuse large B-cell lymphoma by using a qPCR-based gene expression assay on formalin-fixed paraffin-embedded tissues[J]. Front Oncol, 2020, 10: 803. DOI: 10.3389/fonc.2020.00803.
    [18] Alizadeh AA, Eisen MB, Davis RE, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling[J]. Nature, 2000, 403(6769): 503-511. DOI: 10.1038/35000501.
    [19] Reif DM, Motsinger-Reif AA, McKinney BA, et al. Integrated analysis of genetic and proteomic data identifies biomarkers associated with adverse events following smallpox vaccination[J]. Genes Immun, 2009, 10(2): 112-119. DOI: 10.1038/gene.2008.80.
    [20] Salzer U, Chapel HM, Webster AD, et al. Mutations in TNFRSF13B encoding TACI are associated with common variable immunodeficiency in humans[J]. Nat Genet, 2005, 37(8): 820-828. DOI: 10.1038/ng1600.
    [21] Kralickova P, Milota T, Litzman J, et al. CVID-associated tumors: Czech nationwide study focused on epidemiology, immunology, and genetic background in a cohort of patients with CVID[J]. Front Immunol, 2019, 9: 3135. DOI: 10.3389/fimmu.2018.03135.
    [22] Inamo J, Suzuki K, Takeshita M, et al. Identification of novel genes associated with dysregulation of B cells in patients with primary Sjögren's syndrome[J]. Arthritis Res Ther, 2020, 22(1): 153. DOI: 10.1186/s13075-020-02248-2.
    [23] Liu JQ, Yao YL, Hu ZY, et al. Transcriptional profiling of long-intergenic noncoding RNAs in lung squamous cell carcinoma and its value in diagnosis and prognosis[J]. Mol Genet Genomic Med, 2019, 7(12): e994. DOI: 10.1002/mgg3.994.
    [24] Blenk S, Engelmann J, Weniger M, et al. Germinal center B cell-like (GCB) and activated B cell-like (ABC) type of diffuse large B cell lymphoma (DLBCL): analysis of molecular predictors, signatures, cell cycle state and patient survival[J]. Cancer Inform, 2007, 3: 399-420. DOI: 10.1177/117693510700300004.
    [25] Wood O, Woo J, Seumois G, et al. Gene expression analysis of TIL rich HPV-driven head and neck tumors reveals a distinct B-cell signature when compared to HPV independent tumors[J]. Oncotarget, 2016, 7(35): 56781-56797. DOI: 10.18632/oncotarget.10788.
    [26] Coutinho R, Clear AJ, Owen A, et al. Poor concordance among nine immunohistochemistry classifiers of cell-of-origin for diffuse large B-cell lymphoma: implications for therapeutic strategies[J]. Clin Cancer Res, 2013, 19(24): 6686-6695. DOI: 10.1158/1078-0432.CCR-13-1482.
  • 加载中
图(5)
计量
  • 文章访问数:  572
  • HTML全文浏览量:  237
  • PDF下载量:  81
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-02-18
  • 修回日期:  2022-05-23
  • 网络出版日期:  2023-04-04
  • 刊出日期:  2023-03-10

目录

    /

    返回文章
    返回