Journal of Shandong University(Engineering Science) ›› 2019, Vol. 49 ›› Issue (5): 105-111.doi: 10.6040/j.issn.1672-3961.0.2018.209

• Machine Learning & Data Mining • Previous Articles     Next Articles

An improved algorithm of maximal information coefficient based on dynamic equipartition

Yanxia MENG1(),Yuchen GUO1,Li WANG2,*()   

  1. 1. College of Information and Computer, Taiyuan University of Technology, Jinzhong 030600, Shanxi, China
    2. College of Data Science, Taiyuan University of Technology, Jinzhong 030600, Shanxi, China
  • Received:2018-05-25 Online:2019-10-20 Published:2019-10-18
  • Contact: Li WANG E-mail:2111428372@qq.com;462672475@qq.com
  • Supported by:
    国家自然基金项目(61872260);山西省重点研发计划项目(201703D421013)

Abstract:

In order to solve the problem of high computational time complexity of the maximal information coefficient algorithm, an improved algorithm of the maximal information coefficient(MIC) based on dynamic equipartition was presented. The scattered points shown in the grid were iterated and optimized by using dynamic mean division pairs of variables. The obtained mutual information entropy was regularized to obtain the optimal MIC value, and the multi-thread computation of the data set was carried out by using the POSIX parallel strategy, which made the computation more efficient in the computation of large data sets. Compared with the RapidMIC method on multiple data sets, the DE-MIC algorithm was faster and more efficient under the premise of preserving the generality and equitability of the maximal information coefficient algorithm.

Key words: dynamic equipartition, correlation, maximal information coefficient, statistical analysis

CLC Number: 

  • TP3-0

Fig.1

The process of the MIC algorithm"

Fig.2

The flow chart of the proposed DE-MIC algorithm"

Table 1

Definitions of two-variable functional relationships"

关系类型 具体形式
线性函数 y=x
抛物线函数 y=x2+x+1
立方函数 y=128(x-1/3)3-48(x-1/3)2-12(x-1/3)+2
指数函数 y=2x
周期+线性函数 y=sin(10πx)+x
随机数 (x, y)

Fig.3

Correlation values of two variables calculated by the MIC algorithm"

Fig.4

Correlation values of two variables calculated by the DE-MIC algorithm"

Fig.5

Correlation values of noisy of two variables calculated by the MIC algorithm"

Fig.6

Correlation values of noisy of two variables calculated by the DE-MIC algorithm"

Table 2

The statistics of experimental data"

数据集 变量数N 特征数M
Spellman 4 381 23
RNA 20 422 16
MLB2008 132 337

Fig.7

Comparison of the computation time of Parabolic function"

Fig.8

Comparison of the computation time of Cubic function"

Table 3

The required time to calculate one single variable by DE-MIC algorithm"

s
数据集 RapidMIC DE-MIC
Spellman 0.219 0.019
RNA 0.831 0.058
MLB2008 0.019 0.009

Table 4

The required time to calculate a pair of multi-variates by DE-MIC algorithm"

s
数据集 RapidMIC DE-MIC
Spellman 0.548 0.066
RNA 0.565 0.151
MLB2008 4.805 2.227

Table 5

The required time to calculate two sets of variable by DE-MIC algorithm"

s
数据集 RapidMIC DE-MIC
Spellman 4.852 2.571
RNA 5.056 3.354
MLB2008 43.918 30.398

Table 6

Performance comprehensive comparison of four algorithms"

算法 真实数据集 算法主要思想 优势 不足
KM-MIC[9] 美国联邦铁路事故数据集 利用k-means将数据集中两变量分别划分为不同数量的数据块 利用k-means聚类算法
使得算法计算时间复杂度降低
未考虑计算准确度
只针对于对相关系数要求不高的数据集
CHIMIC[16] Prostate1与2两个基因表达谱数据, WDBC与WPBC两个肿瘤组织图像数据集 利用动态规划算法对其中一个变量划分,并以卡方测验检测各分段点的显著性
利用均分对另一边两变量进行划分
利用卡方检验对网格数进行控制
算法的准确度较高
计算效率较低
应用于医学方面,求取其主要的疾病特征
RapidMIC[18] 表2 对其中一个变量进行动态规划
对另一变量进行均分
同时利用了POSIX并行策略
利用POSIX并行策略,使算法可以在所有线程执行 计算时间复杂度较高,算法效率低
DE-MIC 表2 利用动态均分分别对两变量划分不同的网格
同时利用了POSIX并行策略
利用动态均分原理快速寻找到两变量最优划分
算法时间复杂度降低,效率提升
准确度有待提高
针对大规模数据集
1 樊嵘, 孟大志, 徐大舜. 统计相关性分析方法研究进展[J]. 数学建模及其应用, 2014, 3 (1): 1- 12.
doi: 10.3969/j.issn.2095-3070.2014.01.002
FAN Rong , MENG Dazhi , XU Dashun . Survey of research process on statistical correlation analysis[J]. Mathematical Modeling and Its Applications, 2014, 3 (1): 1- 12.
doi: 10.3969/j.issn.2095-3070.2014.01.002
2 PEARSON K . Mathematical contributions to the theory of evolution:Ⅲ: regression, heredity and panmixia[J]. Philosophical Transactions of the Royal Society of London:Series A: Containing Papers of a Mathematical or Physical Character, 1895, 187, 253- 318.
3 MYERS J L , WELL A D . Research design and statistical analysis[M]. 2nd ed New York, USA: Lawrence Erlbaum Associates, 2003: 15- 89.
4 KRUSKAL W H . Ordinal measures of association[J]. Journal of the American Statistical Association, 1958, 53 (284): 814- 861.
doi: 10.1080/01621459.1958.10501481
5 CIGANOVIC N , BEAUTY N J , RENNER R . Smooth max-information as one-shot generalization for mutual information[J]. IEEE Transactions on Information Theory, 2014, 60 (3): 1573- 1581.
doi: 10.1109/TIT.2013.2295314
6 DELICADO P , SMREKAR M . Measuring non-linear dependence for two random variables distributed along a curve[J]. Statistics and Computing, 2009, 19 (3): 255- 269.
doi: 10.1007/s11222-008-9090-y
7 RESHEF D N , RESHEF Y A , FINUCANE H K , et al. Detecting novel associations in large data sets[J]. Science, 2011, 334 (6062): 1518- 1524.
doi: 10.1126/science.1205438
8 SPEED T . A correlation for the 21st century[J]. Science, 2011, 334 (6062): 1502- 1503.
doi: 10.1126/science.1215894
9 SHAO Fubo , LI Keping , et al. Railway accidents analysis based on the improved algorithm of the maximal information coefficient[J]. Intelligent Data Analysis, 2016, 20 (3): 597- 613.
doi: 10.3233/IDA-160822
10 PANG C N I , GOEL A , LI S S , et al. A multidimensional matrix for systems biology research and its application to interaction networks[J]. Journal of Proteome Research, 2012, 11 (11): 5204- 5220.
doi: 10.1021/pr300405y
11 KOREN O , GOODRICH J K , CULLENDER T C , et al. Host remodeling of the gut microbiome and metabolic cha-nges during pregnancy[J]. Cell, 2012, 150 (3): 470- 480.
doi: 10.1016/j.cell.2012.07.008
12 SAGL G , BLASCHKE T , BEINAT E , et al. Ubiquitous geo-sensing for context-aware analysis: exploring relationships between environmental and human dynamics[J]. Sensors, 2012, 12 (7): 9800- 9822.
doi: 10.3390/s120709800
13 EILER A , ZAREMBA-NIEDZWIEDZKA K , MARTINEZ-CARCIA M , et al. Productivity and salinity structuring of the microplankton revealed by comparative freshwater metagenomics[J]. Environmental Microbiology, 2014, 16 (9): 2682- 2698.
doi: 10.1111/1462-2920.12301
14 ZHANG Yi , JIA Shili , HUANG Haiyun , et al. A novel algorithm for the precise calculation of the maximal information coefficient[J]. Scientific Reports, 2014, (4): 6662.
15 WANG Shuliang , ZHAO Yiping . Analysing large biological datasets with an improved algorithm for MIC[J]. International Journal of Data Mining and Bioinformatics, 2015, 13 (2): 158.
doi: 10.1504/IJDMB.2015.071548
16 CHEN Yuan , ZENG Ying , LUO Feng , et al. A new algorithm to optimize maximal information coefficient[J]. PLos One, 2016, 11 (6): e0157567.
doi: 10.1371/journal.pone.0157567
17 ALBANESE D , FILOSI M , VISINTAINER R , et al. Minerva and minepy:a C engine for the MINE suite and its R, Python and MATLAB wrappers[J]. Bioinformatics, 2013, 29 (3): 407- 408.
doi: 10.1093/bioinformatics/bts707
18 TANG D , WANG M , ZHENG W , et al. RapidMic: rapid computation of the maximal information coefficient[J]. Evolutionary Bioinformatics, 2014, 10- 11.
19 SPELLMAN P T , SHERLOCK G , ZHANG M Q , et al. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization[J]. Molecular Biology of the Cell, 1998, 9 (12): 3273- 3297.
doi: 10.1091/mbc.9.12.3273
20 KALARI K R , ROSSELL D , NECELA B M , et al. Deep sequence analysis of non-small cell lung cancer: integrated analysis of gene expression, alternative splicing, and single nucleotide variations in lung adenocarcinomas with and without oncogenic KRAS mutations[J]. Frontiers in Oncology, 2012, 2- 12.
[1] AN Haiyun, ZHOU Qian, LIU Yufang, HUANG Cheng, CHEN Zhe, WU Qiuwei. A hybrid data-mechanism driven approach to active distribution network line loss calculation [J]. Journal of Shandong University(Engineering Science), 2025, 55(5): 51-61.
[2] XUE Bingbing, WANG Yong, YANG Weihao, WANG Chuan, YU Di, WANG Xu. Real-time expressway traffic data imputation and state prediction based on ETC system data [J]. Journal of Shandong University(Engineering Science), 2025, 55(3): 58-71.
[3] Xinzhang WU,Xiangyu LIANG,Hongyu ZHU,Dongdong ZHANG. Short-term wind power prediction based on CEEMDAN-GRA-PCC-ATCN [J]. Journal of Shandong University(Engineering Science), 2022, 52(6): 146-156.
[4] Wensheng LI, Xian WANG, Yuanze MI, Yongji CAO, Xiaoming LIU, Hengxu ZHANG, Zihan LIU. Method for generation planning with the temporal and spatial correlation of wind and solar power [J]. Journal of Shandong University(Engineering Science), 2022, 52(1): 111-119.
[5] ZHU Changming, YUE Wen, WANG Panhong, SHEN Zhenyu, ZHOU Rigui. Global and local multi-view multi-label learning with active three-way clustering [J]. Journal of Shandong University(Engineering Science), 2021, 51(2): 34-46.
[6] Baocheng LIU,Yan PIAO,Xuemei SONG. Adaptive fusion target tracking based on joint detection [J]. Journal of Shandong University(Engineering Science), 2020, 50(3): 51-57.
[7] Shengnan ZHANG,Lei WANG,Chunhong CHANG,Benli HAO. Image denoising based on 3D shearlet transform and BM4D [J]. Journal of Shandong University(Engineering Science), 2020, 50(2): 83-90.
[8] Hongbin LIU,Liu SONG. Study on modeling methods of wastewater treatment processes with canonical correlation analysis [J]. Journal of Shandong University(Engineering Science), 2020, 50(1): 101-108.
[9] Xinyu DONG,Hanyue CHEN,Jiaguo LI,Qingyan MENG,Shihe XING,Liming ZHANG. An unsupervised color image segmentation method based on fusion of multiple methods [J]. Journal of Shandong University(Engineering Science), 2019, 49(2): 96-101.
[10] Guangli LI,Bin LIU,Tao ZHU,Yi YIN,Hongbin ZHANG. Cross-media retrieval model based on choosing key canonical correlated vectors [J]. Journal of Shandong University(Engineering Science), 2018, 48(5): 38-46.
[11] HAN Xueshan, WANG Junxiong, SUN Donglei, LI Wenbo, ZHANG Xinyi, WEI Zhiqing. Nodal load forecasting method considering spatial correlation and redundancy [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(6): 7-12.
[12] CHEN Zhiwen, PENG Tao, YANG Chunhua , HE Zhangming, YANG Chao, YANG Xiaoyue. A fault detection method based on modified canonical correlation analysis [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(5): 44-50.
[13] ZHANG Milu, WANG Tianzhen, TANG Tianhao, XIN Bin. A mode-correlation principal component analysis for the fault detection of marine current turbine [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(5): 123-129.
[14] WANG Mengyuan, ZHANG Xiong, MA Liang, PENG Kaixiang. Fault diagnosis for industrial processes based on causal topological graph [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(5): 187-194.
[15] HOU Guangsong, GAO Jun, WU Yanda, ZHANG Xin, DENG Ying, LI Changgang, ZHANG Yaping. Correlation analysis between transmission line parameters and operation modes [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(4): 89-95.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LI Kan . Empolder and implement of the embedded weld control system[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(4): 37 -41 .
[2] LAI Xiang . The global domain of attraction for a kind of MKdV equations[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(1): 87 -92 .
[3] YU Jia yuan1, TIAN Jin ting1, ZHU Qiang zhong2. Computational intelligence and its application in psychology[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 1 -5 .
[4] CHEN Rui, LI Hongwei, TIAN Jing. The relationship between the number of magnetic poles and the bearing capacity of radial magnetic bearing[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(2): 81 -85 .
[5] WANG Bo,WANG Ning-sheng . Automatic generation and combinatory optimization of disassembly sequence for mechanical-electric assembly[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(2): 52 -57 .
[6] ZHANG Ying,LANG Yongmei,ZHAO Yuxiao,ZHANG Jianda,QIAO Peng,LI Shanping . Research on technique of aerobic granular sludge cultivationby seeding EGSB anaerobic granular sludge[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(4): 56 -59 .
[7] Yue Khing Toh1, XIAO Wendong2, XIE Lihua1. Wireless sensor network for distributed target tracking: practices via real test bed development[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 50 -56 .
[8] SUN Weiwei, WANG Yuzhen. Finite gain stabilization of singlemachine infinite bus system subject to saturation[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 69 -76 .
[9] SUN Yu-li,LI De-fa,ZUO Dun-wen,QI mei . [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(6): 19 -23 .
[10] WANG Yong, XIE Yudong. Gas control technology of largeflow pipe[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(2): 70 -74 .