Journal of Shandong University(Engineering Science) ›› 2019, Vol. 49 ›› Issue (5): 105-111.doi: 10.6040/j.issn.1672-3961.0.2018.209

• Machine Learning & Data Mining • Previous Articles     Next Articles

An improved algorithm of maximal information coefficient based on dynamic equipartition

Yanxia MENG1(),Yuchen GUO1,Li WANG2,*()   

  1. 1. College of Information and Computer, Taiyuan University of Technology, Jinzhong 030600, Shanxi, China
    2. College of Data Science, Taiyuan University of Technology, Jinzhong 030600, Shanxi, China
  • Received:2018-05-25 Online:2019-10-20 Published:2019-10-18
  • Contact: Li WANG E-mail:2111428372@qq.com;462672475@qq.com
  • Supported by:
    国家自然基金项目(61872260);山西省重点研发计划项目(201703D421013)

Abstract:

In order to solve the problem of high computational time complexity of the maximal information coefficient algorithm, an improved algorithm of the maximal information coefficient(MIC) based on dynamic equipartition was presented. The scattered points shown in the grid were iterated and optimized by using dynamic mean division pairs of variables. The obtained mutual information entropy was regularized to obtain the optimal MIC value, and the multi-thread computation of the data set was carried out by using the POSIX parallel strategy, which made the computation more efficient in the computation of large data sets. Compared with the RapidMIC method on multiple data sets, the DE-MIC algorithm was faster and more efficient under the premise of preserving the generality and equitability of the maximal information coefficient algorithm.

Key words: dynamic equipartition, correlation, maximal information coefficient, statistical analysis

CLC Number: 

  • TP3-0

Fig.1

The process of the MIC algorithm"

Fig.2

The flow chart of the proposed DE-MIC algorithm"

Table 1

Definitions of two-variable functional relationships"

关系类型 具体形式
线性函数 y=x
抛物线函数 y=x2+x+1
立方函数 y=128(x-1/3)3-48(x-1/3)2-12(x-1/3)+2
指数函数 y=2x
周期+线性函数 y=sin(10πx)+x
随机数 (x, y)

Fig.3

Correlation values of two variables calculated by the MIC algorithm"

Fig.4

Correlation values of two variables calculated by the DE-MIC algorithm"

Fig.5

Correlation values of noisy of two variables calculated by the MIC algorithm"

Fig.6

Correlation values of noisy of two variables calculated by the DE-MIC algorithm"

Table 2

The statistics of experimental data"

数据集 变量数N 特征数M
Spellman 4 381 23
RNA 20 422 16
MLB2008 132 337

Fig.7

Comparison of the computation time of Parabolic function"

Fig.8

Comparison of the computation time of Cubic function"

Table 3

The required time to calculate one single variable by DE-MIC algorithm"

s
数据集 RapidMIC DE-MIC
Spellman 0.219 0.019
RNA 0.831 0.058
MLB2008 0.019 0.009

Table 4

The required time to calculate a pair of multi-variates by DE-MIC algorithm"

s
数据集 RapidMIC DE-MIC
Spellman 0.548 0.066
RNA 0.565 0.151
MLB2008 4.805 2.227

Table 5

The required time to calculate two sets of variable by DE-MIC algorithm"

s
数据集 RapidMIC DE-MIC
Spellman 4.852 2.571
RNA 5.056 3.354
MLB2008 43.918 30.398

Table 6

Performance comprehensive comparison of four algorithms"

算法 真实数据集 算法主要思想 优势 不足
KM-MIC[9] 美国联邦铁路事故数据集 利用k-means将数据集中两变量分别划分为不同数量的数据块 利用k-means聚类算法
使得算法计算时间复杂度降低
未考虑计算准确度
只针对于对相关系数要求不高的数据集
CHIMIC[16] Prostate1与2两个基因表达谱数据, WDBC与WPBC两个肿瘤组织图像数据集 利用动态规划算法对其中一个变量划分,并以卡方测验检测各分段点的显著性
利用均分对另一边两变量进行划分
利用卡方检验对网格数进行控制
算法的准确度较高
计算效率较低
应用于医学方面,求取其主要的疾病特征
RapidMIC[18] 表2 对其中一个变量进行动态规划
对另一变量进行均分
同时利用了POSIX并行策略
利用POSIX并行策略,使算法可以在所有线程执行 计算时间复杂度较高,算法效率低
DE-MIC 表2 利用动态均分分别对两变量划分不同的网格
同时利用了POSIX并行策略
利用动态均分原理快速寻找到两变量最优划分
算法时间复杂度降低,效率提升
准确度有待提高
针对大规模数据集
1 樊嵘, 孟大志, 徐大舜. 统计相关性分析方法研究进展[J]. 数学建模及其应用, 2014, 3 (1): 1- 12.
doi: 10.3969/j.issn.2095-3070.2014.01.002
FAN Rong , MENG Dazhi , XU Dashun . Survey of research process on statistical correlation analysis[J]. Mathematical Modeling and Its Applications, 2014, 3 (1): 1- 12.
doi: 10.3969/j.issn.2095-3070.2014.01.002
2 PEARSON K . Mathematical contributions to the theory of evolution:Ⅲ: regression, heredity and panmixia[J]. Philosophical Transactions of the Royal Society of London:Series A: Containing Papers of a Mathematical or Physical Character, 1895, 187, 253- 318.
3 MYERS J L , WELL A D . Research design and statistical analysis[M]. 2nd ed New York, USA: Lawrence Erlbaum Associates, 2003: 15- 89.
4 KRUSKAL W H . Ordinal measures of association[J]. Journal of the American Statistical Association, 1958, 53 (284): 814- 861.
doi: 10.1080/01621459.1958.10501481
5 CIGANOVIC N , BEAUTY N J , RENNER R . Smooth max-information as one-shot generalization for mutual information[J]. IEEE Transactions on Information Theory, 2014, 60 (3): 1573- 1581.
doi: 10.1109/TIT.2013.2295314
6 DELICADO P , SMREKAR M . Measuring non-linear dependence for two random variables distributed along a curve[J]. Statistics and Computing, 2009, 19 (3): 255- 269.
doi: 10.1007/s11222-008-9090-y
7 RESHEF D N , RESHEF Y A , FINUCANE H K , et al. Detecting novel associations in large data sets[J]. Science, 2011, 334 (6062): 1518- 1524.
doi: 10.1126/science.1205438
8 SPEED T . A correlation for the 21st century[J]. Science, 2011, 334 (6062): 1502- 1503.
doi: 10.1126/science.1215894
9 SHAO Fubo , LI Keping , et al. Railway accidents analysis based on the improved algorithm of the maximal information coefficient[J]. Intelligent Data Analysis, 2016, 20 (3): 597- 613.
doi: 10.3233/IDA-160822
10 PANG C N I , GOEL A , LI S S , et al. A multidimensional matrix for systems biology research and its application to interaction networks[J]. Journal of Proteome Research, 2012, 11 (11): 5204- 5220.
doi: 10.1021/pr300405y
11 KOREN O , GOODRICH J K , CULLENDER T C , et al. Host remodeling of the gut microbiome and metabolic cha-nges during pregnancy[J]. Cell, 2012, 150 (3): 470- 480.
doi: 10.1016/j.cell.2012.07.008
12 SAGL G , BLASCHKE T , BEINAT E , et al. Ubiquitous geo-sensing for context-aware analysis: exploring relationships between environmental and human dynamics[J]. Sensors, 2012, 12 (7): 9800- 9822.
doi: 10.3390/s120709800
13 EILER A , ZAREMBA-NIEDZWIEDZKA K , MARTINEZ-CARCIA M , et al. Productivity and salinity structuring of the microplankton revealed by comparative freshwater metagenomics[J]. Environmental Microbiology, 2014, 16 (9): 2682- 2698.
doi: 10.1111/1462-2920.12301
14 ZHANG Yi , JIA Shili , HUANG Haiyun , et al. A novel algorithm for the precise calculation of the maximal information coefficient[J]. Scientific Reports, 2014, (4): 6662.
15 WANG Shuliang , ZHAO Yiping . Analysing large biological datasets with an improved algorithm for MIC[J]. International Journal of Data Mining and Bioinformatics, 2015, 13 (2): 158.
doi: 10.1504/IJDMB.2015.071548
16 CHEN Yuan , ZENG Ying , LUO Feng , et al. A new algorithm to optimize maximal information coefficient[J]. PLos One, 2016, 11 (6): e0157567.
doi: 10.1371/journal.pone.0157567
17 ALBANESE D , FILOSI M , VISINTAINER R , et al. Minerva and minepy:a C engine for the MINE suite and its R, Python and MATLAB wrappers[J]. Bioinformatics, 2013, 29 (3): 407- 408.
doi: 10.1093/bioinformatics/bts707
18 TANG D , WANG M , ZHENG W , et al. RapidMic: rapid computation of the maximal information coefficient[J]. Evolutionary Bioinformatics, 2014, 10- 11.
19 SPELLMAN P T , SHERLOCK G , ZHANG M Q , et al. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization[J]. Molecular Biology of the Cell, 1998, 9 (12): 3273- 3297.
doi: 10.1091/mbc.9.12.3273
20 KALARI K R , ROSSELL D , NECELA B M , et al. Deep sequence analysis of non-small cell lung cancer: integrated analysis of gene expression, alternative splicing, and single nucleotide variations in lung adenocarcinomas with and without oncogenic KRAS mutations[J]. Frontiers in Oncology, 2012, 2- 12.
[1] Xinyu DONG,Hanyue CHEN,Jiaguo LI,Qingyan MENG,Shihe XING,Liming ZHANG. An unsupervised color image segmentation method based on fusion of multiple methods [J]. Journal of Shandong University(Engineering Science), 2019, 49(2): 96-101.
[2] Guangli LI,Bin LIU,Tao ZHU,Yi YIN,Hongbin ZHANG. Cross-media retrieval model based on choosing key canonical correlated vectors [J]. Journal of Shandong University(Engineering Science), 2018, 48(5): 38-46.
[3] HAN Xueshan, WANG Junxiong, SUN Donglei, LI Wenbo, ZHANG Xinyi, WEI Zhiqing. Nodal load forecasting method considering spatial correlation and redundancy [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(6): 7-12.
[4] CHEN Zhiwen, PENG Tao, YANG Chunhua , HE Zhangming, YANG Chao, YANG Xiaoyue. A fault detection method based on modified canonical correlation analysis [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(5): 44-50.
[5] ZHANG Milu, WANG Tianzhen, TANG Tianhao, XIN Bin. A mode-correlation principal component analysis for the fault detection of marine current turbine [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(5): 123-129.
[6] WANG Mengyuan, ZHANG Xiong, MA Liang, PENG Kaixiang. Fault diagnosis for industrial processes based on causal topological graph [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(5): 187-194.
[7] HOU Guangsong, GAO Jun, WU Yanda, ZHANG Xin, DENG Ying, LI Changgang, ZHANG Yaping. Correlation analysis between transmission line parameters and operation modes [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(4): 89-95.
[8] JI Cuilian, HAN Jitian, YIN Jing, CHEN Changnian, REN Libo, KONG Lingjian. Fitting heat transfer coefficient correlation on flow boiling and error analysis for the helically coiled tube [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2014, 44(5): 83-87.
[9] KONG Chao1,2, ZHANG Huaxiang1,2*, LIU Li1,2. A semi-supervised image retrieval algorithm based onfeature fusion of the region of interest [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2014, 44(3): 22-28.
[10] WANG Danhui1, WANG An2*. The efficiency of power analysis attack based on S-boxes of block ciphers [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2014, 44(2): 6-11.
[11] LI Wu, HOU Zhiqiang*, WEI Guojian, YU Wangsheng. Algorithm of scale change objects tracking with adaptive bandwidth [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2014, 44(2): 28-34.
[12] PAN Pan1, WANG Xi-zhao2, ZHAI Jun-hai2. An improved induction algorithm based on ordinal decision tree [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2014, 44(1): 41-44.
[13] ZHU Na-na1, 2, ZHANG Hua-xiang1, 2*, LIU Li1, 2. Automatic image annotation based on approved FCM algorithm and Bayesian classification [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2013, 43(6): 12-16.
[14] ZHU Hong-jin1, FAN Hong-hui1, CHEN Xing-rui1, TAMURA-Yasutaka2. Image normalization based on local autocorrelation and its application to face detection [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2012, 42(5): 59-64.
[15] LI Jia-jia1,2,3, LU Xiao-ning1, LI Zuo-yong1, XIONG Dong-hong2,3, LIU Zhi-hong1, ZHAI Juan1,2,3. Application of the matter-element model based on normalized index  values to the assessment of water safety [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2012, 42(3): 126-132.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] ZHANG Ying-Chun, WANG Zuo-Xun, WANG Gui-Juan. High voltage cable temperature measurement system based on neural network controller[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(5): 62 -67 .
[2] ZHANG Dao-qiang. Knowledge preserving embedding[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2010, 40(2): 1 -10 .
[3] DONG Cheng-xi,WU De-wei,HE Jing . Combat effectiveness evaluation method of satellite navigation
system based on rough fuzzy sets
[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(4): 32 -36 .
[4] ZHANG Dun,HOU Jing-ming,LIU Han-sheng,XU Gen-hai . Experimental and numerical modeling for the location of abucket aerator on an arch structure[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(2): 101 -105 .
[5] HUANG Jinchao. A new method for muti-objects image segmentation based on faster region proposal networks[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(4): 20 -26 .
[6] LIU Tao, JIA Lei, ZHU WenXin.
A new lattice model of traffic flow with the effects of relative current
[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(6): 63 -67 .
[7] ZHU Jun, ZHAO Jianfeng, CHEN Zhijian. Study on the circulating Pushover on ductile ability of pillar piers caused by axial compression ratio[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2016, 46(5): 110 -115 .
[8] MA Zongzheng, SHAO Fengxiang, WANG Xinli, YANG Anjie. Thermoelectric generator system based on engine exhaust gas[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2016, 46(2): 122 -127 .
[9] YANG Min,ZOU Zeng-da,SONG Wen-peng,SUN Xiao-lei .

The mechanical characteristics and microstructure of the partial liquid phase diffusing bonding joint of Si3N4/(Cu,Nb)/Ni/Inconel600high temperature alloy

[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2007, 37(6): 36 -40 .
[10] SHANG Fang, LIU Yun-Gang, ZHANG Cheng-Hui. Disturbance attenuation by output feedback fora class of uncertain nonlinear systems[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2010, 40(1): 19 -27 .