A Distribution-Free Independence Test for High Dimension Data
Abstract: Test of independence is of fundamental importance in modern data analysis, with broad applications in variable selection, graphical models, and causal inference. When the data is high dimensional and the potential dependence signal is sparse, independence testing becomes very challenging without distributional or structural assumptions. In this paper we propose a general framework for independence testing by first fitting a classifier that distinguishes the joint and product distributions, and then testing the significance of the fitted classifier. This framework allows us to borrow the strength of the most advanced classification algorithms developed from the modern machine learning community, making it applicable to high dimensional, complex data. By combining a sample split and a fixed permutation, our test statistic has a universal, fixed Gaussian null distribution that is independent of the underlying data distribution. Extensive simulations demonstrate the advantages of the newly proposed test compared with existing methods. We further apply the new test to a Genetic dataset and an economics dataset, where the high dimensionality makes existing methods hard to apply.
Short bio: Zhanrui Cai is currently a postdoc researcher at Carnegie Mellon University. He received his PhD in statistics from Penn State University in May 2021. His research interests include high dimensional inference, model-free inference and genetics data analysis.