School of Mathematical Sciences, Peking University

Research Areas

Statistics

Time: Mar 24 , 2023

Statistics

1. Causal Inference (GENG Zhi, LIN Wei)

Causal inference is one of the most important goals for many scientific researches. Statistical approaches for causal inference are often used to remove spurious association and correlation, to evaluate causal effects and discover causal relationships. Causal inference is one of the most challenge topics in statistics. Researchers from Peking University have made major contributions to causal inference in recent years.

In their earlier work [Chen-Geng-Jia, JRSSB (2007)], Geng and his collaborators first investigated the issues on Yule-Simpson paradox and spurious associations and proposed the surrogate paradox, which also is called the intermediate variable paradox and the instrumental variable paradox. They pointed out that the existing criteria and conditions for surrogate endpoints cannot avoid the surrogate paradox. Continuing with this research direction, in recent years, Geng and his collaborators have proposed various new criteria for surrogates to avoid the surrogate paradox and these researches were all published in top tier statistical journals [Ju-Geng, JRSSB (2010); Jiang-Ding- Peng, JRSSB (2016)].

Another important problem is casual inference in missing data. Missing data problems arise in many applied research studies. If the missing mechanism is nonignorable, the model of interest is often not identifiable without imposing further assumptions. In a recent work [Miao-Ding-Geng, JASA 2016], Geng and his collaborators investigated the identifiability of causal effects and statistical approaches for nonignorable missing data under a number of important model setups such as normal and normal mixture models. This is a major advancement of the field because this work does not require an instrumental variable, which is often required in earlier works but is often infeasible to find in many real applications, to obtain identifiability.

High dimensional data is wide-spread in many real applications, especially in genomic studies. The effort of using these high dimensional genomic data to dissect the causal genetic mechanisms of complex traits, however, has not always been successful and is often compromised by the critical issue of confounding. Many factors, such as unmeasured variables, experimental conditions, and environmental perturbations, may lead to spurious associations or distortion of true effects. Instrumental variable models provide an ideal framework for joint analysis and control for confounding in genomics studies, but high-dimensionality of both covariates and instrument variables poses great challenges. Lin and his collaborators recently developed a class of two-stage regularization methods for identifying and estimating important covariate effects while selecting and estimating optimal instruments [Lin-Feng-Li, JASA (2015)]. The proposed methodology extends the classical two-stage least squares method to high dimension by exploiting sparsity using sparsity inducing penalties in both stages.

2. Experimental Design (AI Mingyao)

Experimental design is a statistical procedure for planning experiments such that the collected data can yield valid and objective conclusions efficiently. Statistical experiment design procedures have broad applications in many scientific and applied areas. Researchers (mainly Ai’s group) from Peking University have been very active in experimental design researches. In recent years, they have made major contributions to optimal designs of interference models and theory of Latin hypercube sampling.

In many agricultural experiments, the treatment assigned to a particular plot could also have effects on its neighbor plots. To adjust the biases caused by these neighbor effects, the interference model is widely adopted. Identifying optimal designs for interference models is a fundamental problem for these experiments. In [Li-Zheng-Ai, AoS (2015)], Ai and his collaborators studied optimal circular designs for the proportional interference model, in which the neighbor effects of a treatment are proportional to its direct effect. Kiefer's equivalence theorems for both directional and unidirectional models were established. One can easily develop computer programs to find the optimal design based on these theorems. In [Zheng-Ai-Li, AoS (2017)], Ai and his collaborators studied optimal circular designs for the interference model. The circular neighbor balanced designs at distances 1 and 2 (CNBD2) are two major designs for estimating the direct treatment effects and can be viewed as two special classes of pseudo symmetric designs. Ai and his collaborators showed that CNBD2 is highly efficient among all possible designs if the error terms are homoscedastic and uncorrelated, but is not efficient if the error terms are correlated. They further established equivalent conditions for any design, pseudo symmetric or not, to be universally optimal for any size of experiment and any covariance structure of the error terms.

Orthogonal array based Latin hypercube sampling (LHS) is popularly adopted for computer experiments. Because of its stratification on multivariate margins in addition to univariate uniformity, the associated samples may provide better estimators for the gross mean of a complex function on a domain. In [Ai-Kong-Li, SS (2016)], Ai and his collaborators developed a unified expression of the variance of the sample mean for LHS methods based on an orthogonal array of strength t. An approximate estimator for the variance of the sample mean was also established that is helpful in obtaining the confidence interval of the gross mean. They extend these statistical properties to three types of LHS: strong orthogonal array-based LHS, nested orthogonal array-based LHS, and correlation-controlled orthogonal array-based LHS.

3. Statistical Methods in Computational Biology and Bioinformatics (DENG Minghua, XI Ruibin)

The recent breakthrough of biological technologies allows biologist capable of accumulating large amount of data in a short time period. Statistical analysis of these big biological data plays a critical role in many biological researches. Recently, researchers from Peking University have developed a series of statistical tools for analyzing these biological data and these tools have been widely used in many biological researches. They mainly focus on statistical methods for biological network analysis and genomics analysis.

Many biological problems are modelled as networks and network analysis played an important role in these researches. In recent years, Deng’s group and Xi’s group published a series papers in biological network analysis in top tier journals. In gene co-expression network analysis, Deng and his colleagues developed a vector based co-expression network construction method VCNet [Wang-Fang-Tang-Deng, Bioinformatics (2017)]. The unique advantage of this method is that such a method can deal with cases with less samples than the number of exons. This method is significantly more powerful than available methods. In metagenomics studies, one can only obtain relative abundance of different microbial communities and biologists are often interested in the correlation network of different microbial communities. This type of data is called the compositional data. However, direct application of traditional Pearson correlation to compositional data would lead to spurious correlation. In [Fang-Huang-Zhao-Deng, Bioinformatics (2016)], Deng and his collaborators have developed a lasso based method CCLasso that can directly estimate the correlation matrix of the latent absolute abundance value. This method can have wide applications in metagenomics studies. In another work [Yuan-Xi-Chen-Deng, Biometrika (2017)], Deng and Xi and their collaborators studied differential network and developed a new loss function called the D-trace loss for estimating the differential network. Many real biological networks change under different genetic and environmental conditions. Investigation of the differential network would help to gain insights into biological systems. In this work, Deng and Xi modelled the network as the Gaussian network and the difference network is modelled as the difference of two precision matrices. The paper showed that with the lasso penalty, under a number of regularity conditions, the D-trace loss function can give consistent estimators even if the network size increases as the sample size increases. An efficient algorithm was also developed based on the alternating direction method of multipliers.

Researchers in Peking University also developed a series of statistical methods for genomics studies. All of these methods were published in top bioinformatics or computational biology journals. In [Wang-Zheng-Zhao-Deng, PG (2013)], Deng and his collaborators developed a new method for expression quantitative trait loci (eQTL) analysis. Unlike other early works, this work did not focus on the single genomic variation’s effect on gene expression. Instead, this work considered synergistic effect of pairs of genomic variations on gene expression and therefore can detect previously unexplored eQTL effects. The method is mainly based on a bi-variable model and an efficient screening statistics for computation speeding-up. In [Xi-Lee-Xia-Park, NAR (2016) and Xia-Liu-Deng-Xi, Bioinformatics (2016)], Xi and Deng and their collaborators developed new algorithms for detecting structural variations (SV) and copy number variations (CNV) based on high-throughput sequencing data. SVs and CNVs are wide spread in normal genomes as well as in diseased genomes. Accurate detection of SVs and CNVs is a critical step for biological and biomedical researches and clinical applications. In these two works, the authors developed two new algorithms called BIC-seq2 and SVmine for CNV and SV detection. These two methods are significantly superior in terms of sensitivity, specificity, detection resolution and replicability.