Show newer

Predicting Forced Responses of Probability Distributions via the Fluctuation-Dissipation Theorem and Generative Modeling arxiv.org/abs/2504.13333 .ML .CD .LG

Predicting Forced Responses of Probability Distributions via the Fluctuation-Dissipation Theorem and Generative Modeling

We present a novel data-driven framework for estimating the response of higher-order moments of nonlinear stochastic systems to small external perturbations. The classical Generalized Fluctuation-Dissipation Theorem (GFDT) links the unperturbed steady-state distribution to the system's linear response. Standard implementations rely on Gaussian approximations, which can often accurately predict the mean response but usually introduce significant biases in higher-order moments, such as variance, skewness, and kurtosis. To address this limitation, we combine GFDT with recent advances in score-based generative modeling, which enable direct estimation of the score function from data without requiring full density reconstruction. Our method is validated on three reduced-order stochastic models relevant to climate dynamics: a scalar stochastic model for low-frequency climate variability, a slow-fast triad model mimicking key features of the El Nino-Southern Oscillation (ENSO), and a six-dimensional stochastic barotropic model capturing atmospheric regime transitions. In all cases, the approach captures strongly nonlinear and non-Gaussian features of the system's response, outperforming traditional Gaussian approximations.

arXiv.org

Intelligent data collection for network discrimination in material flow analysis using Bayesian optimal experimental design arxiv.org/abs/2504.13382 .AP

Intelligent data collection for network discrimination in material flow analysis using Bayesian optimal experimental design

Material flow analyses (MFAs) are powerful tools for highlighting resource efficiency opportunities in supply chains. MFAs are often represented as directed graphs, with nodes denoting processes and edges representing mass flows. However, network structure uncertainty -- uncertainty in the presence or absence of flows between nodes -- is common and can compromise flow predictions. While collection of more MFA data can reduce network structure uncertainty, an intelligent data acquisition strategy is crucial to optimize the resources (person-hours and money spent on collecting and purchasing data) invested in constructing an MFA. In this study, we apply Bayesian optimal experimental design (BOED), based on the Kullback-Leibler divergence, to efficiently target high-utility MFA data -- data that minimizes network structure uncertainty. We introduce a new method with reduced bias for estimating expected utility, demonstrating its superior accuracy over traditional approaches. We illustrate these advances with a case study on the U.S. steel sector MFA, where the expected utility of collecting specific single pieces of steel mass flow data aligns with the actual reduction in network structure uncertainty achieved by collecting said data from the United States Geological Survey and the World Steel Association. The results highlight that the optimal MFA data to collect depends on the total amount of data being gathered, making it sensitive to the scale of the data collection effort. Overall, our methods support intelligent data acquisition strategies, accelerating uncertainty reduction in MFAs and enhancing their utility for impact quantification and informed decision-making.

arXiv.org

A Novel Strategy for Detecting Multiple Mediators in High-Dimensional Mediation Models arxiv.org/abs/2504.11550 .ME

A Novel Strategy for Detecting Multiple Mediators in High-Dimensional Mediation Models

This article presents a novel methodology for detecting multiple biomarkers in high-dimensional mediation models by utilizing a modified Least Absolute Shrinkage and Selection Operator (LASSO) alongside Pathway LASSO. This approach effectively addresses the problem of overestimating direct effects, which can result in the inaccurate identification of mediators with nonzero indirect effects. To mitigate this overestimation and improve the true positive rate for detecting mediators, two constraints on the $L_1$-norm penalty are introduced. The proposed methodology's effectiveness is demonstrated through extensive simulations across various scenarios, highlighting its robustness and reliability under different conditions. Furthermore, a procedure for selecting an optimal threshold for dimension reduction using sure independence screening is introduced, enhancing the accuracy of true biomarker detection and yielding a final model that is both robust and well-suited for real-world applications. To illustrate the practical utility of this methodology, the results are applied to a study dataset involving patients with internalizing psychopathology, showcasing its applicability in clinical settings. Overall, this methodology signifies a substantial advancement in biomarker detection within high-dimensional mediation models, offering promising implications for both research and clinical practices.

arXiv.org

Mapping Multivariate Phenotypes in the Presence of Missing Observations for Family-Based Data arxiv.org/abs/2504.11579 .ME

Mapping Multivariate Phenotypes in the Presence of Missing Observations for Family-Based Data

Clinical end-point traits are often characterized by quantitative or qualitative precursors and it has been argued that it may be statistically a more powerful strategy to analyze these precursor traits to decipher the genetic architecture of the underlying complex end-point trait. While association methods for both quantitative and qualitative traits have been extensively developed to analyze population level data, development of such methods are of current research interest for family-level data that pose additional challenges of incorporation of correlation of trait values within a family. Haldar and Ghosh (2015) developed a test which is Statistical equivalent of the classical TDT for quantitative traits and multivariate phenotypes. The model does not require a priori assumptions on the probability distributions of the phenotypes. However, it may often arise in practice that data on the phenotype of interest may not be available for all offspring in a nuclear family. In this study, we explore methodologies to estimate missing phenotypes conditioned on the available ones and carry out the transmission-based test for association on the 'complete' data. We consider three types of phenotypes: continuous, count and categorical. For a missing continuous phenotype, the trait value is estimated using a conditional normal model. For a missing count phenotypes, the trait value is estimated using a conditional Poisson model. For a missing categorical phenotype, the risk of the phenotype status is estimated using a conditional logistic model. We shall carry out simulations under a wide spectrum of genetic models and assess the effect of the proposed imputation strategy on the power of the association test vis-à-vis the the ideal situation with no missing data.

arXiv.org

Towards Interpretable Deep Generative Models via Causal Representation Learning arxiv.org/abs/2504.11609 .ML .ME .AI .LG

Towards Interpretable Deep Generative Models via Causal Representation Learning

Recent developments in generative artificial intelligence (AI) rely on machine learning techniques such as deep learning and generative modeling to achieve state-of-the-art performance across wide-ranging domains. These methods' surprising performance is due in part to their ability to learn implicit "representations'' of complex, multi-modal data. Unfortunately, deep neural networks are notoriously black boxes that obscure these representations, making them difficult to interpret or analyze. To resolve these difficulties, one approach is to build new interpretable neural network models from the ground up. This is the goal of the emerging field of causal representation learning (CRL) that uses causality as a vector for building flexible, interpretable, and transferable generative AI. CRL can be seen as a culmination of three intrinsically statistical problems: (i) latent variable models such as factor analysis; (ii) causal graphical models with latent variables; and (iii) nonparametric statistics and deep learning. This paper reviews recent progress in CRL from a statistical perspective, focusing on connections to classical models and statistical and causal identifiablity results. This review also highlights key application areas, implementation strategies, and open statistical questions in CRL.

arXiv.org

Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations arxiv.org/abs/2504.11610 -bio.QM .ML .LG

Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations

Background: The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data. Results: We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. The model demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset. Conclusion: GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. To make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at https://github.com/Kaversoniano/GPCCA.

arXiv.org

Statistical Modeling of Combinatorial Response Data arxiv.org/abs/2504.11630 .ME

Statistical Modeling of Combinatorial Response Data

In categorical data analysis, there is rich literature for modeling binary and polychotomous responses. However, existing methods are inadequate for handling combinatorial responses, where each response is an array of integers subject to additional constraints. Such data are increasingly common in modern applications, such as surveys collected under skip logic, event propagation on a network, and observed matching in ecology. Ignoring the combinatorial structure in the response data may lead to biased estimation and prediction. The fundamental challenge for modeling these integer-vector data is the lack of a link function that connects a linear or functional predictor with a probability respecting the combinatorial constraints. In this paper, we propose a novel augmented likelihood, in which a combinatorial response can be viewed as a deterministic transform of a continuous latent variable. We specify the transform as the maximizer of integer linear program, and characterize useful properties such as dual thresholding representation. When taking a Bayesian approach and considering a multivariate normal distribution for the latent variable, our method becomes a direct generalization to the celebrated probit data augmentation, and enjoys straightforward computation via Gibbs sampler. We provide theoretical justification for the proposed method at an interesting intersection between duality and probability distribution and develop useful sufficient conditions that guarantee the applicability of our method. We demonstrate the effectiveness of our method through simulation studies and a real data application on modeling the formation of seasonal matching between waterfowl.

arXiv.org

A cautionary note for plasmode simulation studies in the setting of causal inference arxiv.org/abs/2504.11740 .ME

A cautionary note for plasmode simulation studies in the setting of causal inference

Plasmode simulation has become an important tool for evaluating the operating characteristics of different statistical methods in complex settings, such as pharmacoepidemiological studies of treatment effectiveness using electronic health records (EHR) data. These studies provide insight into how estimator performance is impacted by challenges including rare events, small sample size, etc., that can indicate which among a set of methods performs best in a real-world dataset. Plasmode simulation combines data resampled from a real-world dataset with synthetic data to generate a known truth for an estimand in realistic data. There are different potential plasmode strategies currently in use. We compare two popular plasmode simulation frameworks. We provide numerical evidence and a theoretical result, which shows that one of these frameworks can cause certain estimators to incorrectly appear overly biased with lower than nominal confidence interval coverage. Detailed simulation studies using both synthetic and real-world EHR data demonstrate that these pitfalls remain at large sample sizes and when analyzing data from a randomized controlled trial. We conclude with guidance for the choice of a plasmode simulation approach that maintains good theoretical properties to allow a fair evaluation of statistical methods while also maintaining the desired similarity to real data.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.