A Sensitivity Analysis Framework for Quantifying Confidence in Decisions in the Presence of Data Uncertainty arxiv.org/abs/2504.17043 .ME .AP

A Sensitivity Analysis Framework for Quantifying Confidence in Decisions in the Presence of Data Uncertainty

Nearly all statistical analyses that inform policy-making are based on imperfect data. As examples, the data may suffer from measurement errors, missing values, sample selection bias, or record linkage errors. Analysts have to decide how to handle such data imperfections, e.g., analyze only the complete cases or impute values for the missing items via some posited model. Their choices can influence estimates and hence, ultimately, policy decisions. Thus, it is prudent for analysts to evaluate the sensitivity of estimates and policy decisions to the assumptions underlying their choices. To facilitate this goal, we propose that analysts define metrics and visualizations that target the sensitivity of the ultimate decision to the assumptions underlying their approach to handling the data imperfections. Using these visualizations, the analyst can assess their confidence in the policy decision under their chosen analysis. We illustrate metrics and corresponding visualizations with two examples, namely considering possible measurement error in the inputs of predictive models of presidential vote share and imputing missing values when evaluating the percentage of children exposed to high levels of lead.

arXiv.org

Conditional-Marginal Nonparametric Estimation for Stage Waiting Times from Multi-Stage Models under Dependent Right Censoring arxiv.org/abs/2504.17089 .ME

Conditional-Marginal Nonparametric Estimation for Stage Waiting Times from Multi-Stage Models under Dependent Right Censoring

We investigate two population-level quantities (corresponding to complete data) related to uncensored stage waiting times in a progressive multi-stage model, conditional on a prior stage visit. We show how to estimate these quantities consistently using right-censored data. The first quantity is the stage waiting time distribution (survival function), representing the proportion of individuals who remain in stage j within time t after entering stage j. The second quantity is the cumulative incidence function, representing the proportion of individuals who transition from stage j to stage j' within time t after entering stage j. To estimate these quantities, we present two nonparametric approaches. The first uses an inverse probability of censoring weighting (IPCW) method, which reweights the counting processes and the number of individuals at risk (the at-risk set) to address dependent right censoring. The second method utilizes the notion of fractional observations (FRE) that modifies the at-risk set by incorporating probabilities of individuals (who might have been censored in a prior stage) eventually entering the stage of interest in the uncensored or full data experiment. Neither approach is limited to the assumption of independent censoring or Markovian multi-stage frameworks. Simulation studies demonstrate satisfactory performance for both sets of estimators, though the IPCW estimator generally outperforms the FRE estimator in the setups considered in our simulations. These estimations are further illustrated through applications to two real-world datasets: one from patients undergoing bone marrow transplants and the other from patients diagnosed with breast cancer.

arXiv.org

MOOSE ProbML: Parallelized Probabilistic Machine Learning and Uncertainty Quantification for Computational Energy Applications arxiv.org/abs/2504.17101 .AP

MOOSE ProbML: Parallelized Probabilistic Machine Learning and Uncertainty Quantification for Computational Energy Applications

This paper presents the development and demonstration of massively parallel probabilistic machine learning (ML) and uncertainty quantification (UQ) capabilities within the Multiphysics Object-Oriented Simulation Environment (MOOSE), an open-source computational platform for parallel finite element and finite volume analyses. In addressing the computational expense and uncertainties inherent in complex multiphysics simulations, this paper integrates Gaussian process (GP) variants, active learning, Bayesian inverse UQ, adaptive forward UQ, Bayesian optimization, evolutionary optimization, and Markov chain Monte Carlo (MCMC) within MOOSE. It also elaborates on the interaction among key MOOSE systems -- Sampler, MultiApp, Reporter, and Surrogate -- in enabling these capabilities. The modularity offered by these systems enables development of a multitude of probabilistic ML and UQ algorithms in MOOSE. Example code demonstrations include parallel active learning and parallel Bayesian inference via active learning. The impact of these developments is illustrated through five applications relevant to computational energy applications: UQ of nuclear fuel fission product release, using parallel active learning Bayesian inference; very rare events analysis in nuclear microreactors using active learning; advanced manufacturing process modeling using multi-output GPs (MOGPs) and dimensionality reduction; fluid flow using deep GPs (DGPs); and tritium transport model parameter optimization for fusion energy, using batch Bayesian optimization.

arXiv.org

Target trial emulation without matching: a more efficient approach for evaluating vaccine effectiveness using observational data arxiv.org/abs/2504.17104 .ME .AP

Target trial emulation without matching: a more efficient approach for evaluating vaccine effectiveness using observational data

Real-world vaccine effectiveness has increasingly been studied using matching-based approaches, particularly in observational cohort studies following the target trial emulation framework. Although matching is appealing in its simplicity, it suffers important limitations in terms of clarity of the target estimand and the efficiency or precision with which is it estimated. Scientifically justified causal estimands of vaccine effectiveness may be difficult to define owing to the fact that vaccine uptake varies over calendar time when infection dynamics may also be rapidly changing. We propose a causal estimand of vaccine effectiveness that summarizes vaccine effectiveness over calendar time, similar to how vaccine efficacy is summarized in a randomized controlled trial. We describe the identification of our estimand, including its underlying assumptions, and propose simple-to-implement estimators based on two hazard regression models. We apply our proposed estimator in simulations and in a study to assess the effectiveness of the Pfizer-BioNTech COVID-19 vaccine to prevent infections with SARS-CoV2 in children 5-11 years old. In both settings, we find that our proposed estimator yields similar scientific inferences while providing significant efficiency gains over commonly used matching-based estimators.

arXiv.org

Estimation and Inference for the Average Treatment Effect in a Score-Explained Heterogeneous Treatment Effect Model arxiv.org/abs/2504.17126 .ME .ST .TH

Estimation and Inference for the Average Treatment Effect in a Score-Explained Heterogeneous Treatment Effect Model

In many practical situations, randomly assigning treatments to subjects is uncommon due to feasibility constraints. For example, economic aid programs and merit-based scholarships are often restricted to those meeting specific income or exam score thresholds. In these scenarios, traditional approaches to estimating treatment effects typically focus solely on observations near the cutoff point, thereby excluding a significant portion of the sample and potentially leading to information loss. Moreover, these methods generally achieve a non-parametric convergence rate. While some approaches, e.g., Mukherjee et al. (2021), attempt to tackle these issues, they commonly assume that treatment effects are constant across individuals, an assumption that is often unrealistic in practice. In this study, we propose a differencing and matching-based estimator of the average treatment effect on the treated (ATT) in the presence of heterogeneous treatment effects, utilizing all available observations. We establish the asymptotic normality of our estimator and illustrate its effectiveness through various synthetic and real data analyses. Additionally, we demonstrate that our method yields non-parametric estimates of the conditional average treatment effect (CATE) and individual treatment effect (ITE) as a byproduct.

arXiv.org

A Delayed Acceptance Auxiliary Variable MCMC for Spatial Models with Intractable Likelihood Function arxiv.org/abs/2504.17147 .ME .CO

A Delayed Acceptance Auxiliary Variable MCMC for Spatial Models with Intractable Likelihood Function

A large class of spatial models contains intractable normalizing functions, such as spatial lattice models, interaction spatial point processes, and social network models. Bayesian inference for such models is challenging since the resulting posterior distribution is doubly intractable. Although auxiliary variable MCMC (AVM) algorithms are known to be the most practical, they are computationally expensive due to the repeated auxiliary variable simulations. To address this, we propose delayed-acceptance AVM (DA-AVM) methods, which can reduce the number of auxiliary variable simulations. The first stage of the kernel uses a cheap surrogate to decide whether to accept or reject the proposed parameter value. The second stage guarantees detailed balance with respect to the posterior. The auxiliary variable simulation is performed only on the parameters accepted in the first stage. We construct various surrogates specifically tailored for doubly intractable problems, including subsampling strategy, Gaussian process emulation, and frequentist estimator-based approximation. We validate our method through simulated and real data applications, demonstrating its practicality for complex spatial models.

arXiv.org

Causal rule ensemble approach for multi-arm data arxiv.org/abs/2504.17166 .ML .LG

Causal rule ensemble approach for multi-arm data

Heterogeneous treatment effect (HTE) estimation is critical in medical research. It provides insights into how treatment effects vary among individuals, which can provide statistical evidence for precision medicine. While most existing methods focus on binary treatment situations, real-world applications often involve multiple interventions. However, current HTE estimation methods are primarily designed for binary comparisons and often rely on black-box models, which limit their applicability and interpretability in multi-arm settings. To address these challenges, we propose an interpretable machine learning framework for HTE estimation in multi-arm trials. Our method employs a rule-based ensemble approach consisting of rule generation, rule ensemble, and HTE estimation, ensuring both predictive accuracy and interpretability. Through extensive simulation studies and real data applications, the performance of our method was evaluated against state-of-the-art multi-arm HTE estimation approaches. The results indicate that our approach achieved lower bias and higher estimation accuracy compared with those of existing methods. Furthermore, the interpretability of our framework allows clearer insights into how covariates influence treatment effects, facilitating clinical decision making. By bridging the gap between accuracy and interpretability, our study contributes a valuable tool for multi-arm HTE estimation, supporting precision medicine.

arXiv.org

A general approach to modeling environmental mixtures with multivariate outcomes arxiv.org/abs/2504.17195 .ME

A general approach to modeling environmental mixtures with multivariate outcomes

An important goal of environmental health research is to assess the health risks posed by mixtures of multiple environmental exposures. In these mixtures analyses, flexible models like Bayesian kernel machine regression and multiple index models are appealing because they allow for arbitrary non-linear exposure-outcome relationships. However, this flexibility comes at the cost of low power, particularly when exposures are highly correlated and the health effects are weak, as is typical in environmental health studies. We propose an adaptive index modelling strategy that borrows strength across exposures and outcomes by exploiting similar mixture component weights and exposure-response relationships. In the special case of distributed lag models, in which exposures are measured repeatedly over time, we jointly encourage co-clustering of lag profiles and exposure-response curves to more efficiently identify critical windows of vulnerability and characterize important exposure effects. We then extend the proposed approach to the multivariate index model setting where the true index structure -- the number of indices and their composition -- is unknown, and introduce variable importance measures to quantify component contributions to mixture effects. Using time series data from the National Morbidity, Mortality and Air Pollution Study, we demonstrate the proposed methods by jointly modelling three mortality outcomes and two cumulative air pollution measurements with a maximum lag of 14 days.

arXiv.org

Parental Imprints On Birth Weight: A Data-Driven Model For Neonatal Prediction In Low Resource Prenatal Care arxiv.org/abs/2504.15290 .OT

Parental Imprints On Birth Weight: A Data-Driven Model For Neonatal Prediction In Low Resource Prenatal Care

Accurate fetal birth weight prediction is a cornerstone of prenatal care, yet traditional methods often rely on imaging technologies that remain inaccessible in resource-limited settings. This study presents a novel machine learning-based framework that circumvents these conventional dependencies, using a diverse set of physiological, environmental, and parental factors to refine birth weight estimation. A multi-stage feature selection pipeline filters the dataset into an optimized subset, demonstrating previously underexplored yet clinically relevant predictors of fetal growth. By integrating advanced regression architectures and ensemble learning strategies, the model captures non-linear relationships often overlooked by traditional approaches, offering a predictive solution that is both interpretable and scalable. Beyond predictive accuracy, this study addresses a question: whether birth weight can be reliably estimated without conventional diagnostic tools. The findings challenge entrenched methodologies by introducing an alternative pathway that enhances accessibility without compromising clinical utility. While limitations exist, the study lays the foundation for a new era in prenatal analytics, one where data-driven inference competes with, and potentially redefines, established medical assessments. By bridging computational intelligence with obstetric science, this research establishes a framework for equitable, technology-driven advancements in maternal-fetal healthcare.

arXiv.org

A New Multiple Correlation Coefficient without Specifying the Dependent Variable arxiv.org/abs/2504.15372 .ME

A New Multiple Correlation Coefficient without Specifying the Dependent Variable

Multiple correlation is a fundamental concept with broad applications. The classical multiple correlation coefficient is developed to assess how strongly a dependent variable is associated with a linear combination of independent variables. To compute this coefficient, the dependent variable must be chosen in advance. In many applications, however, it is difficult and even infeasible to specify the dependent variable, especially when some variables of interest are equally important. To overcome this difficulty, we propose a new coefficient of multiple correlation which (a) does not require the specification of the dependent variable, (b) has a simple formula and shares connections with the classical correlation coefficients, (c) consistently measures the linear correlation between continuous variables, which is 0 if and only if variables are uncorrelated and 1 if and only if one variable is a linear function of others, and (d) has an asymptotic distribution which can be used for hypothesis testing. We study the asymptotic behavior of the sample coefficient under mild regularity conditions. Given that the asymptotic bias of the sample coefficient is not negligible when the data dimension and the sample size are comparable, we propose a bias-corrected estimator that consistently performs well in such cases. Moreover, we develop an efficient strategy for making inferences on multiple correlation based on either the limiting distribution or the resampling methods and the stochastic approximation Monte Carlo algorithm, depending on whether the regularity assumptions are valid or not. Theoretical and numerical studies demonstrate that our coefficient provides a useful tool for evaluating multiple correlation in practice.

arXiv.org

Assessing Surrogate Heterogeneity in Real World Data Using Meta-Learners arxiv.org/abs/2504.15386 .ME .ML .LG

Deep learning with missing data arxiv.org/abs/2504.15388 .ME .ST .ML .TH .LG

Deep learning with missing data

In the context of multivariate nonparametric regression with missing covariates, we propose Pattern Embedded Neural Networks (PENNs), which can be applied in conjunction with any existing imputation technique. In addition to a neural network trained on the imputed data, PENNs pass the vectors of observation indicators through a second neural network to provide a compact representation. The outputs are then combined in a third neural network to produce final predictions. Our main theoretical result exploits an assumption that the observation patterns can be partitioned into cells on which the Bayes regression function behaves similarly, and belongs to a compositional Hölder class. It provides a finite-sample excess risk bound that holds for an arbitrary missingness mechanism, and in combination with a complementary minimax lower bound, demonstrates that our PENN estimator attains in typical cases the minimax rate of convergence as if the cells of the partition were known in advance, up to a poly-logarithmic factor in the sample size. Numerical experiments on simulated, semi-synthetic and real data confirm that the PENN estimator consistently improves, often dramatically, on standard neural networks without pattern embedding. Code to reproduce our experiments, as well as a tutorial on how to apply our method, is publicly available.

arXiv.org

A stochastic method to estimate a zero-inflated two-part mixed model for human microbiome data arxiv.org/abs/2504.15411 .ME .ST .TH

A stochastic method to estimate a zero-inflated two-part mixed model for human microbiome data

Human microbiome studies based on genetic sequencing techniques produce compositional longitudinal data of the relative abundances of microbial taxa over time, allowing to understand, through mixed-effects modeling, how microbial communities evolve in response to clinical interventions, environmental changes, or disease progression. In particular, the Zero-Inflated Beta Regression (ZIBR) models jointly and over time the presence and abundance of each microbe taxon, considering the compositional nature of the data, its skewness, and the over-abundance of zeros. However, as for other complex random effects models, maximum likelihood estimation suffers from the intractability of likelihood integrals. Available estimation methods rely on log-likelihood approximation, which is prone to potential limitations such as biased estimates or unstable convergence. In this work we develop an alternative maximum likelihood estimation approach for the ZIBR model, based on the Stochastic Approximation Expectation Maximization (SAEM) algorithm. The proposed methodology allows to model unbalanced data, which is not always possible in existing approaches. We also provide estimations of the standard errors and the log-likelihood of the fitted model. The performance of the algorithm is established through simulation, and its use is demonstrated on two microbiome studies, showing its ability to detect changes in both presence and abundance of bacterial taxa over time and in response to treatment.

arXiv.org

Intra-Class Correlation Coefficient Ignorable Clustered Randomized Trials for Detecting Treatment Effect Heterogeneity arxiv.org/abs/2504.15503 .ME

Intra-Class Correlation Coefficient Ignorable Clustered Randomized Trials for Detecting Treatment Effect Heterogeneity

Accurately estimating the intra-class correlation coefficient (ICC) is crucial for adequately powering clustered randomized trials (CRTs). Challenges arise due to limited prior data on the specific outcome within the target population, making accurate ICC estimation difficult. Furthermore, ICC can vary significantly across studies, even for the same outcome, influenced by factors like study design, participant characteristics, and the specific intervention. Power calculations are extremely sensitive to ICC assumptions. Minor variation in the assumed ICC can lead to large differences in the number of clusters needed, potentially impacting trial feasibility and cost. This paper identifies a special class of CRTs aiming to detect the treatment effect heterogeneity, wherein the ICC can be completely disregarded in calculation of power and sample size. This result offers a solution for research projects lacking preliminary estimates of the ICC or facing challenges in their estimate. Moreover, this design facilitates power improvement through increasing the cluster sizes rather than the number of clusters, making it particular advantageous in the situations where expanding the number of clusters is difficult or costly. This paper provides a rigorous theoretical foundation for this class of ICC-ignorable CRTs, including mathematical proofs and practical guidance for implementation. We also present illustrative examples to demonstrate the practical implications of this approach in various research contexts in healthcare delivery.

arXiv.org

Ridge-Regularized Largest Root Test for General High-dimensional Double Wishart Problems arxiv.org/abs/2504.15510 .ME

Ridge-Regularized Largest Root Test for General High-dimensional Double Wishart Problems

In multivariate analysis, many core problems involve the eigen-analysis of an \(F\)-matrix, \(\bF = \bW_1\bW_2^{-1}\), constructed from two Wishart matrices, \(\bW_1\) and \(\bW_2\). These so-called \textit{Double Wishart problems} arise in contexts such as MANOVA, covariance matrix equality testing, and hypothesis testing in multivariate linear regression. A prominent classical approach, Roy's largest root test, relies on the largest eigenvalue of \(\bF\) for inference. However, in high-dimensional settings, this test becomes impractical due to the singularity or near-singularity of \(\bW_2\). To address this challenge, we propose a ridge-regularization framework by introducing a ridge term to \(\bW_2\). Specifically, we develop a family of ridge-regularized largest root tests, leveraging the largest eigenvalue of \(\bF_λ= \bW_1(\bW_2 + λI)^{-1}\), where \(λ> 0\) is the regularization parameter. Under mild assumptions, we establish the asymptotic Tracy-Widom distribution of the largest eigenvalue of \(\bF_λ\) after appropriate scaling. An efficient method for estimating the scaling parameters is proposed using the Marčenko-Pastur equation, and the consistency of these estimators is proven. The proposed framework is applied to illustrative Double Wishart problems, and simulation studies are conducted to evaluate the numerical performance of the methods. Finally, the proposed method is applied to the \emph{Human Connectome Project} data to test for the presence of associations between volumetric measurements of human brain and behavioral variables.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.