Learning Enhanced Ensemble Filters arxiv.org/abs/2504.17836 .comp-ph .ML .SY .LG .SY

Learning Enhanced Ensemble Filters

The filtering distribution in hidden Markov models evolves according to the law of a mean-field model in state--observation space. The ensemble Kalman filter (EnKF) approximates this mean-field model with an ensemble of interacting particles, employing a Gaussian ansatz for the joint distribution of the state and observation at each observation time. These methods are robust, but the Gaussian ansatz limits accuracy. This shortcoming is addressed by approximating the mean-field evolution using a novel form of neural operator taking probability distributions as input: a Measure Neural Mapping (MNM). A MNM is used to design a novel approach to filtering, the MNM-enhanced ensemble filter (MNMEF), which is defined in both the mean-fieldlimit and for interacting ensemble particle approximations. The ensemble approach uses empirical measures as input to the MNM and is implemented using the set transformer, which is invariant to ensemble permutation and allows for different ensemble sizes. The derivation of methods from a mean-field formulation allows a single parameterization of the algorithm to be deployed at different ensemble sizes. In practice fine-tuning of a small number of parameters, for specific ensemble sizes, further enhances the accuracy of the scheme. The promise of the approach is demonstrated by its superior root-mean-square-error performance relative to leading methods in filtering the Lorenz 96 and Kuramoto-Sivashinsky models.

arXiv.org

SOFARI-R: High-Dimensional Manifold-Based Inference for Latent Responses arxiv.org/abs/2504.17874 .ME .ML .LG

SOFARI-R: High-Dimensional Manifold-Based Inference for Latent Responses

Data reduction with uncertainty quantification plays a key role in various multi-task learning applications, where large numbers of responses and features are present. To this end, a general framework of high-dimensional manifold-based SOFAR inference (SOFARI) was introduced recently in Zheng, Zhou, Fan and Lv (2024) for interpretable multi-task learning inference focusing on the left factor vectors and singular values exploiting the latent singular value decomposition (SVD) structure. Yet, designing a valid inference procedure on the latent right factor vectors is not straightforward from that of the left ones and can be even more challenging due to asymmetry of left and right singular vectors in the response matrix. To tackle these issues, in this paper we suggest a new method of high-dimensional manifold-based SOFAR inference for latent responses (SOFARI-R), where two variants of SOFARI-R are introduced. The first variant deals with strongly orthogonal factors by coupling left singular vectors with the design matrix and then appropriately rescaling them to generate new Stiefel manifolds. The second variant handles the more general weakly orthogonal factors by employing the hard-thresholded SOFARI estimates and delicately incorporating approximation errors into the distribution. Both variants produce bias-corrected estimators for the latent right factor vectors that enjoy asymptotically normal distributions with justified asymptotic variance estimates. We demonstrate the effectiveness of the newly suggested method using extensive simulation studies and an economic application.

arXiv.org

Bernstein Polynomial Processes for Continuous Time Change Detection arxiv.org/abs/2504.17876 .ME .ST .AP .TH

Bernstein Polynomial Processes for Continuous Time Change Detection

There is a lack of methodological results for continuous time change detection due to the challenges of noninformative prior specification and efficient posterior inference in this setting. Most methodologies to date assume data are collected according to uniformly spaced time intervals. This assumption incurs bias in the continuous time setting where, a priori, two consecutive observations measured closely in time are less likely to change than two consecutive observations that are far apart in time. Models proposed in this setting have required MCMC sampling which is not ideal. To address these issues, we derive the heterogeneous continuous time Markov chain that models change point transition probabilities noninformatively. By construction, change points under this model can be inferred efficiently using the forward backward algorithm and do not require MCMC sampling. We then develop a novel loss function for the continuous time setting, derive its Bayes estimator, and demonstrate its performance on synthetic data. A case study using time series of remotely sensed observations is then carried out on three change detection applications. To reduce falsely detected changes in this setting, we develop a semiparametric mean function that captures interannual variability due to weather in addition to trend and seasonal components.

arXiv.org

Model Error Covariance Estimation for Weak Constraint Data Assimilation arxiv.org/abs/2504.17900 .ME .NA .OC .NA

Model Error Covariance Estimation for Weak Constraint Data Assimilation

State estimates from weak constraint 4D-Var data assimilation can vary significantly depending on the data and model error covariances. As a result, the accuracy of these estimates heavily depends on the correct specification of both model and observational data error covariances. In this work, we assume that the data error is known and and focus on estimating the model error covariance by framing weak constraint 4D-Var as a regularized inverse problem, where the inverse model error covariance serves as the regularization matrix. We consider both isotropic and non-isotropic forms of the model error covariance. Using the representer method, we reduce the 4D-Var problem from state space to data space, enabling the efficient application of regularization parameter selection techniques. The Representer method also provides an analytic expression for the optimal state estimate, allowing us to derive matrix expressions for the three regularization parameter selection methods i.e. the L-curve, generalized cross-validation (GCV), and the Chi-square method. We validate our approach by assimilating simulated data into a 1D transport equation modeling wildfire smoke transport under various observational noise and forward model perturbations. In these experiments the goal is to identify the model error covariances that accurately capture the influence of observational data versus model predictions on assimilated state estimates. The regularization parameter selection methods successfully estimate hyperparameters for both isotropic and non-isotropic model error covariances, that reflect whether the first guess model predictions are more or less reliable than the observational data. The results further indicate that isotropic variances are sufficient when the first guess is more accurate than the data whereas non-isotropic covariances are preferred when the observational data is more reliable.

arXiv.org

Auto-Regressive Standard Precipitation Index: A Bayesian Approach for Drought Characterization arxiv.org/abs/2504.18197 .AP

Auto-Regressive Standard Precipitation Index: A Bayesian Approach for Drought Characterization

This study proposes Auto-Regressive Standardized Precipitation Index (ARSPI) as a novel alternative to the traditional Standardized Precipitation Index (SPI) for measuring drought by relaxing the assumption of independent and identical rainfall distribution over time. ARSPI utilizes an auto-regressive framework to tackle the auto-correlated characteristics of precipitation, providing a more precise depiction of drought dynamics. The proposed model integrates a spike-and-slab log-normal distribution for zero rainfall seasons. The Bayesian Monte Carlo Markov Chain (MCMC) approach simplifies the SPI computation using the non-parametric predictive density estimation of total rainfall across various time windows from simulated samples. The MCMC simulations further ensure robust estimation of severity, duration, peak and return period with greater precision. This study also provides a comparison between the performances of ARSPI and SPI using the precipitation data from the Colorado River Basin (1893-1991). ARSPI emerges to be more efficient than the benchmark SPI in terms of model fit. ARSPI shows enhanced sensitivity to climatic extremes, making it a valuable tool for hydrological research and water resource management.

arXiv.org

A Sensitivity Analysis Framework for Quantifying Confidence in Decisions in the Presence of Data Uncertainty arxiv.org/abs/2504.17043 .ME .AP

A Sensitivity Analysis Framework for Quantifying Confidence in Decisions in the Presence of Data Uncertainty

Nearly all statistical analyses that inform policy-making are based on imperfect data. As examples, the data may suffer from measurement errors, missing values, sample selection bias, or record linkage errors. Analysts have to decide how to handle such data imperfections, e.g., analyze only the complete cases or impute values for the missing items via some posited model. Their choices can influence estimates and hence, ultimately, policy decisions. Thus, it is prudent for analysts to evaluate the sensitivity of estimates and policy decisions to the assumptions underlying their choices. To facilitate this goal, we propose that analysts define metrics and visualizations that target the sensitivity of the ultimate decision to the assumptions underlying their approach to handling the data imperfections. Using these visualizations, the analyst can assess their confidence in the policy decision under their chosen analysis. We illustrate metrics and corresponding visualizations with two examples, namely considering possible measurement error in the inputs of predictive models of presidential vote share and imputing missing values when evaluating the percentage of children exposed to high levels of lead.

arXiv.org

Conditional-Marginal Nonparametric Estimation for Stage Waiting Times from Multi-Stage Models under Dependent Right Censoring arxiv.org/abs/2504.17089 .ME

Conditional-Marginal Nonparametric Estimation for Stage Waiting Times from Multi-Stage Models under Dependent Right Censoring

We investigate two population-level quantities (corresponding to complete data) related to uncensored stage waiting times in a progressive multi-stage model, conditional on a prior stage visit. We show how to estimate these quantities consistently using right-censored data. The first quantity is the stage waiting time distribution (survival function), representing the proportion of individuals who remain in stage j within time t after entering stage j. The second quantity is the cumulative incidence function, representing the proportion of individuals who transition from stage j to stage j' within time t after entering stage j. To estimate these quantities, we present two nonparametric approaches. The first uses an inverse probability of censoring weighting (IPCW) method, which reweights the counting processes and the number of individuals at risk (the at-risk set) to address dependent right censoring. The second method utilizes the notion of fractional observations (FRE) that modifies the at-risk set by incorporating probabilities of individuals (who might have been censored in a prior stage) eventually entering the stage of interest in the uncensored or full data experiment. Neither approach is limited to the assumption of independent censoring or Markovian multi-stage frameworks. Simulation studies demonstrate satisfactory performance for both sets of estimators, though the IPCW estimator generally outperforms the FRE estimator in the setups considered in our simulations. These estimations are further illustrated through applications to two real-world datasets: one from patients undergoing bone marrow transplants and the other from patients diagnosed with breast cancer.

arXiv.org

MOOSE ProbML: Parallelized Probabilistic Machine Learning and Uncertainty Quantification for Computational Energy Applications arxiv.org/abs/2504.17101 .AP

MOOSE ProbML: Parallelized Probabilistic Machine Learning and Uncertainty Quantification for Computational Energy Applications

This paper presents the development and demonstration of massively parallel probabilistic machine learning (ML) and uncertainty quantification (UQ) capabilities within the Multiphysics Object-Oriented Simulation Environment (MOOSE), an open-source computational platform for parallel finite element and finite volume analyses. In addressing the computational expense and uncertainties inherent in complex multiphysics simulations, this paper integrates Gaussian process (GP) variants, active learning, Bayesian inverse UQ, adaptive forward UQ, Bayesian optimization, evolutionary optimization, and Markov chain Monte Carlo (MCMC) within MOOSE. It also elaborates on the interaction among key MOOSE systems -- Sampler, MultiApp, Reporter, and Surrogate -- in enabling these capabilities. The modularity offered by these systems enables development of a multitude of probabilistic ML and UQ algorithms in MOOSE. Example code demonstrations include parallel active learning and parallel Bayesian inference via active learning. The impact of these developments is illustrated through five applications relevant to computational energy applications: UQ of nuclear fuel fission product release, using parallel active learning Bayesian inference; very rare events analysis in nuclear microreactors using active learning; advanced manufacturing process modeling using multi-output GPs (MOGPs) and dimensionality reduction; fluid flow using deep GPs (DGPs); and tritium transport model parameter optimization for fusion energy, using batch Bayesian optimization.

arXiv.org

Target trial emulation without matching: a more efficient approach for evaluating vaccine effectiveness using observational data arxiv.org/abs/2504.17104 .ME .AP

Target trial emulation without matching: a more efficient approach for evaluating vaccine effectiveness using observational data

Real-world vaccine effectiveness has increasingly been studied using matching-based approaches, particularly in observational cohort studies following the target trial emulation framework. Although matching is appealing in its simplicity, it suffers important limitations in terms of clarity of the target estimand and the efficiency or precision with which is it estimated. Scientifically justified causal estimands of vaccine effectiveness may be difficult to define owing to the fact that vaccine uptake varies over calendar time when infection dynamics may also be rapidly changing. We propose a causal estimand of vaccine effectiveness that summarizes vaccine effectiveness over calendar time, similar to how vaccine efficacy is summarized in a randomized controlled trial. We describe the identification of our estimand, including its underlying assumptions, and propose simple-to-implement estimators based on two hazard regression models. We apply our proposed estimator in simulations and in a study to assess the effectiveness of the Pfizer-BioNTech COVID-19 vaccine to prevent infections with SARS-CoV2 in children 5-11 years old. In both settings, we find that our proposed estimator yields similar scientific inferences while providing significant efficiency gains over commonly used matching-based estimators.

arXiv.org

Estimation and Inference for the Average Treatment Effect in a Score-Explained Heterogeneous Treatment Effect Model arxiv.org/abs/2504.17126 .ME .ST .TH

Estimation and Inference for the Average Treatment Effect in a Score-Explained Heterogeneous Treatment Effect Model

In many practical situations, randomly assigning treatments to subjects is uncommon due to feasibility constraints. For example, economic aid programs and merit-based scholarships are often restricted to those meeting specific income or exam score thresholds. In these scenarios, traditional approaches to estimating treatment effects typically focus solely on observations near the cutoff point, thereby excluding a significant portion of the sample and potentially leading to information loss. Moreover, these methods generally achieve a non-parametric convergence rate. While some approaches, e.g., Mukherjee et al. (2021), attempt to tackle these issues, they commonly assume that treatment effects are constant across individuals, an assumption that is often unrealistic in practice. In this study, we propose a differencing and matching-based estimator of the average treatment effect on the treated (ATT) in the presence of heterogeneous treatment effects, utilizing all available observations. We establish the asymptotic normality of our estimator and illustrate its effectiveness through various synthetic and real data analyses. Additionally, we demonstrate that our method yields non-parametric estimates of the conditional average treatment effect (CATE) and individual treatment effect (ITE) as a byproduct.

arXiv.org

A Delayed Acceptance Auxiliary Variable MCMC for Spatial Models with Intractable Likelihood Function arxiv.org/abs/2504.17147 .ME .CO

A Delayed Acceptance Auxiliary Variable MCMC for Spatial Models with Intractable Likelihood Function

A large class of spatial models contains intractable normalizing functions, such as spatial lattice models, interaction spatial point processes, and social network models. Bayesian inference for such models is challenging since the resulting posterior distribution is doubly intractable. Although auxiliary variable MCMC (AVM) algorithms are known to be the most practical, they are computationally expensive due to the repeated auxiliary variable simulations. To address this, we propose delayed-acceptance AVM (DA-AVM) methods, which can reduce the number of auxiliary variable simulations. The first stage of the kernel uses a cheap surrogate to decide whether to accept or reject the proposed parameter value. The second stage guarantees detailed balance with respect to the posterior. The auxiliary variable simulation is performed only on the parameters accepted in the first stage. We construct various surrogates specifically tailored for doubly intractable problems, including subsampling strategy, Gaussian process emulation, and frequentist estimator-based approximation. We validate our method through simulated and real data applications, demonstrating its practicality for complex spatial models.

arXiv.org

Causal rule ensemble approach for multi-arm data arxiv.org/abs/2504.17166 .ML .LG

Causal rule ensemble approach for multi-arm data

Heterogeneous treatment effect (HTE) estimation is critical in medical research. It provides insights into how treatment effects vary among individuals, which can provide statistical evidence for precision medicine. While most existing methods focus on binary treatment situations, real-world applications often involve multiple interventions. However, current HTE estimation methods are primarily designed for binary comparisons and often rely on black-box models, which limit their applicability and interpretability in multi-arm settings. To address these challenges, we propose an interpretable machine learning framework for HTE estimation in multi-arm trials. Our method employs a rule-based ensemble approach consisting of rule generation, rule ensemble, and HTE estimation, ensuring both predictive accuracy and interpretability. Through extensive simulation studies and real data applications, the performance of our method was evaluated against state-of-the-art multi-arm HTE estimation approaches. The results indicate that our approach achieved lower bias and higher estimation accuracy compared with those of existing methods. Furthermore, the interpretability of our framework allows clearer insights into how covariates influence treatment effects, facilitating clinical decision making. By bridging the gap between accuracy and interpretability, our study contributes a valuable tool for multi-arm HTE estimation, supporting precision medicine.

arXiv.org

A general approach to modeling environmental mixtures with multivariate outcomes arxiv.org/abs/2504.17195 .ME

A general approach to modeling environmental mixtures with multivariate outcomes

An important goal of environmental health research is to assess the health risks posed by mixtures of multiple environmental exposures. In these mixtures analyses, flexible models like Bayesian kernel machine regression and multiple index models are appealing because they allow for arbitrary non-linear exposure-outcome relationships. However, this flexibility comes at the cost of low power, particularly when exposures are highly correlated and the health effects are weak, as is typical in environmental health studies. We propose an adaptive index modelling strategy that borrows strength across exposures and outcomes by exploiting similar mixture component weights and exposure-response relationships. In the special case of distributed lag models, in which exposures are measured repeatedly over time, we jointly encourage co-clustering of lag profiles and exposure-response curves to more efficiently identify critical windows of vulnerability and characterize important exposure effects. We then extend the proposed approach to the multivariate index model setting where the true index structure -- the number of indices and their composition -- is unknown, and introduce variable importance measures to quantify component contributions to mixture effects. Using time series data from the National Morbidity, Mortality and Air Pollution Study, we demonstrate the proposed methods by jointly modelling three mortality outcomes and two cumulative air pollution measurements with a maximum lag of 14 days.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.