Show newer

Efficient Estimation of Causal Effects Under Two-Phase Sampling with Error-Prone Outcome and Treatment Measurements arxiv.org/abs/2506.21777 .ME

Efficient Estimation of Causal Effects Under Two-Phase Sampling with Error-Prone Outcome and Treatment Measurements

Measurement error is a common challenge for causal inference studies using electronic health record (EHR) data, where clinical outcomes and treatments are frequently mismeasured. Researchers often address measurement error by conducting manual chart reviews to validate measurements in a subset of the full EHR data -- a form of two-phase sampling. To improve efficiency, phase-two samples are often collected in a biased manner dependent on the patients' initial, error-prone measurements. In this work, motivated by our aim of performing causal inference with error-prone outcome and treatment measurements under two-phase sampling, we develop solutions applicable to both this specific problem and the broader problem of causal inference with two-phase samples. For our specific measurement error problem, we construct two asymptotically equivalent doubly-robust estimators of the average treatment effect and demonstrate how these estimators arise from two previously disconnected approaches to constructing efficient estimators in general two-phase sampling settings. We document various sources of instability affecting estimators from each approach and propose modifications that can considerably improve finite sample performance in any two-phase sampling context. We demonstrate the utility of our proposed methods through simulation studies and an illustrative example assessing effects of antiretroviral therapy on occurrence of AIDS-defining events in patients with HIV from the Vanderbilt Comprehensive Care Clinic.

arXiv.org

Estimating Average Causal Effects with Incomplete Exposure and Confounders arxiv.org/abs/2506.21786 .ME

Estimating Average Causal Effects with Incomplete Exposure and Confounders

Standard methods for estimating average causal effects require complete observations of the exposure and confounders. In observational studies, however, missing data are ubiquitous. Motivated by a study on the effect of prescription opioids on mortality, we propose methods for estimating average causal effects when exposures and potential confounders may be missing. We consider missingness at random and additionally propose several specific missing not at random (MNAR) assumptions. Under our proposed MNAR assumptions, we show that the average causal effects are identified from the observed data and derive corresponding influence functions in a nonparametric model, which form the basis of our proposed estimators. Our simulations show that standard multiple imputation techniques paired with a complete data estimator is unbiased when data are missing at random (MAR) but can be biased otherwise. For each of the MNAR assumptions, we instead propose doubly robust targeted maximum likelihood estimators (TMLE), allowing misspecification of either (i) the outcome models or (ii) the exposure and missingness models. The proposed methods are suitable for any outcome types, and we apply them to a motivating study that examines the effect of prescription opioid usage on all-cause mortality using data from the National Health and Nutrition Examination Survey (NHANES).

arXiv.org

Classification with Reject Option: Distribution-free Error Guarantees via Conformal Prediction arxiv.org/abs/2506.21802 .ML .LG

Classification with Reject Option: Distribution-free Error Guarantees via Conformal Prediction

Machine learning (ML) models always make a prediction, even when they are likely to be wrong. This causes problems in practical applications, as we do not know if we should trust a prediction. ML with reject option addresses this issue by abstaining from making a prediction if it is likely to be incorrect. In this work, we formalise the approach to ML with reject option in binary classification, deriving theoretical guarantees on the resulting error rate. This is achieved through conformal prediction (CP), which produce prediction sets with distribution-free validity guarantees. In binary classification, CP can output prediction sets containing exactly one, two or no labels. By accepting only the singleton predictions, we turn CP into a binary classifier with reject option. Here, CP is formally put in the framework of predicting with reject option. We state and prove the resulting error rate, and give finite sample estimates. Numerical examples provide illustrations of derived error rate through several different conformal prediction settings, ranging from full conformal prediction to offline batch inductive conformal prediction. The former has a direct link to sharp validity guarantees, whereas the latter is more fuzzy in terms of validity guarantees but can be used in practice. Error-reject curves illustrate the trade-off between error rate and reject rate, and can serve to aid a user to set an acceptable error rate or reject rate in practice.

arXiv.org

Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon arxiv.org/abs/2506.20779 .ML .LG

Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon

We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs -- a problem well motivated by the minima stability and edge-of-stability phenomena in gradient-descent training. Existing work either requires interpolation or focuses only on univariate inputs. This paper presents new and somewhat surprising theoretical results for multivariate inputs. On two natural settings (1) generalization gap for flat solutions, and (2) mean-squared error (MSE) in nonparametric function estimation by stable minima, we prove upper and lower bounds, which establish that while flatness does imply generalization, the resulting rates of convergence necessarily deteriorate exponentially as the input dimension grows. This gives an exponential separation between the flat solutions vis-à-vis low-norm solutions (i.e., weight decay), which knowingly do not suffer from the curse of dimensionality. In particular, our minimax lower bound construction, based on a novel packing argument with boundary-localized ReLU neurons, reveals how flat solutions can exploit a kind of ''neural shattering'' where neurons rarely activate, but with high weight magnitudes. This leads to poor performance in high dimensions. We corroborate these theoretical findings with extensive numerical simulations. To the best of our knowledge, our analysis provides the first systematic explanation for why flat minima may fail to generalize in high dimensions.

arXiv.org

Forecasting Geopolitical Events with a Sparse Temporal Fusion Transformer and Gaussian Process Hybrid: A Case Study in Middle Eastern and U.S. Conflict Dynamics arxiv.org/abs/2506.20935 .ML .AP .CO .LG

Forecasting Geopolitical Events with a Sparse Temporal Fusion Transformer and Gaussian Process Hybrid: A Case Study in Middle Eastern and U.S. Conflict Dynamics

Forecasting geopolitical conflict from data sources like the Global Database of Events, Language, and Tone (GDELT) is a critical challenge for national security. The inherent sparsity, burstiness, and overdispersion of such data cause standard deep learning models, including the Temporal Fusion Transformer (TFT), to produce unreliable long-horizon predictions. We introduce STFT-VNNGP, a hybrid architecture that won the 2023 Algorithms for Threat Detection (ATD) competition by overcoming these limitations. Designed to bridge this gap, our model employs a two-stage process: first, a TFT captures complex temporal dynamics to generate multi-quantile forecasts. These quantiles then serve as informed inputs for a Variational Nearest Neighbor Gaussian Process (VNNGP), which performs principled spatiotemporal smoothing and uncertainty quantification. In a case study forecasting conflict dynamics in the Middle East and the U.S., STFT-VNNGP consistently outperforms a standalone TFT, showing a superior ability to predict the timing and magnitude of bursty event periods, particularly at long-range horizons. This work offers a robust framework for generating more reliable and actionable intelligence from challenging event data, with all code and workflows made publicly available to ensure reproducibility.

arXiv.org

Disentangling network dependence among multiple variables arxiv.org/abs/2506.20974 .ME

Disentangling network dependence among multiple variables

When two variables depend on the same or similar underlying network, their shared network dependence structure can lead to spurious associations. While statistical associations between two variables sampled from interconnected subjects are a common inferential goal across various fields, little research has focused on how to disentangle shared dependence for valid statistical inference. We revisit two different approaches from distinct fields that may address shared network dependence: the pre-whitening approach, commonly used in time series analysis to remove the shared temporal dependence, and the network autocorrelation model, widely used in network analysis often to examine or account for autocorrelation of the outcome variable. We demonstrate how each approach implicitly entails assumptions about how a variable of interest propagates among nodes via network ties given the network structure. We further propose adaptations of existing pre-whitening methods to the network setting by explicitly reflecting underlying assumptions about "level of interaction" that induce network dependence, while accounting for its unique complexities. Our simulation studies demonstrate the effectiveness of the two approaches in reducing spurious associations due to shared network dependence when their respective assumptions hold. However, the results also show the sensitivity to assumption violations, underscoring the importance of correctly specifying the shared dependence structure based on available network information and prior knowledge about the interactions driving dependence.

arXiv.org

Leveraging Relational Evidence: Population Size Estimation on Tree-Structured Data with the Weighted Multiplier Method arxiv.org/abs/2506.21020 .ME

Leveraging Relational Evidence: Population Size Estimation on Tree-Structured Data with the Weighted Multiplier Method

Populations of interest are often hidden from data for a variety of reasons, though their magnitude remains important in determining resource allocation and appropriate policy. One popular approach to population size estimation, the multiplier method, is a back-calculation tool requiring only a marginal subpopulation size and an estimate of the proportion belonging to this subgroup. Another approach is to use Bayesian methods, which are inherently well-suited to incorporating multiple data sources. However, both methods have their drawbacks. A framework for applying the multiplier method which combines information from several known subpopulations has not yet been established; Bayesian models, though able to incorporate complex dependencies and various data sources, can be difficult for researchers in less technical fields to design and implement. Increasing data collection and linkage across diverse fields suggests accessible methods of estimating population size with synthesized data are needed. We propose an extension to the well-known multiplier method which is applicable to tree-structured data, where multiple subpopulations and corresponding proportions combine to generate a population size estimate via the minimum variance estimator. The methodology and resulting estimates are compared with those from a Bayesian hierarchical model, for both simulated and real world data. Subsequent analysis elucidates which data are key to estimation in each method, and examines robustness and feasibility of this new methodology.

arXiv.org

AutoWMM and JAGStree -- R packages for Population Size Estimation on Relational Tree-Structured Data arxiv.org/abs/2506.21023 .CO

Data-Driven Dynamic Factor Modeling via Manifold Learning arxiv.org/abs/2506.19945 .ML .PR .LG

Data-Driven Dynamic Factor Modeling via Manifold Learning

We propose a data-driven dynamic factor framework where a response variable depends on a high-dimensional set of covariates, without imposing any parametric model on the joint dynamics. Leveraging Anisotropic Diffusion Maps, a nonlinear manifold learning technique introduced by Singer and Coifman, our framework uncovers the joint dynamics of the covariates and responses in a purely data-driven way. We approximate the embedding dynamics using linear diffusions, and exploit Kalman filtering to predict the evolution of the covariates and response variables directly from the diffusion map embedding space. We generalize Singer's convergence rate analysis of the graph Laplacian from the case of independent uniform samples on a compact manifold to the case of time series arising from Langevin diffusions in Euclidean space. Furthermore, we provide rigorous justification for our procedure by showing the robustness of approximations of the diffusion map coordinates by linear diffusions, and the convergence of ergodic averages under standard spectral assumptions on the underlying dynamics. We apply our method to the stress testing of equity portfolios using a combination of financial and macroeconomic factors from the Federal Reserve's supervisory scenarios. We demonstrate that our data-driven stress testing method outperforms standard scenario analysis and Principal Component Analysis benchmarks through historical backtests spanning three major financial crises, achieving reductions in mean absolute error of up to 55% and 39% for scenario-based portfolio return prediction, respectively.

arXiv.org

Introducing RobustiPy: An efficient next generation multiversal library with model selection, averaging, resampling, and explainable artificial intelligence arxiv.org/abs/2506.19958 -fin.EC .ME .GN .AP .CO

Introducing RobustiPy: An efficient next generation multiversal library with model selection, averaging, resampling, and explainable artificial intelligence

We present RobustiPy, a next generation Python-based framework for model uncertainty quantification and multiverse analysis, released under the GNU GPL v3.0. Through the integration of efficient bootstrap-based confidence intervals, combinatorial exploration of dependent-variable specifications, model selection and averaging, and two complementary joint-inference routines, RobustiPy transcends existing uncertainty-quantification tools. Its design further supports rigorous out-of-sample evaluation and apportions the predictive contribution of each covariate. We deploy the library across five carefully constructed simulations and ten empirically grounded case studies drawn from high-impact literature and teaching examples, including a novel re-analysis of "unexplained discrepancies" in famous prior work. To illustrate its performance, we time-profile RobustiPy over roughly 672 million simulated linear regressions. These applications showcase how RobustiPy not only accelerates robust inference but also deepens our interpretive insight into model sensitivity across the vast analytical multiverse within which scientists operate.

arXiv.org

Tipping Point Sensitivity Analysis for Missing Data in Time-to-Event Endpoints: Model-Based and Model-Free Approaches arxiv.org/abs/2506.19988 .ME

Tipping Point Sensitivity Analysis for Missing Data in Time-to-Event Endpoints: Model-Based and Model-Free Approaches

Missing data frequently occurs in clinical trials with time-to-event endpoints, often due to administrative censoring. Other reasons, such as loss-to-follow up and patient withdrawal of consent, can violate the censoring-at-random assumption hence lead to biased estimates of the treatment effect under treatment policy estimand. Numerous methods have been proposed to conduct sensitivity analyses in these situations, one of which is the tipping point analysis. It aims to evaluate the robustness of trial conclusions by varying certain data and/or model aspects while imputing missing data. We provide an overview of the missing data considerations. The main contribution of this paper lies in categorizing and contrasting tipping point methods as two groups, namely model-based and model-free approaches, where the latter is under-emphasized in the literature. We highlight their important differences in terms of assumptions, behaviors and interpretations. Through two case studies and simulations under various scenarios, we provide insight on how different tipping point methods impact the interpretation of trial outcomes. Through these comparisons, we aim to provide a practical guide to conduct tipping point analyses from the choice of methods to ways of conducting clinical plausibility assessment, and ultimately contributing to more robust and reliable interpretation of clinical trial results in the presence of missing data for time-to-event endpoints.

arXiv.org

Dynamic Causal Mediation Analysis for Intensive Longitudinal Data arxiv.org/abs/2506.20027 .ME

Dynamic Causal Mediation Analysis for Intensive Longitudinal Data

Intensive longitudinal data, characterized by frequent measurements across numerous time points, are increasingly common due to advances in wearable devices and mobile health technologies. We consider evaluating causal mediation pathways between time-varying exposures, time-varying mediators, and a final, distal outcome using such data. Addressing mediation questions in these settings is challenging due to numerous potential exposures, complex mediation pathways, and intermediate confounding. Existing methods, such as interventional and path-specific effects, become impractical in intensive longitudinal data. We propose novel mediation effects termed natural direct and indirect excursion effects, which quantify mediation through the most immediate mediator following each treatment time. These effects are identifiable under plausible assumptions and decompose the total excursion effect. We derive efficient influence functions and propose multiply-robust estimators for these mediation effects. The estimators are multiply-robust and accommodate flexible machine learning algorithms and optional cross-fitting. In settings where the treatment assignment mechanism is known, such as the micro-randomized trial, the estimators are doubly-robust. We establish the consistency and asymptotic normality of the proposed estimators. Our methodology is illustrated using real-world data from the HeartSteps micro-randomized trial and the SleepHealth observational study.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.