Show newer

Bayesian Shrinkage in High-Dimensional VAR Models: A Comparative Study arxiv.org/abs/2504.05489 .ME .AP

Bayesian Shrinkage in High-Dimensional VAR Models: A Comparative Study

High-dimensional vector autoregressive (VAR) models offer a versatile framework for multivariate time series analysis, yet face critical challenges from over-parameterization and uncertain lag order. In this paper, we systematically compare three Bayesian shrinkage priors (horseshoe, lasso, and normal) and two frequentist regularization approaches (ridge and nonparametric shrinkage) under three carefully crafted simulation scenarios. These scenarios encompass (i) overfitting in a low-dimensional setting, (ii) sparse high-dimensional processes, and (iii) a combined scenario where both large dimension and overfitting complicate inference. We evaluate each method in quality of parameter estimation (root mean squared error, coverage, and interval length) and out-of-sample forecasting (one-step-ahead forecast RMSE). Our findings show that local-global Bayesian methods, particularly the horseshoe, dominate in maintaining accurate coverage and minimizing parameter error, even when the model is heavily over-parameterized. Frequentist ridge often yields competitive point forecasts but underestimates uncertainty, leading to sub-nominal coverage. A real-data application using macroeconomic variables from Canada illustrates how these methods perform in practice, reinforcing the advantages of local-global priors in stabilizing inference when dimension or lag order is inflated.

arXiv.org

Adaptive Design for Contour Estimation from Computer Experiments with Quantitative and Qualitative Inputs arxiv.org/abs/2504.05498 .ME

Adaptive Design for Contour Estimation from Computer Experiments with Quantitative and Qualitative Inputs

Computer experiments with quantitative and qualitative inputs are widely used to study many scientific and engineering processes. Much of the existing work has focused on design and modeling or process optimization for such experiments. This paper proposes an adaptive design approach for estimating a contour from computer experiments with quantitative and qualitative inputs. A new criterion is introduced to search for the follow-up inputs. The key features of the proposed criterion are (a) the criterion yields adaptive search regions; and (b) it is region-based cooperative in that for each stage of the sequential procedure, the candidate points in the design space is divided into two disjoint groups using confidence bounds, and within each group, an acquisition function is used to select a candidate point. Among the two selected points, a point that is closer to the contour level with the higher uncertainty or that has higher uncertainty when the distance between its prediction and the contour level is within a threshold is chosen. The proposed approach provides empirically more accurate contour estimation than existing approaches as illustrated in numerical examples and a real application. Theoretical justification of the proposed adaptive search region is given.

arXiv.org

Bayesian Modal Regression for Forecast Combinations arxiv.org/abs/2504.03859 .ME

Bayesian Modal Regression for Forecast Combinations

Forecast combination methods have traditionally emphasized symmetric loss functions, particularly squared error loss, with equally weighted combinations often justified as a robust approach under such criteria. However, these justifications do not extend to asymmetric loss functions, where optimally weighted combinations may provide superior predictive performance. This study introduces a novel contribution by incorporating modal regression into forecast combinations, offering a Bayesian hierarchical framework that models the conditional mode of the response through combinations of time-varying parameters and exponential discounting. The proposed approach utilizes error distributions characterized by asymmetry and heavy tails, specifically the asymmetric Laplace, asymmetric normal, and reverse Gumbel distributions. Simulated data validate the parameter estimation for the modal regression models, confirming the robustness of the proposed methodology. Application of these methodologies to a real-world analyst forecast dataset shows that modal regression with asymmetric Laplace errors outperforms mean regression based on two key performance metrics: the hit rate, which measures the accuracy of classifying the sign of revenue surprises, and the win rate, which assesses the proportion of forecasts surpassing the equally weighted consensus. These results underscore the presence of skewness and fat-tailed behavior in forecast combination errors for revenue forecasting, highlighting the advantages of modal regression in financial applications.

arXiv.org

MaxTDA: Robust Statistical Inference for Maximal Persistence in Topological Data Analysis arxiv.org/abs/2504.03897 .ME .AT .CO

MaxTDA: Robust Statistical Inference for Maximal Persistence in Topological Data Analysis

Persistent homology is an area within topological data analysis (TDA) that can uncover different dimensional holes (connected components, loops, voids, etc.) in data. The holes are characterized, in part, by how long they persist across different scales. Noisy data can result in many additional holes that are not true topological signal. Various robust TDA techniques have been proposed to reduce the number of noisy holes, however, these robust methods have a tendency to also reduce the topological signal. This work introduces Maximal TDA (MaxTDA), a statistical framework addressing a limitation in TDA wherein robust inference techniques systematically underestimate the persistence of significant homological features. MaxTDA combines kernel density estimation with level-set thresholding via rejection sampling to generate consistent estimators for the maximal persistence features that minimizes bias while maintaining robustness to noise and outliers. We establish the consistency of the sampling procedure and the stability of the maximal persistence estimator. The framework also enables statistical inference on topological features through rejection bands, constructed from quantiles that bound the estimator's deviation probability. MaxTDA is particularly valuable in applications where precise quantification of statistically significant topological features is essential for revealing underlying structural properties in complex datasets. Numerical simulations across varied datasets, including an example from exoplanet astronomy, highlight the effectiveness of MaxTDA in recovering true topological signals.

arXiv.org

Confirmatory Biomarker Identification via Derandomized Knockoffs for Cox Regression with k-FWER Control arxiv.org/abs/2504.03907 .ME .AP

Confirmatory Biomarker Identification via Derandomized Knockoffs for Cox Regression with k-FWER Control

Selecting important features in high-dimensional survival analysis is critical for identifying confirmatory biomarkers while maintaining rigorous error control. In this paper, we propose a derandomized knockoffs procedure for Cox regression that enhances stability in feature selection while maintaining rigorous control over the k-familywise error rate (k-FWER). By aggregating across multiple randomized knockoff realizations, our approach mitigates the instability commonly observed with conventional knockoffs. Through extensive simulations, we demonstrate that our method consistently outperforms standard knockoffs in both selection power and error control. Moreover, we apply our procedure to a clinical dataset on primary biliary cirrhosis (PBC) to identify key prognostic biomarkers associated with patient survival. The results confirm the superior stability of the derandomized knockoffs method, allowing for a more reliable identification of important clinical variables. Additionally, our approach is applicable to datasets containing both continuous and categorical covariates, broadening its utility in real-world biomedical studies. This framework provides a robust and interpretable solution for high-dimensional survival analysis, making it particularly suitable for applications requiring precise and stable variable selection.

arXiv.org

Batch Bayesian Optimization for High-Dimensional Experimental Design: Simulation and Visualization arxiv.org/abs/2504.03943 -mat.mtrl-sci .ML .LG

Batch Bayesian Optimization for High-Dimensional Experimental Design: Simulation and Visualization

Bayesian Optimization (BO) is increasingly used to guide experimental optimization tasks. To elucidate BO behavior in noisy and high-dimensional settings typical for materials science applications, we perform batch BO of two six-dimensional test functions: an Ackley function representing a needle-in-a-haystack problem and a Hartmann function representing a problem with a false maximum with a value close to the global maximum. We show learning curves, performance metrics, and visualization to effectively track the evolution of optimization in high dimensions and evaluate how they are affected by noise, batch-picking method, choice of acquisition function,and its exploration hyperparameter values. We find that the effects of noise depend on the problem landscape; therefore, prior knowledge of the domain structure and noise level is needed when designing BO. The Ackley function optimization is significantly degraded by noise with a complete loss of ground truth resemblance when noise equals 10 % of the maximum objective value. For the Hartmann function, even in the absence of noise, a significant fraction of the initial samplings identify the false maximum instead of the ground truth maximum as the optimum of the function; with increasing noise, BO remains effective, albeit with increasing probability of landing on the false maximum. This study systematically highlights the critical issues when setting up BO and choosing synthetic data to test experimental design. The results and methodology will facilitate wider utilization of BO in guiding experiments, specifically in high-dimensional settings.

arXiv.org

Spatially-Heterogeneous Causal Bayesian Networks for Seismic Multi-Hazard Estimation: A Variational Approach with Gaussian Processes and Normalizing Flows arxiv.org/abs/2504.04013 .ML .AP .LG

Spatially-Heterogeneous Causal Bayesian Networks for Seismic Multi-Hazard Estimation: A Variational Approach with Gaussian Processes and Normalizing Flows

Post-earthquake hazard and impact estimation are critical for effective disaster response, yet current approaches face significant limitations. Traditional models employ fixed parameters regardless of geographical context, misrepresenting how seismic effects vary across diverse landscapes, while remote sensing technologies struggle to distinguish between co-located hazards. We address these challenges with a spatially-aware causal Bayesian network that decouples co-located hazards by modeling their causal relationships with location-specific parameters. Our framework integrates sensing observations, latent variables, and spatial heterogeneity through a novel combination of Gaussian Processes with normalizing flows, enabling us to capture how same earthquake produces different effects across varied geological and topographical features. Evaluations across three earthquakes demonstrate Spatial-VCBN achieves Area Under the Curve (AUC) improvements of up to 35.2% over existing methods. These results highlight the critical importance of modeling spatial heterogeneity in causal mechanisms for accurate disaster assessment, with direct implications for improving emergency response resource allocation.

arXiv.org

A Lanczos-Based Algorithmic Approach for Spike Detection in Large Sample Covariance Matrices arxiv.org/abs/2504.03066 .ST .PR .CO .TH

A computational transition for detecting multivariate shuffled linear regression by low-degree polynomials arxiv.org/abs/2504.03097 .ML .PR .ST .TH .LG

A computational transition for detecting multivariate shuffled linear regression by low-degree polynomials

In this paper, we study the problem of multivariate shuffled linear regression, where the correspondence between predictors and responses in a linear model is obfuscated by a latent permutation. Specifically, we investigate the model $Y=\tfrac{1}{\sqrt{1+σ^2}}(Π_* X Q_* + σZ)$, where $X$ is an $n*d$ standard Gaussian design matrix, $Z$ is an $n*m$ Gaussian noise matrix, $Π_*$ is an unknown $n*n$ permutation matrix, and $Q_*$ is an unknown $d*m$ on the Grassmanian manifold satisfying $Q_*^{\top} Q_* = \mathbb I_m$. Consider the hypothesis testing problem of distinguishing this model from the case where $X$ and $Y$ are independent Gaussian random matrices of sizes $n*d$ and $n*m$, respectively. Our results reveal a phase transition phenomenon in the performance of low-degree polynomial algorithms for this task. (1) When $m=o(d)$, we show that all degree-$D$ polynomials fail to distinguish these two models even when $σ=0$, provided with $D^4=o\big( \tfrac{d}{m} \big)$. (2) When $m=d$ and $σ=ω(1)$, we show that all degree-$D$ polynomials fail to distinguish these two models provided with $D=o(σ)$. (3) When $m=d$ and $σ=o(1)$, we show that there exists a constant-degree polynomial that strongly distinguish these two models. These results establish a smooth transition in the effectiveness of low-degree polynomial algorithms for this problem, highlighting the interplay between the dimensions $m$ and $d$, the noise level $σ$, and the computational complexity of the testing task.

arXiv.org

Interim Analysis in Sequential Multiple Assignment Randomized Trials for Survival Outcomes arxiv.org/abs/2504.03143 .ME

Interim Analysis in Sequential Multiple Assignment Randomized Trials for Survival Outcomes

Sequential multiple assignment randomized trials mimic the actual treatment processes experienced by physicians and patients in clinical settings and inform the comparative effectiveness of dynamic treatment regimes. In such trials, patients go through multiple stages of treatment, and the treatment assignment is adapted over time based on individual patient characteristics such as disease status and treatment history. In this work, we develop and evaluate statistically valid interim monitoring approaches to allow for early termination of sequential multiple assignment randomized trials for efficacy targeting survival outcomes. We propose a weighted log-rank Chi-square statistic to account for overlapping treatment paths and quantify how the log-rank statistics at two different analysis points are correlated. Efficacy boundaries at multiple interim analyses can then be established using the Pocock, O'Brien Fleming, and Lan-Demets boundaries. We run extensive simulations to comparatively evaluate the operating characteristics (type I error and power) of our interim monitoring procedure based on the proposed statistic and another existing statistic. The methods are demonstrated via an analysis of a neuroblastoma dataset.

arXiv.org

A multi-locus predictiveness curve and its summary assessment for genetic risk prediction arxiv.org/abs/2504.00024 .ME .AI .LG

Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.