arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 601 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

Modeling Multivariate Degradation Data with Dynamic Covariates Under a Bayesian Framework https://arxiv.org/abs/2504.05484 #stat.ME #stat.AP

Modeling Multivariate Degradation Data with Dynamic Covariates Under a Bayesian Framework

Degradation data are essential for determining the reliability of high-end products and systems, especially when covering multiple degradation characteristics (DCs). Modern degradation studies not only measure these characteristics but also record dynamic system usage and environmental factors, such as temperature, humidity, and ultraviolet exposures, referred to as the dynamic covariates. Most current research either focuses on a single DC with dynamic covariates or multiple DCs with fixed covariates. This paper presents a Bayesian framework to analyze data with multiple DCs, which incorporates dynamic covariates. We develop a Bayesian framework for mixed effect nonlinear general path models to describe the degradation path and use Bayesian shape-constrained P-splines to model the effects of dynamic covariates. We also detail algorithms for estimating the failure time distribution induced by our degradation model, validate the developed methods through simulation, and illustrate their use in predicting the lifespan of organic coatings in dynamic environments.

arXiv Statistics @arxiv_stats@qoto.org

Bayesian Shrinkage in High-Dimensional VAR Models: A Comparative Study https://arxiv.org/abs/2504.05489 #stat.ME #stat.AP

Bayesian Shrinkage in High-Dimensional VAR Models: A Comparative Study

High-dimensional vector autoregressive (VAR) models offer a versatile framework for multivariate time series analysis, yet face critical challenges from over-parameterization and uncertain lag order. In this paper, we systematically compare three Bayesian shrinkage priors (horseshoe, lasso, and normal) and two frequentist regularization approaches (ridge and nonparametric shrinkage) under three carefully crafted simulation scenarios. These scenarios encompass (i) overfitting in a low-dimensional setting, (ii) sparse high-dimensional processes, and (iii) a combined scenario where both large dimension and overfitting complicate inference. We evaluate each method in quality of parameter estimation (root mean squared error, coverage, and interval length) and out-of-sample forecasting (one-step-ahead forecast RMSE). Our findings show that local-global Bayesian methods, particularly the horseshoe, dominate in maintaining accurate coverage and minimizing parameter error, even when the model is heavily over-parameterized. Frequentist ridge often yields competitive point forecasts but underestimates uncertainty, leading to sub-nominal coverage. A real-data application using macroeconomic variables from Canada illustrates how these methods perform in practice, reinforcing the advantages of local-global priors in stabilizing inference when dimension or lag order is inflated.

arXiv Statistics @arxiv_stats@qoto.org

Adaptive Design for Contour Estimation from Computer Experiments with Quantitative and Qualitative Inputs https://arxiv.org/abs/2504.05498 #stat.ME

Adaptive Design for Contour Estimation from Computer Experiments with Quantitative and Qualitative Inputs

Computer experiments with quantitative and qualitative inputs are widely used to study many scientific and engineering processes. Much of the existing work has focused on design and modeling or process optimization for such experiments. This paper proposes an adaptive design approach for estimating a contour from computer experiments with quantitative and qualitative inputs. A new criterion is introduced to search for the follow-up inputs. The key features of the proposed criterion are (a) the criterion yields adaptive search regions; and (b) it is region-based cooperative in that for each stage of the sequential procedure, the candidate points in the design space is divided into two disjoint groups using confidence bounds, and within each group, an acquisition function is used to select a candidate point. Among the two selected points, a point that is closer to the contour level with the higher uncertainty or that has higher uncertainty when the distance between its prediction and the contour level is within a threshold is chosen. The proposed approach provides empirically more accurate contour estimation than existing approaches as illustrated in numerical examples and a real application. Theoretical justification of the proposed adaptive search region is given.

arXiv Statistics @arxiv_stats@qoto.org

A new discrimination measure for assessing predictive performance of non-linear survival models https://arxiv.org/abs/2504.05630 #stat.ME

A new discrimination measure for assessing predictive performance of non-linear survival models

Non-linear survival models are flexible models in which the proportional hazard assumption is not required. This poses difficulties in their evaluation. We introduce a new discrimination measure, time-dependent Uno's C-index, to assess the discrimination performance of non-linear survival models. This is an unbiased version of Antolini's time-dependent concordance. We prove convergence of both measures employing Nolan and Pollard's results on U-statistics. We explore the relationship between these measures and, in particular, the bias of Antolini's concordance in the presence of censoring using simulated data. We demonstrate the value of time-dependent Uno's C-index for the evaluation of models trained on censored real data and for model tuning.

arXiv Statistics @arxiv_stats@qoto.org

Effective Method for Inverse Ising Problem under Missing Observations in Restricted Boltzmann Machines https://arxiv.org/abs/2504.05643 #cond-mat.dis-nn #physics.data-an #stat.ML #cs.LG

Effective Method for Inverse Ising Problem under Missing Observations in Restricted Boltzmann Machines

Restricted Boltzmann machines (RBMs) are energy-based models analogous to the Ising model and are widely applied in statistical machine learning. The standard inverse Ising problem with a complete dataset requires computing both data and model expectations and is computationally challenging because model expectations have a combinatorial explosion. Furthermore, in many applications, the available datasets are partially incomplete, making it difficult to compute even data expectations. In this study, we propose a approximation framework for these expectations in the practical inverse Ising problems that integrates mean-field approximation or persistent contrastive divergence to generate refined initial points and spatial Monte Carlo integration to enhance estimator accuracy. We demonstrate that the proposed method effectively and accurately tunes the model parameters in comparison to the conventional method.

arXiv Statistics @arxiv_stats@qoto.org

Identification and estimation of causal peer effects using instrumental variables https://arxiv.org/abs/2504.05658 #stat.ME

Identification and estimation of causal peer effects using instrumental variables

In social science researches, causal inference regarding peer effects often faces significant challenges due to homophily bias and contextual confounding. For example, unmeasured health conditions (e.g., influenza) and psychological states (e.g., happiness, loneliness) can spread among closely connected individuals, such as couples or siblings. To address these issues, we define four effect estimands for dyadic data to characterize direct effects and spillover effects. We employ dual instrumental variables to achieve nonparametric identification of these causal estimands in the presence of unobserved confounding. We then derive the efficient influence functions for these estimands under the nonparametric model. Additionally, we develop a triply robust and locally efficient estimator that remains consistent even under partial misspecification of the observed data model. The proposed robust estimators can be easily adapted to flexible approaches such as machine learning estimation methods, provided that certain rate conditions are satisfied. Finally, we illustrate our approach through simulations and an empirical application evaluating the peer effects of retirement on fluid cognitive perception among couples.

arXiv Statistics @arxiv_stats@qoto.org

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning https://arxiv.org/abs/2504.03784 #stat.ML #cs.AI #cs.LG

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications. Theoretically, our algorithm reduces the variance of reward and policy estimators, leading to improved regret bounds. Empirical evaluations on LLM benchmark datasets demonstrate that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset.

arXiv Statistics @arxiv_stats@qoto.org

Bayesian Modal Regression for Forecast Combinations https://arxiv.org/abs/2504.03859 #stat.ME

Bayesian Modal Regression for Forecast Combinations

Forecast combination methods have traditionally emphasized symmetric loss functions, particularly squared error loss, with equally weighted combinations often justified as a robust approach under such criteria. However, these justifications do not extend to asymmetric loss functions, where optimally weighted combinations may provide superior predictive performance. This study introduces a novel contribution by incorporating modal regression into forecast combinations, offering a Bayesian hierarchical framework that models the conditional mode of the response through combinations of time-varying parameters and exponential discounting. The proposed approach utilizes error distributions characterized by asymmetry and heavy tails, specifically the asymmetric Laplace, asymmetric normal, and reverse Gumbel distributions. Simulated data validate the parameter estimation for the modal regression models, confirming the robustness of the proposed methodology. Application of these methodologies to a real-world analyst forecast dataset shows that modal regression with asymmetric Laplace errors outperforms mean regression based on two key performance metrics: the hit rate, which measures the accuracy of classifying the sign of revenue surprises, and the win rate, which assesses the proportion of forecasts surpassing the equally weighted consensus. These results underscore the presence of skewness and fat-tailed behavior in forecast combination errors for revenue forecasting, highlighting the advantages of modal regression in financial applications.

arXiv Statistics @arxiv_stats@qoto.org

MaxTDA: Robust Statistical Inference for Maximal Persistence in Topological Data Analysis https://arxiv.org/abs/2504.03897 #stat.ME #math.AT #stat.CO

MaxTDA: Robust Statistical Inference for Maximal Persistence in Topological Data Analysis

Persistent homology is an area within topological data analysis (TDA) that can uncover different dimensional holes (connected components, loops, voids, etc.) in data. The holes are characterized, in part, by how long they persist across different scales. Noisy data can result in many additional holes that are not true topological signal. Various robust TDA techniques have been proposed to reduce the number of noisy holes, however, these robust methods have a tendency to also reduce the topological signal. This work introduces Maximal TDA (MaxTDA), a statistical framework addressing a limitation in TDA wherein robust inference techniques systematically underestimate the persistence of significant homological features. MaxTDA combines kernel density estimation with level-set thresholding via rejection sampling to generate consistent estimators for the maximal persistence features that minimizes bias while maintaining robustness to noise and outliers. We establish the consistency of the sampling procedure and the stability of the maximal persistence estimator. The framework also enables statistical inference on topological features through rejection bands, constructed from quantiles that bound the estimator's deviation probability. MaxTDA is particularly valuable in applications where precise quantification of statistically significant topological features is essential for revealing underlying structural properties in complex datasets. Numerical simulations across varied datasets, including an example from exoplanet astronomy, highlight the effectiveness of MaxTDA in recovering true topological signals.

arXiv Statistics @arxiv_stats@qoto.org

Confirmatory Biomarker Identification via Derandomized Knockoffs for Cox Regression with k-FWER Control https://arxiv.org/abs/2504.03907 #stat.ME #stat.AP

Confirmatory Biomarker Identification via Derandomized Knockoffs for Cox Regression with k-FWER Control

Selecting important features in high-dimensional survival analysis is critical for identifying confirmatory biomarkers while maintaining rigorous error control. In this paper, we propose a derandomized knockoffs procedure for Cox regression that enhances stability in feature selection while maintaining rigorous control over the k-familywise error rate (k-FWER). By aggregating across multiple randomized knockoff realizations, our approach mitigates the instability commonly observed with conventional knockoffs. Through extensive simulations, we demonstrate that our method consistently outperforms standard knockoffs in both selection power and error control. Moreover, we apply our procedure to a clinical dataset on primary biliary cirrhosis (PBC) to identify key prognostic biomarkers associated with patient survival. The results confirm the superior stability of the derandomized knockoffs method, allowing for a more reliable identification of important clinical variables. Additionally, our approach is applicable to datasets containing both continuous and categorical covariates, broadening its utility in real-world biomedical studies. This framework provides a robust and interpretable solution for high-dimensional survival analysis, making it particularly suitable for applications requiring precise and stable variable selection.

arXiv Statistics @arxiv_stats@qoto.org

Common Drivers in Sparsely Interacting Hawkes Processes https://arxiv.org/abs/2504.03916 #math.ST #stat.TH

Common Drivers in Sparsely Interacting Hawkes Processes

We study a multivariate Hawkes process as a model for time-continuous relational event networks. The model does not assume the network to be known, it includes covariates, and it allows for both common drivers, parameters common to all the actors in the network, and also local parameters specific for each actor. We derive rates of convergence for all of the model parameters when both the number of actors and the time horizon tends to infinity. To prevent an exploding network, sparseness is assumed. We also discuss numerical aspects.

arXiv Statistics @arxiv_stats@qoto.org

Batch Bayesian Optimization for High-Dimensional Experimental Design: Simulation and Visualization https://arxiv.org/abs/2504.03943 #cond-mat.mtrl-sci #stat.ML #cs.LG

Batch Bayesian Optimization for High-Dimensional Experimental Design: Simulation and Visualization

Bayesian Optimization (BO) is increasingly used to guide experimental optimization tasks. To elucidate BO behavior in noisy and high-dimensional settings typical for materials science applications, we perform batch BO of two six-dimensional test functions: an Ackley function representing a needle-in-a-haystack problem and a Hartmann function representing a problem with a false maximum with a value close to the global maximum. We show learning curves, performance metrics, and visualization to effectively track the evolution of optimization in high dimensions and evaluate how they are affected by noise, batch-picking method, choice of acquisition function,and its exploration hyperparameter values. We find that the effects of noise depend on the problem landscape; therefore, prior knowledge of the domain structure and noise level is needed when designing BO. The Ackley function optimization is significantly degraded by noise with a complete loss of ground truth resemblance when noise equals 10 % of the maximum objective value. For the Hartmann function, even in the absence of noise, a significant fraction of the initial samplings identify the false maximum instead of the ground truth maximum as the optimum of the function; with increasing noise, BO remains effective, albeit with increasing probability of landing on the false maximum. This study systematically highlights the critical issues when setting up BO and choosing synthetic data to test experimental design. The results and methodology will facilitate wider utilization of BO in guiding experiments, specifically in high-dimensional settings.

arXiv Statistics @arxiv_stats@qoto.org

Spatially-Heterogeneous Causal Bayesian Networks for Seismic Multi-Hazard Estimation: A Variational Approach with Gaussian Processes and Normalizing Flows https://arxiv.org/abs/2504.04013 #stat.ML #stat.AP #cs.LG

Spatially-Heterogeneous Causal Bayesian Networks for Seismic Multi-Hazard Estimation: A Variational Approach with Gaussian Processes and Normalizing Flows

Post-earthquake hazard and impact estimation are critical for effective disaster response, yet current approaches face significant limitations. Traditional models employ fixed parameters regardless of geographical context, misrepresenting how seismic effects vary across diverse landscapes, while remote sensing technologies struggle to distinguish between co-located hazards. We address these challenges with a spatially-aware causal Bayesian network that decouples co-located hazards by modeling their causal relationships with location-specific parameters. Our framework integrates sensing observations, latent variables, and spatial heterogeneity through a novel combination of Gaussian Processes with normalizing flows, enabling us to capture how same earthquake produces different effects across varied geological and topographical features. Evaluations across three earthquakes demonstrate Spatial-VCBN achieves Area Under the Curve (AUC) improvements of up to 35.2% over existing methods. These results highlight the critical importance of modeling spatial heterogeneity in causal mechanisms for accurate disaster assessment, with direct implications for improving emergency response resource allocation.

arXiv Statistics @arxiv_stats@qoto.org

Computational Efficient Informative Nonignorable Matrix Completion: A Row- and Column-Wise Matrix U-Statistic Pseudo-Likelihood Approach https://arxiv.org/abs/2504.04016 #stat.ML #cs.LG

Computational Efficient Informative Nonignorable Matrix Completion: A Row- and Column-Wise Matrix U-Statistic Pseudo-Likelihood Approach

In this study, we establish a unified framework to deal with the high dimensional matrix completion problem under flexible nonignorable missing mechanisms. Although the matrix completion problem has attracted much attention over the years, there are very sparse works that consider the nonignorable missing mechanism. To address this problem, we derive a row- and column-wise matrix U-statistics type loss function, with the nuclear norm for regularization. A singular value proximal gradient algorithm is developed to solve the proposed optimization problem. We prove the non-asymptotic upper bound of the estimation error's Frobenius norm and show the performance of our method through numerical simulations and real data analysis.

arXiv Statistics @arxiv_stats@qoto.org

Leveraging Shared Factor Structures for Enhanced Matrix Completion with Nonconvex Penalty Regularization https://arxiv.org/abs/2504.04020 #stat.ME

Leveraging Shared Factor Structures for Enhanced Matrix Completion with Nonconvex Penalty Regularization

This article investigates the problem of noisy low-rank matrix completion with a shared factor structure, leveraging the auxiliary information from the missing indicator matrix to enhance prediction accuracy. Despite decades of development in matrix completion, the potential relationship between observed data and missing indicators has largely been overlooked. To address this gap, we propose a joint modeling framework for the observed data and missing indicators within the context of a generalized factor model and derive the asymptotic limit distribution of the estimators. Furthermore, to tackle the rank estimation problem for model specification, we employ matrix nonconvex penalty regularization and establish nonasymptotic probability guarantees for the Oracle property. The theoretical results are validated through extensive simulation studies and real-world data analysis, demonstrating the effectiveness of the proposed method.

arXiv Statistics @arxiv_stats@qoto.org

Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes https://arxiv.org/abs/2504.04105 #stat.ML #cs.LG

Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes

We study $\textit{gradient descent}$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter $η$. We show that after at most $1/γ^2$ burn-in steps, GD achieves a risk upper bounded by $\exp(-Θ(η))$, where $γ$ is the margin of the dataset. As $η$ can be arbitrarily large, GD attains an arbitrarily small risk $\textit{immediately after the burn-in steps}$, though the risk evolution may be $\textit{non-monotonic}$. We further construct hard datasets with margin $γ$, where any batch or online first-order method requires $Ω(1/γ^2)$ steps to find a linear separator. Thus, GD with large, adaptive stepsizes is $\textit{minimax optimal}$ among first-order batch methods. Notably, the classical $\textit{Perceptron}$ (Novikoff, 1962), a first-order online method, also achieves a step complexity of $1/γ^2$, matching GD even in constants. Finally, our GD analysis extends to a broad class of loss functions and certain two-layer networks.

arXiv Statistics @arxiv_stats@qoto.org

A Lanczos-Based Algorithmic Approach for Spike Detection in Large Sample Covariance Matrices https://arxiv.org/abs/2504.03066 #math.ST #math.PR #stat.CO #stat.TH

arXiv Statistics @arxiv_stats@qoto.org

A computational transition for detecting multivariate shuffled linear regression by low-degree polynomials https://arxiv.org/abs/2504.03097 #stat.ML #math.PR #math.ST #stat.TH #cs.LG

A computational transition for detecting multivariate shuffled linear regression by low-degree polynomials

In this paper, we study the problem of multivariate shuffled linear regression, where the correspondence between predictors and responses in a linear model is obfuscated by a latent permutation. Specifically, we investigate the model $Y=\tfrac{1}{\sqrt{1+σ^2}}(Π_* X Q_* + σZ)$, where $X$ is an $n*d$ standard Gaussian design matrix, $Z$ is an $n*m$ Gaussian noise matrix, $Π_*$ is an unknown $n*n$ permutation matrix, and $Q_*$ is an unknown $d*m$ on the Grassmanian manifold satisfying $Q_*^{\top} Q_* = \mathbb I_m$. Consider the hypothesis testing problem of distinguishing this model from the case where $X$ and $Y$ are independent Gaussian random matrices of sizes $n*d$ and $n*m$, respectively. Our results reveal a phase transition phenomenon in the performance of low-degree polynomial algorithms for this task. (1) When $m=o(d)$, we show that all degree-$D$ polynomials fail to distinguish these two models even when $σ=0$, provided with $D^4=o\big( \tfrac{d}{m} \big)$. (2) When $m=d$ and $σ=ω(1)$, we show that all degree-$D$ polynomials fail to distinguish these two models provided with $D=o(σ)$. (3) When $m=d$ and $σ=o(1)$, we show that there exists a constant-degree polynomial that strongly distinguish these two models. These results establish a smooth transition in the effectiveness of low-degree polynomial algorithms for this problem, highlighting the interplay between the dimensions $m$ and $d$, the noise level $σ$, and the computational complexity of the testing task.

arXiv Statistics @arxiv_stats@qoto.org

Interim Analysis in Sequential Multiple Assignment Randomized Trials for Survival Outcomes https://arxiv.org/abs/2504.03143 #stat.ME

Interim Analysis in Sequential Multiple Assignment Randomized Trials for Survival Outcomes

Sequential multiple assignment randomized trials mimic the actual treatment processes experienced by physicians and patients in clinical settings and inform the comparative effectiveness of dynamic treatment regimes. In such trials, patients go through multiple stages of treatment, and the treatment assignment is adapted over time based on individual patient characteristics such as disease status and treatment history. In this work, we develop and evaluate statistically valid interim monitoring approaches to allow for early termination of sequential multiple assignment randomized trials for efficacy targeting survival outcomes. We propose a weighted log-rank Chi-square statistic to account for overlapping treatment paths and quantify how the log-rank statistics at two different analysis points are correlated. Efficacy boundaries at multiple interim analyses can then be established using the Pocock, O'Brien Fleming, and Lan-Demets boundaries. We run extensive simulations to comparatively evaluate the operating characteristics (type I error and power) of our interim monitoring procedure based on the proposed statistic and another existing statistic. The methods are demonstrated via an analysis of a neuroblastoma dataset.

arXiv Statistics @arxiv_stats@qoto.org

A multi-locus predictiveness curve and its summary assessment for genetic risk prediction https://arxiv.org/abs/2504.00024 #stat.ME #cs.AI #cs.LG

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019