arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 603 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse https://arxiv.org/abs/2504.10540 #stat.ML #cs.AI #cs.LG

AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse

Diffusion models have demonstrated remarkable success in generative tasks, yet their iterative denoising process results in slow inference, limiting their practicality. While existing acceleration methods exploit the well-known U-shaped similarity pattern between adjacent steps through caching mechanisms, they lack theoretical foundation and rely on simplistic computation reuse, often leading to performance degradation. In this work, we provide a theoretical understanding by analyzing the denoising process through the second-order Adams-Bashforth method, revealing a linear relationship between the outputs of consecutive steps. This analysis explains why the outputs of adjacent steps exhibit a U-shaped pattern. Furthermore, extending Adams-Bashforth method to higher order, we propose a novel caching-based acceleration approach for diffusion models, instead of directly reusing cached results, with a truncation error bound of only \(O(h^k)\) where $h$ is the step size. Extensive validation across diverse image and video diffusion models (including HunyuanVideo and FLUX.1-dev) with various schedulers demonstrates our method's effectiveness in achieving nearly $3\times$ speedup while maintaining original performance levels, offering a practical real-time solution without compromising generation quality.

arXiv Statistics @arxiv_stats@qoto.org

Mitigating Eddington and Malmquist Biases in Latent-Inclination Regression of the Tully-Fisher Relation https://arxiv.org/abs/2504.10589 #astro-ph.GA #astro-ph.IM #stat.ME

Mitigating Eddington and Malmquist Biases in Latent-Inclination Regression of the Tully-Fisher Relation

Precise estimation of the Tully-Fisher relation is compromised by statistical biases and uncertain inclination corrections. To account for selection effects (Malmquist bias) while avoiding individual inclination corrections, I introduce a Bayesian method based on likelihood functions that incorporate Sine-distributed scatter of rotation velocities, Gaussian scatter from intrinsic dispersion and measurement error, and the observational selection function. However, tests of unidirectional models on simulated datasets reveal an additional bias arising from neglect of the Gaussian scatter in the independent variable. This additional bias is identified as a generalized Eddington bias, which distorts the data distribution independently of Malmuqist bias. I introduce two extensions to the Bayesian method that successfully mitigate the Eddington bias: (1) analytical bias corrections of the dependent variable prior to likelihood computation, and (2) a bidirectional dual-scatter model that includes the Gaussian scatter of the independent variable in the likelihood function. By rigorously accounting for Malmquist and Eddington biases in a latent-inclination regression analysis, this work establishes a framework for unbiased distance estimates from standardizable candles, critical for improving determinations of the Hubble constant.

arXiv Statistics @arxiv_stats@qoto.org

Beyond Worst-Case Online Classification: VC-Based Regret Bounds for Relaxed Benchmarks https://arxiv.org/abs/2504.10598 #stat.ML #cs.LG

Beyond Worst-Case Online Classification: VC-Based Regret Bounds for Relaxed Benchmarks

We revisit online binary classification by shifting the focus from competing with the best-in-class binary loss to competing against relaxed benchmarks that capture smoothed notions of optimality. Instead of measuring regret relative to the exact minimal binary error -- a standard approach that leads to worst-case bounds tied to the Littlestone dimension -- we consider comparing with predictors that are robust to small input perturbations, perform well under Gaussian smoothing, or maintain a prescribed output margin. Previous examples of this were primarily limited to the hinge loss. Our algorithms achieve regret guarantees that depend only on the VC dimension and the complexity of the instance space (e.g., metric entropy), and notably, they incur only an $O(\log(1/γ))$ dependence on the generalized margin $γ$. This stands in contrast to most existing regret bounds, which typically exhibit a polynomial dependence on $1/γ$. We complement this with matching lower bounds. Our analysis connects recent ideas from adversarial robustness and smoothed online learning.

arXiv Statistics @arxiv_stats@qoto.org

Enhancing the Tensor Normal via Geometrically Parameterized Cholesky Factors https://arxiv.org/abs/2504.10645 #stat.ME #stat.CO

Enhancing the Tensor Normal via Geometrically Parameterized Cholesky Factors

In this article, we explore Bayesian extensions of the tensor normal model through a geometric expansion of the multi-way covariance's Cholesky factor inspired by the Fréchet mean under the log-Cholesky metric. Specifically, within a tensor normal framework, we identify three structural components in the covariance of the vectorized data. By parameterizing vector normal covariances through such a Cholesky factor representation, analogous to a finite average of multiway Cholesky factors, we eliminate one of these structural components without compromising the analytical tractability of the likelihood, in which the multiway covariance is a special case. Furthermore, we demonstrate that a specific class of structured Cholesky factors can be precisely represented under this parameterization, serving as an analogue to the Pitsianis-Van Loan decomposition. We apply this model using Hamiltonian Monte Carlo in a fixed-mean setting for two-way covariance relevancy detection of components, where efficient analytical gradient updates are available, as well as in a seasonally-varying covariance process regime.

arXiv Statistics @arxiv_stats@qoto.org

Bayesian analysis of regression discontinuity designs with heterogeneous treatment effects https://arxiv.org/abs/2504.10652 #math.ST #stat.ME #stat.TH

Bayesian analysis of regression discontinuity designs with heterogeneous treatment effects

Regression Discontinuity Design (RDD) is a popular framework for estimating a causal effect in settings where treatment is assigned if an observed covariate exceeds a fixed threshold. We consider estimation and inference in the common setting where the sample consists of multiple known sub-populations with potentially heterogeneous treatment effects. In the applied literature, it is common to account for heterogeneity by either fitting a parametric model or considering each sub-population separately. In contrast, we develop a Bayesian hierarchical model using Gaussian process regression which allows for non-parametric regression while borrowing information across sub-populations. We derive the posterior distribution, prove posterior consistency, and develop a Metropolis-Hastings within Gibbs sampling algorithm. In extensive simulations, we show that the proposed procedure outperforms existing methods in both estimation and inferential tasks. Finally, we apply our procedure to U.S. Senate election data and discover an incumbent party advantage which is heterogeneous over different time periods.

arXiv Statistics @arxiv_stats@qoto.org

On the Contractivity of Stochastic Interpolation Flow https://arxiv.org/abs/2504.10653 #math.ST #stat.ML #stat.TH #cs.LG

On the Contractivity of Stochastic Interpolation Flow

We investigate stochastic interpolation, a recently introduced framework for high dimensional sampling which bears many similarities to diffusion modeling. Stochastic interpolation generates a data sample by first randomly initializing a particle drawn from a simple base distribution, then simulating deterministic or stochastic dynamics such that in finite time the particle's distribution converges to the target. We show that for a Gaussian base distribution and a strongly log-concave target distribution, the stochastic interpolation flow map is Lipschitz with a sharp constant which matches that of Caffarelli's theorem for optimal transport maps. We are further able to construct Lipschitz transport maps between non-Gaussian distributions, generalizing some recent constructions in the literature on transport methods for establishing functional inequalities. We discuss the practical implications of our theorem for the sampling and estimation problems required by stochastic interpolation.

arXiv Statistics @arxiv_stats@qoto.org

Beyond Coordinates: Meta-Equivariance in Statistical Inference https://arxiv.org/abs/2504.10667 #math.ST #math.OC #stat.ML #stat.TH

Beyond Coordinates: Meta-Equivariance in Statistical Inference

Optimal statistical decisions should transcend the language used to describe them. Yet, how do we guarantee that the choice of coordinates - the parameterisation of an optimisation problem - does not subtly dictate the solution? This paper reveals a fundamental geometric invariance principle. We first analyse the optimal combination of two asymptotically normal estimators under a strictly convex trace-AMSE risk. While methods for finding optimal weights are known, we prove that the resulting optimal estimator is invariant under direct affine reparameterisations of the weighting scheme. This exemplifies a broader principle we term meta-equivariance: the unique minimiser of any strictly convex, differentiable scalar objective over a matrix space transforms covariantly under any invertible affine reparameterisation of that space. Distinct from classical statistical equivariance tied to data symmetries, meta-equivariance arises from the immutable geometry of convex optimisation itself. It guarantees that optimality, in these settings, is not an artefact of representation but an intrinsic, coordinate-free truth.

arXiv Statistics @arxiv_stats@qoto.org

Predicting Power Grid Failures Using Self-Organized Criticality: A Case Study of the Texas Grid 2014-2022 https://arxiv.org/abs/2504.10675 #stat.AP

Predicting Power Grid Failures Using Self-Organized Criticality: A Case Study of the Texas Grid 2014-2022

This study develops a novel predictive framework for power grid vulnerability based on the statistical signatures of Self-Organized Criticality (SOC). By analyzing the evolution of the power law critical exponents in outage size distributions from the Texas grid during 2014-2022, we demonstrate the method's ability for forecasting system-wide vulnerability to catastrophic failures. Our results reveal a systematic decline in the critical exponent from 1.45 in 2018 to 0.95 in 2020, followed by a drop below the theoretical critical threshold ($α$ = 1) to 0.62 in 2021, coinciding precisely with the catastrophic February 2021 power crisis. Such predictive signal emerged 6-12 months before the crisis. By monitoring critical exponent transitions through subcritical and supercritical regimes, we provide quantitative early warning capabilities for catastrophic infrastructure failures, with significant implications for grid resilience planning, risk assessment, and emergency preparedness in increasingly stressed power systems.

arXiv Statistics @arxiv_stats@qoto.org

Power properties of the two sample test based on nearest neighbors graphs https://arxiv.org/abs/2504.10719 #math.ST #stat.TH

Power properties of the two sample test based on nearest neighbors graphs

In this paper, we study the problem of testing the equality of two multivariate distributions. One class of tests used for this purpose utilizes geometric graphs constructed using inter-point distances. The asymptotic theory of these tests so far applies only to graphs which fall under the stabilizing graphs framework of Penrose and Yukich. We study the case of the $K$-nearest neighbors graph where $K=k_N$ increases with the sample size, which does not fall under the stabilizing graphs framework. Our main result gives detection thresholds for this test in parametrized families when $k_N = o(N^{1/4})$, thus extending the family of graphs where the theoretical behavior is known. We propose a 2-sided version of the test which removes an exponent gap that plagues the 1-sided test. Our result also shows that using a greater number of nearest neighbors boosts the power of the test. This provides theoretical justification for using denser graphs in testing equality of two distributions.

arXiv Statistics @arxiv_stats@qoto.org

Gaussian Approximation for High-Dimensional $U$-statistics with Size-Dependent Kernels https://arxiv.org/abs/2504.10866 #math.ST #stat.TH

Gaussian Approximation for High-Dimensional $U$-statistics with Size-Dependent Kernels

Motivated by small bandwidth asymptotics for kernel-based semiparametric estimators in econometrics, this paper establishes Gaussian approximation results for high-dimensional fixed-order $U$-statistics whose kernels depend on the sample size. Our results allow for a situation where the dominant component of the Hoeffding decomposition is absent or unknown, including cases with known degrees of degeneracy as special forms. The obtained error bounds for Gaussian approximations are sharp enough to almost recover the weakest bandwidth condition of small bandwidth asymptotics in the fixed-dimensional setting when applied to a canonical semiparametric estimation problem. We also present an application to an adaptive goodness-of-fit testing, along with discussions about several potential applications.

arXiv Statistics @arxiv_stats@qoto.org

Can SGD Select Good Fishermen? Local Convergence under Self-Selection Biases and Beyond https://arxiv.org/abs/2504.07133 #stat.ML #math.ST #stat.TH #cs.DS #cs.LG

Can SGD Select Good Fishermen? Local Convergence under Self-Selection Biases and Beyond

We revisit the problem of estimating $k$ linear regressors with self-selection bias in $d$ dimensions with the maximum selection criterion, as introduced by Cherapanamjeri, Daskalakis, Ilyas, and Zampetakis [CDIZ23, STOC'23]. Our main result is a $\operatorname{poly}(d,k,1/\varepsilon) + {k}^{O(k)}$ time algorithm for this problem, which yields an improvement in the running time of the algorithms of [CDIZ23] and [GM24, arXiv]. We achieve this by providing the first local convergence algorithm for self-selection, thus resolving the main open question of [CDIZ23]. To obtain this algorithm, we reduce self-selection to a seemingly unrelated statistical problem called coarsening. Coarsening occurs when one does not observe the exact value of the sample but only some set (a subset of the sample space) that contains the exact value. Inference from coarse samples arises in various real-world applications due to rounding by humans and algorithms, limited precision of instruments, and lag in multi-agent systems. Our reduction to coarsening is intuitive and relies on the geometry of the self-selection problem, which enables us to bypass the limitations of previous analytic approaches. To demonstrate its applicability, we provide a local convergence algorithm for linear regression under another self-selection criterion, which is related to second-price auction data. Further, we give the first polynomial time local convergence algorithm for coarse Gaussian mean estimation given samples generated from a convex partition. Previously, only a sample-efficient algorithm was known due to Fotakis, Kalavasis, Kontonis, and Tzamos [FKKT21, COLT'21].

arXiv Statistics @arxiv_stats@qoto.org

NFL Draft Modelling: Loss Functional Analysis https://arxiv.org/abs/2504.07291 #stat.AP #stat.ME

NFL Draft Modelling: Loss Functional Analysis

In the NFL draft, teams must strategically balance immediate player impact against long-term value, presenting a complex optimization challenge for draft capital management. This paper introduces a framework for evaluating the fairness and efficiency of draft pick trades using norm-based loss functions. Draft pick valuations are modelled by the Weibull distribution. Utilizing these valuation techniques, the research identifies key trade-offs between aggressive, immediate-impact strategies and conservative, risk-averse approaches. Ultimately, this framework serves as a valuable analytical tool for assessing NFL draft trade fairness and value distribution, aiding team decision-makers and enriching insights within the sports analytics community.

arXiv Statistics @arxiv_stats@qoto.org

Effective treatment allocation strategies under partial interference https://arxiv.org/abs/2504.07305 #stat.ME #stat.AP

Effective treatment allocation strategies under partial interference

Interference occurs when the potential outcomes of a unit depend on the treatment of others. Interference can be highly heterogeneous, where treating certain individuals might have a larger effect on the population's overall outcome. A better understanding of how covariates explain this heterogeneity may lead to more effective interventions. In the presence of clusters of units, we assume that interference occurs within clusters but not across them. We define novel causal estimands under hypothetical, stochastic treatment allocation strategies that fix the marginal treatment probability in a cluster and vary how the treatment probability depends on covariates, such as a unit's network position and characteristics. We illustrate how these causal estimands can shed light on the heterogeneity of interference and on the network and covariate profile of influential individuals. For experimental settings, we develop standardized weighting estimators for our novel estimands and derive their asymptotic distribution. We design an inferential procedure for testing the null hypothesis of interference homogeneity with respect to covariates. We validate the performance of the estimator and inferential procedure through simulations.We then apply the novel estimators to a clustered experiment in China to identify the important characteristics that drive heterogeneity in the effect of providing information sessions on insurance uptake.

arXiv Statistics @arxiv_stats@qoto.org

A Unified Framework for Large-Scale Classification: Error Rate Control and Optimality https://arxiv.org/abs/2504.07321 #stat.ME

A Unified Framework for Large-Scale Classification: Error Rate Control and Optimality

Classification is a fundamental task in supervised learning, while achieving valid misclassification rate control remains challenging due to possibly the limited predictive capability of the classifiers or the intrinsic complexity of the classification task. In this article, we address large-scale multi-class classification problems with general error rate guarantees to enhance algorithmic trustworthiness. To this end, we first introduce a notion of group-wise classification, which unifies the common class-wise and overall classifications as special cases. We then develop a unified algorithmic framework for the general group-wise classification that consists of three steps: Pre-classification, Selective $p$-value construction, and large-scale Post-classification decisions (PSP). Theoretically, PSP is distribution-free and provides valid finite-sample guarantees for controlling general group-wise false decision rates at target levels. To show the power of PSP, we demonstrate that the step of post-classification decisions never degrades the power of pre-classification, provided that pre-classification has been sufficiently powerful to meet the target error levels. Additionally, we further establish general power optimality theories for PSP from both non-asymptotic and asymptotic perspectives. Numerical results in both simulations and real data analysis validate the performance of the proposed PSP approach.

arXiv Statistics @arxiv_stats@qoto.org

What is the price of approximation? The saddlepoint approximation to a likelihood function https://arxiv.org/abs/2504.07324 #stat.ME

What is the price of approximation? The saddlepoint approximation to a likelihood function

The saddlepoint approximation to the likelihood, and its corresponding maximum likelihood estimate (MLE), offer an alternative estimation method when the true likelihood is intractable or computationally expensive. However, maximizing this approximated likelihood instead of the true likelihood inevitably comes at a price: a discrepancy between the MLE derived from the saddlepoint approximation and the true MLE. In previous studies, the size of this discrepancy has been investigated via simulation, or by engaging with the true likelihood despite its computational difficulties. Here, we introduce an explicit and computable approximation formula for the discrepancy, through which the adequacy of the saddlepoint-based MLE can be directly assessed. We present examples demonstrating the accuracy of this formula in specific cases where the true likelihood can be calculated. Additionally, we present asymptotic results that capture the behaviour of the discrepancy in a suitable limiting framework.

arXiv Statistics @arxiv_stats@qoto.org

Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents https://arxiv.org/abs/2504.07347 #stat.ML #math.PR #cs.LG

Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents

As demand for Large Language Models (LLMs) and AI agents rapidly grows, optimizing systems for efficient LLM inference becomes critical. While significant efforts have targeted system-level engineering, little is explored through a mathematical modeling and queuing perspective. In this paper, we aim to develop the queuing fundamentals for LLM inference, bridging the gap between queuing and LLM system communities. In particular, we study the throughput aspect in LLM inference systems. We prove that a large class of 'work-conserving' scheduling algorithms can achieve maximum throughput for both individual requests and AI agent workloads, highlighting 'work-conserving' as a key design principle in practice. Evaluations of real-world systems show that Orca and Sarathi-serve are throughput-optimal, reassuring practitioners, while FastTransformer and vanilla vLLM are not maximally stable and should be used with caution. Our results highlight the substantial benefits queuing community can offer in improving LLM inference systems and call for more interdisciplinary developments.

arXiv Statistics @arxiv_stats@qoto.org

A GARMA Framework for Unit-Bounded Time Series Based on the Unit-Lindley Distribution with Application to Renewable Energy Data https://arxiv.org/abs/2504.07351 #math.ST #stat.AP #stat.TH

A GARMA Framework for Unit-Bounded Time Series Based on the Unit-Lindley Distribution with Application to Renewable Energy Data

The Unit-Lindley is a one-parameter family of distributions in $(0,1)$ obtained from an appropriate transformation of the Lindley distribution. In this work, we introduce a class of dynamical time series models for continuous random variables taking values in $(0,1)$ based on the Unit-Lindley distribution. The models pertaining to the proposed class are observation-driven ones for which, conditionally on a set of covariates, the random component is modeled by a Unit-Lindley distribution. The systematic component aims at modeling the conditional mean through a dynamical structure resembling the classical ARMA models. Parameter estimation in conducted using partial maximum likelihood, for which an asymptotic theory is available. Based on asymptotic results, the construction of confidence intervals, hypotheses testing, model selection, and forecasting can be carried on. A Monte Carlo simulation study is conducted to assess the finite sample performance of the proposed partial maximum likelihood approach. Finally, an application considering forecasting of the proportion of net electricity generated by conventional hydroelectric power in the United States is presented. The application show the versatility of the proposed method compared to other benchmarks models in the literature.

arXiv Statistics @arxiv_stats@qoto.org

Estimand framework development for eGFR slope estimation and comparative analyses across various estimation methods https://arxiv.org/abs/2504.07411 #stat.ME

Estimand framework development for eGFR slope estimation and comparative analyses across various estimation methods

Chronic kidney disease (CKD) is a global health challenge characterized by progressive kidney function decline, often culminating in end-stage kidney disease (ESKD) and increased mortality. To address the limitations such as the extended trial follow-up necessitated by the low incidence of kidney composite endpoint, the eGFR slope -- a surrogate endpoint reflecting the trajectory of kidney function decline -- has gained prominence for its predictive power and regulatory support. Despite its advantages, the lack of a standardized framework for eGFR slope estimand and estimation complicates consistent interpretation and cross-trial comparisons. Existing methods, including simple linear regression and mixed-effects models, vary in their underlying assumptions, creating a need for a formalized approach to align estimation methods with trial objectives. This manuscript proposes an estimand framework tailored to eGFR slope-based analyses in CKD RCTs, ensuring clarity in defining "what to estimate" and enhancing the comparability of results. Through simulation studies and real-world data applications, we evaluate the performance of various commonly applied estimation techniques under distinct scenarios. By recommending a clear characterization for eGFR slope estimand and providing considerations for estimation approaches, this work aims to improve the reliability and interpretability of CKD trial results, advancing therapeutic development and clinical decision-making.

arXiv Statistics @arxiv_stats@qoto.org

Regression for Left-Truncated and Right-Censored Data: A Semiparametric Sieve Likelihood Approach https://arxiv.org/abs/2504.07413 #stat.ME

Regression for Left-Truncated and Right-Censored Data: A Semiparametric Sieve Likelihood Approach

Cohort studies of the onset of a disease often encounter left-truncation on the event time of interest in addition to right-censoring due to variable enrollment times of study participants. Analysis of such event time data can be biased if left-truncation is not handled properly. We propose a semiparametric sieve likelihood approach for fitting a linear regression model to data where the response variable is subject to both left-truncation and right-censoring. We show that the estimators of regression coefficients are consistent, asymptotically normal and semiparametrically efficient. Extensive simulation studies show the effectiveness of the method across a wide variety of error distributions. We further illustrate the method by analyzing a dataset from The 90+ Study for aging and dementia.

arXiv Statistics @arxiv_stats@qoto.org

Conditional Data Synthesis Augmentation https://arxiv.org/abs/2504.07426 #stat.ME #cs.LG

Conditional Data Synthesis Augmentation

Reliable machine learning and statistical analysis rely on diverse, well-distributed training data. However, real-world datasets are often limited in size and exhibit underrepresentation across key subpopulations, leading to biased predictions and reduced performance, particularly in supervised tasks such as classification. To address these challenges, we propose Conditional Data Synthesis Augmentation (CoDSA), a novel framework that leverages generative models, such as diffusion models, to synthesize high-fidelity data for improving model performance across multimodal domains including tabular, textual, and image data. CoDSA generates synthetic samples that faithfully capture the conditional distributions of the original data, with a focus on under-sampled or high-interest regions. Through transfer learning, CoDSA fine-tunes pre-trained generative models to enhance the realism of synthetic data and increase sample density in sparse areas. This process preserves inter-modal relationships, mitigates data imbalance, improves domain adaptation, and boosts generalization. We also introduce a theoretical framework that quantifies the statistical accuracy improvements enabled by CoDSA as a function of synthetic sample volume and targeted region allocation, providing formal guarantees of its effectiveness. Extensive experiments demonstrate that CoDSA consistently outperforms non-adaptive augmentation strategies and state-of-the-art baselines in both supervised and unsupervised settings.

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019