arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 603 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

Optimizing Returns from Experimentation Programs https://arxiv.org/abs/2412.05508 #stat.ME #econ.EM #stat.AP

Optimizing Returns from Experimentation Programs

Experimentation in online digital platforms is used to inform decision making. Specifically, the goal of many experiments is to optimize a metric of interest. Null hypothesis statistical testing can be ill-suited to this task, as it is indifferent to the magnitude of effect sizes and opportunity costs. Given access to a pool of related past experiments, we discuss how experimentation practice should change when the goal is optimization. We survey the literature on empirical Bayes analyses of A/B test portfolios, and single out the A/B Testing Problem (Azevedo et al., 2020) as a starting point, which treats experimentation as a constrained optimization problem. We show that the framework can be solved with dynamic programming and implemented by appropriately tuning $p$-value thresholds. Furthermore, we develop several extensions of the A/B Testing Problem and discuss the implications of these results on experimentation programs in industry. For example, under no-cost assumptions, firms should be testing many more ideas, reducing test allocation sizes, and relaxing $p$-value thresholds away from $p = 0.05$.

arXiv Statistics @arxiv_stats@qoto.org

Exact distribution-free tests of spherical symmetry applicable to high dimensional data https://arxiv.org/abs/2412.05608 #math.ST #stat.ME #stat.TH

Exact distribution-free tests of spherical symmetry applicable to high dimensional data

We develop some graph-based tests for spherical symmetry of a multivariate distribution using a method based on data augmentation. These tests are constructed using a new notion of signs and ranks that are computed along a path obtained by optimizing an objective function based on pairwise dissimilarities among the observations in the augmented data set. The resulting tests based on these signs and ranks have the exact distribution-free property, and irrespective of the dimension of the data, the null distributions of the test statistics remain the same. These tests can be conveniently used for high-dimensional data, even when the dimension is much larger than the sample size. Under appropriate regularity conditions, we prove the consistency of these tests in high dimensional asymptotic regime, where the dimension grows to infinity while the sample size may or may not grow with the dimension. We also propose a generalization of our methods to take care of the situations, where the center of symmetry is not specified by the null hypothesis. Several simulated data sets and a real data set are analyzed to demonstrate the utility of the proposed tests.

arXiv Statistics @arxiv_stats@qoto.org

Improved estimation of the positive powers ordered restricted standard deviation of two normal populations https://arxiv.org/abs/2412.05620 #math.ST #stat.CO #stat.TH

Improved estimation of the positive powers ordered restricted standard deviation of two normal populations

The present manuscript is concerned with component-wise estimation of the positive power of ordered restricted standard deviation of two normal populations with certain restrictions on the means. We propose several improved estimators under a general scale invariant bowl-shaped loss function. Also, we proposed a class of improved estimators. It has been shown that the boundary estimator of this class is a generalized Bayes. As an application, the improved estimators are obtained with respect to quadratic loss, entropy loss, and a symmetric loss function. We have conducted extensive Monte Carlo simulations to study and compare the risk performance of the proposed estimators. Finally, a real life data analysis is given to illustrate our findings.

arXiv Statistics @arxiv_stats@qoto.org

A generalized Bayesian approach for high-dimensional robust regression with serially correlated errors and predictors https://arxiv.org/abs/2412.05673 #stat.ME #math.ST #stat.CO #stat.TH

A generalized Bayesian approach for high-dimensional robust regression with serially correlated errors and predictors

This paper presents a loss-based generalized Bayesian methodology for high-dimensional robust regression with serially correlated errors and predictors. The proposed framework employs a novel scaled pseudo-Huber (SPH) loss function, which smooths the well-known Huber loss, achieving a balance between quadratic and absolute linear loss behaviors. This flexibility enables the framework to accommodate both thin-tailed and heavy-tailed data effectively. The generalized Bayesian approach constructs a working likelihood utilizing the SPH loss that facilitates efficient and stable estimation while providing rigorous estimation uncertainty quantification for all model parameters. Notably, this allows formal statistical inference without requiring ad hoc tuning parameter selection while adaptively addressing a wide range of tail behavior in the errors. By specifying appropriate prior distributions for the regression coefficients -- e.g., ridge priors for small or moderate-dimensional settings and spike-and-slab priors for high-dimensional settings -- the framework ensures principled inference. We establish rigorous theoretical guarantees for the accurate estimation of underlying model parameters and the correct selection of predictor variables under sparsity assumptions for a wide range of data generating setups. Extensive simulation studies demonstrate the superiority of our approach compared to traditional quadratic and absolute linear loss-based Bayesian regression methods, highlighting its flexibility and robustness in high-dimensional and challenging data contexts.

arXiv Statistics @arxiv_stats@qoto.org

Bootstrap Model Averaging https://arxiv.org/abs/2412.05687 #stat.ME

Bootstrap Model Averaging

Model averaging has gained significant attention in recent years due to its ability of fusing information from different models. The critical challenge in frequentist model averaging is the choice of weight vector. The bootstrap method, known for its favorable properties, presents a new solution. In this paper, we propose a bootstrap model averaging approach that selects the weights by minimizing a bootstrap criterion. Our weight selection criterion can also be interpreted as a bootstrap aggregating. We demonstrate that the resultant estimator is asymptotically optimal in the sense that it achieves the lowest possible squared error loss. Furthermore, we establish the convergence rate of bootstrap weights tending to the theoretically optimal weights. Additionally, we derive the limiting distribution for our proposed model averaging estimator. Through simulation studies and empirical applications, we show that our proposed method often has better performance than other commonly used model selection and model averaging methods, and bootstrap variants.

arXiv Statistics @arxiv_stats@qoto.org

Training-Free Bayesianization for Low-Rank Adapters of Large Language Models https://arxiv.org/abs/2412.05723 #stat.ML #cs.AI #cs.CL #cs.LG

Training-Free Bayesianization for Low-Rank Adapters of Large Language Models

Estimating the uncertainty of responses from Large Language Models (LLMs) remains a critical challenge. While recent Bayesian methods have demonstrated effectiveness in quantifying uncertainty through low-rank weight updates, they typically require complex fine-tuning or post-training procedures. In this paper, we propose Training-Free Bayesianization (TFB), a simple yet theoretically grounded framework that efficiently transforms trained low-rank adapters into Bayesian ones without additional training. TFB systematically searches for the maximally acceptable level of variance in the weight posterior, constrained within a family of low-rank isotropic Gaussian distributions. Our theoretical analysis shows that under mild conditions, this search process is equivalent to KL-regularized variational optimization, a generalized form of variational inference. Through comprehensive experiments, we show that TFB achieves superior uncertainty estimation and generalization compared to existing methods while eliminating the need for complex Bayesianization training procedures. Code will be available at https://github.com/Wang-ML-Lab/bayesian-peft.

arXiv Statistics @arxiv_stats@qoto.org

Robust Quickest Change Detection in Multi-Stream Non-Stationary Processes https://arxiv.org/abs/2412.04493 #stat.ME

Robust Quickest Change Detection in Multi-Stream Non-Stationary Processes

The problem of robust quickest change detection (QCD) in non-stationary processes under a multi-stream setting is studied. In classical QCD theory, optimal solutions are developed to detect a sudden change in the distribution of stationary data. Most studies have focused on single-stream data. In non-stationary processes, the data distribution both before and after change varies with time and is not precisely known. The multi-dimension data even complicates such issues. It is shown that if the non-stationary family for each dimension or stream has a least favorable law (LFL) or distribution in a well-defined sense, then the algorithm designed using the LFLs is robust optimal. The notion of LFL defined in this work differs from the classical definitions due to the dependence of the post-change model on the change point. Examples of multi-stream non-stationary processes encountered in public health monitoring and aviation applications are provided. Our robust algorithm is applied to simulated and real data to show its effectiveness.

arXiv Statistics @arxiv_stats@qoto.org

Dual approach to proving electoral fraud via statistics and forensics (Dvojnoe dokazatel'stvo falsifikazij na vyborah statistikoj i kriminalistikoj) https://arxiv.org/abs/2412.04535 #physics.soc-ph #stat.AP

Dual approach to proving electoral fraud via statistics and forensics (Dvojnoe dokazatel'stvo falsifikazij na vyborah statistikoj i kriminalistikoj)

Electoral fraud often manifests itself as statistical anomalies in election results, yet its extent can rarely be reliably confirmed by other evidence. Here we report complete results of municipal elections in Vlasikha town near Moscow, where we observe both statistical irregularities in the vote-counting transcripts and forensic evidence of tampering with ballots during their overnight storage. We evaluate two types of statistical signatures in the vote sequence that can prove batches of fraudulent ballots have been injected. We find that pairs of factory-made security bags with identical serial numbers are used in this fraud scheme. At 8 out of our 9 polling stations, the statistical and forensic evidence agrees (identifying 7 as fraudulent and 1 as honest), while at the remaining station the statistical evidence detects the fraud while the forensic one is insufficient. We also illustrate that the use of tamper-indicating seals at elections is inherently unreliable. -- -- Tezis po-russki est' v russkoj versii stat'i (normal'noj kirillicej, ne translitom)

arXiv Statistics @arxiv_stats@qoto.org

Multi-Quantile Estimators for the parameters of Generalized Extreme Value distribution https://arxiv.org/abs/2412.04640 #stat.ME

Multi-Quantile Estimators for the parameters of Generalized Extreme Value distribution

We introduce and study Multi-Quantile estimators for the parameters $( ξ, σ, μ)$ of Generalized Extreme Value (GEV) distributions to provide a robust approach to extreme value modeling. Unlike classical estimators, such as the Maximum Likelihood Estimation (MLE) estimator and the Probability Weighted Moments (PWM) estimator, which impose strict constraints on the shape parameter $ξ$, our estimators are always asymptotically normal and consistent across all values of the GEV parameters. The asymptotic variances of our estimators decrease with the number of quantiles increasing and can approach the Cramér-Rao lower bound very closely whenever it exists. Our Multi-Quantile Estimators thus offer a more flexible and efficient alternative for practical applications. We also discuss how they can be implemented in the context of Block Maxima method.

arXiv Statistics @arxiv_stats@qoto.org

Fairness-aware Principal Component Analysis for Mortality Forecasting and Annuity Pricing https://arxiv.org/abs/2412.04663 #stat.ME #stat.AP

Fairness-aware Principal Component Analysis for Mortality Forecasting and Annuity Pricing

Fairness-aware statistical learning is critical for data-driven decision-making to mitigate discrimination against protected attributes, such as gender, race, and ethnicity. This is especially important for high-stake decision-making, such as insurance underwriting and annuity pricing. This paper proposes a new fairness-regularized principal component analysis - Fair PCA, in the context of high-dimensional factor models. An efficient gradient descent algorithm is constructed with adaptive selection criteria for hyperparameter tuning. The Fair PCA is applied to mortality modelling to mitigate gender discrimination in annuity pricing. The model performance has been validated through both simulation studies and empirical data analysis.

arXiv Statistics @arxiv_stats@qoto.org

Sequential anomaly identification with observation control under generalized error metrics https://arxiv.org/abs/2412.04693 #math.ST #stat.TH

Sequential anomaly identification with observation control under generalized error metrics

The problem of sequential anomaly detection and identification is considered, where multiple data sources are simultaneously monitored and the goal is to identify in real time those, if any, that exhibit ``anomalous" statistical behavior. An upper bound is postulated on the number of data sources that can be sampled at each sampling instant, but the decision maker selects which ones to sample based on the already collected data. Thus, in this context, a policy consists not only of a stopping rule and a decision rule that determine when sampling should be terminated and which sources to identify as anomalous upon stopping, but also of a sampling rule that determines which sources to sample at each time instant subject to the sampling constraint. Two distinct formulations are considered, which require control of different, ``generalized" error metrics. The first one tolerates a certain user-specified number of errors, of any kind, whereas the second tolerates distinct, user-specified numbers of false positives and false negatives. For each of them, a universal asymptotic lower bound on the expected time for stopping is established as the error probabilities go to 0, and it is shown to be attained by a policy that combines the stopping and decision rules proposed in the full-sampling case with a probabilistic sampling rule that achieves a specific long-run sampling frequency for each source. Moreover, the optimal to a first order asymptotic approximation expected time for stopping is compared in simulation studies with the corresponding factor in a finite regime, and the impact of the sampling constraint and tolerance to errors is assessed.

arXiv Statistics @arxiv_stats@qoto.org

Modeling High-Dimensional Dependent Data in the Presence of Many Explanatory Variables and Weak Signals https://arxiv.org/abs/2412.04736 #stat.ME #stat.ML

Modeling High-Dimensional Dependent Data in the Presence of Many Explanatory Variables and Weak Signals

This article considers a novel and widely applicable approach to modeling high-dimensional dependent data when a large number of explanatory variables are available and the signal-to-noise ratio is low. We postulate that a $p$-dimensional response series is the sum of a linear regression with many observable explanatory variables and an error term driven by some latent common factors and an idiosyncratic noise. The common factors have dynamic dependence whereas the covariance matrix of the idiosyncratic noise can have diverging eigenvalues to handle the situation of low signal-to-noise ratio commonly encountered in applications. The regression coefficient matrix is estimated using penalized methods when the dimensions involved are high. We apply factor modeling to the regression residuals, employ a high-dimensional white noise testing procedure to determine the number of common factors, and adopt a projected Principal Component Analysis when the signal-to-noise ratio is low. We establish asymptotic properties of the proposed method, both for fixed and diverging numbers of regressors, as $p$ and the sample size $T$ approach infinity. Finally, we use simulations and empirical applications to demonstrate the efficacy of the proposed approach in finite samples.

arXiv Statistics @arxiv_stats@qoto.org

Marginally interpretable spatial logistic regression with bridge processes https://arxiv.org/abs/2412.04744 #stat.ME #stat.AP

Marginally interpretable spatial logistic regression with bridge processes

In including random effects to account for dependent observations, the odds ratio interpretation of logistic regression coefficients is changed from population-averaged to subject-specific. This is unappealing in many applications, motivating a rich literature on methods that maintain the marginal logistic regression structure without random effects, such as generalized estimating equations. However, for spatial data, random effect approaches are appealing in providing a full probabilistic characterization of the data that can be used for prediction. We propose a new class of spatial logistic regression models that maintain both population-averaged and subject-specific interpretations through a novel class of bridge processes for spatial random effects. These processes are shown to have appealing computational and theoretical properties, including a scale mixture of normal representation. The new methodology is illustrated with simulations and an analysis of childhood malaria prevalence data in the Gambia.

arXiv Statistics @arxiv_stats@qoto.org

Low-Rank Expectile Representations of a Data Matrix, with Application to Diurnal Heart Rates https://arxiv.org/abs/2412.04765 #stat.AP

Low-Rank Expectile Representations of a Data Matrix, with Application to Diurnal Heart Rates

Low-rank matrix factorization is a powerful tool for understanding the structure of 2-way data, and is usually accomplished by minimizing a sum of squares criterion. Expectile analysis generalizes squared-error loss by introducing asymmetry, allowing tail behavior to be elicited. Here we present a framework for low-rank expectile analysis of a data matrix that incorporates both additive and multiplicative effects, utilizing expectile loss, and accommodating arbitrary patterns of missing data. The representation can be fit with gradient-descent. Simulation studies demonstrate the accuracy of the structure recovery. Using diurnal heart rate data indexed by person-days versus minutes within a day, we find divergent behavior for lower versus upper expectiles, with the lower expectiles being much more stable within subjects across days, while the upper expectiles are much more variable, even within subjects.

arXiv Statistics @arxiv_stats@qoto.org

Robust and Optimal Tensor Estimation via Robust Gradient Descent https://arxiv.org/abs/2412.04773 #stat.ME

Robust and Optimal Tensor Estimation via Robust Gradient Descent

Low-rank tensor models are widely used in statistics and machine learning. However, most existing methods rely heavily on the assumption that data follows a sub-Gaussian distribution. To address the challenges associated with heavy-tailed distributions encountered in real-world applications, we propose a novel robust estimation procedure based on truncated gradient descent for general low-rank tensor models. We establish the computational convergence of the proposed method and derive optimal statistical rates under heavy-tailed distributional settings of both covariates and noise for various low-rank models. Notably, the statistical error rates are governed by a local moment condition, which captures the distributional properties of tensor variables projected onto certain low-dimensional local regions. Furthermore, we present numerical results to demonstrate the effectiveness of our method.

arXiv Statistics @arxiv_stats@qoto.org

Regression Analysis of Cure Rate Models with Competing Risks Subjected to Interval Censoring https://arxiv.org/abs/2412.04803 #stat.ME

Regression Analysis of Cure Rate Models with Competing Risks Subjected to Interval Censoring

In this work, we present two defective regression models for the analysis of interval-censored competing risk data in the presence of cured individuals, viz., defective Gompertz and defective inverse Gaussian regression models. The proposed models enable us to estimate the cure fraction directly from the model. Simultaneously, we estimate the regression parameters corresponding to each cause of failure using the method of maximum likelihood. The finite sample behaviour of the proposed models is evaluated through Monte Carlo simulation studies. We illustrate the practical applicability of the models using a real-life data set on HIV patients.

arXiv Statistics @arxiv_stats@qoto.org

No Free Lunch for Stochastic Gradient Langevin Dynamics https://arxiv.org/abs/2412.01952 #stat.CO

No Free Lunch for Stochastic Gradient Langevin Dynamics

As sample sizes grow, scalability has become a central concern in the development of Markov chain Monte Carlo (MCMC) methods. One general approach to this problem, exemplified by the popular stochastic gradient Langevin dynamics (SGLD) algorithm, is to use a small random subsample of the data at every time step. This paper, building on recent work such as \cite{nagapetyan2017true,JohndrowJamesE2020NFLf}, shows that this approach often fails: while decreasing the sample size increases the speed of each MCMC step, for typical datasets this is balanced by a matching decrease in accuracy. This result complements recent work such as \cite{nagapetyan2017true} (which came to the same conclusion, but analyzed only specific upper bounds on errors rather than actual errors) and \cite{JohndrowJamesE2020NFLf} (which did not analyze nonreversible algorithms and allowed for logarithmic improvements).

arXiv Statistics @arxiv_stats@qoto.org

Dynamic Prediction of High-density Generalized Functional Data with Fast Generalized Functional Principal Component Analysis https://arxiv.org/abs/2412.02014 #stat.ME #stat.AP

Dynamic Prediction of High-density Generalized Functional Data with Fast Generalized Functional Principal Component Analysis

Dynamic prediction, which typically refers to the prediction of future outcomes using historical records, is often of interest in biomedical research. For datasets with large sample sizes, high measurement density, and complex correlation structures, traditional methods are often infeasible because of the computational burden associated with both data scale and model complexity. Moreover, many models do not directly facilitate out-of-sample predictions for generalized outcomes. To address these issues, we develop a novel approach for dynamic predictions based on a recently developed method estimating complex patterns of variation for exponential family data: fast Generalized Functional Principal Components Analysis (fGFPCA). Our method is able to handle large-scale, high-density repeated measures much more efficiently with its implementation feasible even on personal computational resources (e.g., a standard desktop or laptop computer). The proposed method makes highly flexible and accurate predictions of future trajectories for data that exhibit high degrees of nonlinearity, and allows for out-of-sample predictions to be obtained without reestimating any parameters. A simulation study is designed and implemented to illustrate the advantages of this method. To demonstrate its practical utility, we also conducted a case study to predict diurnal active/inactive patterns using accelerometry data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014. Both the simulation study and the data application demonstrate the better predictive performance and high computational efficiency of the proposed method compared to existing methods. The proposed method also obtains more personalized prediction that improves as more information becomes available, which is an essential goal of dynamic prediction that other methods fail to achieve.

arXiv Statistics @arxiv_stats@qoto.org

Comparing Clustering Approaches for Smart Meter Time Series: Investigating the Influence of Dataset Properties on Performance https://arxiv.org/abs/2412.02026 #stat.AP

Comparing Clustering Approaches for Smart Meter Time Series: Investigating the Influence of Dataset Properties on Performance

The widespread adoption of smart meters for monitoring energy consumption has generated vast quantities of high-resolution time series data which remains underutilised. While clustering has emerged as a fundamental tool for mining smart meter time series (SMTS) data, selecting appropriate clustering methods remains challenging despite numerous comparative studies. These studies often rely on problematic methodologies and consider a limited scope of methods, frequently overlooking compelling methods from the broader time series clustering literature. Consequently, they struggle to provide dependable guidance for practitioners designing their own clustering approaches. This paper presents a comprehensive comparative framework for SMTS clustering methods using expert-informed synthetic datasets that emphasise peak consumption behaviours as fundamental cluster concepts. Using a phased methodology, we first evaluated 31 distance measures and 8 representation methods using leave-one-out classification, then examined the better-suited methods in combination with 11 clustering algorithms. We further assessed the robustness of these combinations to systematic changes in key dataset properties that affect clustering performance on real-world datasets, including cluster balance, noise, and the presence of outliers. Our results revealed that methods accommodating local temporal shifts while maintaining amplitude sensitivity, particularly Dynamic Time Warping and $k$-sliding distance, consistently outperformed traditional approaches. Among other key findings, we identified that when combined with hierarchical clustering using Ward's linkage, these methods demonstrated consistent robustness across varying dataset characteristics without careful parameter tuning. These and other findings inform actionable recommendations for practitioners.

arXiv Statistics @arxiv_stats@qoto.org

MEP-Net: Generating Solutions to Scientific Problems with Limited Knowledge by Maximum Entropy Principle https://arxiv.org/abs/2412.02090 #physics.data-an #stat.ML #cs.LG

MEP-Net: Generating Solutions to Scientific Problems with Limited Knowledge by Maximum Entropy Principle

Maximum entropy principle (MEP) offers an effective and unbiased approach to inferring unknown probability distributions when faced with incomplete information, while neural networks provide the flexibility to learn complex distributions from data. This paper proposes a novel neural network architecture, the MEP-Net, which combines the MEP with neural networks to generate probability distributions from moment constraints. We also provide a comprehensive overview of the fundamentals of the maximum entropy principle, its mathematical formulations, and a rigorous justification for its applicability for non-equilibrium systems based on the large deviations principle. Through fruitful numerical experiments, we demonstrate that the MEP-Net can be particularly useful in modeling the evolution of probability distributions in biochemical reaction networks and in generating complex distributions from data.

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019