arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 601 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

A Stochastic Weather Model: A Case of Bono Region of Ghana https://arxiv.org/abs/2409.06731 #stat.AP #math.PR

A Stochastic Weather Model: A Case of Bono Region of Ghana

The paper sought to fit an Ornstein Uhlenbeck model with seasonal mean and volatility, where the residuals are generated by a Brownian motion for Ghanian daily average temperature. This paper employed the modified Ornstein Uhlenbeck model proposed by Bhowan which has a seasonal mean and stochastic volatility process. The findings revealed that, the Bono region experiences warm temperatures and maximum precipitation up to 32.67 degree celsius and 126.51mm respectively. It was observed that the Daily Average Temperature (DAT) of the region reverts to a temperature of approximately 26 degree celsius at a rate of 18.72% with maximum and minimum temperatures of 32.67degree celsius and 19.75degree celsius respectively. Although the region is in the middle belt of Ghana, it still experiences warm(hot) temperatures daily and experiences dry seasons relatively more than wet seasons in the number of years considered for our analysis. Our model explained approximately 50% of the variations in the daily average temperature of the region which can be regarded as relatively a good model. The findings of this paper are relevant in the pricing of weather derivatives with temperature as an underlying variable in the Ghanaian financial and agricultural sector. Furthermore, it would assist in the development and design of tailored agriculture/crop insurance models which would incorporate temperature dynamics rather than extreme weather conditions/events such as floods, drought and wildfires.

arXiv Statistics @arxiv_stats@qoto.org

Kramnik vs Nakamura: A Chess Scandal https://arxiv.org/abs/2409.06739 #stat.AP

Kramnik vs Nakamura: A Chess Scandal

We provide a statistical analysis of the recent controversy between Vladimir Kramnik (ex chess world champion) and Hikaru Nakamura. Hikaru Nakamura is a chess prodigy and a five-time United States chess champion. Kramnik called into question Nakamura's 45.5 out of 46 win streak in an online blitz contest at chess.com. We assess the weight of evidence using a priori assessment of Viswanathan Anand and the streak evidence. Based on this evidence, we show that Nakamura has a 99.6 percent chance of not cheating. We study the statistical fallacies prevalent in both their analyses. On the one hand Kramnik bases his argument on the probability of such a streak is very small. This falls precisely into the Prosecutor's Fallacy. On the other hand, Nakamura tries to refute the argument using a cherry-picking argument. This violates the likelihood principle. We conclude with a discussion of the relevant statistical literature on the topic of fraud detection and the analysis of streaks in sports data.

arXiv Statistics @arxiv_stats@qoto.org

Learning Deep Kernels for Non-Parametric Independence Testing https://arxiv.org/abs/2409.06890 #stat.ML #cs.LG

Learning Deep Kernels for Non-Parametric Independence Testing

The Hilbert-Schmidt Independence Criterion (HSIC) is a powerful tool for nonparametric detection of dependence between random variables. It crucially depends, however, on the selection of reasonable kernels; commonly-used choices like the Gaussian kernel, or the kernel that yields the distance covariance, are sufficient only for amply sized samples from data distributions with relatively simple forms of dependence. We propose a scheme for selecting the kernels used in an HSIC-based independence test, based on maximizing an estimate of the asymptotic test power. We prove that maximizing this estimate indeed approximately maximizes the true power of the test, and demonstrate that our learned kernels can identify forms of structured dependence between random variables in various experiments.

arXiv Statistics @arxiv_stats@qoto.org

k-MLE, k-Bregman, k-VARs: Theory, Convergence, Computation https://arxiv.org/abs/2409.06938 #stat.ML #cs.LG

k-MLE, k-Bregman, k-VARs: Theory, Convergence, Computation

We develop hard clustering based on likelihood rather than distance and prove convergence. We also provide simulations and real data examples.

arXiv Statistics @arxiv_stats@qoto.org

Toward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions https://arxiv.org/abs/2409.06960 #physics.data-an #stat.ML #stat.AP #cs.LG

Toward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions

In the search for new particles in high-energy physics, it is crucial to select the Signal Region (SR) in such a way that it is enriched with signal events if they are present. While most existing search methods set the region relying on prior domain knowledge, it may be unavailable for a completely novel particle that falls outside the current scope of understanding. We address this issue by proposing a method built upon a model-agnostic but often realistic assumption about the localized topology of the signal events, in which they are concentrated in a certain area of the feature space. Considering the signal component as a localized high-frequency feature, our approach employs the notion of a low-pass filter. We define the SR as an area which is most affected when the observed events are smeared with additive random noise. We overcome challenges in density estimation in the high-dimensional feature space by learning the density ratio of events that potentially include a signal to the complementary observation of events that closely resemble the target events but are free of any signals. By applying our method to simulated $\mathrm{HH} \rightarrow 4b$ events, we demonstrate that the method can efficiently identify a data-driven SR in a high-dimensional feature space in which a high portion of signal events concentrate.

arXiv Statistics @arxiv_stats@qoto.org

A Practical Theory of Generalization in Selectivity Learning https://arxiv.org/abs/2409.07014 #stat.ML #cs.DB #cs.LG

A Practical Theory of Generalization in Selectivity Learning

Query-driven machine learning models have emerged as a promising estimation technique for query selectivities. Yet, surprisingly little is known about the efficacy of these techniques from a theoretical perspective, as there exist substantial gaps between practical solutions and state-of-the-art (SOTA) theory based on the Probably Approximately Correct (PAC) learning framework. In this paper, we aim to bridge the gaps between theory and practice. First, we demonstrate that selectivity predictors induced by signed measures are learnable, which relaxes the reliance on probability measures in SOTA theory. More importantly, beyond the PAC learning framework (which only allows us to characterize how the model behaves when both training and test workloads are drawn from the same distribution), we establish, under mild assumptions, that selectivity predictors from this class exhibit favorable out-of-distribution (OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the in-distribution and OOD generalization capabilities of query-driven selectivity learning, and facilitate the design of two general strategies to improve OOD generalization for existing query-driven selectivity models. We empirically verify that our techniques help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance, while maintaining their superior in-distribution generalization performance.

arXiv Statistics @arxiv_stats@qoto.org

Clustered Factor Analysis for Multivariate Spatial Data https://arxiv.org/abs/2409.07018 #stat.ME

Clustered Factor Analysis for Multivariate Spatial Data

Factor analysis has been extensively used to reveal the dependence structures among multivariate variables, offering valuable insight in various fields. However, it cannot incorporate the spatial heterogeneity that is typically present in spatial data. To address this issue, we introduce an effective method specifically designed to discover the potential dependence structures in multivariate spatial data. Our approach assumes that spatial locations can be approximately divided into a finite number of clusters, with locations within the same cluster sharing similar dependence structures. By leveraging an iterative algorithm that combines spatial clustering with factor analysis, we simultaneously detect spatial clusters and estimate a unique factor model for each cluster. The proposed method is evaluated through comprehensive simulation studies, demonstrating its flexibility. In addition, we apply the proposed method to a dataset of railway station attributes in the Tokyo metropolitan area, highlighting its practical applicability and effectiveness in uncovering complex spatial dependencies.

arXiv Statistics @arxiv_stats@qoto.org

From optimal score matching to optimal sampling https://arxiv.org/abs/2409.07032 #stat.ML #cs.LG

From optimal score matching to optimal sampling

The recent, impressive advances in algorithmic generation of high-fidelity image, audio, and video are largely due to great successes in score-based diffusion models. A key implementing step is score matching, that is, the estimation of the score function of the forward diffusion process from training data. As shown in earlier literature, the total variation distance between the law of a sample generated from the trained diffusion model and the ground truth distribution can be controlled by the score matching risk. Despite the widespread use of score-based diffusion models, basic theoretical questions concerning exact optimal statistical rates for score estimation and its application to density estimation remain open. We establish the sharp minimax rate of score estimation for smooth, compactly supported densities. Formally, given \(n\) i.i.d. samples from an unknown \(α\)-Hölder density \(f\) supported on \([-1, 1]\), we prove the minimax rate of estimating the score function of the diffused distribution \(f * \mathcal{N}(0, t)\) with respect to the score matching loss is \(\frac{1}{nt^2} \wedge \frac{1}{nt^{3/2}} \wedge (t^{α-1} + n^{-2(α-1)/(2α+1)})\) for all \(α> 0\) and \(t \ge 0\). As a consequence, it is shown the law \(\hat{f}\) of a sample generated from the diffusion model achieves the sharp minimax rate \(\bE(\dTV(\hat{f}, f)^2) \lesssim n^{-2α/(2α+1)}\) for all \(α> 0\) without any extraneous logarithmic terms which are prevalent in the literature, and without the need for early stopping which has been required for all existing procedures to the best of our knowledge.

arXiv Statistics @arxiv_stats@qoto.org

Identifiability of Polynomial Models from First Principles and via a Gr\"obner Basis Approach https://arxiv.org/abs/2409.07062 #math.ST #stat.TH

Identifiability of Polynomial Models from First Principles and via a Gröbner Basis Approach

The relationship between a set of design points and the class of hierarchical polynomial models identifiable from the design is investigated. Saturated models are of particular interest. Necessary and sufficient conditions are derived on the set of design points for specific terms to be included in leaves of the statistical fan. A practitioner led approach to building hierarchical saturated models that are identifiable is developed. This approach is compared to the method of model building based on Gröbner bases. The main results are illustrated by examples.

arXiv Statistics @arxiv_stats@qoto.org

Statistical Finite Elements via Interacting Particle Langevin Dynamics https://arxiv.org/abs/2409.07101 #physics.comp-ph #stat.CO

Statistical Finite Elements via Interacting Particle Langevin Dynamics

In this paper, we develop a class of interacting particle Langevin algorithms to solve inverse problems for partial differential equations (PDEs). In particular, we leverage the statistical finite elements (statFEM) formulation to obtain a finite-dimensional latent variable statistical model where the parameter is that of the (discretised) forward map and the latent variable is the statFEM solution of the PDE which is assumed to be partially observed. We then adapt a recently proposed expectation-maximisation like scheme, interacting particle Langevin algorithm (IPLA), for this problem and obtain a joint estimation procedure for the parameters and the latent variables. We consider three main examples: (i) estimating the forcing for linear Poisson PDE, (ii) estimating the forcing for nonlinear Poisson PDE, and (iii) estimating diffusivity for linear Poisson PDE. We provide computational complexity estimates for forcing estimation in the linear case. We also provide comprehensive numerical experiments and preconditioning strategies that significantly improve the performance, showing that the proposed class of methods can be the choice for parameter inference in PDE models.

arXiv Statistics @arxiv_stats@qoto.org

Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting https://arxiv.org/abs/2409.05980 #stat.ML #cs.LG

Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting

Rested and Restless Bandits are two well-known bandit settings that are useful to model real-world sequential decision-making problems in which the expected reward of an arm evolves over time due to the actions we perform or due to the nature. In this work, we propose Graph-Triggered Bandits (GTBs), a unifying framework to generalize and extend rested and restless bandits. In this setting, the evolution of the arms' expected rewards is governed by a graph defined over the arms. An edge connecting a pair of arms $(i,j)$ represents the fact that a pull of arm $i$ triggers the evolution of arm $j$, and vice versa. Interestingly, rested and restless bandits are both special cases of our model for some suitable (degenerated) graph. As relevant case studies for this setting, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs. For these cases, we study the optimal policies. We provide suitable algorithms for all scenarios and discuss their theoretical guarantees, highlighting the complexity of the learning problem concerning instance-dependent terms that encode specific properties of the underlying graph structure.

arXiv Statistics @arxiv_stats@qoto.org

Learning about Spatial and Temporal Proximity using Tree-Based Methods https://arxiv.org/abs/2409.06046 #stat.AP

Learning about Spatial and Temporal Proximity using Tree-Based Methods

Learning about the relationship between distance to landmarks and events and phenomena of interest is a multi-faceted problem, as it may require taking into account multiple dimensions, including: spatial position of landmarks, timing of events taking place over time, and attributes of occurrences and locations. Here I show that tree-based methods are well suited for the study of these questions as they allow exploring the relationship between proximity metrics and outcomes of interest in a non-parametric and data-driven manner. I illustrate the usefulness of tree-based methods vis-à-vis conventional regression methods by examining the association between: (i) distance to border crossings along the US-Mexico border and support for immigration reform, and (ii) distance to mass shootings and support for gun control.

arXiv Statistics @arxiv_stats@qoto.org

Empirical Bernstein in smooth Banach spaces https://arxiv.org/abs/2409.06060 #math.ST #stat.TH

Empirical Bernstein in smooth Banach spaces

Existing concentration bounds for bounded vector-valued random variables include extensions of the scalar Hoeffding and Bernstein inequalities. While the latter is typically tighter, it requires knowing a bound on the variance of the random variables. We derive a new vector-valued empirical Bernstein inequality, which makes use of an empirical estimator of the variance instead of the true variance. The bound holds in 2-smooth separable Banach spaces, which include finite dimensional Euclidean spaces and separable Hilbert spaces. The resulting confidence sets are instantiated for both the batch setting (where the sample size is fixed) and the sequential setting (where the sample size is a stopping time). The confidence set width asymptotically exactly matches that achieved by Bernstein in the leading term. The method and supermartingale proof technique combine several tools of Pinelis (1994) and Waudby-Smith and Ramdas (2024).

arXiv Statistics @arxiv_stats@qoto.org

Variational Search Distributions https://arxiv.org/abs/2409.06142 #stat.ML #cs.LG

Variational Search Distributions

We develop variational search distributions (VSD), a method for finding and generating discrete, combinatorial designs of a rare desired class in a batch sequential manner with a fixed experimental budget. We formalize the requirements and desiderata for active generation and formulate a solution via variational inference. In particular, VSD uses off-the-shelf gradient based optimization routines, can learn powerful generative models for designs, and can take advantage of scalable predictive models. We derive asymptotic convergence rates for learning the true conditional generative distribution of designs with certain configurations of our method. After illustrating the generative model on images, we empirically demonstrate that VSD can outperform existing baseline methods on a set of real sequence-design problems in various biological systems.

arXiv Statistics @arxiv_stats@qoto.org

Nonparametric Inference for Balance in Signed Networks https://arxiv.org/abs/2409.06172 #stat.ME

Nonparametric Inference for Balance in Signed Networks

In many real-world networks, relationships often go beyond simple dyadic presence or absence; they can be positive, like friendship, alliance, and mutualism, or negative, characterized by enmity, disputes, and competition. To understand the formation mechanism of such signed networks, the social balance theory sheds light on the dynamics of positive and negative connections. In particular, it characterizes the proverbs, "a friend of my friend is my friend" and "an enemy of my enemy is my friend". In this work, we propose a nonparametric inference approach for assessing empirical evidence for the balance theory in real-world signed networks. We first characterize the generating process of signed networks with node exchangeability and propose a nonparametric sparse signed graphon model. Under this model, we construct confidence intervals for the population parameters associated with balance theory and establish their theoretical validity. Our inference procedure is as computationally efficient as a simple normal approximation but offers higher-order accuracy. By applying our method, we find strong real-world evidence for balance theory in signed networks across various domains, extending its applicability beyond social psychology.

arXiv Statistics @arxiv_stats@qoto.org

Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach https://arxiv.org/abs/2409.06180 #q-bio.GN #stat.ME

Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach

Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment.

arXiv Statistics @arxiv_stats@qoto.org

Intrinsic geometry-inspired dependent toroidal distribution: Application to regression model for astigmatism data https://arxiv.org/abs/2409.06229 #stat.AP

Intrinsic geometry-inspired dependent toroidal distribution: Application to regression model for astigmatism data

This paper introduces a dependent toroidal distribution, to analyze astigmatism data following cataract surgery. Rather than utilizing the flat torus, we opt to represent the bivariate angular data on the surface of a curved torus, which naturally offers smooth edge identifiability and accommodates a variety of curvatures: positive, negative, and zero. Beginning with the area-uniform toroidal distribution on this curved surface, we develop a five-parameter-dependent toroidal distribution that harnesses its intrinsic geometry via the area element to model the distribution of two dependent circular random variables. We show that both marginal distributions are Cardioid, with one of the conditional variables also following a Cardioid distribution. This key feature enables us to propose a circular-circular regression model based on conditional expectations derived from circular moments. To address the high rejection rate (approximately 50%) in existing acceptance-rejection sampling methods for Cardioid distributions, we introduce an exact sampling method based on a probabilistic transformation. Additionally, we generate random samples from the proposed dependent toroidal distribution through suitable conditioning. This bivariate distribution and the regression model are applied to analyze astigmatism data arising in the follow-up of one and three months due to cataract surgery.

arXiv Statistics @arxiv_stats@qoto.org

Applications of machine learning to predict seasonal precipitation for East Africa https://arxiv.org/abs/2409.06238 #stat.AP #stat.ML

Applications of machine learning to predict seasonal precipitation for East Africa

Seasonal climate forecasts are commonly based on model runs from fully coupled forecasting systems that use Earth system models to represent interactions between the atmosphere, ocean, land and other Earth-system components. Recently, machine learning (ML) methods are increasingly being investigated for this task where large-scale climate variability is linked to local or regional temperature or precipitation in a linear or non-linear fashion. This paper investigates the use of interpretable ML methods to predict seasonal precipitation for East Africa in an operational setting. Dimension reduction is performed by decomposing the precipitation fields via empirical orthogonal functions (EOFs), such that only the respective factor loadings need to the predicted. Indices of large-scale climate variability--including the rate of change in individual indices as well as interactions between different indices--are then used as potential features to obtain tercile forecasts from an interpretable ML algorithm. Several research questions regarding the use of data and the effect of model complexity are studied. The results are compared against the ECMWF seasonal forecasting system (SEAS5) for three seasons--MAM, JJAS and OND--over the period 1993-2020. Compared to climatology for the same period, the ECMWF forecasts have negative skill in MAM and JJAS and significant positive skill in OND. The ML approach is on par with climatology in MAM and JJAS and a significantly positive skill in OND, if not quite at the level of the OND ECMWF forecast.

arXiv Statistics @arxiv_stats@qoto.org

A new paradigm for global sensitivity analysis https://arxiv.org/abs/2409.06271 #stat.ML #stat.ME #cs.LG

A new paradigm for global sensitivity analysis

<div><p>Current theory of global sensitivity analysis, based on a nonlinear functional ANOVA decomposition of the random output, is limited in scope-for instance, the analysis is limited to the output's variance and the inputs have to be mutually independent-and leads to sensitivity indices the interpretation of which is not fully clear, especially interaction effects. Alternatively, sensitivity indices built for arbitrary user-defined importance measures have been proposed but a theory to define interactions in a systematic fashion and/or establish a decomposition of the total importance measure is still missing. It is shown that these important problems are solved all at once by adopting a new paradigm. By partitioning the inputs into those causing the change in the output and those which do not, arbitrary user-defined variability measures are identified with the outcomes of a factorial experiment at two levels, leading to all factorial effects without assuming any functional decomposition. To link various well-known sensitivity indices of the literature (Sobol indices and Shapley effects), weighted factorial effects are studied and utilized.</p></div>

arXiv Statistics @arxiv_stats@qoto.org

On Sparsity and Sub-Gaussianity in the Johnson-Lindenstrauss Lemma https://arxiv.org/abs/2409.06275 #math.ST #stat.TH

On Sparsity and Sub-Gaussianity in the Johnson-Lindenstrauss Lemma

We provide a simple proof of the Johnson-Lindenstrauss lemma for sub-Gaussian variables. We extend the analysis to identify how sparse projections can be, and what the cost of sparsity is on the target dimension.The Johnson-Lindenstrauss lemma is the theoretical core of the dimensionality reduction methods based on random projections. While its original formulation involves matrices with Gaussian entries, the computational cost of random projections can be drastically reduced by the use of simpler variables, especially if they vanish with a high probability. In this paper, we propose a simple and elementary analysis of random projections under classical assumptions that emphasizes the key role of sub-Gaussianity. Furthermore, we show how to extend it to sparse projections, emphasizing the limits induced by the sparsity of the data itself.

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019