arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 603 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

Cubature-based uncertainty estimation for nonlinear regression models https://arxiv.org/abs/2409.08756 #stat.ME #math.NA #cs.NA

Cubature-based uncertainty estimation for nonlinear regression models

Calibrating model parameters to measured data by minimizing loss functions is an important step in obtaining realistic predictions from model-based approaches, e.g., for process optimization. This is applicable to both knowledge-driven and data-driven model setups. Due to measurement errors, the calibrated model parameters also carry uncertainty. In this contribution, we use cubature formulas based on sparse grids to calculate the variance of the regression results. The number of cubature points is close to the theoretical minimum required for a given level of exactness. We present exact benchmark results, which we also compare to other cubatures. This scheme is then applied to estimate the prediction uncertainty of the NRTL model, calibrated to observations from different experimental designs.

arXiv Statistics @arxiv_stats@qoto.org

Spatial Deep Convolutional Neural Networks https://arxiv.org/abs/2409.07559 #stat.ME #stat.AP

Spatial Deep Convolutional Neural Networks

Spatial prediction problems often use Gaussian process models, which can be computationally burdensome in high dimensions. Specification of an appropriate covariance function for the model can be challenging when complex non-stationarities exist. Recent work has shown that pre-computed spatial basis functions and a feed-forward neural network can capture complex spatial dependence structures while remaining computationally efficient. This paper builds on this literature by tailoring spatial basis functions for use in convolutional neural networks. Through both simulated and real data, we demonstrate that this approach yields more accurate spatial predictions than existing methods. Uncertainty quantification is also considered.

arXiv Statistics @arxiv_stats@qoto.org

Debiased high-dimensional regression calibration for errors-in-variables log-contrast models https://arxiv.org/abs/2409.07568 #stat.ME #stat.ML

Debiased high-dimensional regression calibration for errors-in-variables log-contrast models

Motivated by the challenges in analyzing gut microbiome and metagenomic data, this work aims to tackle the issue of measurement errors in high-dimensional regression models that involve compositional covariates. This paper marks a pioneering effort in conducting statistical inference on high-dimensional compositional data affected by mismeasured or contaminated data. We introduce a calibration approach tailored for the linear log-contrast model. Under relatively lenient conditions regarding the sparsity level of the parameter, we have established the asymptotic normality of the estimator for inference. Numerical experiments and an application in microbiome study have demonstrated the efficacy of our high-dimensional calibration strategy in minimizing bias and achieving the expected coverage rates for confidence intervals. Moreover, the potential application of our proposed methodology extends well beyond compositional data, suggesting its adaptability for a wide range of research contexts.

arXiv Statistics @arxiv_stats@qoto.org

Determining number of factors under stability considerations https://arxiv.org/abs/2409.07617 #stat.ME

Determining number of factors under stability considerations

This paper proposes a novel method for determining the number of factors in linear factor models under stability considerations. An instability measure is proposed based on the principal angle between the estimated loading spaces obtained by data splitting. Based on this measure, criteria for determining the number of factors are proposed and shown to be consistent. This consistency is obtained using results from random matrix theory, especially the complete delocalization of non-outlier eigenvectors. The advantage of the proposed methods over the existing ones is shown via weaker asymptotic requirements for consistency, simulation studies and a real data example.

arXiv Statistics @arxiv_stats@qoto.org

Weather-Informed Probabilistic Forecasting and Scenario Generation in Power Systems https://arxiv.org/abs/2409.07637 #stat.ML #stat.AP #cs.AI #cs.LG

Weather-Informed Probabilistic Forecasting and Scenario Generation in Power Systems

The integration of renewable energy sources (RES) into power grids presents significant challenges due to their intrinsic stochasticity and uncertainty, necessitating the development of new techniques for reliable and efficient forecasting. This paper proposes a method combining probabilistic forecasting and Gaussian copula for day-ahead prediction and scenario generation of load, wind, and solar power in high-dimensional contexts. By incorporating weather covariates and restoring spatio-temporal correlations, the proposed method enhances the reliability of probabilistic forecasts in RES. Extensive numerical experiments compare the effectiveness of different time series models, with performance evaluated using comprehensive metrics on a real-world and high-dimensional dataset from Midcontinent Independent System Operator (MISO). The results highlight the importance of weather information and demonstrate the efficacy of the Gaussian copula in generating realistic scenarios, with the proposed weather-informed Temporal Fusion Transformer (WI-TFT) model showing superior performance.

arXiv Statistics @arxiv_stats@qoto.org

Gaussian Process Upper Confidence Bounds in Distributed Point Target Tracking over Wireless Sensor Networks https://arxiv.org/abs/2409.07652 #stat.ML #math.ST #stat.TH #cs.LG

Gaussian Process Upper Confidence Bounds in Distributed Point Target Tracking over Wireless Sensor Networks

Uncertainty quantification plays a key role in the development of autonomous systems, decision-making, and tracking over wireless sensor networks (WSNs). However, there is a need of providing uncertainty confidence bounds, especially for distributed machine learning-based tracking, dealing with different volumes of data collected by sensors. This paper aims to fill in this gap and proposes a distributed Gaussian process (DGP) approach for point target tracking and derives upper confidence bounds (UCBs) of the state estimates. A unique contribution of this paper includes the derived theoretical guarantees on the proposed approach and its maximum accuracy for tracking with and without clutter measurements. Particularly, the developed approaches with uncertainty bounds are generic and can provide trustworthy solutions with an increased level of reliability. A novel hybrid Bayesian filtering method is proposed to enhance the DGP approach by adopting a Poisson measurement likelihood model. The proposed approaches are validated over a WSN case study, where sensors have limited sensing ranges. Numerical results demonstrate the tracking accuracy and robustness of the proposed approaches. The derived UCBs constitute a tool for trustworthiness evaluation of DGP approaches. The simulation results reveal that the proposed UCBs successfully encompass the true target states with 88% and 42% higher probability in X and Y coordinates, respectively, when compared to the confidence interval-based method.

arXiv Statistics @arxiv_stats@qoto.org

Unsupervised anomaly detection in spatio-temporal stream network sensor data https://arxiv.org/abs/2409.07667 #stat.AP

Unsupervised anomaly detection in spatio-temporal stream network sensor data

The use of in-situ digital sensors for water quality monitoring is becoming increasingly common worldwide. While these sensors provide near real-time data for science, the data are prone to technical anomalies that can undermine the trustworthiness of the data and the accuracy of statistical inferences, particularly in spatial and temporal analyses. Here we propose a framework for detecting anomalies in sensor data recorded in stream networks, which takes advantage of spatial and temporal autocorrelation to improve detection rates. The proposed framework involves the implementation of effective data imputation to handle missing data, alignment of time-series to address temporal disparities, and the identification of water quality events. We explore the effectiveness of a suite of state-of-the-art statistical methods including posterior predictive distributions, finite mixtures, and Hidden Markov Models (HMM). We showcase the practical implementation of automated anomaly detection in near-real time by employing a Bayesian recursive approach. This demonstration is conducted through a comprehensive simulation study and a practical application to a substantive case study situated in the Herbert River, located in Queensland, Australia, which flows into the Great Barrier Reef. We found that methods such as posterior predictive distributions and HMM produce the best performance in detecting multiple types of anomalies. Utilizing data from multiple sensors deployed relatively near one another enhances the ability to distinguish between water quality events and technical anomalies, thereby significantly improving the accuracy of anomaly detection. Thus, uncertainty and biases in water quality reporting, interpretation, and modelling are reduced, and the effectiveness of subsequent management actions improved.

arXiv Statistics @arxiv_stats@qoto.org

Ratio Divergence Learning Using Target Energy in Restricted Boltzmann Machines: Beyond Kullback--Leibler Divergence Learning https://arxiv.org/abs/2409.07679 #cond-mat.dis-nn #stat.ML #math.ST #stat.ME #stat.TH #cs.LG

Ratio Divergence Learning Using Target Energy in Restricted Boltzmann Machines: Beyond Kullback--Leibler Divergence Learning

We propose ratio divergence (RD) learning for discrete energy-based models, a method that utilizes both training data and a tractable target energy function. We apply RD learning to restricted Boltzmann machines (RBMs), which are a minimal model that satisfies the universal approximation theorem for discrete distributions. RD learning combines the strength of both forward and reverse Kullback-Leibler divergence (KLD) learning, effectively addressing the "notorious" issues of underfitting with the forward KLD and mode-collapse with the reverse KLD. Since the summation of forward and reverse KLD seems to be sufficient to combine the strength of both approaches, we include this learning method as a direct baseline in numerical experiments to evaluate its effectiveness. Numerical experiments demonstrate that RD learning significantly outperforms other learning methods in terms of energy function fitting, mode-covering, and learning stability across various discrete energy-based models. Moreover, the performance gaps between RD learning and the other learning methods become more pronounced as the dimensions of target models increase.

arXiv Statistics @arxiv_stats@qoto.org

Critically Damped Third-Order Langevin Dynamics https://arxiv.org/abs/2409.07697 #stat.ML #eess.SP #eess.SY #cs.LG #cs.SY

Critically Damped Third-Order Langevin Dynamics

While systems analysis has been studied for decades in the context of control theory, it has only been recently used to improve the convergence of Denoising Diffusion Probabilistic Models. This work describes a novel improvement to Third- Order Langevin Dynamics (TOLD), a recent diffusion method that performs better than its predecessors. This improvement, abbreviated TOLD++, is carried out by critically damping the TOLD forward transition matrix similarly to Dockhorn's Critically-Damped Langevin Dynamics (CLD). Specifically, it exploits eigen-analysis of the forward transition matrix to derive the optimal set of dynamics under the original TOLD scheme. TOLD++ is theoretically guaranteed to converge faster than TOLD, and its faster convergence is verified on the Swiss Roll toy dataset and CIFAR-10 dataset according to the FID metric.

arXiv Statistics @arxiv_stats@qoto.org

Dataset-Free Weight-Initialization on Restricted Boltzmann Machine https://arxiv.org/abs/2409.07708 #cond-mat.dis-nn #stat.ML #cs.LG

Dataset-Free Weight-Initialization on Restricted Boltzmann Machine

In feed-forward neural networks, dataset-free weight-initialization method such as LeCun, Xavier (or Glorot), and He initializations have been developed. These methods randomly determine the initial values of weight parameters based on specific distributions (e.g., Gaussian or uniform distributions) without using training datasets. To the best of the authors' knowledge, such a dataset-free weight-initialization method is yet to be developed for restricted Boltzmann machines (RBMs), which are probabilistic neural networks consisting of two layers, In this study, we derive a dataset-free weight-initialization method for Bernoulli--Bernoulli RBMs based on a statistical mechanical analysis. In the proposed weight-initialization method, the weight parameters are drawn from a Gaussian distribution with zero mean. The standard deviation of the Gaussian distribution is optimized based on our hypothesis which is that a standard deviation providing a larger layer correlation (LC) between the two layers improves the learning efficiency. The expression of the LC is derived based on a statistical mechanical analysis. The optimal value of the standard deviation corresponds to the maximum point of the LC. The proposed weight-initialization method is identical to Xavier initialization in a specific case (i.e., in the case the sizes of the two layers are the same, the random variables of the layers are $\{-1,1\}$-binary, and all bias parameters are zero).

arXiv Statistics @arxiv_stats@qoto.org

A model-based approach for clustering binned data https://arxiv.org/abs/2409.07738 #stat.ME #math.ST #stat.TH

A model-based approach for clustering binned data

Binned data often appears in different fields of research, and it is generated after summarizing the original data in a sequence of pairs of bins (or their midpoints) and frequencies. There may exist different reasons to only provide this summary, but more importantly, it is necessary being able to perform statistical analyses based only on it. We present a Bayesian nonparametric model for clustering applicable for binned data. Clusters are modeled via random partitions, and within them a model-based approach is assumed. Inferences are performed by a Markov chain Monte Carlo method and the complete proposal is tested using simulated and real data. Having particular interest in studying marine populations, we analyze samples of Lobatus (Strobus) gigas' lengths and found the presence of up to three cohorts along the year.

arXiv Statistics @arxiv_stats@qoto.org

A Stochastic Weather Model: A Case of Bono Region of Ghana https://arxiv.org/abs/2409.06731 #stat.AP #math.PR

A Stochastic Weather Model: A Case of Bono Region of Ghana

The paper sought to fit an Ornstein Uhlenbeck model with seasonal mean and volatility, where the residuals are generated by a Brownian motion for Ghanian daily average temperature. This paper employed the modified Ornstein Uhlenbeck model proposed by Bhowan which has a seasonal mean and stochastic volatility process. The findings revealed that, the Bono region experiences warm temperatures and maximum precipitation up to 32.67 degree celsius and 126.51mm respectively. It was observed that the Daily Average Temperature (DAT) of the region reverts to a temperature of approximately 26 degree celsius at a rate of 18.72% with maximum and minimum temperatures of 32.67degree celsius and 19.75degree celsius respectively. Although the region is in the middle belt of Ghana, it still experiences warm(hot) temperatures daily and experiences dry seasons relatively more than wet seasons in the number of years considered for our analysis. Our model explained approximately 50% of the variations in the daily average temperature of the region which can be regarded as relatively a good model. The findings of this paper are relevant in the pricing of weather derivatives with temperature as an underlying variable in the Ghanaian financial and agricultural sector. Furthermore, it would assist in the development and design of tailored agriculture/crop insurance models which would incorporate temperature dynamics rather than extreme weather conditions/events such as floods, drought and wildfires.

arXiv Statistics @arxiv_stats@qoto.org

Kramnik vs Nakamura: A Chess Scandal https://arxiv.org/abs/2409.06739 #stat.AP

Kramnik vs Nakamura: A Chess Scandal

We provide a statistical analysis of the recent controversy between Vladimir Kramnik (ex chess world champion) and Hikaru Nakamura. Hikaru Nakamura is a chess prodigy and a five-time United States chess champion. Kramnik called into question Nakamura's 45.5 out of 46 win streak in an online blitz contest at chess.com. We assess the weight of evidence using a priori assessment of Viswanathan Anand and the streak evidence. Based on this evidence, we show that Nakamura has a 99.6 percent chance of not cheating. We study the statistical fallacies prevalent in both their analyses. On the one hand Kramnik bases his argument on the probability of such a streak is very small. This falls precisely into the Prosecutor's Fallacy. On the other hand, Nakamura tries to refute the argument using a cherry-picking argument. This violates the likelihood principle. We conclude with a discussion of the relevant statistical literature on the topic of fraud detection and the analysis of streaks in sports data.

arXiv Statistics @arxiv_stats@qoto.org

Learning Deep Kernels for Non-Parametric Independence Testing https://arxiv.org/abs/2409.06890 #stat.ML #cs.LG

Learning Deep Kernels for Non-Parametric Independence Testing

The Hilbert-Schmidt Independence Criterion (HSIC) is a powerful tool for nonparametric detection of dependence between random variables. It crucially depends, however, on the selection of reasonable kernels; commonly-used choices like the Gaussian kernel, or the kernel that yields the distance covariance, are sufficient only for amply sized samples from data distributions with relatively simple forms of dependence. We propose a scheme for selecting the kernels used in an HSIC-based independence test, based on maximizing an estimate of the asymptotic test power. We prove that maximizing this estimate indeed approximately maximizes the true power of the test, and demonstrate that our learned kernels can identify forms of structured dependence between random variables in various experiments.

arXiv Statistics @arxiv_stats@qoto.org

k-MLE, k-Bregman, k-VARs: Theory, Convergence, Computation https://arxiv.org/abs/2409.06938 #stat.ML #cs.LG

k-MLE, k-Bregman, k-VARs: Theory, Convergence, Computation

We develop hard clustering based on likelihood rather than distance and prove convergence. We also provide simulations and real data examples.

arXiv Statistics @arxiv_stats@qoto.org

Toward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions https://arxiv.org/abs/2409.06960 #physics.data-an #stat.ML #stat.AP #cs.LG

Toward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions

In the search for new particles in high-energy physics, it is crucial to select the Signal Region (SR) in such a way that it is enriched with signal events if they are present. While most existing search methods set the region relying on prior domain knowledge, it may be unavailable for a completely novel particle that falls outside the current scope of understanding. We address this issue by proposing a method built upon a model-agnostic but often realistic assumption about the localized topology of the signal events, in which they are concentrated in a certain area of the feature space. Considering the signal component as a localized high-frequency feature, our approach employs the notion of a low-pass filter. We define the SR as an area which is most affected when the observed events are smeared with additive random noise. We overcome challenges in density estimation in the high-dimensional feature space by learning the density ratio of events that potentially include a signal to the complementary observation of events that closely resemble the target events but are free of any signals. By applying our method to simulated $\mathrm{HH} \rightarrow 4b$ events, we demonstrate that the method can efficiently identify a data-driven SR in a high-dimensional feature space in which a high portion of signal events concentrate.

arXiv Statistics @arxiv_stats@qoto.org

A Practical Theory of Generalization in Selectivity Learning https://arxiv.org/abs/2409.07014 #stat.ML #cs.DB #cs.LG

A Practical Theory of Generalization in Selectivity Learning

Query-driven machine learning models have emerged as a promising estimation technique for query selectivities. Yet, surprisingly little is known about the efficacy of these techniques from a theoretical perspective, as there exist substantial gaps between practical solutions and state-of-the-art (SOTA) theory based on the Probably Approximately Correct (PAC) learning framework. In this paper, we aim to bridge the gaps between theory and practice. First, we demonstrate that selectivity predictors induced by signed measures are learnable, which relaxes the reliance on probability measures in SOTA theory. More importantly, beyond the PAC learning framework (which only allows us to characterize how the model behaves when both training and test workloads are drawn from the same distribution), we establish, under mild assumptions, that selectivity predictors from this class exhibit favorable out-of-distribution (OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the in-distribution and OOD generalization capabilities of query-driven selectivity learning, and facilitate the design of two general strategies to improve OOD generalization for existing query-driven selectivity models. We empirically verify that our techniques help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance, while maintaining their superior in-distribution generalization performance.

arXiv Statistics @arxiv_stats@qoto.org

Clustered Factor Analysis for Multivariate Spatial Data https://arxiv.org/abs/2409.07018 #stat.ME

Clustered Factor Analysis for Multivariate Spatial Data

Factor analysis has been extensively used to reveal the dependence structures among multivariate variables, offering valuable insight in various fields. However, it cannot incorporate the spatial heterogeneity that is typically present in spatial data. To address this issue, we introduce an effective method specifically designed to discover the potential dependence structures in multivariate spatial data. Our approach assumes that spatial locations can be approximately divided into a finite number of clusters, with locations within the same cluster sharing similar dependence structures. By leveraging an iterative algorithm that combines spatial clustering with factor analysis, we simultaneously detect spatial clusters and estimate a unique factor model for each cluster. The proposed method is evaluated through comprehensive simulation studies, demonstrating its flexibility. In addition, we apply the proposed method to a dataset of railway station attributes in the Tokyo metropolitan area, highlighting its practical applicability and effectiveness in uncovering complex spatial dependencies.

arXiv Statistics @arxiv_stats@qoto.org

From optimal score matching to optimal sampling https://arxiv.org/abs/2409.07032 #stat.ML #cs.LG

From optimal score matching to optimal sampling

The recent, impressive advances in algorithmic generation of high-fidelity image, audio, and video are largely due to great successes in score-based diffusion models. A key implementing step is score matching, that is, the estimation of the score function of the forward diffusion process from training data. As shown in earlier literature, the total variation distance between the law of a sample generated from the trained diffusion model and the ground truth distribution can be controlled by the score matching risk. Despite the widespread use of score-based diffusion models, basic theoretical questions concerning exact optimal statistical rates for score estimation and its application to density estimation remain open. We establish the sharp minimax rate of score estimation for smooth, compactly supported densities. Formally, given \(n\) i.i.d. samples from an unknown \(α\)-Hölder density \(f\) supported on \([-1, 1]\), we prove the minimax rate of estimating the score function of the diffused distribution \(f * \mathcal{N}(0, t)\) with respect to the score matching loss is \(\frac{1}{nt^2} \wedge \frac{1}{nt^{3/2}} \wedge (t^{α-1} + n^{-2(α-1)/(2α+1)})\) for all \(α> 0\) and \(t \ge 0\). As a consequence, it is shown the law \(\hat{f}\) of a sample generated from the diffusion model achieves the sharp minimax rate \(\bE(\dTV(\hat{f}, f)^2) \lesssim n^{-2α/(2α+1)}\) for all \(α> 0\) without any extraneous logarithmic terms which are prevalent in the literature, and without the need for early stopping which has been required for all existing procedures to the best of our knowledge.

arXiv Statistics @arxiv_stats@qoto.org

Identifiability of Polynomial Models from First Principles and via a Gr\"obner Basis Approach https://arxiv.org/abs/2409.07062 #math.ST #stat.TH

Identifiability of Polynomial Models from First Principles and via a Gröbner Basis Approach

The relationship between a set of design points and the class of hierarchical polynomial models identifiable from the design is investigated. Saturated models are of particular interest. Necessary and sufficient conditions are derived on the set of design points for specific terms to be included in leaves of the statistical fan. A practitioner led approach to building hierarchical saturated models that are identifiable is developed. This approach is compared to the method of model building based on Gröbner bases. The main results are illustrated by examples.

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019