Show newer

A Stochastic Weather Model: A Case of Bono Region of Ghana arxiv.org/abs/2409.06731 .AP .PR

A Stochastic Weather Model: A Case of Bono Region of Ghana

The paper sought to fit an Ornstein Uhlenbeck model with seasonal mean and volatility, where the residuals are generated by a Brownian motion for Ghanian daily average temperature. This paper employed the modified Ornstein Uhlenbeck model proposed by Bhowan which has a seasonal mean and stochastic volatility process. The findings revealed that, the Bono region experiences warm temperatures and maximum precipitation up to 32.67 degree celsius and 126.51mm respectively. It was observed that the Daily Average Temperature (DAT) of the region reverts to a temperature of approximately 26 degree celsius at a rate of 18.72% with maximum and minimum temperatures of 32.67degree celsius and 19.75degree celsius respectively. Although the region is in the middle belt of Ghana, it still experiences warm(hot) temperatures daily and experiences dry seasons relatively more than wet seasons in the number of years considered for our analysis. Our model explained approximately 50% of the variations in the daily average temperature of the region which can be regarded as relatively a good model. The findings of this paper are relevant in the pricing of weather derivatives with temperature as an underlying variable in the Ghanaian financial and agricultural sector. Furthermore, it would assist in the development and design of tailored agriculture/crop insurance models which would incorporate temperature dynamics rather than extreme weather conditions/events such as floods, drought and wildfires.

arxiv.org

Toward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions arxiv.org/abs/2409.06960 .data-an .ML .AP .LG

Toward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions

In the search for new particles in high-energy physics, it is crucial to select the Signal Region (SR) in such a way that it is enriched with signal events if they are present. While most existing search methods set the region relying on prior domain knowledge, it may be unavailable for a completely novel particle that falls outside the current scope of understanding. We address this issue by proposing a method built upon a model-agnostic but often realistic assumption about the localized topology of the signal events, in which they are concentrated in a certain area of the feature space. Considering the signal component as a localized high-frequency feature, our approach employs the notion of a low-pass filter. We define the SR as an area which is most affected when the observed events are smeared with additive random noise. We overcome challenges in density estimation in the high-dimensional feature space by learning the density ratio of events that potentially include a signal to the complementary observation of events that closely resemble the target events but are free of any signals. By applying our method to simulated $\mathrm{HH} \rightarrow 4b$ events, we demonstrate that the method can efficiently identify a data-driven SR in a high-dimensional feature space in which a high portion of signal events concentrate.

arxiv.org

A Practical Theory of Generalization in Selectivity Learning arxiv.org/abs/2409.07014 .ML .DB .LG

A Practical Theory of Generalization in Selectivity Learning

Query-driven machine learning models have emerged as a promising estimation technique for query selectivities. Yet, surprisingly little is known about the efficacy of these techniques from a theoretical perspective, as there exist substantial gaps between practical solutions and state-of-the-art (SOTA) theory based on the Probably Approximately Correct (PAC) learning framework. In this paper, we aim to bridge the gaps between theory and practice. First, we demonstrate that selectivity predictors induced by signed measures are learnable, which relaxes the reliance on probability measures in SOTA theory. More importantly, beyond the PAC learning framework (which only allows us to characterize how the model behaves when both training and test workloads are drawn from the same distribution), we establish, under mild assumptions, that selectivity predictors from this class exhibit favorable out-of-distribution (OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the in-distribution and OOD generalization capabilities of query-driven selectivity learning, and facilitate the design of two general strategies to improve OOD generalization for existing query-driven selectivity models. We empirically verify that our techniques help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance, while maintaining their superior in-distribution generalization performance.

arxiv.org

From optimal score matching to optimal sampling arxiv.org/abs/2409.07032 .ML .LG

From optimal score matching to optimal sampling

The recent, impressive advances in algorithmic generation of high-fidelity image, audio, and video are largely due to great successes in score-based diffusion models. A key implementing step is score matching, that is, the estimation of the score function of the forward diffusion process from training data. As shown in earlier literature, the total variation distance between the law of a sample generated from the trained diffusion model and the ground truth distribution can be controlled by the score matching risk. Despite the widespread use of score-based diffusion models, basic theoretical questions concerning exact optimal statistical rates for score estimation and its application to density estimation remain open. We establish the sharp minimax rate of score estimation for smooth, compactly supported densities. Formally, given \(n\) i.i.d. samples from an unknown \(α\)-Hölder density \(f\) supported on \([-1, 1]\), we prove the minimax rate of estimating the score function of the diffused distribution \(f * \mathcal{N}(0, t)\) with respect to the score matching loss is \(\frac{1}{nt^2} \wedge \frac{1}{nt^{3/2}} \wedge (t^{α-1} + n^{-2(α-1)/(2α+1)})\) for all \(α> 0\) and \(t \ge 0\). As a consequence, it is shown the law \(\hat{f}\) of a sample generated from the diffusion model achieves the sharp minimax rate \(\bE(\dTV(\hat{f}, f)^2) \lesssim n^{-2α/(2α+1)}\) for all \(α> 0\) without any extraneous logarithmic terms which are prevalent in the literature, and without the need for early stopping which has been required for all existing procedures to the best of our knowledge.

arxiv.org

Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting arxiv.org/abs/2409.05980 .ML .LG

Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting

Rested and Restless Bandits are two well-known bandit settings that are useful to model real-world sequential decision-making problems in which the expected reward of an arm evolves over time due to the actions we perform or due to the nature. In this work, we propose Graph-Triggered Bandits (GTBs), a unifying framework to generalize and extend rested and restless bandits. In this setting, the evolution of the arms' expected rewards is governed by a graph defined over the arms. An edge connecting a pair of arms $(i,j)$ represents the fact that a pull of arm $i$ triggers the evolution of arm $j$, and vice versa. Interestingly, rested and restless bandits are both special cases of our model for some suitable (degenerated) graph. As relevant case studies for this setting, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs. For these cases, we study the optimal policies. We provide suitable algorithms for all scenarios and discuss their theoretical guarantees, highlighting the complexity of the learning problem concerning instance-dependent terms that encode specific properties of the underlying graph structure.

arxiv.org

Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach arxiv.org/abs/2409.06180 -bio.GN .ME

Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach

Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment.

arxiv.org

Intrinsic geometry-inspired dependent toroidal distribution: Application to regression model for astigmatism data arxiv.org/abs/2409.06229 .AP

Intrinsic geometry-inspired dependent toroidal distribution: Application to regression model for astigmatism data

This paper introduces a dependent toroidal distribution, to analyze astigmatism data following cataract surgery. Rather than utilizing the flat torus, we opt to represent the bivariate angular data on the surface of a curved torus, which naturally offers smooth edge identifiability and accommodates a variety of curvatures: positive, negative, and zero. Beginning with the area-uniform toroidal distribution on this curved surface, we develop a five-parameter-dependent toroidal distribution that harnesses its intrinsic geometry via the area element to model the distribution of two dependent circular random variables. We show that both marginal distributions are Cardioid, with one of the conditional variables also following a Cardioid distribution. This key feature enables us to propose a circular-circular regression model based on conditional expectations derived from circular moments. To address the high rejection rate (approximately 50%) in existing acceptance-rejection sampling methods for Cardioid distributions, we introduce an exact sampling method based on a probabilistic transformation. Additionally, we generate random samples from the proposed dependent toroidal distribution through suitable conditioning. This bivariate distribution and the regression model are applied to analyze astigmatism data arising in the follow-up of one and three months due to cataract surgery.

arxiv.org

Applications of machine learning to predict seasonal precipitation for East Africa arxiv.org/abs/2409.06238 .AP .ML

Applications of machine learning to predict seasonal precipitation for East Africa

Seasonal climate forecasts are commonly based on model runs from fully coupled forecasting systems that use Earth system models to represent interactions between the atmosphere, ocean, land and other Earth-system components. Recently, machine learning (ML) methods are increasingly being investigated for this task where large-scale climate variability is linked to local or regional temperature or precipitation in a linear or non-linear fashion. This paper investigates the use of interpretable ML methods to predict seasonal precipitation for East Africa in an operational setting. Dimension reduction is performed by decomposing the precipitation fields via empirical orthogonal functions (EOFs), such that only the respective factor loadings need to the predicted. Indices of large-scale climate variability--including the rate of change in individual indices as well as interactions between different indices--are then used as potential features to obtain tercile forecasts from an interpretable ML algorithm. Several research questions regarding the use of data and the effect of model complexity are studied. The results are compared against the ECMWF seasonal forecasting system (SEAS5) for three seasons--MAM, JJAS and OND--over the period 1993-2020. Compared to climatology for the same period, the ECMWF forecasts have negative skill in MAM and JJAS and significant positive skill in OND. The ML approach is on par with climatology in MAM and JJAS and a significantly positive skill in OND, if not quite at the level of the OND ECMWF forecast.

arxiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.