arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 601 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

Statistical Finite Elements via Interacting Particle Langevin Dynamics https://arxiv.org/abs/2409.07101 #physics.comp-ph #stat.CO

Statistical Finite Elements via Interacting Particle Langevin Dynamics

In this paper, we develop a class of interacting particle Langevin algorithms to solve inverse problems for partial differential equations (PDEs). In particular, we leverage the statistical finite elements (statFEM) formulation to obtain a finite-dimensional latent variable statistical model where the parameter is that of the (discretised) forward map and the latent variable is the statFEM solution of the PDE which is assumed to be partially observed. We then adapt a recently proposed expectation-maximisation like scheme, interacting particle Langevin algorithm (IPLA), for this problem and obtain a joint estimation procedure for the parameters and the latent variables. We consider three main examples: (i) estimating the forcing for linear Poisson PDE, (ii) estimating the forcing for nonlinear Poisson PDE, and (iii) estimating diffusivity for linear Poisson PDE. We provide computational complexity estimates for forcing estimation in the linear case. We also provide comprehensive numerical experiments and preconditioning strategies that significantly improve the performance, showing that the proposed class of methods can be the choice for parameter inference in PDE models.

arXiv Statistics @arxiv_stats@qoto.org

Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting https://arxiv.org/abs/2409.05980 #stat.ML #cs.LG

Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting

Rested and Restless Bandits are two well-known bandit settings that are useful to model real-world sequential decision-making problems in which the expected reward of an arm evolves over time due to the actions we perform or due to the nature. In this work, we propose Graph-Triggered Bandits (GTBs), a unifying framework to generalize and extend rested and restless bandits. In this setting, the evolution of the arms' expected rewards is governed by a graph defined over the arms. An edge connecting a pair of arms $(i,j)$ represents the fact that a pull of arm $i$ triggers the evolution of arm $j$, and vice versa. Interestingly, rested and restless bandits are both special cases of our model for some suitable (degenerated) graph. As relevant case studies for this setting, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs. For these cases, we study the optimal policies. We provide suitable algorithms for all scenarios and discuss their theoretical guarantees, highlighting the complexity of the learning problem concerning instance-dependent terms that encode specific properties of the underlying graph structure.

arXiv Statistics @arxiv_stats@qoto.org

Learning about Spatial and Temporal Proximity using Tree-Based Methods https://arxiv.org/abs/2409.06046 #stat.AP

Learning about Spatial and Temporal Proximity using Tree-Based Methods

Learning about the relationship between distance to landmarks and events and phenomena of interest is a multi-faceted problem, as it may require taking into account multiple dimensions, including: spatial position of landmarks, timing of events taking place over time, and attributes of occurrences and locations. Here I show that tree-based methods are well suited for the study of these questions as they allow exploring the relationship between proximity metrics and outcomes of interest in a non-parametric and data-driven manner. I illustrate the usefulness of tree-based methods vis-à-vis conventional regression methods by examining the association between: (i) distance to border crossings along the US-Mexico border and support for immigration reform, and (ii) distance to mass shootings and support for gun control.

arXiv Statistics @arxiv_stats@qoto.org

Empirical Bernstein in smooth Banach spaces https://arxiv.org/abs/2409.06060 #math.ST #stat.TH

Empirical Bernstein in smooth Banach spaces

Existing concentration bounds for bounded vector-valued random variables include extensions of the scalar Hoeffding and Bernstein inequalities. While the latter is typically tighter, it requires knowing a bound on the variance of the random variables. We derive a new vector-valued empirical Bernstein inequality, which makes use of an empirical estimator of the variance instead of the true variance. The bound holds in 2-smooth separable Banach spaces, which include finite dimensional Euclidean spaces and separable Hilbert spaces. The resulting confidence sets are instantiated for both the batch setting (where the sample size is fixed) and the sequential setting (where the sample size is a stopping time). The confidence set width asymptotically exactly matches that achieved by Bernstein in the leading term. The method and supermartingale proof technique combine several tools of Pinelis (1994) and Waudby-Smith and Ramdas (2024).

arXiv Statistics @arxiv_stats@qoto.org

Variational Search Distributions https://arxiv.org/abs/2409.06142 #stat.ML #cs.LG

Variational Search Distributions

We develop variational search distributions (VSD), a method for finding and generating discrete, combinatorial designs of a rare desired class in a batch sequential manner with a fixed experimental budget. We formalize the requirements and desiderata for active generation and formulate a solution via variational inference. In particular, VSD uses off-the-shelf gradient based optimization routines, can learn powerful generative models for designs, and can take advantage of scalable predictive models. We derive asymptotic convergence rates for learning the true conditional generative distribution of designs with certain configurations of our method. After illustrating the generative model on images, we empirically demonstrate that VSD can outperform existing baseline methods on a set of real sequence-design problems in various biological systems.

arXiv Statistics @arxiv_stats@qoto.org

Nonparametric Inference for Balance in Signed Networks https://arxiv.org/abs/2409.06172 #stat.ME

Nonparametric Inference for Balance in Signed Networks

In many real-world networks, relationships often go beyond simple dyadic presence or absence; they can be positive, like friendship, alliance, and mutualism, or negative, characterized by enmity, disputes, and competition. To understand the formation mechanism of such signed networks, the social balance theory sheds light on the dynamics of positive and negative connections. In particular, it characterizes the proverbs, "a friend of my friend is my friend" and "an enemy of my enemy is my friend". In this work, we propose a nonparametric inference approach for assessing empirical evidence for the balance theory in real-world signed networks. We first characterize the generating process of signed networks with node exchangeability and propose a nonparametric sparse signed graphon model. Under this model, we construct confidence intervals for the population parameters associated with balance theory and establish their theoretical validity. Our inference procedure is as computationally efficient as a simple normal approximation but offers higher-order accuracy. By applying our method, we find strong real-world evidence for balance theory in signed networks across various domains, extending its applicability beyond social psychology.

arXiv Statistics @arxiv_stats@qoto.org

Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach https://arxiv.org/abs/2409.06180 #q-bio.GN #stat.ME

Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach

Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment.

arXiv Statistics @arxiv_stats@qoto.org

Intrinsic geometry-inspired dependent toroidal distribution: Application to regression model for astigmatism data https://arxiv.org/abs/2409.06229 #stat.AP

Intrinsic geometry-inspired dependent toroidal distribution: Application to regression model for astigmatism data

This paper introduces a dependent toroidal distribution, to analyze astigmatism data following cataract surgery. Rather than utilizing the flat torus, we opt to represent the bivariate angular data on the surface of a curved torus, which naturally offers smooth edge identifiability and accommodates a variety of curvatures: positive, negative, and zero. Beginning with the area-uniform toroidal distribution on this curved surface, we develop a five-parameter-dependent toroidal distribution that harnesses its intrinsic geometry via the area element to model the distribution of two dependent circular random variables. We show that both marginal distributions are Cardioid, with one of the conditional variables also following a Cardioid distribution. This key feature enables us to propose a circular-circular regression model based on conditional expectations derived from circular moments. To address the high rejection rate (approximately 50%) in existing acceptance-rejection sampling methods for Cardioid distributions, we introduce an exact sampling method based on a probabilistic transformation. Additionally, we generate random samples from the proposed dependent toroidal distribution through suitable conditioning. This bivariate distribution and the regression model are applied to analyze astigmatism data arising in the follow-up of one and three months due to cataract surgery.

arXiv Statistics @arxiv_stats@qoto.org

Applications of machine learning to predict seasonal precipitation for East Africa https://arxiv.org/abs/2409.06238 #stat.AP #stat.ML

Applications of machine learning to predict seasonal precipitation for East Africa

Seasonal climate forecasts are commonly based on model runs from fully coupled forecasting systems that use Earth system models to represent interactions between the atmosphere, ocean, land and other Earth-system components. Recently, machine learning (ML) methods are increasingly being investigated for this task where large-scale climate variability is linked to local or regional temperature or precipitation in a linear or non-linear fashion. This paper investigates the use of interpretable ML methods to predict seasonal precipitation for East Africa in an operational setting. Dimension reduction is performed by decomposing the precipitation fields via empirical orthogonal functions (EOFs), such that only the respective factor loadings need to the predicted. Indices of large-scale climate variability--including the rate of change in individual indices as well as interactions between different indices--are then used as potential features to obtain tercile forecasts from an interpretable ML algorithm. Several research questions regarding the use of data and the effect of model complexity are studied. The results are compared against the ECMWF seasonal forecasting system (SEAS5) for three seasons--MAM, JJAS and OND--over the period 1993-2020. Compared to climatology for the same period, the ECMWF forecasts have negative skill in MAM and JJAS and significant positive skill in OND. The ML approach is on par with climatology in MAM and JJAS and a significantly positive skill in OND, if not quite at the level of the OND ECMWF forecast.

arXiv Statistics @arxiv_stats@qoto.org

A new paradigm for global sensitivity analysis https://arxiv.org/abs/2409.06271 #stat.ML #stat.ME #cs.LG

A new paradigm for global sensitivity analysis

<div><p>Current theory of global sensitivity analysis, based on a nonlinear functional ANOVA decomposition of the random output, is limited in scope-for instance, the analysis is limited to the output's variance and the inputs have to be mutually independent-and leads to sensitivity indices the interpretation of which is not fully clear, especially interaction effects. Alternatively, sensitivity indices built for arbitrary user-defined importance measures have been proposed but a theory to define interactions in a systematic fashion and/or establish a decomposition of the total importance measure is still missing. It is shown that these important problems are solved all at once by adopting a new paradigm. By partitioning the inputs into those causing the change in the output and those which do not, arbitrary user-defined variability measures are identified with the outcomes of a factorial experiment at two levels, leading to all factorial effects without assuming any functional decomposition. To link various well-known sensitivity indices of the literature (Sobol indices and Shapley effects), weighted factorial effects are studied and utilized.</p></div>

arXiv Statistics @arxiv_stats@qoto.org

On Sparsity and Sub-Gaussianity in the Johnson-Lindenstrauss Lemma https://arxiv.org/abs/2409.06275 #math.ST #stat.TH

On Sparsity and Sub-Gaussianity in the Johnson-Lindenstrauss Lemma

We provide a simple proof of the Johnson-Lindenstrauss lemma for sub-Gaussian variables. We extend the analysis to identify how sparse projections can be, and what the cost of sparsity is on the target dimension.The Johnson-Lindenstrauss lemma is the theoretical core of the dimensionality reduction methods based on random projections. While its original formulation involves matrices with Gaussian entries, the computational cost of random projections can be drastically reduced by the use of simpler variables, especially if they vanish with a high probability. In this paper, we propose a simple and elementary analysis of random projections under classical assumptions that emphasizes the key role of sub-Gaussianity. Furthermore, we show how to extend it to sparse projections, emphasizing the limits induced by the sparsity of the data itself.

arXiv Statistics @arxiv_stats@qoto.org

Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm https://arxiv.org/abs/2409.04500 #stat.ML #stat.ME #cs.LG

Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm

Estimating the effect of treatments from natural experiments, where treatments are pre-assigned, is an important and well-studied problem. We introduce a novel natural experiment dataset obtained from an early childhood literacy nonprofit. Surprisingly, applying over 20 established estimators to the dataset produces inconsistent results in evaluating the nonprofit's efficacy. To address this, we create a benchmark to evaluate estimator accuracy using synthetic outcomes, whose design was guided by domain experts. The benchmark extensively explores performance as real world conditions like sample size, treatment correlation, and propensity score accuracy vary. Based on our benchmark, we observe that the class of doubly robust treatment effect estimators, which are based on simple and intuitive regression adjustment, generally outperform other more complicated estimators by orders of magnitude. To better support our theoretical understanding of doubly robust estimators, we derive a closed form expression for the variance of any such estimator that uses dataset splitting to obtain an unbiased estimate. This expression motivates the design of a new doubly robust estimator that uses a novel loss function when fitting functions for regression adjustment. We release the dataset and benchmark in a Python package; the package is built in a modular way to facilitate new datasets and estimators.

arXiv Statistics @arxiv_stats@qoto.org

Enhancing Electrocardiography Data Classification Confidence: A Robust Gaussian Process Approach (MuyGPs) https://arxiv.org/abs/2409.04642 #stat.AP

Enhancing Electrocardiography Data Classification Confidence: A Robust Gaussian Process Approach (MuyGPs)

Analyzing electrocardiography (ECG) data is essential for diagnosing and monitoring various heart diseases. The clinical adoption of automated methods requires accurate confidence measurements, which are largely absent from existing classification methods. In this paper, we present a robust Gaussian Process classification hyperparameter training model (MuyGPs) for discerning normal heartbeat signals from the signals affected by different arrhythmias and myocardial infarction. We compare the performance of MuyGPs with traditional Gaussian process classifier as well as conventional machine learning models, such as, Random Forest, Extra Trees, k-Nearest Neighbors and Convolutional Neural Network. Comparing these models reveals MuyGPs as the most performant model for making confident predictions on individual patient ECGs. Furthermore, we explore the posterior distribution obtained from the Gaussian process to interpret the prediction and quantify uncertainty. In addition, we provide a guideline on obtaining the prediction confidence of the machine learning models and quantitatively compare the uncertainty measures of these models. Particularly, we identify a class of less-accurate (ambiguous) signals for further diagnosis by an expert.

arXiv Statistics @arxiv_stats@qoto.org

A Multi-objective Economic Statistical Design of the CUSUM chart: NSGA II Approach https://arxiv.org/abs/2409.04673 #stat.AP

A Multi-objective Economic Statistical Design of the CUSUM chart: NSGA II Approach

This paper presents an approach for the economic statistical design of the Cumulative Sum (CUSUM) control chart in a multi-objective optimization framework. The proposed methodology integrates economic considerations with statistical aspects to optimize the design parameters like the sample size ($n$), sampling interval ($h$), and decision interval ($H$) of the CUSUM chart. The Non-dominated Sorting Genetic Algorithm II (NSGA II) is employed to solve the multi-objective optimization problem, aiming to minimize both the average cost per cycle ($C_E$) and the out-of-control Average Run Length ($ARL_δ$) simultaneously. The effectiveness of the proposed approach is demonstrated through a numerical example by determining the optimized CUSUM chart parameters using NSGA II. Additionally, sensitivity analysis is conducted to assess the impact of variations in input parameters. The corresponding results indicate that the proposed methodology significantly reduces the expected cost per cycle by about 43\% when compared to the findings of the article by M. Lee in the year 2011. A more extensive comparison with respect to both $C_E$ and $ARL_δ$ has also been provided for justifying the methodology proposed in this article. This highlights the practical relevance and potential of this study for the right application of the technique of the CUSUM chart for process control purposes in industries.

arXiv Statistics @arxiv_stats@qoto.org

Establishing the Parallels and Differences Between Right-Censored and Missing Covariates https://arxiv.org/abs/2409.04684 #stat.ME #stat.AP

Establishing the Parallels and Differences Between Right-Censored and Missing Covariates

While right-censored time-to-event outcomes have been studied for decades, handling time-to-event covariates, also known as right-censored covariates, is now of growing interest. So far, the literature has treated right-censored covariates as distinct from missing covariates, overlooking the potential applicability of estimators to both scenarios. We bridge this gap by establishing connections between right-censored and missing covariates under various assumptions about censoring and missingness, allowing us to identify parallels and differences to determine when estimators can be used in both contexts. These connections reveal adaptations to five estimators for right-censored covariates in the unexplored area of informative covariate right-censoring and to formulate a new estimator for this setting, where the event time depends on the censoring time. We establish the asymptotic properties of the six estimators, evaluate their robustness under incorrect distributional assumptions, and establish their comparative efficiency. We conducted a simulation study to confirm our theoretical results, and then applied all estimators to a Huntington disease observational study to analyze cognitive impairments as a function of time to clinical diagnosis.

arXiv Statistics @arxiv_stats@qoto.org

Privacy enhanced collaborative inference in the Cox proportional hazards model for distributed data https://arxiv.org/abs/2409.04716 #stat.AP #math.ST #stat.TH

Privacy enhanced collaborative inference in the Cox proportional hazards model for distributed data

Data sharing barriers are paramount challenges arising from multicenter clinical studies where multiple data sources are stored in a distributed fashion at different local study sites. Particularly in the case of time-to-event analysis when global risk sets are needed for the Cox proportional hazards model, access to a centralized database is typically necessary. Merging such data sources into a common data storage for a centralized statistical analysis requires a data use agreement, which is often time-consuming. Furthermore, the construction and distribution of risk sets to participating clinical centers for subsequent calculations may pose a risk of revealing individual-level information. We propose a new collaborative Cox model that eliminates the need for accessing the centralized database and constructing global risk sets but needs only the sharing of summary statistics with significantly smaller dimensions than risk sets. Thus, the proposed collaborative inference enjoys maximal protection of data privacy. We show theoretically and numerically that the new distributed proportional hazards model approach has little loss of statistical power when compared to the centralized method that requires merging the entire data. We present a renewable sieve method to establish large-sample properties for the proposed method. We illustrate its performance through simulation experiments and a real-world data example from patients with kidney transplantation in the Organ Procurement and Transplantation Network (OPTN) to understand the factors associated with the 5-year death-censored graft failure (DCGF) for patients who underwent kidney transplants in the US.

arXiv Statistics @arxiv_stats@qoto.org

A Unified Framework for Cluster Methods with Tensor Networks https://arxiv.org/abs/2409.04729 #cond-mat.stat-mech #physics.comp-ph #stat.ME #stat.CO

A Unified Framework for Cluster Methods with Tensor Networks

Markov Chain Monte Carlo (MCMC), and Tensor Networks (TN) are two powerful frameworks for numerically investigating many-body systems, each offering distinct advantages. MCMC, with its flexibility and theoretical consistency, is well-suited for simulating arbitrary systems by sampling. TN, on the other hand, provides a powerful tensor-based language for capturing the entanglement properties intrinsic to many-body systems, offering a universal representation of these systems. In this work, we leverage the computational strengths of TN to design a versatile cluster MCMC sampler. Specifically, we propose a general framework for constructing tensor-based cluster MCMC methods, enabling arbitrary cluster updates by utilizing TNs to compute the distributions required in the MCMC sampler. Our framework unifies several existing cluster algorithms as special cases and allows for natural extensions. We demonstrate our method by applying it to the simulation of the two-dimensional Edwards-Anderson Model and the three-dimensional Ising Model. This work is dedicated to the memory of Prof. David Draper.

arXiv Statistics @arxiv_stats@qoto.org

Spatial Interference Detection in Treatment Effect Model https://arxiv.org/abs/2409.04836 #stat.ME

Spatial Interference Detection in Treatment Effect Model

Modeling the interference effect is an important issue in the field of causal inference. Existing studies rely on explicit and often homogeneous assumptions regarding interference structures. In this paper, we introduce a low-rank and sparse treatment effect model that leverages data-driven techniques to identify the locations of interference effects. A profiling algorithm is proposed to estimate the model coefficients, and based on these estimates, global test and local detection methods are established to detect the existence of interference and the interference neighbor locations for each unit. We derive the non-asymptotic bound of the estimation error, and establish theoretical guarantees for the global test and the accuracy of the detection method in terms of Jaccard index. Simulations and real data examples are provided to demonstrate the usefulness of the proposed method.

arXiv Statistics @arxiv_stats@qoto.org

Moving from Machine Learning to Statistics: the case of Expected Points in American football https://arxiv.org/abs/2409.04889 #stat.AP

Moving from Machine Learning to Statistics: the case of Expected Points in American football

Expected points is a value function fundamental to player evaluation and strategic in-game decision-making across sports analytics, particularly in American football. To estimate expected points, football analysts use machine learning tools, which are not equipped to handle certain challenges. They suffer from selection bias, display counter-intuitive artifacts of overfitting, do not quantify uncertainty in point estimates, and do not account for the strong dependence structure of observational football data. These issues are not unique to American football or even sports analytics; they are general problems analysts encounter across various statistical applications, particularly when using machine learning in lieu of traditional statistical models. We explore these issues in detail and devise expected points models that account for them. We also introduce a widely applicable novel methodological approach to mitigate overfitting, using a catalytic prior to smooth our machine learning models.

arXiv Statistics @arxiv_stats@qoto.org

Marginal Structural Modeling of Representative Treatment Trajectories https://arxiv.org/abs/2409.04933 #stat.ME

Marginal Structural Modeling of Representative Treatment Trajectories

Marginal structural models (MSMs) are widely used in observational studies to estimate the causal effect of time-varying treatments. Despite its popularity, limited attention has been paid to summarizing the treatment history in the outcome model, which proves particularly challenging when individuals' treatment trajectories exhibit complex patterns over time. Commonly used metrics such as the average treatment level fail to adequately capture the treatment history, hindering causal interpretation. For scenarios where treatment histories exhibit distinct temporal patterns, we develop a new approach to parameterize the outcome model. We apply latent growth curve analysis to identify representative treatment trajectories from the observed data and use the posterior probability of latent class membership to summarize the different treatment trajectories. We demonstrate its use in parameterizing the MSMs, which facilitates the interpretations of the results. We apply the method to analyze data from an existing cohort of lung transplant recipients to estimate the effect of Tacrolimus concentrations on the risk of incident chronic kidney disease.

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019