arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 603 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

Spatio-Temporal-Network Point Processes for Modeling Crime Events with Landmarks https://arxiv.org/abs/2409.10882 #stat.AP

Spatio-Temporal-Network Point Processes for Modeling Crime Events with Landmarks

Self-exciting point processes are widely used to model the contagious effects of crime events living within continuous geographic space, using their occurrence time and locations. However, in urban environments, most events are naturally constrained within the city's street network structure, and the contagious effects of crime are governed by such a network geography. Meanwhile, the complex distribution of urban infrastructures also plays an important role in shaping crime patterns across space. We introduce a novel spatio-temporal-network point process framework for crime modeling that integrates these urban environmental characteristics by incorporating self-attention graph neural networks. Our framework incorporates the street network structure as the underlying event space, where crime events can occur at random locations on the network edges. To realistically capture criminal movement patterns, distances between events are measured using street network distances. We then propose a new mark for a crime event by concatenating the event's crime category with the type of its nearby landmark, aiming to capture how the urban design influences the mixing structures of various crime types. A graph attention network architecture is adopted to learn the existence of mark-to-mark interactions. Extensive experiments on crime data from Valencia, Spain, demonstrate the effectiveness of our framework in understanding the crime landscape and forecasting crime risks across regions.

arXiv Statistics @arxiv_stats@qoto.org

Bounds on the Generalization Error in Active Learning https://arxiv.org/abs/2409.09078 #stat.ML #cs.LG

Bounds on the Generalization Error in Active Learning

We establish empirical risk minimization principles for active learning by deriving a family of upper bounds on the generalization error. Aligning with empirical observations, the bounds suggest that superior query algorithms can be obtained by combining both informativeness and representativeness query strategies, where the latter is assessed using integral probability metrics. To facilitate the use of these bounds in application, we systematically link diverse active learning scenarios, characterized by their loss functions and hypothesis classes to their corresponding upper bounds. Our results show that regularization techniques used to constraint the complexity of various hypothesis classes are sufficient conditions to ensure the validity of the bounds. The present work enables principled construction and empirical quality-evaluation of query algorithms in active learning.

arXiv Statistics @arxiv_stats@qoto.org

Reducing Shape-Graph Complexity with Application to Classification of Retinal Blood Vessels and Neurons https://arxiv.org/abs/2409.09168 #stat.CO

Reducing Shape-Graph Complexity with Application to Classification of Retinal Blood Vessels and Neurons

Shape graphs are complex geometrical structures commonly found in biological and anatomical systems. A shape graph is a collection of nodes, some connected by curvilinear edges with arbitrary shapes. Their high complexity stems from the large number of nodes and edges and the complex shapes of edges. With an eye for statistical analysis, one seeks low-complexity representations that retain as much of the global structures of the original shape graphs as possible. This paper develops a framework for reducing graph complexity using hierarchical clustering procedures that replace groups of nodes and edges with their simpler representatives. It demonstrates this framework using graphs of retinal blood vessels in two dimensions and neurons in three dimensions. The paper also presents experiments on classifications of shape graphs using progressively reduced levels of graph complexity. The accuracy of disease detection in retinal blood vessels drops quickly when the complexity is reduced, with accuracy loss particularly associated with discarding terminal edges. Accuracy in identifying neural cell types remains stable with complexity reduction.

arXiv Statistics @arxiv_stats@qoto.org

Identification of distributions for risks based on the first moment and c-statistic https://arxiv.org/abs/2409.09178 #stat.ME #stat.CO

Identification of distributions for risks based on the first moment and c-statistic

We show that for any family of distributions with support on [0,1] with strictly monotonic cumulative distribution function (CDF) that has no jumps and is quantile-identifiable (i.e., any two distinct quantiles identify the distribution), knowing the first moment and c-statistic is enough to identify the distribution. The derivations motivate numerical algorithms for mapping a given pair of expected value and c-statistic to the parameters of specified two-parameter distributions for probabilities. We implemented these algorithms in R and in a simulation study evaluated their numerical accuracy for common families of distributions for risks (beta, logit-normal, and probit-normal). An area of application for these developments is in risk prediction modeling (e.g., sample size calculations and Value of Information analysis), where one might need to estimate the parameters of the distribution of predicted risks from the reported summary statistics.

arXiv Statistics @arxiv_stats@qoto.org

Off-Policy Evaluation with Irregularly-Spaced, Outcome-Dependent Observation Times https://arxiv.org/abs/2409.09236 #stat.ME

Off-Policy Evaluation with Irregularly-Spaced, Outcome-Dependent Observation Times

While the classic off-policy evaluation (OPE) literature commonly assumes decision time points to be evenly spaced for simplicity, in many real-world scenarios, such as those involving user-initiated visits, decisions are made at irregularly-spaced and potentially outcome-dependent time points. For a more principled evaluation of the dynamic policies, this paper constructs a novel OPE framework, which concerns not only the state-action process but also an observation process dictating the time points at which decisions are made. The framework is closely connected to the Markov decision process in computer science and with the renewal process in the statistical literature. Within the framework, two distinct value functions, derived from cumulative reward and integrated reward respectively, are considered, and statistical inference for each value function is developed under revised Markov and time-homogeneous assumptions. The validity of the proposed method is further supported by theoretical results, simulation studies, and a real-world application from electronic health records (EHR) evaluating periodontal disease treatments.

arXiv Statistics @arxiv_stats@qoto.org

Bounding the probability of causality under ordinal outcomes https://arxiv.org/abs/2409.09297 #math.ST #stat.TH

Bounding the probability of causality under ordinal outcomes

The probability of causation (PC) is often used in liability assessments. In a legal context, for example, where a patient suffered the side effect after taking a medication and sued the pharmaceutical company as a result, the value of the PC can help assess the likelihood that the side effect was caused by the medication, in other words, how likely it is that the patient will win the case. Beyond the issue of legal disputes, the PC plays an equally large role when one wants to go about explaining causal relationships between events that have already occurred in other areas. This article begins by reviewing the definitions and bounds of the probability of causality for binary outcomes, then generalizes them to ordinal outcomes. It demonstrates that incorporating additional mediator variable information in a complete mediation analysis provides a more refined bound compared to the simpler scenario where only exposure and outcome variables are considered.

arXiv Statistics @arxiv_stats@qoto.org

Exact Posterior Mean and Covariance for Generalized Linear Mixed Models https://arxiv.org/abs/2409.09310 #stat.ME

Exact Posterior Mean and Covariance for Generalized Linear Mixed Models

A novel method is proposed for the exact posterior mean and covariance of the random effects given the response in a generalized linear mixed model (GLMM) when the response does not follow normal. The research solves a long-standing problem in Bayesian statistics when an intractable integral appears in the posterior distribution. It is well-known that the posterior distribution of the random effects given the response in a GLMM when the response does not follow normal contains intractable integrals. Previous methods rely on Monte Carlo simulations for the posterior distributions. They do not provide the exact posterior mean and covariance of the random effects given the response. The special integral computation (SIC) method is proposed to overcome the difficulty. The SIC method does not use the posterior distribution in the computation. It devises an optimization problem to reach the task. An advantage is that the computation of the posterior distribution is unnecessary. The proposed SIC avoids the main difficulty in Bayesian analysis when intractable integrals appear in the posterior distribution.

arXiv Statistics @arxiv_stats@qoto.org

A Random-effects Approach to Regression Involving Many Categorical Predictors and Their Interactions https://arxiv.org/abs/2409.09355 #stat.ME #math.ST #stat.TH

A Random-effects Approach to Regression Involving Many Categorical Predictors and Their Interactions

Linear model prediction with a large number of potential predictors is both statistically and computationally challenging. The traditional approaches are largely based on shrinkage selection/estimation methods, which are applicable even when the number of potential predictors is (much) larger than the sample size. A situation of the latter scenario occurs when the candidate predictors involve many binary indicators corresponding to categories of some categorical predictors as well as their interactions. We propose an alternative approach to the shrinkage prediction methods in such a case based on mixed model prediction, which effectively treats combinations of the categorical effects as random effects. We establish theoretical validity of the proposed method, and demonstrate empirically its advantage over the shrinkage methods. We also develop measures of uncertainty for the proposed method and evaluate their performance empirically. A real-data example is considered.

arXiv Statistics @arxiv_stats@qoto.org

Topological Tensor Eigenvalue Theorems in Data Fusion https://arxiv.org/abs/2409.09392 #stat.ML #stat.CO #cs.LG

Topological Tensor Eigenvalue Theorems in Data Fusion

This paper introduces a novel framework for tensor eigenvalue analysis in the context of multi-modal data fusion, leveraging topological invariants such as Betti numbers. While traditional approaches to tensor eigenvalues rely on algebraic extensions of matrix theory, this work provides a topological perspective that enriches the understanding of tensor structures. By establishing new theorems linking eigenvalues to topological features, the proposed framework offers deeper insights into the latent structure of data, enhancing both interpretability and robustness. Applications to data fusion illustrate the theoretical and practical significance of the approach, demonstrating its potential for broad impact across machine learning and data science domains.

arXiv Statistics @arxiv_stats@qoto.org

Group Sequential Testing of a Treatment Effect Using a Surrogate Marker https://arxiv.org/abs/2409.09440 #stat.ME

Group Sequential Testing of a Treatment Effect Using a Surrogate Marker

The identification of surrogate markers is motivated by their potential to make decisions sooner about a treatment effect. However, few methods have been developed to actually use a surrogate marker to test for a treatment effect in a future study. Most existing methods consider combining surrogate marker and primary outcome information to test for a treatment effect, rely on fully parametric methods where strict parametric assumptions are made about the relationship between the surrogate and the outcome, and/or assume the surrogate marker is measured at only a single time point. Recent work has proposed a nonparametric test for a treatment effect using only surrogate marker information measured at a single time point by borrowing information learned from a prior study where both the surrogate and primary outcome were measured. In this paper, we utilize this nonparametric test and propose group sequential procedures that allow for early stopping of treatment effect testing in a setting where the surrogate marker is measured repeatedly over time. We derive the properties of the correlated surrogate-based nonparametric test statistics at multiple time points and compute stopping boundaries that allow for early stopping for a significant treatment effect, or for futility. We examine the performance of our testing procedure using a simulation study and illustrate the method using data from two distinct AIDS clinical trials.

arXiv Statistics @arxiv_stats@qoto.org

Hyperedge Representations with Hypergraph Wavelets: Applications to Spatial Transcriptomics https://arxiv.org/abs/2409.09469 #q-bio.QM #stat.ML #eess.SP #cs.LG

Hyperedge Representations with Hypergraph Wavelets: Applications to Spatial Transcriptomics

In many data-driven applications, higher-order relationships among multiple objects are essential in capturing complex interactions. Hypergraphs, which generalize graphs by allowing edges to connect any number of nodes, provide a flexible and powerful framework for modeling such higher-order relationships. In this work, we introduce hypergraph diffusion wavelets and describe their favorable spectral and spatial properties. We demonstrate their utility for biomedical discovery in spatially resolved transcriptomics by applying the method to represent disease-relevant cellular niches for Alzheimer's disease.

arXiv Statistics @arxiv_stats@qoto.org

Towards Definition of Higher Order Causality in Complex Systems https://arxiv.org/abs/2409.08295 #physics.data-an #stat.ML #math.IT #cs.IT #cs.LG

Towards Definition of Higher Order Causality in Complex Systems

The description of the dynamics of complex systems, in particular the capture of the interaction structure and causal relationships between elements of the system, is one of the central questions of interdisciplinary research. While the characterization of pairwise causal interactions is a relatively ripe field with established theoretical concepts and the current focus is on technical issues of their efficient estimation, it turns out that the standard concepts such as Granger causality or transfer entropy may not faithfully reflect possible synergies or interactions of higher orders, phenomena highly relevant for many real-world complex systems. In this paper, we propose a generalization and refinement of the information-theoretic approach to causal inference, enabling the description of truly multivariate, rather than multiple pairwise, causal interactions, and moving thus from causal networks to causal hypernetworks. In particular, while keeping the ability to control for mediating variables or common causes, in case of purely synergetic interactions such as the exclusive disjunction, it ascribes the causal role to the multivariate causal set but \emph{not} to individual inputs, distinguishing it thus from the case of e.g. two additive univariate causes. We demonstrate this concept by application to illustrative theoretical examples as well as a biophysically realistic simulation of biological neuronal dynamics recently reported to employ synergetic computations.

arXiv Statistics @arxiv_stats@qoto.org

Theoretical guarantees in KL for Diffusion Flow Matching https://arxiv.org/abs/2409.08311 #stat.ML #math.PR #cs.LG

Theoretical guarantees in KL for Diffusion Flow Matching

Flow Matching (FM) (also referred to as stochastic interpolants or rectified flows) stands out as a class of generative models that aims to bridge in finite time the target distribution $ν^\star$ with an auxiliary distribution $μ$, leveraging a fixed coupling $π$ and a bridge which can either be deterministic or stochastic. These two ingredients define a path measure which can then be approximated by learning the drift of its Markovian projection. The main contribution of this paper is to provide relatively mild assumptions on $ν^\star$, $μ$ and $π$ to obtain non-asymptotics guarantees for Diffusion Flow Matching (DFM) models using as bridge the conditional distribution associated with the Brownian motion. More precisely, we establish bounds on the Kullback-Leibler divergence between the target distribution and the one generated by such DFM models under moment conditions on the score of $ν^\star$, $μ$ and $π$, and a standard $L^2$-drift-approximation error assumption.

arXiv Statistics @arxiv_stats@qoto.org

Foundation of Calculating Normalized Maximum Likelihood for Continuous Probability Models https://arxiv.org/abs/2409.08387 #math.ST #math.IT #stat.ML #stat.TH #cs.IT

Foundation of Calculating Normalized Maximum Likelihood for Continuous Probability Models

The normalized maximum likelihood (NML) code length is widely used as a model selection criterion based on the minimum description length principle, where the model with the shortest NML code length is selected. A common method to calculate the NML code length is to use the sum (for a discrete model) or integral (for a continuous model) of a function defined by the distribution of the maximum likelihood estimator. While this method has been proven to correctly calculate the NML code length of discrete models, no proof has been provided for continuous cases. Consequently, it has remained unclear whether the method can accurately calculate the NML code length of continuous models. In this paper, we solve this problem affirmatively, proving that the method is also correct for continuous cases. Remarkably, completing the proof for continuous cases is non-trivial in that it cannot be achieved by merely replacing the sums in discrete cases with integrals, as the decomposition trick applied to sums in the discrete model case proof is not applicable to integrals in the continuous model case proof. To overcome this, we introduce a novel decomposition approach based on the coarea formula from geometric measure theory, which is essential to establishing our proof for continuous cases.

arXiv Statistics @arxiv_stats@qoto.org

Federated One-Shot Ensemble Clustering https://arxiv.org/abs/2409.08396 #stat.ML #stat.AP #cs.LG

Federated One-Shot Ensemble Clustering

Cluster analysis across multiple institutions poses significant challenges due to data-sharing restrictions. To overcome these limitations, we introduce the Federated One-shot Ensemble Clustering (FONT) algorithm, a novel solution tailored for multi-site analyses under such constraints. FONT requires only a single round of communication between sites and ensures privacy by exchanging only fitted model parameters and class labels. The algorithm combines locally fitted clustering models into a data-adaptive ensemble, making it broadly applicable to various clustering techniques and robust to differences in cluster proportions across sites. Our theoretical analysis validates the effectiveness of the data-adaptive weights learned by FONT, and simulation studies demonstrate its superior performance compared to existing benchmark methods. We applied FONT to identify subgroups of patients with rheumatoid arthritis across two health systems, revealing improved consistency of patient clusters across sites, while locally fitted clusters proved less transferable. FONT is particularly well-suited for real-world applications with stringent communication and privacy constraints, offering a scalable and practical solution for multi-site clustering.

arXiv Statistics @arxiv_stats@qoto.org

Improved Finite-Particle Convergence Rates for Stein Variational Gradient Descent https://arxiv.org/abs/2409.08469 #math.ST #math.PR #stat.ML #stat.TH #cs.LG

Improved Finite-Particle Convergence Rates for Stein Variational Gradient Descent

We provide finite-particle convergence rates for the Stein Variational Gradient Descent (SVGD) algorithm in the Kernel Stein Discrepancy ($\mathsf{KSD}$) and Wasserstein-2 metrics. Our key insight is the observation that the time derivative of the relative entropy between the joint density of $N$ particle locations and the $N$-fold product target measure, starting from a regular initial distribution, splits into a dominant `negative part' proportional to $N$ times the expected $\mathsf{KSD}^2$ and a smaller `positive part'. This observation leads to $\mathsf{KSD}$ rates of order $1/\sqrt{N}$, providing a near optimal double exponential improvement over the recent result by~\cite{shi2024finite}. Under mild assumptions on the kernel and potential, these bounds also grow linearly in the dimension $d$. By adding a bilinear component to the kernel, the above approach is used to further obtain Wasserstein-2 convergence. For the case of `bilinear + Matérn' kernels, we derive Wasserstein-2 rates that exhibit a curse-of-dimensionality similar to the i.i.d. setting. We also obtain marginal convergence and long-time propagation of chaos results for the time-averaged particle laws.

arXiv Statistics @arxiv_stats@qoto.org

Optimal Classification-based Anomaly Detection with Neural Networks: Theory and Practice https://arxiv.org/abs/2409.08521 #stat.ML #math.ST #stat.TH #cs.CR #cs.LG

Optimal Classification-based Anomaly Detection with Neural Networks: Theory and Practice

Anomaly detection is an important problem in many application areas, such as network security. Many deep learning methods for unsupervised anomaly detection produce good empirical performance but lack theoretical guarantees. By casting anomaly detection into a binary classification problem, we establish non-asymptotic upper bounds and a convergence rate on the excess risk on rectified linear unit (ReLU) neural networks trained on synthetic anomalies. Our convergence rate on the excess risk matches the minimax optimal rate in the literature. Furthermore, we provide lower and upper bounds on the number of synthetic anomalies that can attain this optimality. For practical implementation, we relax some conditions to improve the search for the empirical risk minimizer, which leads to competitive performance to other classification-based methods for anomaly detection. Overall, our work provides the first theoretical guarantees of unsupervised neural network-based anomaly detectors and empirical insights on how to design them well.

arXiv Statistics @arxiv_stats@qoto.org

Think Twice Before You Act: Improving Inverse Problem Solving With MCMC https://arxiv.org/abs/2409.08551 #stat.ML #cs.LG

Think Twice Before You Act: Improving Inverse Problem Solving With MCMC

Recent studies demonstrate that diffusion models can serve as a strong prior for solving inverse problems. A prominent example is Diffusion Posterior Sampling (DPS), which approximates the posterior distribution of data given the measure using Tweedie's formula. Despite the merits of being versatile in solving various inverse problems without re-training, the performance of DPS is hindered by the fact that this posterior approximation can be inaccurate especially for high noise levels. Therefore, we propose \textbf{D}iffusion \textbf{P}osterior \textbf{MC}MC (\textbf{DPMC}), a novel inference algorithm based on Annealed MCMC to solve inverse problems with pretrained diffusion models. We define a series of intermediate distributions inspired by the approximated conditional distributions used by DPS. Through annealed MCMC sampling, we encourage the samples to follow each intermediate distribution more closely before moving to the next distribution at a lower noise level, and therefore reduce the accumulated error along the path. We test our algorithm in various inverse problems, including super resolution, Gaussian deblurring, motion deblurring, inpainting, and phase retrieval. Our algorithm outperforms DPS with less number of evaluations across nearly all tasks, and is competitive among existing approaches.

arXiv Statistics @arxiv_stats@qoto.org

On the maximal correlation coefficient for the bivariate Marshall Olkin distribution https://arxiv.org/abs/2409.08661 #math.ST #stat.TH

On the maximal correlation coefficient for the bivariate Marshall Olkin distribution

We prove a formula for the maximal correlation coefficient of the bivariate Marshall Olkin distribution that was conjectured in Lin, Lai, and Govindaraju (2016, Stat. Methodol., 29:1-9). The formula is applied to obtain a new proof for a variance inequality in extreme value statistics that links the disjoint and the sliding block maxima method.

arXiv Statistics @arxiv_stats@qoto.org

On spiked eigenvalues of a renormalized sample covariance matrix from multi-population https://arxiv.org/abs/2409.08715 #math.ST #stat.TH

On spiked eigenvalues of a renormalized sample covariance matrix from multi-population

Sample covariance matrices from multi-population typically exhibit several large spiked eigenvalues, which stem from differences between population means and are crucial for inference on the underlying data structure. This paper investigates the asymptotic properties of spiked eigenvalues of a renormalized sample covariance matrices from multi-population in the ultrahigh dimensional context where the dimension-to-sample size ratio p/n go to infinity. The first- and second-order convergence of these spikes are established based on asymptotic properties of three types of sesquilinear forms from multi-population. These findings are further applied to two scenarios,including determination of total number of subgroups and a new criterion for evaluating clustering results in the absence of true labels. Additionally, we provide a unified framework with p/n->c\in (0,\infty] that integrates the asymptotic results in both high and ultrahigh dimensional settings.

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019