Show newer

Efficient Analysis of Latent Spaces in Heterogeneous Networks arxiv.org/abs/2412.02151 .ME

Searching for local associations while controlling the false discovery rate arxiv.org/abs/2412.02182 .ME

Selective Reviews of Bandit Problems in AI via a Statistical View arxiv.org/abs/2412.02251 .ML .EM .PR .AI .LG

The Relative Information Generating Function-A Quantile Approach arxiv.org/abs/2412.02253 .ST .TH

Scalable computation of the maximum flow in large brain connectivity networks arxiv.org/abs/2412.00106 .ME

Scalable computation of the maximum flow in large brain connectivity networks

We are interested in computing an approximation of the maximum flow in large (brain) connectivity networks. The maximum flow in such networks is of interest in order to better understand the routing of information in the human brain. However, the runtime of $O(|V||E|^2)$ for the classic Edmonds-Karp algorithm renders computations of the maximum flow on networks with millions of vertices infeasible, where $V$ is the set of vertices and $E$ is the set of edges. In this contribution, we propose a new Monte Carlo algorithm which is capable of computing an approximation of the maximum flow in networks with millions of vertices via subsampling. Apart from giving a point estimate of the maximum flow, our algorithm also returns valid confidence bounds for the true maximum flow. Importantly, its runtime only scales as $O(B \cdot |\tilde{V}| |\tilde{E}|^2)$, where $B$ is the number of Monte Carlo samples, $\tilde{V}$ is the set of subsampled vertices, and $\tilde{E}$ is the edge set induced by $\tilde{V}$. Choosing $B \in O(|V|)$ and $|\tilde{V}| \in O(\sqrt{|V|})$ (implying $|\tilde{E}| \in O(|V|)$) yields an algorithm with runtime $O(|V|^{3.5})$ while still guaranteeing the usual "root-n" convergence of the confidence interval of the maximum flow estimate. We evaluate our proposed algorithm with respect to both accuracy and runtime on simulated graphs as well as graphs downloaded from the Brain Networks Data Repository (https://networkrepository.com).

arXiv.org

A Doubly Robust Method to Counteract Outcome-Dependent Selection Bias in Multi-Cohort EHR Studies arxiv.org/abs/2412.00228 .ME

A Doubly Robust Method to Counteract Outcome-Dependent Selection Bias in Multi-Cohort EHR Studies

Selection bias can hinder accurate estimation of association parameters in binary disease risk models using non-probability samples like electronic health records (EHRs). The issue is compounded when participants are recruited from multiple clinics or centers with varying selection mechanisms that may depend on the disease or outcome of interest. Traditional inverse-probability-weighted (IPW) methods, based on constructed parametric selection models, often struggle with misspecifications when selection mechanisms vary across cohorts. This paper introduces a new Joint Augmented Inverse Probability Weighted (JAIPW) method, which integrates individual-level data from multiple cohorts collected under potentially outcome-dependent selection mechanisms, with data from an external probability sample. JAIPW offers double robustness by incorporating a flexible auxiliary score model to address potential misspecifications in the selection models. We outline the asymptotic properties of the JAIPW estimator, and our simulations reveal that JAIPW achieves up to five times lower relative bias and three times lower root mean square error (RMSE) compared to the best performing joint IPW methods under scenarios with misspecified selection models. Applying JAIPW to the Michigan Genomics Initiative (MGI), a multi-clinic EHR-linked biobank, combined with external national probability samples, resulted in cancer-sex association estimates more closely aligned with national estimates. We also analyzed the association between cancer and polygenic risk scores (PRS) in MGI to illustrate a situation where the exposure is not available in the external probability sample.

arXiv.org

Benchmarking covariates balancing methods, a simulation study arxiv.org/abs/2412.00280 .ME .AP

Benchmarking covariates balancing methods, a simulation study

Causal inference in observational studies has advanced significantly since Rosenbaum and Rubin introduced propensity score methods. Inverse probability of treatment weighting (IPTW) is widely used to handle confounding bias. However, newer methods, such as energy balancing (EB), kernel optimal matching (KOM), and covariate balancing propensity score by tailored loss function (TLF), offer model-free or non-parametric alternatives. Despite these developments, guidance remains limited in selecting the most suitable method for treatment effect estimation in practical applications. This study compares IPTW with EB, KOM, and TLF, focusing on their ability to estimate treatment effects since this is the primary objective in many applications. Monte Carlo simulations are used to assess the ability of these balancing methods combined with different estimators to evaluate average treatment effect. We compare these methods across a range of scenarios varying sample size, level of confusion, and proportion of treated. In our simulation, we observe no significant advantages in using EB, KOM, or TLF methods over IPTW. Moreover, these recent methods make obtaining confidence intervals with nominal coverage difficult. We also compare the methods on the PROBITsim dataset and get results similar to those of our simulations.

arXiv.org

Sparse Bayesian Factor Models with Mass-Nonlocal Factor Scores arxiv.org/abs/2412.00304 .ME

Sparse Bayesian Factor Models with Mass-Nonlocal Factor Scores

Bayesian factor models are widely used for dimensionality reduction and pattern discovery in high-dimensional datasets across diverse fields. These models typically focus on imposing priors on factor loading to induce sparsity and improve interpretability. However, factor score, which plays a critical role in individual-level associations with factors, has received less attention and is assumed to have standard multivariate normal distribution. This oversimplification fails to capture the heterogeneity observed in real-world applications. We propose the Sparse Bayesian Factor Model with Mass-Nonlocal Factor Scores (BFMAN), a novel framework that addresses these limitations by introducing a mass-nonlocal prior for factor scores. This prior provides a more flexible posterior distribution that captures individual heterogeneity while assigning positive probability to zero value. The zeros entries in the score matrix, characterize the sparsity, offering a robust and novel approach for determining the optimal number of factors. Model parameters are estimated using a fast and efficient Gibbs sampler. Extensive simulations demonstrate that BFMAN outperforms standard Bayesian sparse factor models in factor recovery, sparsity detection, and score estimation. We apply BFMAN to the Hispanic Community Health Study/Study of Latinos and identify dietary patterns and their associations with cardiovascular outcomes, showcasing the model's ability to uncover meaningful insights in diet.

arXiv.org

Nonlinearity and Uncertainty Informed Moment-Matching Gaussian Mixture Splitting arxiv.org/abs/2412.00343 .ML .SP .LG

Nonlinearity and Uncertainty Informed Moment-Matching Gaussian Mixture Splitting

Many problems in navigation and tracking require increasingly accurate characterizations of the evolution of uncertainty in nonlinear systems. Nonlinear uncertainty propagation approaches based on Gaussian mixture density approximations offer distinct advantages over sampling based methods in their computational cost and continuous representation. State-of-the-art Gaussian mixture approaches are adaptive in that individual Gaussian mixands are selectively split into mixtures to yield better approximations of the true propagated distribution. Despite the importance of the splitting process to accuracy and computational efficiency, relatively little work has been devoted to mixand selection and splitting direction optimization. The first part of this work presents splitting methods that preserve the mean and covariance of the original distribution. Then, we present and compare a number of novel heuristics for selecting the splitting direction. The choice of splitting direction is informed by the initial uncertainty distribution, properties of the nonlinear function through which the original distribution is propagated, and a whitening based natural scaling method to avoid dependence of the splitting direction on the scaling of coordinates. We compare these novel heuristics to existing techniques in three distinct examples involving Cartesian to polar coordinate transformation, Keplerian orbital element propagation, and uncertainty propagation in the circular restricted three-body problem.

arXiv.org

Functional worst risk minimization arxiv.org/abs/2412.00412 .ST .PR .TH

Functional worst risk minimization

The aim of this paper is to extend worst risk minimization, also called worst average loss minimization, to the functional realm. This means finding a functional regression representation that will be robust to future distribution shifts on the basis of data from two environments. In the classical non-functional realm, structural equations are based on a transfer matrix $B$. In section~\ref{sec:sfr}, we generalize this to consider a linear operator $\mathcal{T}$ on square integrable processes that plays the the part of $B$. By requiring that $(I-\mathcal{T})^{-1}$ is bounded -- as opposed to $\mathcal{T}$ -- this will allow for a large class of unbounded operators to be considered. Section~\ref{sec:worstrisk} considers two separate cases that both lead to the same worst-risk decomposition. Remarkably, this decomposition has the same structure as in the non-functional case. We consider any operator $\mathcal{T}$ that makes $(I-\mathcal{T})^{-1}$ bounded and define the future shift set in terms of the covariance functions of the shifts. In section~\ref{sec:minimizer}, we prove a necessary and sufficient condition for existence of a minimizer to this worst risk in the space of square integrable kernels. Previously, such minimizers were expressed in terms of the unknown eigenfunctions of the target and covariate integral operators (see for instance \cite{HeMullerWang} and \cite{YaoAOS}). This means that in order to estimate the minimizer, one must first estimate these unknown eigenfunctions. In contrast, the solution provided here will be expressed in any arbitrary ON-basis. This completely removes any necessity of estimating eigenfunctions. This pays dividends in section~\ref{sec:estimation}, where we provide a family of estimators, that are consistent with a large sample bound. Proofs of all the results are provided in the appendix.

arXiv.org

Defective regression models for cure rate modeling in Marshall-Olkin family arxiv.org/abs/2411.17841 .ME .AP

Defective regression models for cure rate modeling in Marshall-Olkin family

Regression model have a substantial impact on interpretation of treatments, genetic characteristics and other covariates in survival analysis. In many datasets, the description of censoring and survival curve reveals the presence of cure fraction on data, which leads to alternative modelling. The most common approach to introduce covariates under a parametric estimation are the cure rate models and their variations, although the use of defective distributions have introduced a more parsimonious and integrated approach. Defective distributions is given by a density function whose integration is not one after changing the domain of one the parameters. In this work, we introduce two new defective regression models for long-term survival data in the Marshall-Olkin family: the Marshall-Olkin Gompertz and the Marshall-Olkin inverse Gaussian. The estimation process is conducted using the maximum likelihood estimation and Bayesian inference. We evaluate the asymptotic properties of the classical approach in Monte Carlo studies as well as the behavior of Bayes estimates with vague information. The application of both models under classical and Bayesian inferences is provided in an experiment of time until death from colon cancer with a dichotomous covariate. The Marshall-Olkin Gompertz regression presented the best adjustment and we present some global diagnostic and residual analysis for this proposal.

arXiv.org

GeneralizIT: A Python Solution for Generalizability Theory Computations arxiv.org/abs/2411.17880 .AP

GeneralizIT: A Python Solution for Generalizability Theory Computations

GeneralizIT is a Python package designed to streamline the application of Generalizability Theory (G-Theory) in research and practice. G-Theory extends classical test theory by estimating multiple sources of error variance, providing a more flexible and detailed approach to reliability assessment. Despite its advantages, G-Theory's complexity can present a significant barrier to researchers. GeneralizIT addresses this challenge by offering an intuitive, user-friendly mechanism to calculate variance components, generalizability coefficients E*rho^2 and dependability Phi and to perform decision (D) studies. D-Studies allow users to make decisions about potential study designs and target improvements in the reliability of certain facets. The package supports both fully crossed and nested designs, enabling users to perform in-depth reliability analysis with minimal coding effort. With built-in visualization tools and detailed reporting functions, GeneralizIT empowers researchers across disciplines, such as education, psychology, healthcare, and the social sciences, to harness the power of G-Theory for robust evidence-based insights. Whether applied to small or large datasets, GeneralizIT offers an accessible and computationally efficient solution to improve measurement reliability in complex data environments.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.