arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 606 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

The causal effects of modified treatment policies under network interference https://arxiv.org/abs/2412.02105 #stat.ME

The causal effects of modified treatment policies under network interference

Modified treatment policies are a widely applicable class of interventions used to study the causal effects of continuous exposures. Approaches to evaluating their causal effects assume no interference, meaning that such effects cannot be learned from data in settings where the exposure of one unit affects the outcome of others, as is common in spatial or network data. We introduce a new class of intervention, induced modified treatment policies, which we show identify such causal effects in the presence of network interference. Building on recent developments in network causal inference, we provide flexible, semi-parametric efficient estimators of the identified statistical estimand. Simulation experiments demonstrate that an induced modified treatment policy can eliminate causal (or identification) bias resulting from interference. We use the methods developed to evaluate the effect of zero-emission vehicle uptake on air pollution in California, strengthening prior evidence.

arXiv Statistics @arxiv_stats@qoto.org

Efficient Analysis of Latent Spaces in Heterogeneous Networks https://arxiv.org/abs/2412.02151 #stat.ME

arXiv Statistics @arxiv_stats@qoto.org

Searching for local associations while controlling the false discovery rate https://arxiv.org/abs/2412.02182 #stat.ME

arXiv Statistics @arxiv_stats@qoto.org

Selective Reviews of Bandit Problems in AI via a Statistical View https://arxiv.org/abs/2412.02251 #stat.ML #econ.EM #math.PR #cs.AI #cs.LG

arXiv Statistics @arxiv_stats@qoto.org

The Relative Information Generating Function-A Quantile Approach https://arxiv.org/abs/2412.02253 #math.ST #stat.TH

arXiv Statistics @arxiv_stats@qoto.org

Deep Matrix Factorization with Adaptive Weights for Multi-View Clustering https://arxiv.org/abs/2412.02292 #stat.ML #cs.AI #cs.LG

Deep Matrix Factorization with Adaptive Weights for Multi-View Clustering

Recently, deep matrix factorization has been established as a powerful model for unsupervised tasks, achieving promising results, especially for multi-view clustering. However, existing methods often lack effective feature selection mechanisms and rely on empirical hyperparameter selection. To address these issues, we introduce a novel Deep Matrix Factorization with Adaptive Weights for Multi-View Clustering (DMFAW). Our method simultaneously incorporates feature selection and generates local partitions, enhancing clustering results. Notably, the features weights are controlled and adjusted by a parameter that is dynamically updated using Control Theory inspired mechanism, which not only improves the model's stability and adaptability to diverse datasets but also accelerates convergence. A late fusion approach is then proposed to align the weighted local partitions with the consensus partition. Finally, the optimization problem is solved via an alternating optimization algorithm with theoretically guaranteed convergence. Extensive experiments on benchmark datasets highlight that DMFAW outperforms state-of-the-art methods in terms of clustering performance.

arXiv Statistics @arxiv_stats@qoto.org

Nonparametric estimation of linear multiplier for stochastic differential equations driven by multiplicative stochastic volatility https://arxiv.org/abs/2412.00005 #math.ST #math.PR #stat.TH

Nonparametric estimation of linear multiplier for stochastic differential equations driven by multiplicative stochastic volatility

We study the problem of nonparametric estimation of the linear multiplier function $θ(t)$ for processes satisfying stochastic differential equations of the type $$dX_t= θ(t)X_t dt+ ε\; σ_1(t,X_t)σ_2(t,Y_t)dW_t, X_0=x_0, 0 \leq t \leq T$$ where $\{W_t, t\geq 0\}$ is a standard Brownian motion, $\{Y_t, t\geq 0\}$ is a process adapted to the filtration generated by the Brownian motion. We study the problem of estimation of the unknown function $θ(.)$ as $ε\rightarrow 0$ based on the observation of the process $\{X_t,0\leq t \leq T\}.$

arXiv Statistics @arxiv_stats@qoto.org

New Axioms for Dependence Measurement and Powerful Tests https://arxiv.org/abs/2412.00066 #stat.ME

New Axioms for Dependence Measurement and Powerful Tests

We build a context-free, comprehensive, flexible, and sound footing for measuring the dependence of two variables based on three new axioms, updating Renyi's (1959) seven postulates. We illustrate the superior footing of axioms by Vinod's (2014) asymmetric matrix of generalized correlation coefficients R*. We list five limitations explaining the poorer footing of axiom-failing Hellinger correlation proposed in 2022. We also describe a new implementation of a one-sided test with Taraldsen's (2021) exact density. This paper provides a new table for more powerful one-sided tests using the exact Taraldsen density and includes a published example where using Taraldsen's method makes a practical difference. The code to implement all our proposals is in R packages.

arXiv Statistics @arxiv_stats@qoto.org

Scalable computation of the maximum flow in large brain connectivity networks https://arxiv.org/abs/2412.00106 #stat.ME

Scalable computation of the maximum flow in large brain connectivity networks

We are interested in computing an approximation of the maximum flow in large (brain) connectivity networks. The maximum flow in such networks is of interest in order to better understand the routing of information in the human brain. However, the runtime of $O(|V||E|^2)$ for the classic Edmonds-Karp algorithm renders computations of the maximum flow on networks with millions of vertices infeasible, where $V$ is the set of vertices and $E$ is the set of edges. In this contribution, we propose a new Monte Carlo algorithm which is capable of computing an approximation of the maximum flow in networks with millions of vertices via subsampling. Apart from giving a point estimate of the maximum flow, our algorithm also returns valid confidence bounds for the true maximum flow. Importantly, its runtime only scales as $O(B \cdot |\tilde{V}| |\tilde{E}|^2)$, where $B$ is the number of Monte Carlo samples, $\tilde{V}$ is the set of subsampled vertices, and $\tilde{E}$ is the edge set induced by $\tilde{V}$. Choosing $B \in O(|V|)$ and $|\tilde{V}| \in O(\sqrt{|V|})$ (implying $|\tilde{E}| \in O(|V|)$) yields an algorithm with runtime $O(|V|^{3.5})$ while still guaranteeing the usual "root-n" convergence of the confidence interval of the maximum flow estimate. We evaluate our proposed algorithm with respect to both accuracy and runtime on simulated graphs as well as graphs downloaded from the Brain Networks Data Repository (https://networkrepository.com).

arXiv Statistics @arxiv_stats@qoto.org

The Bernstein-von Mises theorem for Semiparametric Mixtures https://arxiv.org/abs/2412.00219 #math.ST #stat.TH

The Bernstein-von Mises theorem for Semiparametric Mixtures

Semiparametric mixture models are parametric models with latent variables. They are defined kernel, $p_θ(x | z)$, where z is the unknown latent variable, and $θ$ is the parameter of interest. We assume that the latent variables are an i.i.d. sample from some mixing distribution $F$. A Bayesian would put a prior on the pair $(θ, F)$. We prove consistency for these models in fair generality and then study efficiency. We first prove an abstract Semiparametric Bernstein-von Mises theorem, and then provide tools to verify the assumptions. We use these tools to study the efficiency for estimating $θ$ in the frailty model and the errors in variables model in the case were we put a generic prior on $θ$ and a species sampling process prior on $F$.

arXiv Statistics @arxiv_stats@qoto.org

A Doubly Robust Method to Counteract Outcome-Dependent Selection Bias in Multi-Cohort EHR Studies https://arxiv.org/abs/2412.00228 #stat.ME

A Doubly Robust Method to Counteract Outcome-Dependent Selection Bias in Multi-Cohort EHR Studies

Selection bias can hinder accurate estimation of association parameters in binary disease risk models using non-probability samples like electronic health records (EHRs). The issue is compounded when participants are recruited from multiple clinics or centers with varying selection mechanisms that may depend on the disease or outcome of interest. Traditional inverse-probability-weighted (IPW) methods, based on constructed parametric selection models, often struggle with misspecifications when selection mechanisms vary across cohorts. This paper introduces a new Joint Augmented Inverse Probability Weighted (JAIPW) method, which integrates individual-level data from multiple cohorts collected under potentially outcome-dependent selection mechanisms, with data from an external probability sample. JAIPW offers double robustness by incorporating a flexible auxiliary score model to address potential misspecifications in the selection models. We outline the asymptotic properties of the JAIPW estimator, and our simulations reveal that JAIPW achieves up to five times lower relative bias and three times lower root mean square error (RMSE) compared to the best performing joint IPW methods under scenarios with misspecified selection models. Applying JAIPW to the Michigan Genomics Initiative (MGI), a multi-clinic EHR-linked biobank, combined with external national probability samples, resulted in cancer-sex association estimates more closely aligned with national estimates. We also analyzed the association between cancer and polygenic risk scores (PRS) in MGI to illustrate a situation where the exposure is not available in the external probability sample.

arXiv Statistics @arxiv_stats@qoto.org

Benchmarking covariates balancing methods, a simulation study https://arxiv.org/abs/2412.00280 #stat.ME #stat.AP

Benchmarking covariates balancing methods, a simulation study

Causal inference in observational studies has advanced significantly since Rosenbaum and Rubin introduced propensity score methods. Inverse probability of treatment weighting (IPTW) is widely used to handle confounding bias. However, newer methods, such as energy balancing (EB), kernel optimal matching (KOM), and covariate balancing propensity score by tailored loss function (TLF), offer model-free or non-parametric alternatives. Despite these developments, guidance remains limited in selecting the most suitable method for treatment effect estimation in practical applications. This study compares IPTW with EB, KOM, and TLF, focusing on their ability to estimate treatment effects since this is the primary objective in many applications. Monte Carlo simulations are used to assess the ability of these balancing methods combined with different estimators to evaluate average treatment effect. We compare these methods across a range of scenarios varying sample size, level of confusion, and proportion of treated. In our simulation, we observe no significant advantages in using EB, KOM, or TLF methods over IPTW. Moreover, these recent methods make obtaining confidence intervals with nominal coverage difficult. We also compare the methods on the PROBITsim dataset and get results similar to those of our simulations.

arXiv Statistics @arxiv_stats@qoto.org

Sparse Bayesian Factor Models with Mass-Nonlocal Factor Scores https://arxiv.org/abs/2412.00304 #stat.ME

Sparse Bayesian Factor Models with Mass-Nonlocal Factor Scores

Bayesian factor models are widely used for dimensionality reduction and pattern discovery in high-dimensional datasets across diverse fields. These models typically focus on imposing priors on factor loading to induce sparsity and improve interpretability. However, factor score, which plays a critical role in individual-level associations with factors, has received less attention and is assumed to have standard multivariate normal distribution. This oversimplification fails to capture the heterogeneity observed in real-world applications. We propose the Sparse Bayesian Factor Model with Mass-Nonlocal Factor Scores (BFMAN), a novel framework that addresses these limitations by introducing a mass-nonlocal prior for factor scores. This prior provides a more flexible posterior distribution that captures individual heterogeneity while assigning positive probability to zero value. The zeros entries in the score matrix, characterize the sparsity, offering a robust and novel approach for determining the optimal number of factors. Model parameters are estimated using a fast and efficient Gibbs sampler. Extensive simulations demonstrate that BFMAN outperforms standard Bayesian sparse factor models in factor recovery, sparsity detection, and score estimation. We apply BFMAN to the Hispanic Community Health Study/Study of Latinos and identify dietary patterns and their associations with cardiovascular outcomes, showcasing the model's ability to uncover meaningful insights in diet.

arXiv Statistics @arxiv_stats@qoto.org

Disentangling The Effects of Air Pollution on Social Mobility: A Bayesian Principal Stratification Approach https://arxiv.org/abs/2412.00311 #stat.ME #stat.AP

Disentangling The Effects of Air Pollution on Social Mobility: A Bayesian Principal Stratification Approach

Principal stratification provides a robust framework for causal inference, enabling the investigation of the causal link between air pollution exposure and social mobility, mediated by the education level. Studying the causal mechanisms through which air pollution affects social mobility is crucial to highlight the role of education as a mediator, and offering evidence that can inform policies aimed at reducing both environmental and educational inequalities for more equitable social outcomes. In this paper, we introduce a novel Bayesian nonparametric model for principal stratification, leveraging the dependent Dirichlet process to flexibly model the distribution of potential outcomes. By incorporating confounders and potential outcomes for the post-treatment variable in the Bayesian mixture model for the final outcome, our approach improves the accuracy of missing data imputation and allows for the characterization of treatment effects. We assess the performance of our method through a simulation study and demonstrate its application in evaluating the principal causal effects of air pollution on social mobility in the United States.

arXiv Statistics @arxiv_stats@qoto.org

Nonlinearity and Uncertainty Informed Moment-Matching Gaussian Mixture Splitting https://arxiv.org/abs/2412.00343 #stat.ML #eess.SP #cs.LG

Nonlinearity and Uncertainty Informed Moment-Matching Gaussian Mixture Splitting

Many problems in navigation and tracking require increasingly accurate characterizations of the evolution of uncertainty in nonlinear systems. Nonlinear uncertainty propagation approaches based on Gaussian mixture density approximations offer distinct advantages over sampling based methods in their computational cost and continuous representation. State-of-the-art Gaussian mixture approaches are adaptive in that individual Gaussian mixands are selectively split into mixtures to yield better approximations of the true propagated distribution. Despite the importance of the splitting process to accuracy and computational efficiency, relatively little work has been devoted to mixand selection and splitting direction optimization. The first part of this work presents splitting methods that preserve the mean and covariance of the original distribution. Then, we present and compare a number of novel heuristics for selecting the splitting direction. The choice of splitting direction is informed by the initial uncertainty distribution, properties of the nonlinear function through which the original distribution is propagated, and a whitening based natural scaling method to avoid dependence of the splitting direction on the scaling of coordinates. We compare these novel heuristics to existing techniques in three distinct examples involving Cartesian to polar coordinate transformation, Keplerian orbital element propagation, and uncertainty propagation in the circular restricted three-body problem.

arXiv Statistics @arxiv_stats@qoto.org

Functional worst risk minimization https://arxiv.org/abs/2412.00412 #math.ST #math.PR #stat.TH

Functional worst risk minimization

The aim of this paper is to extend worst risk minimization, also called worst average loss minimization, to the functional realm. This means finding a functional regression representation that will be robust to future distribution shifts on the basis of data from two environments. In the classical non-functional realm, structural equations are based on a transfer matrix $B$. In section~\ref{sec:sfr}, we generalize this to consider a linear operator $\mathcal{T}$ on square integrable processes that plays the the part of $B$. By requiring that $(I-\mathcal{T})^{-1}$ is bounded -- as opposed to $\mathcal{T}$ -- this will allow for a large class of unbounded operators to be considered. Section~\ref{sec:worstrisk} considers two separate cases that both lead to the same worst-risk decomposition. Remarkably, this decomposition has the same structure as in the non-functional case. We consider any operator $\mathcal{T}$ that makes $(I-\mathcal{T})^{-1}$ bounded and define the future shift set in terms of the covariance functions of the shifts. In section~\ref{sec:minimizer}, we prove a necessary and sufficient condition for existence of a minimizer to this worst risk in the space of square integrable kernels. Previously, such minimizers were expressed in terms of the unknown eigenfunctions of the target and covariate integral operators (see for instance \cite{HeMullerWang} and \cite{YaoAOS}). This means that in order to estimate the minimizer, one must first estimate these unknown eigenfunctions. In contrast, the solution provided here will be expressed in any arbitrary ON-basis. This completely removes any necessity of estimating eigenfunctions. This pays dividends in section~\ref{sec:estimation}, where we provide a family of estimators, that are consistent with a large sample bound. Proofs of all the results are provided in the appendix.

arXiv Statistics @arxiv_stats@qoto.org

spar: Sparse Projected Averaged Regression in R https://arxiv.org/abs/2411.17808 #stat.CO #stat.ME

spar: Sparse Projected Averaged Regression in R

Package spar for R builds ensembles of predictive generalized linear models with high-dimensional predictors. It employs an algorithm utilizing variable screening and random projection tools to efficiently handle the computational challenges associated with large sets of predictors. The package is designed with a strong focus on extensibility. Screening and random projection techniques are implemented as S3 classes with user-friendly constructor functions, enabling users to easily integrate and develop new procedures. This design enhances the package's adaptability and makes it a powerful tool for a variety of high-dimensional applications.

arXiv Statistics @arxiv_stats@qoto.org

Defective regression models for cure rate modeling in Marshall-Olkin family https://arxiv.org/abs/2411.17841 #stat.ME #stat.AP

Defective regression models for cure rate modeling in Marshall-Olkin family

Regression model have a substantial impact on interpretation of treatments, genetic characteristics and other covariates in survival analysis. In many datasets, the description of censoring and survival curve reveals the presence of cure fraction on data, which leads to alternative modelling. The most common approach to introduce covariates under a parametric estimation are the cure rate models and their variations, although the use of defective distributions have introduced a more parsimonious and integrated approach. Defective distributions is given by a density function whose integration is not one after changing the domain of one the parameters. In this work, we introduce two new defective regression models for long-term survival data in the Marshall-Olkin family: the Marshall-Olkin Gompertz and the Marshall-Olkin inverse Gaussian. The estimation process is conducted using the maximum likelihood estimation and Bayesian inference. We evaluate the asymptotic properties of the classical approach in Monte Carlo studies as well as the behavior of Bayes estimates with vague information. The application of both models under classical and Bayesian inferences is provided in an experiment of time until death from colon cancer with a dichotomous covariate. The Marshall-Olkin Gompertz regression presented the best adjustment and we present some global diagnostic and residual analysis for this proposal.

arXiv Statistics @arxiv_stats@qoto.org

Sparse twoblock dimension reduction for simultaneous compression and variable selection in two blocks of variables https://arxiv.org/abs/2411.17859 #stat.ME #stat.CO

Sparse twoblock dimension reduction for simultaneous compression and variable selection in two blocks of variables

A method is introduced to perform simultaneous sparse dimension reduction on two blocks of variables. Beyond dimension reduction, it also yields an estimator for multivariate regression with the capability to intrinsically deselect uninformative variables in both independent and dependent blocks. An algorithm is provided that leads to a straightforward implementation of the method. The benefits of simultaneous sparse dimension reduction are shown to carry through to enhanced capability to predict a set of multivariate dependent variables jointly. Both in a simulation study and in two chemometric applications, the new method outperforms its dense counterpart, as well as multivariate partial least squares.

arXiv Statistics @arxiv_stats@qoto.org

GeneralizIT: A Python Solution for Generalizability Theory Computations https://arxiv.org/abs/2411.17880 #stat.AP

GeneralizIT: A Python Solution for Generalizability Theory Computations

GeneralizIT is a Python package designed to streamline the application of Generalizability Theory (G-Theory) in research and practice. G-Theory extends classical test theory by estimating multiple sources of error variance, providing a more flexible and detailed approach to reliability assessment. Despite its advantages, G-Theory's complexity can present a significant barrier to researchers. GeneralizIT addresses this challenge by offering an intuitive, user-friendly mechanism to calculate variance components, generalizability coefficients E*rho^2 and dependability Phi and to perform decision (D) studies. D-Studies allow users to make decisions about potential study designs and target improvements in the reliability of certain facets. The package supports both fully crossed and nested designs, enabling users to perform in-depth reliability analysis with minimal coding effort. With built-in visualization tools and detailed reporting functions, GeneralizIT empowers researchers across disciplines, such as education, psychology, healthcare, and the social sciences, to harness the power of G-Theory for robust evidence-based insights. Whether applied to small or large datasets, GeneralizIT offers an accessible and computationally efficient solution to improve measurement reliability in complex data environments.

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019