Show newer

Simultaneous Estimation of Many Sparse Networks via Hierarchical Poisson Log-Normal Model arxiv.org/abs/2409.12275 .ME

Simultaneous Estimation of Many Sparse Networks via Hierarchical Poisson Log-Normal Model

The advancement of single-cell RNA-sequencing (scRNA-seq) technologies allow us to study the individual level cell-type-specific gene expression networks by direct inference of genes' conditional independence structures. scRNA-seq data facilitates the analysis of gene expression data across different conditions or samples, enabling simultaneous estimation of condition- or sample-specific gene networks. Since the scRNA-seq data are count data with many zeros, existing network inference methods based on Gaussian graphs cannot be applied to such single cell data directly. We propose a hierarchical Poisson Log-Normal model to simultaneously estimate many such networks to effectively incorporate the shared network structures. We develop an efficient simultaneous estimation method that uses the variational EM and alternating direction method of multipliers (ADMM) algorithms, optimized for parallel processing. Simulation studies show this method outperforms traditional methods in network structure recovery and parameter estimation across various network models. We apply the method to two single cell RNA-seq datasets, a yeast single-cell gene expression dataset measured under 11 different environmental conditions, and a single-cell gene expression data from 13 inflammatory bowel disease patients. We demonstrate that simultaneous estimation can uncover a wider range of conditional dependence networks among genes, offering deeper insights into gene expression mechanisms.

arxiv.org

Optimizing MCMC-Driven Bayesian Neural Networks for High-Precision Medical Image Classification in Small Sample Sizes arxiv.org/abs/2409.12355 .CO

Optimizing MCMC-Driven Bayesian Neural Networks for High-Precision Medical Image Classification in Small Sample Sizes

This paper discusses the application of a Bayesian neural network based on the Markov Chain Monte Carlo method in medical image classification with small samples. Experimental results on two medical image datasets, including lung X-ray images and breast tissue slice images, show that this MCMC-based BNN model works very well on small-sample data and greatly improves the robustness and accuracy of classification. Model accuracy reached 85% for the lung X-ray dataset and 88% for the breast tissue slice dataset. To this end, we combine data augmentation techniques such as rotation, flipping, and scaling with regularization methods like dropout and weight decay to improve effectively the diversity of the training data and the generalization ability of the model. The performance of the model was evaluated by many indicators of the results, including accuracy, precision, recall, and the F1 score. All of these have proven the advantages of BNN in small-sample medical image classification. This study not only enriches the application of BNN in the field of medical image classification, but also provides specific implementation paths and optimization methods, providing new solutions for future medical image analysis.

arxiv.org

Two New Families of Local Asymptotically Minimax Lower Bounds in Parameter Estimation arxiv.org/abs/2409.12491 .ST .IT .TH .IT

Two New Families of Local Asymptotically Minimax Lower Bounds in Parameter Estimation

We propose two families of asymptotically local minimax lower bounds on parameter estimation performance. The first family of bounds applies to any convex, symmetric loss function that depends solely on the difference between the estimate and the true underlying parameter value (i.e., the estimation error), whereas the second is more specifically oriented to the moments of the estimation error. The proposed bounds are relatively easy to calculate numerically (in the sense that their optimization is over relatively few auxiliary parameters), yet they turn out to be tighter (sometimes significantly so) than previously reported bounds that are associated with similar calculation efforts, across a variety of application examples. In addition to their relative simplicity, they also have the following advantages: (i) Essentially no regularity conditions are required regarding the parametric family of distributions; (ii) The bounds are local (in a sense to be specified); (iii) The bounds provide the correct order of decay as functions of the number of observations, at least in all examples examined; (iv) At least the first family of bounds extends straightforwardly to vector parameters.

arxiv.org

Neymanian inference in randomized experiments arxiv.org/abs/2409.12498 .ME .ST .TH

Neymanian inference in randomized experiments

In his seminal work in 1923, Neyman studied the variance estimation problem for the difference-in-means estimator of the average treatment effect in completely randomized experiments. He proposed a variance estimator that is conservative in general and unbiased when treatment effects are homogeneous. While widely used under complete randomization, there is no unique or natural way to extend this estimator to more complex designs. To this end, we show that Neyman's estimator can be alternatively derived in two ways, leading to two novel variance estimation approaches: the imputation approach and the contrast approach. While both approaches recover Neyman's estimator under complete randomization, they yield fundamentally different variance estimators for more general designs. In the imputation approach, the variance is expressed as a function of observed and missing potential outcomes and then estimated by imputing the missing potential outcomes, akin to Fisherian inference. In the contrast approach, the variance is expressed as a function of several unobservable contrasts of potential outcomes and then estimated by exchanging each unobservable contrast with an observable contrast. Unlike the imputation approach, the contrast approach does not require separately estimating the missing potential outcome for each unit. We examine the theoretical properties of both approaches, showing that for a large class of designs, each produces conservative variance estimators that are unbiased in finite samples or asymptotically under homogeneous treatment effects.

arxiv.org

CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity arxiv.org/abs/2409.12578 .CO .AP

CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity

Recently, SHapley Additive exPlanations (SHAP) has been widely utilized in various research domains. This is particularly evident in medical applications, where SHAP analysis serves as a crucial tool for identifying biomarkers and assisting in result validation. However, despite its frequent usage, SHAP is often not applied in a manner that maximizes its potential contributions. A review of recent papers employing SHAP reveals that many studies subjectively select a limited number of features as 'important' and analyze SHAP values by approximately observing plots without assessing statistical significance. Such superficial application may hinder meaningful contributions to the applied fields. To address this, we propose a library package designed to simplify the interpretation of SHAP values. By simply inputting the original data and SHAP values, our library provides: 1) the number of important features to analyze, 2) the pattern of each feature via univariate analysis, and 3) the interaction between features. All information is extracted based on its statistical significance and presented in simple, comprehensible sentences, enabling users of all levels to understand the interpretations. We hope this library fosters a comprehensive understanding of statistically valid SHAP results.

arxiv.org

Test-Time Augmentation Meets Variational Bayes arxiv.org/abs/2409.12587 .ML .AI .LG

Test-Time Augmentation Meets Variational Bayes

Data augmentation is known to contribute significantly to the robustness of machine learning models. In most instances, data augmentation is utilized during the training phase. Test-Time Augmentation (TTA) is a technique that instead leverages these data augmentations during the testing phase to achieve robust predictions. More precisely, TTA averages the predictions of multiple data augmentations of an instance to produce a final prediction. Although the effectiveness of TTA has been empirically reported, it can be expected that the predictive performance achieved will depend on the set of data augmentation methods used during testing. In particular, the data augmentation methods applied should make different contributions to performance. That is, it is anticipated that there may be differing degrees of contribution in the set of data augmentation methods used for TTA, and these could have a negative impact on prediction performance. In this study, we consider a weighted version of the TTA based on the contribution of each data augmentation. Some variants of TTA can be regarded as considering the problem of determining the appropriate weighting. We demonstrate that the determination of the coefficients of this weighted TTA can be formalized in a variational Bayesian framework. We also show that optimizing the weights to maximize the marginal log-likelihood suppresses candidates of unwanted data augmentations at the test phase.

arxiv.org

Decomposing Gaussians with Unknown Covariance arxiv.org/abs/2409.11497 .ME .ML

Decomposing Gaussians with Unknown Covariance

Common workflows in machine learning and statistics rely on the ability to partition the information in a data set into independent portions. Recent work has shown that this may be possible even when conventional sample splitting is not (e.g., when the number of samples $n=1$, or when observations are not independent and identically distributed). However, the approaches that are currently available to decompose multivariate Gaussian data require knowledge of the covariance matrix. In many important problems (such as in spatial or longitudinal data analysis, and graphical modeling), the covariance matrix may be unknown and even of primary interest. Thus, in this work we develop new approaches to decompose Gaussians with unknown covariance. First, we present a general algorithm that encompasses all previous decomposition approaches for Gaussian data as special cases, and can further handle the case of an unknown covariance. It yields a new and more flexible alternative to sample splitting when $n>1$. When $n=1$, we prove that it is impossible to partition the information in a multivariate Gaussian into independent portions without knowing the covariance matrix. Thus, we use the general algorithm to decompose a single multivariate Gaussian with unknown covariance into dependent parts with tractable conditional distributions, and demonstrate their use for inference and validation. The proposed decomposition strategy extends naturally to Gaussian processes. In simulation and on electroencephalography data, we apply these decompositions to the tasks of model selection and post-selection inference in settings where alternative strategies are unavailable.

arxiv.org

Interpretability Indices and Soft Constraints for Factor Models arxiv.org/abs/2409.11525 .ME

Interpretability Indices and Soft Constraints for Factor Models

Factor analysis is a way to characterize the relationships between many (observable) variables in terms of a smaller number of unobservable random variables which are called factors. However, the application of factor models and its success can be subjective or difficult to gauge, since infinitely many factor models that produce the same correlation matrix can be fit given sample data. Thus, there is a need to operationalize a criterion that measures how meaningful or "interpretable" a factor model is in order to select the best among many factor models. While there are already techniques that aim to measure and enhance interpretability, new indices, as well as rotation methods via mathematical optimization based on them, are proposed to measure interpretability. The proposed methods directly incorporate semantics with the help of natural language processing and are generalized to incorporate any "prior information". Moreover, the indices allow for complete or partial specification of relationships at a pairwise level. Aside from these, two other main benefits of the proposed methods are that they do not require the estimation of factor scores, which avoids the factor score indeterminacy problem, and that no additional explanatory variables are necessary. The implementation of the proposed methods is written in Python 3 and is made available together with several helper functions through the package interpretablefa on the Python Package Index. The methods' application is demonstrated here using data on the Experiences in Close Relationships Scale, obtained from the Open-Source Psychometrics Project.

arxiv.org

A Robust Approach to Gaussian Processes Implementation arxiv.org/abs/2409.11577 .CO

A Robust Approach to Gaussian Processes Implementation

Gaussian Process (GP) regression is a flexible modeling technique used to predict outputs and to capture uncertainty in the predictions. However, the GP regression process becomes computationally intensive when the training spatial dataset has a large number of observations. To address this challenge, we introduce a scalable GP algorithm, termed MuyGPs, which incorporates nearest neighbor and leave-one-out cross-validation during training. This approach enables the evaluation of large spatial datasets with state-of-the-art accuracy and speed in certain spatial problems. Despite these advantages, conventional quadratic loss functions used in the MuyGPs optimization such as Root Mean Squared Error(RMSE), are highly influenced by outliers. We explore the behavior of MuyGPs in cases involving outlying observations, and subsequently, develop a robust approach to handle and mitigate their impact. Specifically, we introduce a novel leave-one-out loss function based on the pseudo-Huber function (LOOPH) that effectively accounts for outliers in large spatial datasets within the MuyGPs framework. Our simulation study shows that the "LOOPH" loss method maintains accuracy despite outlying observations, establishing MuyGPs as a powerful tool for mitigating unusual observation impacts in the large data regime. In the analysis of U.S. ozone data, MuyGPs provides accurate predictions and uncertainty quantification, demonstrating its utility in managing data anomalies. Through these efforts, we advance the understanding of GP regression in spatial contexts.

arxiv.org

Outlier Detection with Cluster Catch Digraphs arxiv.org/abs/2409.11596 .ML .LG

Outlier Detection with Cluster Catch Digraphs

This paper introduces a novel family of outlier detection algorithms based on Cluster Catch Digraphs (CCDs), specifically tailored to address the challenges of high dimensionality and varying cluster shapes, which deteriorate the performance of most traditional outlier detection methods. We propose the Uniformity-Based CCD with Mutual Catch Graph (U-MCCD), the Uniformity- and Neighbor-Based CCD with Mutual Catch Graph (UN-MCCD), and their shape-adaptive variants (SU-MCCD and SUN-MCCD), which are designed to detect outliers in data sets with arbitrary cluster shapes and high dimensions. We present the advantages and shortcomings of these algorithms and provide the motivation or need to define each particular algorithm. Through comprehensive Monte Carlo simulations, we assess their performance and demonstrate the robustness and effectiveness of our algorithms across various settings and contamination levels. We also illustrate the use of our algorithms on various real-life data sets. The U-MCCD algorithm efficiently identifies outliers while maintaining high true negative rates, and the SU-MCCD algorithm shows substantial improvement in handling non-uniform clusters. Additionally, the UN-MCCD and SUN-MCCD algorithms address the limitations of existing methods in high-dimensional spaces by utilizing Nearest Neighbor Distances (NND) for clustering and outlier detection. Our results indicate that these novel algorithms offer substantial advancements in the accuracy and adaptability of outlier detection, providing a valuable tool for various real-world applications. Keyword: Outlier detection, Graph-based clustering, Cluster catch digraphs, $k$-nearest-neighborhood, Mutual catch graphs, Nearest neighbor distance.

arxiv.org

Bias Reduction in Matched Observational Studies with Continuous Treatments: Calipered Non-Bipartite Matching and Bias-Corrected Estimation and Inference arxiv.org/abs/2409.11701 .ME .AP

Bias Reduction in Matched Observational Studies with Continuous Treatments: Calipered Non-Bipartite Matching and Bias-Corrected Estimation and Inference

Matching is a commonly used causal inference framework in observational studies. By pairing individuals with different treatment values but with the same values of covariates (i.e., exact matching), the sample average treatment effect (SATE) can be consistently estimated and inferred using the classic Neyman-type (difference-in-means) estimator and confidence interval. However, inexact matching typically exists in practice and may cause substantial bias for the downstream treatment effect estimation and inference. Many methods have been proposed to reduce bias due to inexact matching in the binary treatment case. However, to our knowledge, no existing work has systematically investigated bias due to inexact matching in the continuous treatment case. To fill this blank, we propose a general framework for reducing bias in inexactly matched observational studies with continuous treatments. In the matching stage, we propose a carefully formulated caliper that incorporates the information of both the paired covariates and treatment doses to better tailor matching for the downstream SATE estimation and inference. In the estimation and inference stage, we propose a bias-corrected Neyman estimator paired with the corresponding bias-corrected variance estimator to leverage the information on propensity density discrepancies after inexact matching to further reduce the bias due to inexact matching. We apply our proposed framework to COVID-19 social mobility data to showcase differences between classic and bias-corrected SATE estimation and inference.

arxiv.org

Model-Embedded Gaussian Process Regression for Parameter Estimation in Dynamical System arxiv.org/abs/2409.11745 .CO .DS

Model-Embedded Gaussian Process Regression for Parameter Estimation in Dynamical System

Identifying dynamical system (DS) is a vital task in science and engineering. Traditional methods require numerous calls to the DS solver, rendering likelihood-based or least-squares inference frameworks impractical. For efficient parameter inference, two state-of-the-art techniques are the kernel method for modeling and the "one-step framework" for jointly inferring unknown parameters and hyperparameters. The kernel method is a quick and straightforward technique, but it cannot estimate solutions and their derivatives, which must strictly adhere to physical laws. We propose a model-embedded "one-step" Bayesian framework for joint inference of unknown parameters and hyperparameters by maximizing the marginal likelihood. This approach models the solution and its derivatives using Gaussian process regression (GPR), taking into account smoothness and continuity properties, and treats differential equations as constraints that can be naturally integrated into the Bayesian framework in the linear case. Additionally, we prove the convergence of the model-embedded Gaussian process regression (ME-GPR) for theoretical development. Motivated by Taylor expansion, we introduce a piecewise first-order linearization strategy to handle nonlinear dynamic systems. We derive estimates and confidence intervals, demonstrating that they exhibit low bias and good coverage properties for both simulated models and real data.

arxiv.org

Symmetry-Based Structured Matrices for Efficient Approximately Equivariant Networks arxiv.org/abs/2409.11772 .ML .LG

Symmetry-Based Structured Matrices for Efficient Approximately Equivariant Networks

There has been much recent interest in designing symmetry-aware neural networks (NNs) exhibiting relaxed equivariance. Such NNs aim to interpolate between being exactly equivariant and being fully flexible, affording consistent performance benefits. In a separate line of work, certain structured parameter matrices -- those with displacement structure, characterized by low displacement rank (LDR) -- have been used to design small-footprint NNs. Displacement structure enables fast function and gradient evaluation, but permits accurate approximations via compression primarily to classical convolutional neural networks (CNNs). In this work, we propose a general framework -- based on a novel construction of symmetry-based structured matrices -- to build approximately equivariant NNs with significantly reduced parameter counts. Our framework integrates the two aforementioned lines of work via the use of so-called Group Matrices (GMs), a forgotten precursor to the modern notion of regular representations of finite groups. GMs allow the design of structured matrices -- resembling LDR matrices -- which generalize the linear operations of a classical CNN from cyclic groups to general finite groups and their homogeneous spaces. We show that GMs can be employed to extend all the elementary operations of CNNs to general discrete groups. Further, the theory of structured matrices based on GMs provides a generalization of LDR theory focussed on matrices with cyclic structure, providing a tool for implementing approximate equivariance for discrete groups. We test GM-based architectures on a variety of tasks in the presence of relaxed symmetry. We report that our framework consistently performs competitively compared to approximately equivariant NNs, and other structured matrix-based compression frameworks, sometimes with a one or two orders of magnitude lower parameter count.

arxiv.org

Sparse Factor Analysis for Categorical Data with the Group-Sparse Generalized Singular Value Decomposition arxiv.org/abs/2409.11789 .ST .TH

Sparse Factor Analysis for Categorical Data with the Group-Sparse Generalized Singular Value Decomposition

Correspondence analysis, multiple correspondence analysis and their discriminant counterparts (i.e., discriminant simple correspondence analysis and discriminant multiple correspondence analysis) are methods of choice for analyzing multivariate categorical data. In these methods, variables are integrated into optimal components computed as linear combinations whose weights are obtained from a generalized singular value decomposition (GSVD) that integrates specific metric constraints on the rows and columns of the original data matrix. The weights of the linear combinations are, in turn, used to interpret the components, and this interpretation is facilitated when components are 1) pairwise orthogonal and 2) when the values of the weights are either large or small but not intermediate-a pattern called a simple or a sparse structure. To obtain such simple configurations, the optimization problem solved by the GSVD is extended to include new constraints that implement component orthogonality and sparse weights. Because multiple correspondence analysis represents qualitative variables by a set of binary variables, an additional group constraint is added to the optimization problem in order to sparsify the whole set representing one qualitative variable. This new algorithm-called group-sparse GSVD (gsGSVD)-integrates these constraints via an iterative projection scheme onto the intersection of subspaces where each subspace implements a specific constraint. In this paper, we expose this new algorithm and show how it can be adapted to the sparsification of simple and multiple correspondence analysis, and illustrate its applications with the analysis of four different data sets-each illustrating the sparsification of a particular CA-based analysis.

arxiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.