Bayesian Variable Selection in Multivariate Regression Under Collinearity in the Design Matrix arxiv.org/abs/2507.17975 .ME .CO

Bayesian Variable Selection in Multivariate Regression Under Collinearity in the Design Matrix

We consider the problem of variable selection in Bayesian multivariate linear regression models, involving multiple response and predictor variables, under multivariate normal errors. In the absence of a known covariance structure, specifying a model with a non-diagonal covariance matrix is appealing. Modeling dependency in the random errors through a non-diagonal covariance matrix is generally expected to lead to improved estimation of the regression coefficients. In this article, we highlight an interesting exception: modeling the dependency in errors can significantly worsen both estimation and prediction. We demonstrate that Bayesian multi-outcome regression models using several popular variable selection priors can suffer from poor estimation properties in low-information settings--such as scenarios with weak signals, high correlation among predictors and responses, and small sample sizes. In such cases, the simultaneous estimation of all unknown parameters in the model becomes difficult when using a non-diagonal covariance matrix. Through simulation studies and a dataset with measurements from NIR spectroscopy, we illustrate that a two-step procedure--estimating the mean and the covariance matrix separately--can provide more accurate estimates in such cases. Thus, a potential solution to avoid the problem altogether is to routinely perform an additional analysis with a diagonal covariance matrix, even if the errors are expected to be correlated.

arXiv.org

Zeroth-order log-concave sampling arxiv.org/abs/2507.18021 .ST .FA .PR .TH .DS .LG

Zeroth-order log-concave sampling

We study the zeroth-order query complexity of log-concave sampling, specifically uniform sampling from convex bodies using membership oracles. We propose a simple variant of the proximal sampler that achieves the query complexity with matched Rényi orders between the initial warmness and output guarantee. Specifically, for any $\varepsilon>0$ and $q\geq2$, the sampler, initialized at $π_{0}$, outputs a sample whose law is $\varepsilon$-close in $q$-Rényi divergence to $π$, the uniform distribution over a convex body in $\mathbb{R}^{d}$, using $\widetilde{O}(qM_{q}^{q/(q-1)}d^{2}\,\lVert\operatorname{cov}π\rVert\log\frac{1}{\varepsilon})$ membership queries, where $M_{q}=\lVert\text{d}π_{0}/\text{d}π\rVert_{L^{q}(π)}$. We further introduce a simple annealing scheme that produces a warm start in $q$-Rényi divergence (i.e., $M_{q}=O(1)$) using $\widetilde{O}(qd^{2}R^{3/2}\,\lVert\operatorname{cov}π\rVert^{1/4})$ queries, where $R^{2}=\mathbb{E}_π[|\cdot|^{2}]$. This interpolates between known complexities for warm-start generation in total variation and Rényi-infinity divergence. To relay a Rényi warmness across the annealing scheme, we establish hypercontractivity under simultaneous heat flow and translate it into an improved mixing guarantee for the proximal sampler under a logarithmic Sobolev inequality. These results extend naturally to general log-concave distributions accessible via evaluation oracles, incurring additional quadratic queries.

arXiv.org

Regression approaches for modelling genotype-environment interaction and making predictions into unseen environments arxiv.org/abs/2507.18125 .ME .AP

Regression approaches for modelling genotype-environment interaction and making predictions into unseen environments

In plant breeding and variety testing, there is an increasing interest in making use of environmental information to enhance predictions for new environments. Here, we will review linear mixed models that have been proposed for this purpose. The emphasis will be on predictions and on methods to assess the uncertainty of predictions for new environments. Our point of departure is straight-line regression, which may be extended to multiple environmental covariates and genotype-specific responses. When observable environmental covariates are used, this is also known as factorial regression. Early work along these lines can be traced back to Stringfield & Salter (1934) and Yates & Cochran (1938), who proposed a method nowadays best known as Finlay-Wilkinson regression. This method, in turn, has close ties with regression on latent environmental covariates and factor-analytic variance-covariance structures for genotype-environment interaction. Extensions of these approaches - reduced rank regression, kernel- or kinship-based approaches, random coefficient regression, and extended Finlay-Wilkinson regression - will be the focus of this paper. Our objective is to demonstrate how seemingly disparate methods are very closely linked and fall within a common model-based prediction framework. The framework considers environments as random throughout, with genotypes also modelled as random in most cases. We will discuss options for assessing uncertainty of predictions, including cross validation and model-based estimates of uncertainty. The methods are illustrated using a long-term rice variety trial dataset from Bangladesh.

arXiv.org

Complex Dynamics in Psychological Data: Mapping Individual Symptom Trajectories to Group-Level Patterns arxiv.org/abs/2507.14161 .AP .ML .LG

Complex Dynamics in Psychological Data: Mapping Individual Symptom Trajectories to Group-Level Patterns

This study integrates causal inference, graph analysis, temporal complexity measures, and machine learning to examine whether individual symptom trajectories can reveal meaningful diagnostic patterns. Testing on a longitudinal dataset of N=45 individuals affected by General Anxiety Disorder (GAD) and/or Major Depressive Disorder (MDD) derived from Fisher et al. 2017, we propose a novel pipeline for the analysis of the temporal dynamics of psychopathological symptoms. First, we employ the PCMCI+ algorithm with nonparametric independence test to determine the causal network of nonlinear dependencies between symptoms in individuals with different mental disorders. We found that the PCMCI+ effectively highlights the individual peculiarities of each symptom network, which could be leveraged towards personalized therapies. At the same time, aggregating the networks by diagnosis sheds light to disorder-specific causal mechanisms, in agreement with previous psychopathological literature. Then, we enrich the dataset by computing complexity-based measures (e.g. entropy, fractal dimension, recurrence) from the symptom time series, and feed it to a suitably selected machine learning algorithm to aid the diagnosis of each individual. The new dataset yields 91% accuracy in the classification of the symptom dynamics, proving to be an effective diagnostic support tool. Overall, these findings highlight how integrating causal modeling and temporal complexity can enhance diagnostic differentiation, offering a principled, data-driven foundation for both personalized assessment in clinical psychology and structural advances in psychological research.

arXiv.org

On the Testing of complete causal mediation and its applications arxiv.org/abs/2507.14246 .ME .AP

On the Testing of complete causal mediation and its applications

The Complete Mediation Test (CMT) serves as a specialized approach of mediation analysis to assess whether an independent variable A, influences an outcome variable Y exclusively through a mediator M, without any direct effect. An application of CMT lies in Mendelian Randomization (MR) studies, where it can be used to investigate non-pleiotropy, that is, to test whether genetic variants impact a disease outcome solely through their effect on a target exposure variable. Traditionally, CMT has relied on two significance-based criteria and a proportion-based criterion with a heuristic threshold that has not been rigorously evaluated. In this paper, we explored the theoretical properties of conventional CMT, and proposed using standardized absolute proportion of mediation (SAPM) as a criterion for CMT. We, systematically assess the performance of various CMT criteria via simulation, and demonstrate their practical utility in the context of MR studies. Our results indicate that the offers the best performance. We also propose using different optimal thresholds depending on whether the mediator and outcome are continuous or binary. The SAPM with proper thresholds ensures that the indirect pathway meaningfully accounts for the effect of the exposure on the outcome, thereby strengthening the case for complete mediation.

arXiv.org

Distributed Kaplan-Meier Analysis via the Influence Function with Application to COVID-19 and COVID-19 Vaccine Adverse Events arxiv.org/abs/2507.14351 .ME

Distributed Kaplan-Meier Analysis via the Influence Function with Application to COVID-19 and COVID-19 Vaccine Adverse Events

During the COVID-19 pandemic, regulatory decision-making was hampered by a lack of timely and high-quality data on rare outcomes. Studying rare outcomes following infection and vaccination requires conducting multi-center observational studies, where sharing individual-level data is a privacy concern. In this paper, we conduct a multi-center observational study of thromboembolic events following COVID-19 and COVID-19 vaccination without sharing individual-level data. We accomplish this by developing a novel distributed learning method for constructing Kaplan-Meier (KM) curves and inverse propensity weighted KM curves with statistical inference. We sequentially update curves site-by-site using the KM influence function, which is a measure of the direction in which an observation should shift our estimate and so can be used to incorporate new observations without access to previous data. We show in simulations that our distributed estimator is unbiased and achieves equal efficiency to the combined data estimator. Applying our method to Beaumont Health, Spectrum Health, and Michigan Medicine data, we find a much higher covariate-adjusted incidence of blood clots after SARS-CoV-2 infection (3.13%, 95% CI: [2.93, 3.35]) compared to first COVID-19 vaccine (0.08%, 95% CI: [0.08, 0.09]). This suggests that the protection vaccines provide against COVID-19-related clots outweighs the risk of vaccine-related adverse events, and shows the potential of distributed survival analysis to provide actionable evidence for time-sensitive decision making.

arXiv.org

A Hybrid Mixture Approach for Clustering and Characterizing Cancer Data arxiv.org/abs/2507.14380 -bio.TO .ME .AP .CO .ML

A Hybrid Mixture Approach for Clustering and Characterizing Cancer Data

Model-based clustering is widely used for identifying and distinguishing types of diseases. However, modern biomedical data coming with high dimensions make it challenging to perform the model estimation in traditional cluster analysis. The incorporation of factor analyzer into the mixture model provides a way to characterize the large set of data features, but the current estimation method is computationally impractical for massive data due to the intrinsic slow convergence of the embedded algorithms, and the incapability to vary the size of the factor analyzers, preventing the implementation of a generalized mixture of factor analyzers and further characterization of the data clusters. We propose a hybrid matrix-free computational scheme to efficiently estimate the clusters and model parameters based on a Gaussian mixture along with generalized factor analyzers to summarize the large number of variables using a small set of underlying factors. Our approach outperforms the existing method with faster convergence while maintaining high clustering accuracy. Our algorithms are applied to accurately identify and distinguish types of breast cancer based on large tumor samples, and to provide a generalized characterization for subtypes of lymphoma using massive gene records.

arXiv.org

Bivariate generalized autoregressive models for forecasting bivariate non-Gaussian times series arxiv.org/abs/2507.14442 .ME .ST .AP .TH

Bivariate generalized autoregressive models for forecasting bivariate non-Gaussian times series

This paper introduces a novel approach, the bivariate generalized autoregressive (BGAR) model, for modeling and forecasting bivariate time series data. The BGAR model generalizes the bivariate vector autoregressive (VAR) models by allowing data that does not necessarily follow a normal distribution. We consider a random vector of two time series and assume each belongs to the canonical exponential family, similarly to the univariate generalized autoregressive moving average (GARMA) model. We include autoregressive terms of one series into the dynamical structure of the other and vice versa. The model parameters are estimated using the conditional maximum likelihood (CML) method. We provide general closed-form expressions for the conditional score vector and conditional Fisher information matrix, encompassing all canonical exponential family distributions. We develop asymptotic confidence intervals and hypothesis tests. We discuss techniques for model selection, residual diagnostic analysis, and forecasting. We carry out Monte Carlo simulation studies to evaluate the performance of the finite sample CML inferences, including point and interval estimation. An application to real data analyzes the number of leptospirosis cases on hospitalizations due to leptospirosis in São Paulo state, Brazil. Competing models such as GARMA, autoregressive integrated moving average (ARIMA), and VAR models are considered for comparison purposes. The new model outperforms the competing models by providing more accurate out-of-sample forecasting and allowing quantification of the lagged effect of the case count series on hospitalizations due to leptospirosis.

arXiv.org

Statistical and Algorithmic Foundations of Reinforcement Learning arxiv.org/abs/2507.14444 .ML .OC .ST .TH .AI .LG

Statistical and Algorithmic Foundations of Reinforcement Learning

As a paradigm for sequential decision making in unknown environments, reinforcement learning (RL) has received a flurry of attention in recent years. However, the explosion of model complexity in emerging applications and the presence of nonconvexity exacerbate the challenge of achieving efficient RL in sample-starved situations, where data collection is expensive, time-consuming, or even high-stakes (e.g., in clinical trials, autonomous systems, and online advertising). How to understand and enhance the sample and computational efficacies of RL algorithms is thus of great interest. In this tutorial, we aim to introduce several important algorithmic and theoretical developments in RL, highlighting the connections between new ideas and classical topics. Employing Markov Decision Processes as the central mathematical model, we cover several distinctive RL scenarios (i.e., RL with a simulator, online RL, offline RL, robust RL, and RL with human feedback), and present several mainstream RL approaches (i.e., model-based approach, value-based approach, and policy optimization). Our discussions gravitate around the issues of sample complexity, computational efficiency, as well as algorithm-dependent and information-theoretic lower bounds from a non-asymptotic viewpoint.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.