Show newer

An $\mathcal{O}(\log_2N)$ SMC$^2$ Algorithm on Distributed Memory with an Approx. Optimal L-Kernel. (arXiv:2311.12973v1 [stat.AP]) arxiv.org/abs/2311.12973

An $\mathcal{O}(\log_2N)$ SMC$^2$ Algorithm on Distributed Memory with an Approx. Optimal L-Kernel

Calibrating statistical models using Bayesian inference often requires both accurate and timely estimates of parameters of interest. Particle Markov Chain Monte Carlo (p-MCMC) and Sequential Monte Carlo Squared (SMC$^2$) are two methods that use an unbiased estimate of the log-likelihood obtained from a particle filter (PF) to evaluate the target distribution. P-MCMC constructs a single Markov chain which is sequential by nature so cannot be readily parallelized using Distributed Memory (DM) architectures. This is in contrast to SMC$^2$ which includes processes, such as importance sampling, that are described as \textit{embarrassingly parallel}. However, difficulties arise when attempting to parallelize resampling. None-the-less, the choice of backward kernel, recycling scheme and compatibility with DM architectures makes SMC$^2$ an attractive option when compared with p-MCMC. In this paper, we present an SMC$^2$ framework that includes the following features: an optimal (in terms of time complexity) $\mathcal{O}(\log_2N)$ parallelization for DM architectures, an approximately optimal (in terms of accuracy) backward kernel, and an efficient recycling scheme. On a cluster of $128$ DM processors, the results on a biomedical application show that SMC$^2$ achieves up to a $70\times$ speed-up vs its sequential implementation. It is also more accurate and roughly $54\times$ faster than p-MCMC. A GitHub link is given which provides access to the code.

arxiv.org

Physics-Informed Priors with Application to Boundary Layer Velocity. (arXiv:2311.12978v1 [physics.flu-dyn]) arxiv.org/abs/2311.12978

Physics-Informed Priors with Application to Boundary Layer Velocity

One of the most popular recent areas of machine learning predicates the use of neural networks augmented by information about the underlying process in the form of Partial Differential Equations (PDEs). These physics-informed neural networks are obtained by penalizing the inference with a PDE, and have been cast as a minimization problem currently lacking a formal approach to quantify the uncertainty. In this work, we propose a novel model-based framework which regards the PDE as a prior information of a deep Bayesian neural network. The prior is calibrated without data to resemble the PDE solution in the prior mean, while our degree in confidence on the PDE with respect to the data is expressed in terms of the prior variance. The information embedded in the PDE is then propagated to the posterior yielding physics-informed forecasts with uncertainty quantification. We apply our approach to a simulated viscous fluid and to experimentally-obtained turbulent boundary layer velocity in a wind tunnel using an appropriately simplified Navier-Stokes equation. Our approach requires very few observations to produce physically-consistent forecasts as opposed to non-physical forecasts stemming from non-informed priors, thereby allowing forecasting complex systems where some amount of data as well as some contextual knowledge is available.

arxiv.org

Beyond Catoni: Sharper Rates for Heavy-Tailed and Robust Mean Estimation. (arXiv:2311.13010v1 [math.ST]) arxiv.org/abs/2311.13010

Beyond Catoni: Sharper Rates for Heavy-Tailed and Robust Mean Estimation

We study the fundamental problem of estimating the mean of a $d$-dimensional distribution with covariance $Σ\preccurlyeq σ^2 I_d$ given $n$ samples. When $d = 1$, Catoni \cite{catoni} showed an estimator with error $(1+o(1)) \cdot σ\sqrt{\frac{2 \log \frac{1}δ}{n}}$, with probability $1 - δ$, matching the Gaussian error rate. For $d>1$, a natural estimator outputs the center of the minimum enclosing ball of one-dimensional confidence intervals to achieve a $1-δ$ confidence radius of $\sqrt{\frac{2 d}{d+1}} \cdot σ\left(\sqrt{\frac{d}{n}} + \sqrt{\frac{2 \log \frac{1}δ}{n}}\right)$, incurring a $\sqrt{\frac{2d}{d+1}}$-factor loss over the Gaussian rate. When the $\sqrt{\frac{d}{n}}$ term dominates by a $\sqrt{\log \frac{1}δ}$ factor, \cite{lee2022optimal-highdim} showed an improved estimator matching the Gaussian rate. This raises a natural question: is the Gaussian rate achievable in general? Or is the $\sqrt{\frac{2 d}{d+1}}$ loss \emph{necessary} when the $\sqrt{\frac{2 \log \frac{1}δ}{n}}$ term dominates? We show that the answer to both these questions is \emph{no} -- we show that \emph{some} constant-factor loss over the Gaussian rate is necessary, but construct an estimator that improves over the above naive estimator by a constant factor. We also consider robust estimation, where an adversary is allowed to corrupt an $ε$-fraction of samples arbitrarily: in this case, we show that the above strategy of combining one-dimensional estimates and incurring the $\sqrt{\frac{2d}{d+1}}$-factor \emph{is} optimal in the infinite-sample limit.

arxiv.org

W-kernel and essential subspace for frequencist's evaluation of Bayesian estimators. (arXiv:2311.13017v1 [stat.ME]) arxiv.org/abs/2311.13017

W-kernel and essential subspace for frequencist's evaluation of Bayesian estimators

The posterior covariance matrix W defined by the log-likelihood of each observation plays important roles both in the sensitivity analysis and frequencist's evaluation of the Bayesian estimators. This study focused on the matrix W and its principal space; we term the latter as an essential subspace. First, it is shown that they appear in various statistical settings, such as the evaluation of the posterior sensitivity, assessment of the frequencist's uncertainty from posterior samples, and stochastic expansion of the loss; a key tool to treat frequencist's properties is the recently proposed Bayesian infinitesimal jackknife approximation (Giordano and Broderick (2023)). In the following part, we show that the matrix W can be interpreted as a reproducing kernel; it is named as W-kernel. Using the W-kernel, the essential subspace is expressed as a principal space given by the kernel PCA. A relation to the Fisher kernel and neural tangent kernel is established, which elucidates the connection to the classical asymptotic theory; it also leads to a sort of Bayesian-frequencist's duality. Finally, two applications, selection of a representative set of observations and dimensional reduction in the approximate bootstrap, are discussed. In the former, incomplete Cholesky decomposition is introduced as an efficient method to compute the essential subspace. In the latter, different implementations of the approximate bootstrap for posterior means are compared.

arxiv.org

Favour: FAst Variance Operator for Uncertainty Rating. (arXiv:2311.13036v1 [cs.LG]) arxiv.org/abs/2311.13036

Favour: FAst Variance Operator for Uncertainty Rating

Bayesian Neural Networks (BNN) have emerged as a crucial approach for interpreting ML predictions. By sampling from the posterior distribution, data scientists may estimate the uncertainty of an inference. Unfortunately many inference samples are often needed, the overhead of which greatly hinder BNN's wide adoption. To mitigate this, previous work proposed propagating the first and second moments of the posterior directly through the network. However, on its own this method is even slower than sampling, so the propagated variance needs to be approximated such as assuming independence between neural nodes. The resulting trade-off between quality and inference time did not match even plain Monte Carlo sampling. Our contribution is a more principled variance propagation framework based on "spiked covariance matrices", which smoothly interpolates between quality and inference time. This is made possible by a new fast algorithm for updating a diagonal-plus-low-rank matrix approximation under various operations. We tested our algorithm against sampling based MC Dropout and Variational Inference on a number of downstream uncertainty themed tasks, such as calibration and out-of-distribution testing. We find that Favour is as fast as performing 2-3 inference samples, while matching the performance of 10-100 samples. In summary, this work enables the use of BNN in the realm of performance critical tasks where they have previously been out of reach.

arxiv.org

Sensitivity analysis with multiple treatments and multiple outcomes with applications to air pollution mixtures. (arXiv:2311.12252v1 [stat.ME]) arxiv.org/abs/2311.12252

Sensitivity analysis with multiple treatments and multiple outcomes with applications to air pollution mixtures

Understanding the health impacts of air pollution is vital in public health research. Numerous studies have estimated negative health effects of a variety of pollutants, but accurately gauging these impacts remains challenging due to the potential for unmeasured confounding bias that is ubiquitous in observational studies. In this study, we develop a framework for sensitivity analysis in settings with both multiple treatments and multiple outcomes simultaneously. This setting is of particular interest because one can identify the strength of association between the unmeasured confounders and both the treatment and outcome, under a factor confounding assumption. This provides informative bounds on the causal effect leading to partial identification regions for the effects of multivariate treatments that account for the maximum possible bias from unmeasured confounding. We also show that when negative controls are available, we are able to refine the partial identification regions substantially, and in certain cases, even identify the causal effect in the presence of unmeasured confounding. We derive partial identification regions for general estimands in this setting, and develop a novel computational approach to finding these regions.

arxiv.org

The limitation of neural nets for approximation and optimization. (arXiv:2311.12253v1 [cs.LG]) arxiv.org/abs/2311.12253

The limitation of neural nets for approximation and optimization

We are interested in assessing the use of neural networks as surrogate models to approximate and minimize objective functions in optimization problems. While neural networks are widely used for machine learning tasks such as classification and regression, their application in solving optimization problems has been limited. Our study begins by determining the best activation function for approximating the objective functions of popular nonlinear optimization test problems, and the evidence provided shows that~SiLU has the best performance. We then analyze the accuracy of function value, gradient, and Hessian approximations for such objective functions obtained through interpolation/regression models and neural networks. When compared to interpolation/regression models, neural networks can deliver competitive zero- and first-order approximations (at a high training cost) but underperform on second-order approximation. However, it is shown that combining a neural net activation function with the natural basis for quadratic interpolation/regression can waive the necessity of including cross terms in the natural basis, leading to models with fewer parameters to determine. Lastly, we provide evidence that the performance of a state-of-the-art derivative-free optimization algorithm can hardly be improved when the gradient of an objective function is approximated using any of the surrogate models considered, including neural networks.

arxiv.org

Learning Causal Representations from General Environments: Identifiability and Intrinsic Ambiguity. (arXiv:2311.12267v1 [cs.LG]) arxiv.org/abs/2311.12267

Learning Causal Representations from General Environments: Identifiability and Intrinsic Ambiguity

This paper studies causal representation learning, the task of recovering high-level latent variables and their causal relationships from low-level data that we observe, assuming access to observations generated from multiple environments. While existing works are able to prove full identifiability of the underlying data generating process, they typically assume access to single-node, hard interventions which is rather unrealistic in practice. The main contribution of this paper is characterize a notion of identifiability which is provably the best one can achieve when hard interventions are not available. First, for linear causal models, we provide identifiability guarantee for data observed from general environments without assuming any similarities between them. While the causal graph is shown to be fully recovered, the latent variables are only identified up to an effect-domination ambiguity (EDA). We then propose an algorithm, LiNGCReL which is guaranteed to recover the ground-truth model up to EDA, and we demonstrate its effectiveness via numerical experiments. Moving on to general non-parametric causal models, we prove the same idenfifiability guarantee assuming access to groups of soft interventions. Finally, we provide counterparts of our identifiability results, indicating that EDA is basically inevitable in our setting.

arxiv.org

Sample size calculation based on the difference in restricted mean time lost for clinical trials with competing risks. (arXiv:2311.12293v1 [stat.ME]) arxiv.org/abs/2311.12293

Sample size calculation based on the difference in restricted mean time lost for clinical trials with competing risks

Computation of sample size is important when designing clinical trials. The presence of competing risks makes the design of clinical trials with time-to-event endpoints cumbersome. A model based on the subdistribution hazard ratio (SHR) is commonly used for trials under competing risks. However, this approach has some limitations related to model assumptions and clinical interpretation. Considering such limitations, the difference in restricted mean time lost (RMTLd) is recommended as an alternative indicator. In this paper, we propose a sample size calculation method based on the RMTLd for the Weibull distribution (RMTLdWeibull) for clinical trials, which considers experimental conditions such as equal allocation, uniform accrual, uniform loss to follow-up, and administrative censoring. Simulation results show that sample size calculation based on the RMTLdWeibull can generally achieve a predefined power level and maintain relative robustness. Moreover, the performance of the sample size calculation based on the RMTLdWeibull is similar or superior to that based on the SHR. Even if the event time does not follow the Weibull distribution, the sample size calculation based on the RMTLdWeibull still performs well. The results also verify the performance of the sample size calculation method based on the RMTLdWeibull. From the perspective of the results of this study, clinical interpretation, application conditions and statistical performance, we recommend that when designing clinical trials in the presence of competing risks, the RMTLd indicator be applied for sample size calculation and subsequent effect size measurement.

arxiv.org

A unified consensus-based parallel ADMM algorithm for high-dimensional regression with combined regularizations. (arXiv:2311.12319v1 [stat.ML]) arxiv.org/abs/2311.12319

A unified consensus-based parallel ADMM algorithm for high-dimensional regression with combined regularizations

The parallel alternating direction method of multipliers (ADMM) algorithm is widely recognized for its effectiveness in handling large-scale datasets stored in a distributed manner, making it a popular choice for solving statistical learning models. However, there is currently limited research on parallel algorithms specifically designed for high-dimensional regression with combined (composite) regularization terms. These terms, such as elastic-net, sparse group lasso, sparse fused lasso, and their nonconvex variants, have gained significant attention in various fields due to their ability to incorporate prior information and promote sparsity within specific groups or fused variables. The scarcity of parallel algorithms for combined regularizations can be attributed to the inherent nonsmoothness and complexity of these terms, as well as the absence of closed-form solutions for certain proximal operators associated with them. In this paper, we propose a unified constrained optimization formulation based on the consensus problem for these types of convex and nonconvex regression problems and derive the corresponding parallel ADMM algorithms. Furthermore, we prove that the proposed algorithm not only has global convergence but also exhibits linear convergence rate. Extensive simulation experiments, along with a financial example, serve to demonstrate the reliability, stability, and scalability of our algorithm. The R package for implementing the proposed algorithms can be obtained at https://github.com/xfwu1016/CPADMM.

arxiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.