arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 601 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

Directional data analysis using the spherical Cauchy and the Poisson-kernel based distribution https://arxiv.org/abs/2409.03292 #stat.ME

Directional data analysis using the spherical Cauchy and the Poisson-kernel based distribution

The spherical Cauchy distribution and the Poisson-kernel based distribution were both proposed in 2020, for the analysis of directional data. The paper explores both of them under various frameworks. Alternative parametrizations that offer numerical and estimation advantages, including a straightforward Newton-Raphson algorithm to estimate the parameters are suggested, which further facilitate a more straightforward formulation under the regression setting. A two-sample location test, based on the log-likelihood ratio test is suggested, completing with discriminant analysis. The two distributions are put to the test-bed for all aforementioned cases, through simulation studies and via real data examples comparing and illustrating their performance.

arXiv Statistics @arxiv_stats@qoto.org

Moment-type estimators for a weighted exponential family https://arxiv.org/abs/2409.02204 #stat.ME

Moment-type estimators for a weighted exponential family

In this paper, we propose and study closed-form moment type estimators for a weighted exponential family. We also develop a bias-reduced version of these proposed closed-form estimators using bootstrap techniques. The estimators are evaluated using Monte Carlo simulation. This shows favourable results for the proposed bootstrap bias-reduced estimators.

arXiv Statistics @arxiv_stats@qoto.org

Estimand-based Inference in Presence of Long-Term Survivors https://arxiv.org/abs/2409.02209 #stat.ME

Estimand-based Inference in Presence of Long-Term Survivors

In this article, we develop nonparametric inference methods for comparing survival data across two samples, which are beneficial for clinical trials of novel cancer therapies where long-term survival is a critical outcome. These therapies, including immunotherapies or other advanced treatments, aim to establish durable effects. They often exhibit distinct survival patterns such as crossing or delayed separation and potentially leveling-off at the tails of survival curves, clearly violating the proportional hazards assumption and rendering the hazard ratio inappropriate for measuring treatment effects. The proposed methodology utilizes the mixture cure framework to separately analyze the cure rates of long-term survivors and the survival functions of susceptible individuals. We evaluate a nonparametric estimator for the susceptible survival function in the one-sample setting. Under sufficient follow-up, it is expressed as a location-scale-shift variant of the Kaplan-Meier (KM) estimator. It retains several desirable features of the KM estimator, including inverse-probability-censoring weighting, product-limit estimation, self-consistency, and nonparametric efficiency. In scenarios of insufficient follow-up, it can easily be adapted by incorporating a suitable cure rate estimator. In the two-sample setting, besides using the difference in cure rates to measure the long-term effect, we propose a graphical estimand to compare the relative treatment effects on susceptible subgroups. This process, inspired by Kendall's tau, compares the order of survival times among susceptible individuals. The proposed methods' large-sample properties are derived for further inference, and the finite-sample properties are examined through extensive simulation studies. The proposed methodology is applied to analyze the digitized data from the CheckMate 067 immunotherapy clinical trial.

arXiv Statistics @arxiv_stats@qoto.org

Generalized implementation of invariant coordinate selection with positive semi-definite scatter matrices https://arxiv.org/abs/2409.02258 #stat.ME

Generalized implementation of invariant coordinate selection with positive semi-definite scatter matrices

Invariant coordinate selection (ICS) is an unsupervised multivariate data transformation useful in many contexts such as outlier detection or clustering. It is based on the simultaneous diagonalization of two affine equivariant and positive definite scatter matrices. Its classical implementation relies on a non-symmetric eigenvalue problem (EVP) by diagonalizing one scatter relatively to the other. In case of collinearity, at least one of the scatter matrices is singular and the problem cannot be solved. To address this limitation, three approaches are proposed based on: a Moore-Penrose pseudo inverse (GINV), a dimension reduction (DR), and a generalized singular value decomposition (GSVD). Their properties are investigated theoretically and in different empirical applications. Overall, the extension based on GSVD seems the most promising even if it restricts the choice of scatter matrices that can be expressed as cross-products. In practice, some of the approaches also look suitable in the context of data in high dimension low sample size (HDLSS).

arXiv Statistics @arxiv_stats@qoto.org

Simulation-calibration testing for inference in Lasso regressions https://arxiv.org/abs/2409.02269 #stat.ME #math.ST #stat.CO #stat.TH

Simulation-calibration testing for inference in Lasso regressions

We propose a test of the significance of a variable appearing on the Lasso path and use it in a procedure for selecting one of the models of the Lasso path, controlling the Family-Wise Error Rate. Our null hypothesis depends on a set A of already selected variables and states that it contains all the active variables. We focus on the regularization parameter value from which a first variable outside A is selected. As the test statistic, we use this quantity's conditional p-value, which we define conditional on the non-penalized estimated coefficients of the model restricted to A. We estimate this by simulating outcome vectors and then calibrating them on the observed outcome's estimated coefficients. We adapt the calibration heuristically to the case of generalized linear models in which it turns into an iterative stochastic procedure. We prove that the test controls the risk of selecting a false positive in linear models, both under the null hypothesis and, under a correlation condition, when A does not contain all active variables. We assess the performance of our procedure through extensive simulation studies. We also illustrate it in the detection of exposures associated with drug-induced liver injuries in the French pharmacovigilance database.

arXiv Statistics @arxiv_stats@qoto.org

Demystified: double robustness with nuisance parameters estimated at rate n-to-the-1/4 https://arxiv.org/abs/2409.02320 #math.ST #stat.TH

Demystified: double robustness with nuisance parameters estimated at rate n-to-the-1/4

Have you also been wondering what is this thing with double robustness and nuisance parameters estimated at rate n^(1/4)? It turns out that to understand this phenomenon one just needs the Middle Value Theorem (or a Taylor expansion) and some smoothness conditions. This note explains why under some fairly simple conditions, as long as the nuisance parameter theta in R^k is estimated at rate n^(1/4) or faster, 1. the resulting variance of the estimator of the parameter of interest psi in R^d does not depend on how the nuisance parameter theta is estimated, and 2. the sandwich estimator of the variance of psi-hat ignoring estimation of theta is consistent.

arXiv Statistics @arxiv_stats@qoto.org

Generative Principal Component Regression via Variational Inference https://arxiv.org/abs/2409.02327 #stat.ML #cs.LG

Generative Principal Component Regression via Variational Inference

The ability to manipulate complex systems, such as the brain, to modify specific outcomes has far-reaching implications, particularly in the treatment of psychiatric disorders. One approach to designing appropriate manipulations is to target key features of predictive models. While generative latent variable models, such as probabilistic principal component analysis (PPCA), is a powerful tool for identifying targets, they struggle incorporating information relevant to low-variance outcomes into the latent space. When stimulation targets are designed on the latent space in such a scenario, the intervention can be suboptimal with minimal efficacy. To address this problem, we develop a novel objective based on supervised variational autoencoders (SVAEs) that enforces such information is represented in the latent space. The novel objective can be used with linear models, such as PPCA, which we refer to as generative principal component regression (gPCR). We show in simulations that gPCR dramatically improves target selection in manipulation as compared to standard PCR and SVAEs. As part of these simulations, we develop a metric for detecting when relevant information is not properly incorporated into the loadings. We then show in two neural datasets related to stress and social behavior in which gPCR dramatically outperforms PCR in predictive performance and that SVAEs exhibit low incorporation of relevant information into the loadings. Overall, this work suggests that our method significantly improves target selection for manipulation using latent variable models over competitor inference schemes.

arXiv Statistics @arxiv_stats@qoto.org

A parameterization of anisotropic Gaussian fields with penalized complexity priors https://arxiv.org/abs/2409.02331 #stat.ME

A parameterization of anisotropic Gaussian fields with penalized complexity priors

Gaussian random fields (GFs) are fundamental tools in spatial modeling and can be represented flexibly and efficiently as solutions to stochastic partial differential equations (SPDEs). The SPDEs depend on specific parameters, which enforce various field behaviors and can be estimated using Bayesian inference. However, the likelihood typically only provides limited insights into the covariance structure under in-fill asymptotics. In response, it is essential to leverage priors to achieve appropriate, meaningful covariance structures in the posterior. This study introduces a smooth, invertible parameterization of the correlation length and diffusion matrix of an anisotropic GF and constructs penalized complexity (PC) priors for the model when the parameters are constant in space. The formulated prior is weakly informative, effectively penalizing complexity by pushing the correlation range toward infinity and the anisotropy to zero.

arXiv Statistics @arxiv_stats@qoto.org

Optimal sampling for least-squares approximation https://arxiv.org/abs/2409.02342 #stat.ML #math.NA #cs.LG #cs.NA

Optimal sampling for least-squares approximation

Least-squares approximation is one of the most important methods for recovering an unknown function from data. While in many applications the data is fixed, in many others there is substantial freedom to choose where to sample. In this paper, we review recent progress on optimal sampling for (weighted) least-squares approximation in arbitrary linear spaces. We introduce the Christoffel function as a key quantity in the analysis of (weighted) least-squares approximation from random samples, then show how it can be used to construct sampling strategies that possess near-optimal sample complexity: namely, the number of samples scales log-linearly in $n$, the dimension of the approximation space. We discuss a series of variations, extensions and further topics, and throughout highlight connections to approximation theory, machine learning, information-based complexity and numerical linear algebra. Finally, motivated by various contemporary applications, we consider a generalization of the classical setting where the samples need not be pointwise samples of a scalar-valued function, and the approximation space need not be linear. We show that even in this significantly more general setting suitable generalizations of the Christoffel function still determine the sample complexity. This provides a unified procedure for designing improved sampling strategies for general recovery problems. This article is largely self-contained, and intended to be accessible to nonspecialists.

arXiv Statistics @arxiv_stats@qoto.org

Conditional logistic individual-level models of spatial infectious disease dynamics https://arxiv.org/abs/2409.02353 #stat.CO

Conditional logistic individual-level models of spatial infectious disease dynamics

Here, we introduce a novel framework for modelling the spatiotemporal dynamics of disease spread known as conditional logistic individual-level models (CL-ILM's). This framework alleviates much of the computational burden associated with traditional spatiotemporal individual-level models for epidemics, and facilitates the use of standard software for fitting logistic models when analysing spatiotemporal disease patterns. The models can be fitted in either a frequentist or Bayesian framework. Here, we apply the new spatial CL-ILM to both simulated and semi-real data from the UK 2001 foot-and-mouth disease epidemic.

arXiv Statistics @arxiv_stats@qoto.org

A Principal Square Response Forward Regression Method for Dimension Reduction https://arxiv.org/abs/2409.02372 #stat.ME

A Principal Square Response Forward Regression Method for Dimension Reduction

Dimension reduction techniques, such as Sufficient Dimension Reduction (SDR), are indispensable for analyzing high-dimensional datasets. This paper introduces a novel SDR method named Principal Square Response Forward Regression (PSRFR) for estimating the central subspace of the response variable Y, given the vector of predictor variables $\bm{X}$. We provide a computational algorithm for implementing PSRFR and establish its consistency and asymptotic properties. Monte Carlo simulations are conducted to assess the performance, efficiency, and robustness of the proposed method. Notably, PSRFR exhibits commendable performance in scenarios where the variance of each component becomes increasingly dissimilar, particularly when the predictor variables follow an elliptical distribution. Furthermore, we illustrate and validate the effectiveness of PSRFR using a real-world dataset concerning wine quality. Our findings underscore the utility and reliability of the PSRFR method in practical applications of dimension reduction for high-dimensional data analysis.

arXiv Statistics @arxiv_stats@qoto.org

CEopt: A MATLAB Package for Non-convex Optimization with the Cross-Entropy Method https://arxiv.org/abs/2409.00013 #stat.CO #math.OC #stat.ME #cs.MS

CEopt: A MATLAB Package for Non-convex Optimization with the Cross-Entropy Method

This paper introduces CEopt (https://ceopt.org), a MATLAB tool leveraging the Cross-Entropy method for non-convex optimization. Due to the relative simplicity of the algorithm, it provides a kind of transparent ``gray-box'' optimization solver, with intuitive control parameters. Unique in its approach, CEopt effectively handles both equality and inequality constraints using an augmented Lagrangian method, offering robustness and scalability for moderately sized complex problems. Through select case studies, the package's applicability and effectiveness in various optimization scenarios are showcased, marking CEopt as a practical addition to optimization research and application toolsets.

arXiv Statistics @arxiv_stats@qoto.org

Learning Latent Space Dynamics with Model-Form Uncertainties: A Stochastic Reduced-Order Modeling Approach https://arxiv.org/abs/2409.00220 #stat.ML #cs.LG

Learning Latent Space Dynamics with Model-Form Uncertainties: A Stochastic Reduced-Order Modeling Approach

This paper presents a probabilistic approach to represent and quantify model-form uncertainties in the reduced-order modeling of complex systems using operator inference techniques. Such uncertainties can arise in the selection of an appropriate state-space representation, in the projection step that underlies many reduced-order modeling methods, or as a byproduct of considerations made during training, to name a few. Following previous works in the literature, the proposed method captures these uncertainties by expanding the approximation space through the randomization of the projection matrix. This is achieved by combining Riemannian projection and retraction operators - acting on a subset of the Stiefel manifold - with an information-theoretic formulation. The efficacy of the approach is assessed on canonical problems in fluid mechanics by identifying and quantifying the impact of model-form uncertainties on the inferred operators.

arXiv Statistics @arxiv_stats@qoto.org

Distribution-Based Sub-Population Selection (DSPS): A Method for in-Silico Reproduction of Clinical Trials Outcomes https://arxiv.org/abs/2409.00232 #stat.AP

Distribution-Based Sub-Population Selection (DSPS): A Method for in-Silico Reproduction of Clinical Trials Outcomes

Background and Objective: Diabetes presents a significant challenge to healthcare due to the negative impact of poor blood sugar control on health and associated complications. Computer simulation platforms, notably exemplified by the UVA/Padova Type 1 Diabetes simulator, has emerged as a promising tool for advancing diabetes treatments by simulating patient responses in a virtual environment. The UVA Virtual Lab (UVLab) is a new simulation platform to mimic the metabolic behavior of people with Type 2 diabetes (T2D) with a large population of 6062 virtual subjects. Methods: The work introduces the Distribution-Based Population Selection (DSPS) method, a systematic approach to identifying virtual subsets that mimic the clinical behavior observed in real trials. The method transforms the sub-population selection task into a Linear Programing problem, enabling the identification of the largest representative virtual cohort. This selection process centers on key clinical outcomes in diabetes research, such as HbA1c and Fasting plasma Glucose (FPG), ensuring that the statistical properties (moments) of the selected virtual sub-population closely resemble those observed in real-word clinical trial. Results: DSPS method was applied to the insulin degludec (IDeg) arm of a phase 3 clinical trial. This method was used to select a sub-population of virtual subjects that closely mirrored the clinical trial data across multiple key metrics, including glycemic efficacy, insulin dosages, and cumulative hypoglycemia events over a 26-week period. Conclusion: The DSPS algorithm is able to select virtual sub-population within UVLab to reproduce and predict the outcomes of a clinical trial. This statistical method can bridge the gap between large population simulation platforms and previously conducted clinical trials.

arXiv Statistics @arxiv_stats@qoto.org

Variable selection in the joint frailty model of recurrent and terminal events using Broken Adaptive Ridge regression https://arxiv.org/abs/2409.00291 #stat.ME #stat.AP

Variable selection in the joint frailty model of recurrent and terminal events using Broken Adaptive Ridge regression

We introduce a novel method to simultaneously perform variable selection and estimation in the joint frailty model of recurrent and terminal events using the Broken Adaptive Ridge Regression penalty. The BAR penalty can be summarized as an iteratively reweighted squared $L_2$-penalized regression, which approximates the $L_0$-regularization method. Our method allows for the number of covariates to diverge with the sample size. Under certain regularity conditions, we prove that the BAR estimator implemented under the model framework is consistent and asymptotically normally distributed, which are known as the oracle properties in the variable selection literature. In our simulation studies, we compare our proposed method to the Minimum Information Criterion (MIC) method. We apply our method on the Medical Information Mart for Intensive Care (MIMIC-III) database, with the aim of investigating which variables affect the risks of repeated ICU admissions and death during ICU stay.

arXiv Statistics @arxiv_stats@qoto.org

Data is missing again -- Reconstruction of power generation data using $k$-Nearest Neighbors and spectral graph theory https://arxiv.org/abs/2409.00300 #stat.AP #stat.ML #cs.LG

Data is missing again -- Reconstruction of power generation data using $k$-Nearest Neighbors and spectral graph theory

The risk of missing data and subsequent incomplete data records at wind farms increases with the number of turbines and sensors. We propose here an imputation method that blends data-driven concepts with expert knowledge, by using the geometry of the wind farm in order to provide better estimates when performing Nearest Neighbor imputation. Our method relies on learning Laplacian eigenmaps out of the graph of the wind farm through spectral graph theory. These learned representations can be based on the wind farm layout only, or additionally account for information provided by collected data. The related weighted graph is allowed to change with time and can be tracked in an online fashion. Application to the Westermost Rough offshore wind farm shows significant improvement over approaches that do not account for the wind farm layout information.

arXiv Statistics @arxiv_stats@qoto.org

Response probability distribution estimation of expensive computer simulators: A Bayesian active learning perspective using Gaussian process regression https://arxiv.org/abs/2409.00407 #stat.CO

Response probability distribution estimation of expensive computer simulators: A Bayesian active learning perspective using Gaussian process regression

Estimation of the response probability distributions of computer simulators in the presence of randomness is a crucial task in many fields. However, achieving this task with guaranteed accuracy remains an open computational challenge, especially for expensive-to-evaluate computer simulators. In this work, a Bayesian active learning perspective is presented to address the challenge, which is based on the use of the Gaussian process (GP) regression. First, estimation of the response probability distributions is conceptually interpreted as a Bayesian inference problem, as opposed to frequentist inference. This interpretation provides several important benefits: (1) it quantifies and propagates discretization error probabilistically; (2) it incorporates prior knowledge of the computer simulator, and (3) it enables the effective reduction of numerical uncertainty in the solution to a prescribed level. The conceptual Bayesian idea is then realized by using the GP regression, where we derive the posterior statistics of the response probability distributions in semi-analytical form and also provide a numerical solution scheme. Based on the practical Bayesian approach, a Bayesian active learning (BAL) method is further proposed for estimating the response probability distributions. In this context, the key contribution lies in the development of two crucial components for active learning, i.e., stopping criterion and learning function, by taking advantage of posterior statistics. It is empirically demonstrated by five numerical examples that the proposed BAL method can efficiently estimate the response probability distributions with desired accuracy.

arXiv Statistics @arxiv_stats@qoto.org

A method to convert traditional fingerprint ACE / ACE-V outputs ("identification", "inconclusive", "exclusion") to Bayes factors https://arxiv.org/abs/2409.00451 #stat.AP

A method to convert traditional fingerprint ACE / ACE-V outputs ("identification", "inconclusive", "exclusion") to Bayes factors

Fingerprint examiners appear to be reluctant to adopt probabilistic reasoning, statistical models, and empirical validation. The rate of adoption of the likelihood-ratio framework by fingerprint practitioners appears to be near zero. A factor in the reluctance to adopt the likelihood-ratio framework may be a perception that it would require a radical change in practice. The present paper proposes a small step that would require minimal changes to current practice. It proposes and demonstrates a method to convert traditional fingerprint-examination outputs ("identification", "inconclusive", "exclusion") to well-calibrated Bayes factors. The method makes use of a beta-binomial model, and both uninformative and informative priors.

arXiv Statistics @arxiv_stats@qoto.org

Bayesian nonparametric mixtures of categorical directed graphs for heterogeneous causal inference https://arxiv.org/abs/2409.00453 #stat.ME #stat.AP

Bayesian nonparametric mixtures of categorical directed graphs for heterogeneous causal inference

Quantifying causal effects of exposures on outcomes, such as a treatment and a disease respectively, is a crucial issue in medical science for the administration of effective therapies. Importantly, any related causal analysis should account for all those variables, e.g. clinical features, that can act as risk factors involved in the occurrence of a disease. In addition, the selection of targeted strategies for therapy administration requires to quantify such treatment effects at personalized level rather than at population level. We address these issues by proposing a methodology based on categorical Directed Acyclic Graphs (DAGs) which provide an effective tool to infer causal relationships and causal effects between variables. In addition, we account for population heterogeneity by considering a Dirichlet Process mixture of categorical DAGs, which clusters individuals into homogeneous groups characterized by common causal structures, dependence parameters and causal effects. We develop computational strategies for Bayesian posterior inference, from which a battery of causal effects at subject-specific level is recovered. Our methodology is evaluated through simulations and applied to a dataset of breast cancer patients to investigate cardiotoxic side effects that can be induced by the administrated anticancer therapies.

arXiv Statistics @arxiv_stats@qoto.org

Examining the robustness of a model selection procedure in the binary latent block model through a language placement test data set https://arxiv.org/abs/2409.00470 #stat.ME #stat.CO

Examining the robustness of a model selection procedure in the binary latent block model through a language placement test data set

When entering French university, the students' foreign language level is assessed through a placement test. In this work, we model the placement test results using binary latent block models which allow to simultaneously form homogeneous groups of students and of items. However, a major difficulty in latent block models is to select correctly the number of groups of rows and the number of groups of columns. The first purpose of this paper is to tune the number of initializations needed to limit the initial values problem in the estimation algorithm in order to propose a model selection procedure in the placement test context. Computational studies based on simulated data sets and on two placement test data sets are investigated. The second purpose is to investigate the robustness of the proposed model selection procedure in terms of stability of the students groups when the number of students varies.

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019