arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 601 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

Variable selection for partially linear single-index varying-coefficient model https://arxiv.org/abs/2412.13468 #math.ST #stat.TH

arXiv Statistics @arxiv_stats@qoto.org

Which Imputation Fits Which Feature Selection Method? A Survey-Based Simulation Study https://arxiv.org/abs/2412.13570 #stat.AP #stat.ML

Which Imputation Fits Which Feature Selection Method? A Survey-Based Simulation Study

Tree-based learning methods such as Random Forest and XGBoost are still the gold-standard prediction methods for tabular data. Feature importance measures are usually considered for feature selection as well as to assess the effect of features on the outcome variables in the model. This also applies to survey data, which are frequently encountered in the social sciences and official statistics. These types of datasets often present the challenge of missing values. The typical solution is to impute the missing data before applying the learning method. However, given the large number of possible imputation methods available, the question arises as to which should be chosen to achieve the 'best' reflection of feature importance and feature selection in subsequent analyses. In the present paper, we investigate this question in a survey-based simulation study for eight state-of-the art imputation methods and three learners. The imputation methods comprise listwise deletion, three MICE options, four \texttt{missRanger} options as well as the recently proposed mixGBoost imputation approach. As learners, we consider the two most common tree-based methods, Random Forest and XGBoost, and an interpretable linear model with regularization.

arXiv Statistics @arxiv_stats@qoto.org

A Model-Based Clustering Approach for Bounded Data Using Transformation-Based Gaussian Mixture Models https://arxiv.org/abs/2412.13572 #stat.ME

A Model-Based Clustering Approach for Bounded Data Using Transformation-Based Gaussian Mixture Models

The clustering of bounded data presents unique challenges in statistical analysis due to the constraints imposed on the data values. This paper introduces a novel method for model-based clustering specifically designed for bounded data. Building on the transformation-based approach to Gaussian mixture density estimation introduced by Scrucca (2019), we extend this framework to develop a probabilistic clustering algorithm for data with bounded support that allows for accurate clustering while respecting the natural bounds of the variables. In our proposal, a flexible range-power transformation is employed to map the data from its bounded domain to the unrestricted real space, hence enabling the estimation of Gaussian mixture models in the transformed space. This approach leads to improved cluster recovery and interpretation, especially for complex distributions within bounded domains. The performance of the proposed method is evaluated through real-world data applications involving both fully and partially bounded data, in both univariate and multivariate settings. The results demonstrate the effectiveness and advantages of our approach over traditional and advanced model-based clustering techniques that employ distributions with bounded support.

arXiv Statistics @arxiv_stats@qoto.org

Single-cell spatial (scs) omics: Recent developments in data analysis https://arxiv.org/abs/2412.13591 #q-bio.GN #stat.CO

Single-cell spatial (scs) omics: Recent developments in data analysis

Over the past few years, technological advances have allowed for measurement of omics data at the cell level, creating a new type of data generally referred to as single-cell (sc) omics. On the other hand, the so-called spatial omics are a family of techniques that generate biological information in a spatial domain, for instance, in the volume of a tissue. In this survey, we are mostly interested in the intersection between sc and spatial (scs) omics and in the challenges and opportunities that this new type of data pose for downstream data analysis methodologies. Our goal is to cover all major omics modalities, including transcriptomics, genomics, epigenomics, proteomics and metabolomics.

arXiv Statistics @arxiv_stats@qoto.org

Sequential Rank and Preference Learning with the Bayesian Mallows Model https://arxiv.org/abs/2412.13644 #stat.CO

Sequential Rank and Preference Learning with the Bayesian Mallows Model

The Bayesian Mallows model is a flexible tool for analyzing data in the form of complete or partial rankings, and transitive or intransitive pairwise preferences. In many potential applications of preference learning, data arrive sequentially and it is of practical interest to update posterior beliefs and predictions efficiently, based on the currently available data. Despite this, most algorithms proposed so far have focused on batch inference. In this paper we present an algorithm for sequentially estimating the posterior distributions of the Bayesian Mallows model using nested sequential Monte Carlo. As it requires minimum user input in form of tuning parameters, is straightforward to parallelize, and returns the marginal likelihood as a direct byproduct of estimation, the algorithm is an alternative to Markov chain Monte Carlo techniques also in batch estimation settings.

arXiv Statistics @arxiv_stats@qoto.org

Time-Reversible Bridges of Data with Machine Learning https://arxiv.org/abs/2412.13665 #stat.ML #cs.LG

Time-Reversible Bridges of Data with Machine Learning

The analysis of dynamical systems is a fundamental tool in the natural sciences and engineering. It is used to understand the evolution of systems as large as entire galaxies and as small as individual molecules. With predefined conditions on the evolution of dy-namical systems, the underlying differential equations have to fulfill specific constraints in time and space. This class of problems is known as boundary value problems. This thesis presents novel approaches to learn time-reversible deterministic and stochastic dynamics constrained by initial and final conditions. The dynamics are inferred by machine learning algorithms from observed data, which is in contrast to the traditional approach of solving differential equations by numerical integration. The work in this thesis examines a set of problems of increasing difficulty each of which is concerned with learning a different aspect of the dynamics. Initially, we consider learning deterministic dynamics from ground truth solutions which are constrained by deterministic boundary conditions. Secondly, we study a boundary value problem in discrete state spaces, where the forward dynamics follow a stochastic jump process and the boundary conditions are discrete probability distributions. In particular, the stochastic dynamics of a specific jump process, the Ehrenfest process, is considered and the reverse time dynamics are inferred with machine learning. Finally, we investigate the problem of inferring the dynamics of a continuous-time stochastic process between two probability distributions without any reference information. Here, we propose a novel criterion to learn time-reversible dynamics of two stochastic processes to solve the Schrödinger Bridge Problem.

arXiv Statistics @arxiv_stats@qoto.org

Reliability analysis for non-deterministic limit-states using stochastic emulators https://arxiv.org/abs/2412.13731 #stat.CO #stat.ME #stat.ML

Reliability analysis for non-deterministic limit-states using stochastic emulators

Reliability analysis is a sub-field of uncertainty quantification that assesses the probability of a system performing as intended under various uncertainties. Traditionally, this analysis relies on deterministic models, where experiments are repeatable, i.e., they produce consistent outputs for a given set of inputs. However, real-world systems often exhibit stochastic behavior, leading to non-repeatable outcomes. These so-called stochastic simulators produce different outputs each time the model is run, even with fixed inputs. This paper formally introduces reliability analysis for stochastic models and addresses it by using suitable surrogate models to lower its typically high computational cost. Specifically, we focus on the recently introduced generalized lambda models and stochastic polynomial chaos expansions. These emulators are designed to learn the inherent randomness of the simulator's response and enable efficient uncertainty quantification at a much lower cost than traditional Monte Carlo simulation. We validate our methodology through three case studies. First, using an analytical function with a closed-form solution, we demonstrate that the emulators converge to the correct solution. Second, we present results obtained from the surrogates using a toy example of a simply supported beam. Finally, we apply the emulators to perform reliability analysis on a realistic wind turbine case study, where only a dataset of simulation results is available.

arXiv Statistics @arxiv_stats@qoto.org

Breaching 1.5{\deg}C: Give me the odds https://arxiv.org/abs/2412.13855 #stat.AP #stat.ME

Breaching 1.5°C: Give me the odds

Climate change communication is crucial to raising awareness and motivating action. In the context of breaching the limits set out by the Paris Agreement, we argue that climate scientists should move away from point estimates and towards reporting probabilities. Reporting probabilities will provide policymakers with a range of possible outcomes and will allow them to make informed timely decisions. To achieve this goal, we propose a method to calculate the probability of breaching the limits set out by the Paris Agreement. The method can be summarized as predicting future temperatures under different scenarios and calculating the number of possible outcomes that breach the limits as a proportion of the total number of outcomes. The probabilities can be computed for different time horizons and can be updated as new data become available. As an illustration, we performed a simulation study to investigate the probability of breaching the limits in a statistical model. Our results show that the probability of breaching the 1.5°C limit is already greater than zero for 2024. Moreover, the probability of breaching the limit is greater than 99% by 2042 if no action is taken to reduce greenhouse gas emissions. Our methodology is simple to implement and can easily be extended to more complex models of the climate system. We encourage climate model developers to include the probabilities of breaching the limits in their reports.

arXiv Statistics @arxiv_stats@qoto.org

On consistent estimation of dimension values https://arxiv.org/abs/2412.13898 #math.ST #stat.ML #stat.TH

On consistent estimation of dimension values

The problem of estimating, from a random sample of points, the dimension of a compact subset S of the Euclidean space is considered. The emphasis is put on consistency results in the statistical sense. That is, statements of convergence to the true dimension value when the sample size grows to infinity. Among the many available definitions of dimension, we have focused (on the grounds of its statistical tractability) on three notions: the Minkowski dimension, the correlation dimension and the, perhaps less popular, concept of pointwise dimension. We prove the statistical consistency of some natural estimators of these quantities. Our proofs partially rely on the use of an instrumental estimator formulated in terms of the empirical volume function Vn (r), defined as the Lebesgue measure of the set of points whose distance to the sample is at most r. In particular, we explore the case in which the true volume function V (r) of the target set S is a polynomial on some interval starting at zero. An empirical study is also included. Our study aims to provide some theoretical support, and some practical insights, for the problem of deciding whether or not the set S has a dimension smaller than that of the ambient space. This is a major statistical motivation of the dimension studies, in connection with the so-called Manifold Hypothesis.

arXiv Statistics @arxiv_stats@qoto.org

KenCoh: A Ranked-Based Canonical Coherence https://arxiv.org/abs/2412.10521 #stat.ME

KenCoh: A Ranked-Based Canonical Coherence

In this paper, we consider the problem of characterizing a robust global dependence between two brain regions where each region may contain several voxels or channels. This work is driven by experiments to investigate the dependence between two cortical regions and to identify differences in brain networks between brain states, e.g., alert and drowsy states. The most common approach to explore dependence between two groups of variables (or signals) is via canonical correlation analysis (CCA). However, it is limited to only capturing linear associations and is sensitive to outlier observations. These limitations are crucial because brain network connectivity is likely to be more complex than linear and that brain signals may exhibit heavy-tailed properties. To overcome these limitations, we develop a robust method, Kendall canonical coherence (KenCoh), for learning monotonic connectivity structure among neuronal signals filtered at given frequency bands. Furthermore, we propose the KenCoh-based permutation test to investigate the differences in brain network connectivity between two different states. Our simulation study demonstrates that KenCoh is competitive to the traditional variance-covariance estimator and outperforms the later when the underlying distributions are heavy-tailed. We apply our method to EEG recordings from a virtual-reality driving experiment. Our proposed method led to further insights on the differences of frontal-parietal cross-dependence network when the subject is alert and when the subject is drowsy and that left-parietal channel drives this dependence at the beta-band.

arXiv Statistics @arxiv_stats@qoto.org

Augmented two-stage estimation for treatment crossover in oncology trials: Leveraging external data for improved precision https://arxiv.org/abs/2412.10563 #stat.ME

Augmented two-stage estimation for treatment crossover in oncology trials: Leveraging external data for improved precision

Randomized controlled trials (RCTs) in oncology often allow control group participants to crossover to experimental treatments, a practice that, while often ethically necessary, complicates the accurate estimation of long-term treatment effects. When crossover rates are high or sample sizes are limited, commonly used methods for crossover adjustment (such as the rank-preserving structural failure time model, inverse probability of censoring weights, and two-stage estimation (TSE)) may produce imprecise estimates. Real-world data (RWD) can be used to develop an external control arm for the RCT, although this approach ignores evidence from trial subjects who did not crossover and ignores evidence from the data obtained prior to crossover for those subjects who did. This paper introduces augmented two-stage estimation (ATSE), a method that combines data from non-switching participants in a RCT with an external dataset, forming a 'hybrid non-switching arm'. With a simulation study, we evaluate the ATSE method's effectiveness compared to TSE crossover adjustment and an external control arm approach. Results indicate that, relative to TSE and the external control arm approach, ATSE can increase precision and may be less susceptible to bias due to unmeasured confounding.

arXiv Statistics @arxiv_stats@qoto.org

CESAR: A Convolutional Echo State AutoencodeR for High-Resolution Wind Forecasting https://arxiv.org/abs/2412.10578 #stat.AP #stat.ML

CESAR: A Convolutional Echo State AutoencodeR for High-Resolution Wind Forecasting

An accurate and timely assessment of wind speed and energy output allows an efficient planning and management of this resource on the power grid. Wind energy, especially at high resolution, calls for the development of nonlinear statistical models able to capture complex dependencies in space and time. This work introduces a Convolutional Echo State AutoencodeR (CESAR), a spatio-temporal, neural network-based model which first extracts the spatial features with a deep convolutional autoencoder, and then models their dynamics with an echo state network. We also propose a two-step approach to also allow for computationally affordable inference, while also performing uncertainty quantification. We focus on a high-resolution simulation in Riyadh (Saudi Arabia), an area where wind farm planning is currently ongoing, and show how CESAR is able to provide improved forecasting of wind speed and power for proposed building sites by up to 17% against the best alternative methods.

arXiv Statistics @arxiv_stats@qoto.org

The Front-door Criterion in the Potential Outcome Framework https://arxiv.org/abs/2412.10600 #stat.ME

The Front-door Criterion in the Potential Outcome Framework

In recent years, the front-door criterion (FDC) has been increasingly noticed in economics and social science. However, most economists still resist collecting this tool in their empirical toolkit. This article aims to incorporate the FDC into the framework of the potential outcome model (RCM). It redefines the key assumptions of the FDC with the language of the RCM. These assumptions are more comprehensive and detailed than the original ones in the structure causal model (SCM). The causal connotations of the FDC estimates are elaborated in detail, and the estimation bias caused by violating some key assumptions is theoretically derived. Rigorous simulation data are used to confirm the theoretical derivation. It is proved that the FDC can still provide useful insights into causal relationships even when some key assumptions are violated. The FDC is also comprehensively compared with the instrumental variables (IV) from the perspective of assumptions and causal connotations. The analyses of this paper show that the FDC can serve as a powerful empirical tool. It can provide new insights into causal relationships compared with the conventional methods in social science.

arXiv Statistics @arxiv_stats@qoto.org

A Multiprocess State Space Model with Feedback and Switching for Patterns of Clinical Measurements Associated with COVID-19 https://arxiv.org/abs/2412.10639 #stat.ME

A Multiprocess State Space Model with Feedback and Switching for Patterns of Clinical Measurements Associated with COVID-19

Clinical measurements, such as body temperature, are often collected over time to monitor an individual's underlying health condition. These measurements exhibit complex temporal dynamics, necessitating sophisticated statistical models to capture patterns and detect deviations. We propose a novel multiprocess state space model with feedback and switching mechanisms to analyze the dynamics of clinical measurements. This model captures the evolution of time series through distinct latent processes and incorporates feedback effects in the transition probabilities between latent processes. We develop estimation methods using the EM algorithm, integrated with multiprocess Kalman filtering and multiprocess fixed-interval smoothing. Simulation study shows that the algorithm is efficient and performs well. We apply the proposed model to body temperature measurements from COVID-19-infected hemodialysis patients to examine temporal dynamics and estimate infection and recovery probabilities.

arXiv Statistics @arxiv_stats@qoto.org

Scientific Realism vs. Anti-Realism: Toward a Common Ground https://arxiv.org/abs/2412.10643 #stat.OT #stat.ME #cs.LG

Scientific Realism vs. Anti-Realism: Toward a Common Ground

The debate between scientific realism and anti-realism remains at a stalemate, with reconciliation seeming hopeless. Yet, important work remains: to seek a common ground, even if only to uncover deeper points of disagreement. I develop the idea that everyone values some truths, and use it to benefit both sides of the debate. More specifically, many anti-realists, such as instrumentalists, have yet to seriously engage with Sober's call to justify their preferred version of Ockham's razor through a positive epistemology. Meanwhile, realists face a similar challenge: providing a non-circular explanation of how their version of Ockham's razor connects to truth. Drawing insights from fields that study scientific inference -- statistics and machine learning -- I propose a common ground that addresses these challenges for both sides. This common ground also isolates a distinctively epistemic root of the irreconcilability in the realism debate.

arXiv Statistics @arxiv_stats@qoto.org

Statistical Problems in the Diagnosis of Shaken Baby Syndrome/Abusive Head Trauma: Limitations to Algorithms and the Need for Reliable Data https://arxiv.org/abs/2412.10648 #stat.AP

Statistical Problems in the Diagnosis of Shaken Baby Syndrome/Abusive Head Trauma: Limitations to Algorithms and the Need for Reliable Data

The medical and legal controversy surrounding the diagnosis of Shaken Baby Syndrome/Abusive Head Trauma (SBS/AHT) raises critical questions about its scientific foundation and reliability. This article argues that SBS/AHT can only be understood by studying the statistical challenges with the data. Current health records are insufficient because there is a lack of ground truth, reliance on circular reasoning, contextual bias, heterogeneity across institutions, and integration of legal decisions into medical assessments. There exists no comprehensive source of legal data. Thus, current data is insufficient to reliably distinguish SBS/AHT from other medical conditions or accidental injuries. A privately-collected medico-legal dataset that has the relevant contextual information, but is limited by being a convenience sample, is used to show how a data analysis might be performed with higher-quality data. There is a need for systematic data collection of the additional contextual information used by physicians and pathologists to make determinations of abuse. Furthermore, because of the legal nature of the diagnosis, i.e., its accuracy, repeatability, and reproducibility, must be tested. Better data and evaluating the scientific validity of SBS/AHT are essential to protect vulnerable children while ensuring fairness and accuracy in legal proceedings involving allegations of abuse.

arXiv Statistics @arxiv_stats@qoto.org

Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling https://arxiv.org/abs/2412.10658 #stat.ME #cs.AI #cs.LG

Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling

Confidence calibration of classification models is a technique to estimate the true posterior probability of the predicted class, which is critical for ensuring reliable decision-making in practical applications. Existing confidence calibration methods mostly use statistical techniques to estimate the calibration curve from data or fit a user-defined calibration function, but often overlook fully mining and utilizing the prior distribution behind the calibration curve. However, a well-informed prior distribution can provide valuable insights beyond the empirical data under the limited data or low-density regions of confidence scores. To fill this gap, this paper proposes a new method that integrates the prior distribution behind the calibration curve with empirical data to estimate a continuous calibration curve, which is realized by modeling the sampling process of calibration data as a binomial process and maximizing the likelihood function of the binomial process. We prove that the calibration curve estimating method is Lipschitz continuous with respect to data distribution and requires a sample size of $3/B$ of that required for histogram binning, where $B$ represents the number of bins. Also, a new calibration metric ($TCE_{bpm}$), which leverages the estimated calibration curve to estimate the true calibration error (TCE), is designed. $TCE_{bpm}$ is proven to be a consistent calibration measure. Furthermore, realistic calibration datasets can be generated by the binomial process modeling from a preset true calibration curve and confidence score distribution, which can serve as a benchmark to measure and compare the discrepancy between existing calibration metrics and the true calibration error. The effectiveness of our calibration method and metric are verified in real-world and simulated data.

arXiv Statistics @arxiv_stats@qoto.org

Adaptive Nonparametric Perturbations of Parametric Bayesian Models https://arxiv.org/abs/2412.10683 #stat.ME

Adaptive Nonparametric Perturbations of Parametric Bayesian Models

Parametric Bayesian modeling offers a powerful and flexible toolbox for scientific data analysis. Yet the model, however detailed, may still be wrong, and this can make inferences untrustworthy. In this paper we study nonparametrically perturbed parametric (NPP) Bayesian models, in which a parametric Bayesian model is relaxed via a distortion of its likelihood. We analyze the properties of NPP models when the target of inference is the true data distribution or some functional of it, such as in causal inference. We show that NPP models can offer the robustness of nonparametric models while retaining the data efficiency of parametric models, achieving fast convergence when the parametric model is close to true. To efficiently analyze data with an NPP model, we develop a generalized Bayes procedure to approximate its posterior. We demonstrate our method by estimating causal effects of gene expression from single cell RNA sequencing data. NPP modeling offers an efficient approach to robust Bayesian inference and can be used to robustify any parametric Bayesian model.

arXiv Statistics @arxiv_stats@qoto.org

Model checking for high dimensional generalized linear models based on random projections https://arxiv.org/abs/2412.10721 #stat.ME #math.ST #stat.TH

Model checking for high dimensional generalized linear models based on random projections

Most existing tests in the literature for model checking do not work in high dimension settings due to challenges arising from the "curse of dimensionality", or dependencies on the normality of parameter estimators. To address these challenges, we proposed a new goodness of fit test based on random projections for generalized linear models, when the dimension of covariates may substantially exceed the sample size. The tests only require the convergence rate of parameter estimators to derive the limiting distribution. The growing rate of the dimension is allowed to be of exponential order in relation to the sample size. As random projection converts covariates to one-dimensional space, our tests can detect the local alternative departing from the null at the rate of $n^{-1/2}h^{-1/4}$ where $h$ is the bandwidth, and $n$ is the sample size. This sensitive rate is not related to the dimension of covariates, and thus the "curse of dimensionality" for our tests would be largely alleviated. An interesting and unexpected result is that for randomly chosen projections, the resulting test statistics can be asymptotic independent. We then proposed combination methods to enhance the power performance of the tests. Detailed simulation studies and a real data analysis are conducted to illustrate the effectiveness of our methodology.

arXiv Statistics @arxiv_stats@qoto.org

Evaluating time-specific treatment effects in matched-pairs studies https://arxiv.org/abs/2412.09697 #stat.ME

Evaluating time-specific treatment effects in matched-pairs studies

This study develops methods for evaluating a treatment effect on a time-to-event outcome in matched-pair studies. While most methods for paired right-censored outcomes allow determining an overall treatment effect over the course of follow-up, they generally lack in providing detailed insights into how the effect changes over time. To address this gap, we propose time-specific and overall tests for paired right-censored outcomes under randomization inference. We further extend our tests to matched observational studies by developing corresponding sensitivity analysis methods to take into account departures from randomization. Simulations demonstrate the robustness of our approach against various non-proportional hazards alternatives, including a crossing survival curves scenario. We demonstrate the application of our methods using a matched observational study from the Korean Longitudinal Study of Aging (KLoSA) data, focusing on the effect of social engagement on survival.

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019