Show newer

Variable selection for partially linear single-index varying-coefficient model arxiv.org/abs/2412.13468 .ST .TH

Which Imputation Fits Which Feature Selection Method? A Survey-Based Simulation Study arxiv.org/abs/2412.13570 .AP .ML

Which Imputation Fits Which Feature Selection Method? A Survey-Based Simulation Study

Tree-based learning methods such as Random Forest and XGBoost are still the gold-standard prediction methods for tabular data. Feature importance measures are usually considered for feature selection as well as to assess the effect of features on the outcome variables in the model. This also applies to survey data, which are frequently encountered in the social sciences and official statistics. These types of datasets often present the challenge of missing values. The typical solution is to impute the missing data before applying the learning method. However, given the large number of possible imputation methods available, the question arises as to which should be chosen to achieve the 'best' reflection of feature importance and feature selection in subsequent analyses. In the present paper, we investigate this question in a survey-based simulation study for eight state-of-the art imputation methods and three learners. The imputation methods comprise listwise deletion, three MICE options, four \texttt{missRanger} options as well as the recently proposed mixGBoost imputation approach. As learners, we consider the two most common tree-based methods, Random Forest and XGBoost, and an interpretable linear model with regularization.

arXiv.org

A Model-Based Clustering Approach for Bounded Data Using Transformation-Based Gaussian Mixture Models arxiv.org/abs/2412.13572 .ME

A Model-Based Clustering Approach for Bounded Data Using Transformation-Based Gaussian Mixture Models

The clustering of bounded data presents unique challenges in statistical analysis due to the constraints imposed on the data values. This paper introduces a novel method for model-based clustering specifically designed for bounded data. Building on the transformation-based approach to Gaussian mixture density estimation introduced by Scrucca (2019), we extend this framework to develop a probabilistic clustering algorithm for data with bounded support that allows for accurate clustering while respecting the natural bounds of the variables. In our proposal, a flexible range-power transformation is employed to map the data from its bounded domain to the unrestricted real space, hence enabling the estimation of Gaussian mixture models in the transformed space. This approach leads to improved cluster recovery and interpretation, especially for complex distributions within bounded domains. The performance of the proposed method is evaluated through real-world data applications involving both fully and partially bounded data, in both univariate and multivariate settings. The results demonstrate the effectiveness and advantages of our approach over traditional and advanced model-based clustering techniques that employ distributions with bounded support.

arXiv.org

Time-Reversible Bridges of Data with Machine Learning arxiv.org/abs/2412.13665 .ML .LG

Time-Reversible Bridges of Data with Machine Learning

The analysis of dynamical systems is a fundamental tool in the natural sciences and engineering. It is used to understand the evolution of systems as large as entire galaxies and as small as individual molecules. With predefined conditions on the evolution of dy-namical systems, the underlying differential equations have to fulfill specific constraints in time and space. This class of problems is known as boundary value problems. This thesis presents novel approaches to learn time-reversible deterministic and stochastic dynamics constrained by initial and final conditions. The dynamics are inferred by machine learning algorithms from observed data, which is in contrast to the traditional approach of solving differential equations by numerical integration. The work in this thesis examines a set of problems of increasing difficulty each of which is concerned with learning a different aspect of the dynamics. Initially, we consider learning deterministic dynamics from ground truth solutions which are constrained by deterministic boundary conditions. Secondly, we study a boundary value problem in discrete state spaces, where the forward dynamics follow a stochastic jump process and the boundary conditions are discrete probability distributions. In particular, the stochastic dynamics of a specific jump process, the Ehrenfest process, is considered and the reverse time dynamics are inferred with machine learning. Finally, we investigate the problem of inferring the dynamics of a continuous-time stochastic process between two probability distributions without any reference information. Here, we propose a novel criterion to learn time-reversible dynamics of two stochastic processes to solve the Schrödinger Bridge Problem.

arXiv.org

Reliability analysis for non-deterministic limit-states using stochastic emulators arxiv.org/abs/2412.13731 .CO .ME .ML

Reliability analysis for non-deterministic limit-states using stochastic emulators

Reliability analysis is a sub-field of uncertainty quantification that assesses the probability of a system performing as intended under various uncertainties. Traditionally, this analysis relies on deterministic models, where experiments are repeatable, i.e., they produce consistent outputs for a given set of inputs. However, real-world systems often exhibit stochastic behavior, leading to non-repeatable outcomes. These so-called stochastic simulators produce different outputs each time the model is run, even with fixed inputs. This paper formally introduces reliability analysis for stochastic models and addresses it by using suitable surrogate models to lower its typically high computational cost. Specifically, we focus on the recently introduced generalized lambda models and stochastic polynomial chaos expansions. These emulators are designed to learn the inherent randomness of the simulator's response and enable efficient uncertainty quantification at a much lower cost than traditional Monte Carlo simulation. We validate our methodology through three case studies. First, using an analytical function with a closed-form solution, we demonstrate that the emulators converge to the correct solution. Second, we present results obtained from the surrogates using a toy example of a simply supported beam. Finally, we apply the emulators to perform reliability analysis on a realistic wind turbine case study, where only a dataset of simulation results is available.

arXiv.org

Breaching 1.5{\deg}C: Give me the odds arxiv.org/abs/2412.13855 .AP .ME

Breaching 1.5°C: Give me the odds

Climate change communication is crucial to raising awareness and motivating action. In the context of breaching the limits set out by the Paris Agreement, we argue that climate scientists should move away from point estimates and towards reporting probabilities. Reporting probabilities will provide policymakers with a range of possible outcomes and will allow them to make informed timely decisions. To achieve this goal, we propose a method to calculate the probability of breaching the limits set out by the Paris Agreement. The method can be summarized as predicting future temperatures under different scenarios and calculating the number of possible outcomes that breach the limits as a proportion of the total number of outcomes. The probabilities can be computed for different time horizons and can be updated as new data become available. As an illustration, we performed a simulation study to investigate the probability of breaching the limits in a statistical model. Our results show that the probability of breaching the 1.5°C limit is already greater than zero for 2024. Moreover, the probability of breaching the limit is greater than 99% by 2042 if no action is taken to reduce greenhouse gas emissions. Our methodology is simple to implement and can easily be extended to more complex models of the climate system. We encourage climate model developers to include the probabilities of breaching the limits in their reports.

arXiv.org

On consistent estimation of dimension values arxiv.org/abs/2412.13898 .ST .ML .TH

On consistent estimation of dimension values

The problem of estimating, from a random sample of points, the dimension of a compact subset S of the Euclidean space is considered. The emphasis is put on consistency results in the statistical sense. That is, statements of convergence to the true dimension value when the sample size grows to infinity. Among the many available definitions of dimension, we have focused (on the grounds of its statistical tractability) on three notions: the Minkowski dimension, the correlation dimension and the, perhaps less popular, concept of pointwise dimension. We prove the statistical consistency of some natural estimators of these quantities. Our proofs partially rely on the use of an instrumental estimator formulated in terms of the empirical volume function Vn (r), defined as the Lebesgue measure of the set of points whose distance to the sample is at most r. In particular, we explore the case in which the true volume function V (r) of the target set S is a polynomial on some interval starting at zero. An empirical study is also included. Our study aims to provide some theoretical support, and some practical insights, for the problem of deciding whether or not the set S has a dimension smaller than that of the ambient space. This is a major statistical motivation of the dimension studies, in connection with the so-called Manifold Hypothesis.

arXiv.org

KenCoh: A Ranked-Based Canonical Coherence arxiv.org/abs/2412.10521 .ME

KenCoh: A Ranked-Based Canonical Coherence

In this paper, we consider the problem of characterizing a robust global dependence between two brain regions where each region may contain several voxels or channels. This work is driven by experiments to investigate the dependence between two cortical regions and to identify differences in brain networks between brain states, e.g., alert and drowsy states. The most common approach to explore dependence between two groups of variables (or signals) is via canonical correlation analysis (CCA). However, it is limited to only capturing linear associations and is sensitive to outlier observations. These limitations are crucial because brain network connectivity is likely to be more complex than linear and that brain signals may exhibit heavy-tailed properties. To overcome these limitations, we develop a robust method, Kendall canonical coherence (KenCoh), for learning monotonic connectivity structure among neuronal signals filtered at given frequency bands. Furthermore, we propose the KenCoh-based permutation test to investigate the differences in brain network connectivity between two different states. Our simulation study demonstrates that KenCoh is competitive to the traditional variance-covariance estimator and outperforms the later when the underlying distributions are heavy-tailed. We apply our method to EEG recordings from a virtual-reality driving experiment. Our proposed method led to further insights on the differences of frontal-parietal cross-dependence network when the subject is alert and when the subject is drowsy and that left-parietal channel drives this dependence at the beta-band.

arXiv.org

Augmented two-stage estimation for treatment crossover in oncology trials: Leveraging external data for improved precision arxiv.org/abs/2412.10563 .ME

Augmented two-stage estimation for treatment crossover in oncology trials: Leveraging external data for improved precision

Randomized controlled trials (RCTs) in oncology often allow control group participants to crossover to experimental treatments, a practice that, while often ethically necessary, complicates the accurate estimation of long-term treatment effects. When crossover rates are high or sample sizes are limited, commonly used methods for crossover adjustment (such as the rank-preserving structural failure time model, inverse probability of censoring weights, and two-stage estimation (TSE)) may produce imprecise estimates. Real-world data (RWD) can be used to develop an external control arm for the RCT, although this approach ignores evidence from trial subjects who did not crossover and ignores evidence from the data obtained prior to crossover for those subjects who did. This paper introduces augmented two-stage estimation (ATSE), a method that combines data from non-switching participants in a RCT with an external dataset, forming a 'hybrid non-switching arm'. With a simulation study, we evaluate the ATSE method's effectiveness compared to TSE crossover adjustment and an external control arm approach. Results indicate that, relative to TSE and the external control arm approach, ATSE can increase precision and may be less susceptible to bias due to unmeasured confounding.

arXiv.org

Statistical Problems in the Diagnosis of Shaken Baby Syndrome/Abusive Head Trauma: Limitations to Algorithms and the Need for Reliable Data arxiv.org/abs/2412.10648 .AP

Statistical Problems in the Diagnosis of Shaken Baby Syndrome/Abusive Head Trauma: Limitations to Algorithms and the Need for Reliable Data

The medical and legal controversy surrounding the diagnosis of Shaken Baby Syndrome/Abusive Head Trauma (SBS/AHT) raises critical questions about its scientific foundation and reliability. This article argues that SBS/AHT can only be understood by studying the statistical challenges with the data. Current health records are insufficient because there is a lack of ground truth, reliance on circular reasoning, contextual bias, heterogeneity across institutions, and integration of legal decisions into medical assessments. There exists no comprehensive source of legal data. Thus, current data is insufficient to reliably distinguish SBS/AHT from other medical conditions or accidental injuries. A privately-collected medico-legal dataset that has the relevant contextual information, but is limited by being a convenience sample, is used to show how a data analysis might be performed with higher-quality data. There is a need for systematic data collection of the additional contextual information used by physicians and pathologists to make determinations of abuse. Furthermore, because of the legal nature of the diagnosis, i.e., its accuracy, repeatability, and reproducibility, must be tested. Better data and evaluating the scientific validity of SBS/AHT are essential to protect vulnerable children while ensuring fairness and accuracy in legal proceedings involving allegations of abuse.

arXiv.org

Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling arxiv.org/abs/2412.10658 .ME .AI .LG

Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling

Confidence calibration of classification models is a technique to estimate the true posterior probability of the predicted class, which is critical for ensuring reliable decision-making in practical applications. Existing confidence calibration methods mostly use statistical techniques to estimate the calibration curve from data or fit a user-defined calibration function, but often overlook fully mining and utilizing the prior distribution behind the calibration curve. However, a well-informed prior distribution can provide valuable insights beyond the empirical data under the limited data or low-density regions of confidence scores. To fill this gap, this paper proposes a new method that integrates the prior distribution behind the calibration curve with empirical data to estimate a continuous calibration curve, which is realized by modeling the sampling process of calibration data as a binomial process and maximizing the likelihood function of the binomial process. We prove that the calibration curve estimating method is Lipschitz continuous with respect to data distribution and requires a sample size of $3/B$ of that required for histogram binning, where $B$ represents the number of bins. Also, a new calibration metric ($TCE_{bpm}$), which leverages the estimated calibration curve to estimate the true calibration error (TCE), is designed. $TCE_{bpm}$ is proven to be a consistent calibration measure. Furthermore, realistic calibration datasets can be generated by the binomial process modeling from a preset true calibration curve and confidence score distribution, which can serve as a benchmark to measure and compare the discrepancy between existing calibration metrics and the true calibration error. The effectiveness of our calibration method and metric are verified in real-world and simulated data.

arXiv.org

Model checking for high dimensional generalized linear models based on random projections arxiv.org/abs/2412.10721 .ME .ST .TH

Model checking for high dimensional generalized linear models based on random projections

Most existing tests in the literature for model checking do not work in high dimension settings due to challenges arising from the "curse of dimensionality", or dependencies on the normality of parameter estimators. To address these challenges, we proposed a new goodness of fit test based on random projections for generalized linear models, when the dimension of covariates may substantially exceed the sample size. The tests only require the convergence rate of parameter estimators to derive the limiting distribution. The growing rate of the dimension is allowed to be of exponential order in relation to the sample size. As random projection converts covariates to one-dimensional space, our tests can detect the local alternative departing from the null at the rate of $n^{-1/2}h^{-1/4}$ where $h$ is the bandwidth, and $n$ is the sample size. This sensitive rate is not related to the dimension of covariates, and thus the "curse of dimensionality" for our tests would be largely alleviated. An interesting and unexpected result is that for randomly chosen projections, the resulting test statistics can be asymptotic independent. We then proposed combination methods to enhance the power performance of the tests. Detailed simulation studies and a real data analysis are conducted to illustrate the effectiveness of our methodology.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.