Show newer

Investigating the Impact of Balancing, Filtering, and Complexity on Predictive Multiplicity: A Data-Centric Perspective arxiv.org/abs/2412.09712 .ML .LG

Investigating the Impact of Balancing, Filtering, and Complexity on Predictive Multiplicity: A Data-Centric Perspective

The Rashomon effect presents a significant challenge in model selection. It occurs when multiple models achieve similar performance on a dataset but produce different predictions, resulting in predictive multiplicity. This is especially problematic in high-stakes environments, where arbitrary model outcomes can have serious consequences. Traditional model selection methods prioritize accuracy and fail to address this issue. Factors such as class imbalance and irrelevant variables further complicate the situation, making it harder for models to provide trustworthy predictions. Data-centric AI approaches can mitigate these problems by prioritizing data optimization, particularly through preprocessing techniques. However, recent studies suggest preprocessing methods may inadvertently inflate predictive multiplicity. This paper investigates how data preprocessing techniques like balancing and filtering methods impact predictive multiplicity and model stability, considering the complexity of the data. We conduct the experiments on 21 real-world datasets, applying various balancing and filtering techniques, and assess the level of predictive multiplicity introduced by these methods by leveraging the Rashomon effect. Additionally, we examine how filtering techniques reduce redundancy and enhance model generalization. The findings provide insights into the relationship between balancing methods, data complexity, and predictive multiplicity, demonstrating how data-centric AI strategies can improve model performance.

arXiv.org

A Statistical Analysis for Supervised Deep Learning with Exponential Families for Intrinsically Low-dimensional Data arxiv.org/abs/2412.09779 .ML .ST .TH .LG

A Statistical Analysis for Supervised Deep Learning with Exponential Families for Intrinsically Low-dimensional Data

Recent advances have revealed that the rate of convergence of the expected test error in deep supervised learning decays as a function of the intrinsic dimension and not the dimension $d$ of the input space. Existing literature defines this intrinsic dimension as the Minkowski dimension or the manifold dimension of the support of the underlying probability measures, which often results in sub-optimal rates and unrealistic assumptions. In this paper, we consider supervised deep learning when the response given the explanatory variable is distributed according to an exponential family with a $β$-Hölder smooth mean function. We consider an entropic notion of the intrinsic data-dimension and demonstrate that with $n$ independent and identically distributed samples, the test error scales as $\tilde{\mathcal{O}}\left(n^{-\frac{2β}{2β+ \bar{d}_{2β}(λ)}}\right)$, where $\bar{d}_{2β}(λ)$ is the $2β$-entropic dimension of $λ$, the distribution of the explanatory variables. This improves on the best-known rates. Furthermore, under the assumption of an upper-bounded density of the explanatory variables, we characterize the rate of convergence as $\tilde{\mathcal{O}}\left( d^{\frac{2\lfloorβ\rfloor(β+ d)}{2β+ d}}n^{-\frac{2β}{2β+ d}}\right)$, establishing that the dependence on $d$ is not exponential but at most polynomial. We also demonstrate that when the explanatory variable has a lower bounded density, this rate in terms of the number of data samples, is nearly optimal for learning the dependence structure for exponential families.

arXiv.org

A class of nonparametric methods for evaluating the effect of continuous treatments on survival outcomes arxiv.org/abs/2412.09786 .ME

A class of nonparametric methods for evaluating the effect of continuous treatments on survival outcomes

In randomized trials and observational studies, it is often necessary to evaluate the extent to which an intervention affects a time-to-event outcome, which is only partially observed due to right censoring. For instance, in infectious disease studies, it is frequently of interest to characterize the relationship between risk of acquisition of infection with a pathogen and a biomarker previously measuring for an immune response against that pathogen induced by prior infection and/or vaccination. It is common to conduct inference within a causal framework, wherein we desire to make inferences about the counterfactual probability of survival through a given time point, at any given exposure level. To determine whether a causal effect is present, one can assess if this quantity differs by exposure level. Recent work shows that, under typical causal assumptions, summaries of the counterfactual survival distribution are identifiable. Moreover, when the treatment is multi-level, these summaries are also pathwise differentiable in a nonparametric probability model, making it possible to construct estimators thereof that are unbiased and approximately normal. In cases where the treatment is continuous, the target estimand is no longer pathwise differentiable, rendering it difficult to construct well-behaved estimators without strong parametric assumptions. In this work, we extend beyond the traditional setting with multilevel interventions to develop approaches to nonparametric inference with a continuous exposure. We introduce methods for testing whether the counterfactual probability of survival time by a given time-point remains constant across the range of the continuous exposure levels. The performance of our proposed methods is evaluated via numerical studies, and we apply our method to data from a recent pair of efficacy trials of an HIV monoclonal antibody.

arXiv.org

Flexible Bayesian Nonparametric Product Mixtures for Multi-scale Functional Clustering arxiv.org/abs/2412.09792 .ME .ST .TH

Flexible Bayesian Nonparametric Product Mixtures for Multi-scale Functional Clustering

There is a rich literature on clustering functional data with applications to time-series modeling, trajectory data, and even spatio-temporal applications. However, existing methods routinely perform global clustering that enforces identical atom values within the same cluster. Such grouping may be inadequate for high-dimensional functions, where the clustering patterns may change between the more dominant high-level features and the finer resolution local features. While there is some limited literature on local clustering approaches to deal with the above problems, these methods are typically not scalable to high-dimensional functions, and their theoretical properties are not well-investigated. Focusing on basis expansions for high-dimensional functions, we propose a flexible non-parametric Bayesian approach for multi-resolution clustering. The proposed method imposes independent Dirichlet process (DP) priors on different subsets of basis coefficients that ultimately results in a product of DP mixture priors inducing local clustering. We generalize the approach to incorporate spatially correlated error terms when modeling random spatial functions to provide improved model fitting. An efficient Markov chain Monte Carlo (MCMC) algorithm is developed for implementation. We show posterior consistency properties under the local clustering approach that asymptotically recovers the true density of random functions. Extensive simulations illustrate the improved clustering and function estimation under the proposed method compared to classical approaches. We apply the proposed approach to a spatial transcriptomics application where the goal is to infer clusters of genes with distinct spatial patterns of expressions. Our method makes an important contribution by expanding the limited literature on local clustering methods for high-dimensional functions with theoretical guarantees.

arXiv.org

Unified optimal model averaging with a general loss function based on cross-validation arxiv.org/abs/2412.09804 .ME

Unified optimal model averaging with a general loss function based on cross-validation

Studying unified model averaging estimation for situations with complicated data structures, we propose a novel model averaging method based on cross-validation (MACV). MACV unifies a large class of new and existing model averaging estimators and covers a very general class of loss functions. Furthermore, to reduce the computational burden caused by the conventional leave-subject/one-out cross validation, we propose a SEcond-order-Approximated Leave-one/subject-out (SEAL) cross validation, which largely improves the computation efficiency. In the context of non-independent and non-identically distributed random variables, we establish the unified theory for analyzing the asymptotic behaviors of the proposed MACV and SEAL methods, where the number of candidate models is allowed to diverge with sample size. To demonstrate the breadth of the proposed methodology, we exemplify four optimal model averaging estimators under four important situations, i.e., longitudinal data with discrete responses, within-cluster correlation structure modeling, conditional prediction in spatial data, and quantile regression with a potential correlation structure. We conduct extensive simulation studies and analyze real-data examples to illustrate the advantages of the proposed methods.

arXiv.org

Understanding algorithmic fairness for clinical prediction in terms of subgroup net benefit and health equity arxiv.org/abs/2412.07879 .AP

Understanding algorithmic fairness for clinical prediction in terms of subgroup net benefit and health equity

There are concerns about the fairness of clinical prediction models. 'Fair' models are defined as those for which their performance or predictions are not inappropriately influenced by protected attributes such as ethnicity, gender, or socio-economic status. Researchers have raised concerns that current algorithmic fairness paradigms enforce strict egalitarianism in healthcare, levelling down the performance of models in higher-performing subgroups instead of improving it in lower-performing ones. We propose assessing the fairness of a prediction model by expanding the concept of net benefit, using it to quantify and compare the clinical impact of a model in different subgroups. We use this to explore how a model distributes benefit across a population, its impact on health inequalities, and its role in the achievement of health equity. We show how resource constraints might introduce necessary trade-offs between health equity and other objectives of healthcare systems. We showcase our proposed approach with the development of two clinical prediction models: 1) a prognostic type 2 diabetes model used by clinicians to enrol patients into a preventive care lifestyle intervention programme, and 2) a lung cancer screening algorithm used to allocate diagnostic scans across the population. This approach helps modellers better understand if a model upholds health equity by considering its performance in a clinical and social context.

arXiv.org

The continuous net benefit: Assessing the clinical utility of prediction models when informing a continuum of decisions arxiv.org/abs/2412.07882 .AP

The continuous net benefit: Assessing the clinical utility of prediction models when informing a continuum of decisions

Clinical prognostic models help inform decision-making by estimating a patient's risk of experiencing an outcome in the future. The net benefit is increasingly being used to assess the clinical utility of models. By calculating an appropriately weighted average of the true and false positives of a model, the net benefit assesses the value added by a binary decision policy obtained when thresholding a model. Although such 'treat or not' decisions are common, prognostic models are also often used to tailor and personalise the care of patients, which implicitly involves the consideration of multiple interventions at different risk thresholds. We extend the net benefit to consider multiple decision thresholds simultaneously, by taking a weighted area under a rescaled version of the net benefit curve, deriving the continuous net benefit. In addition to the consideration of a continuum of interventions, we also show how the continuous net benefit can be used for populations with a range of optimal thresholds for a single treatment, due to individual variations in expected treatment benefit or harm, highlighting limitations of current proposed methods that calculate the area under the decision curve. We showcase the continuous net benefit through two examples of cardiovascular preventive care, comparing two modelling choices using the continuous net benefit. The continuous net benefit informs researchers of the clinical utility of models during selection, development, and validation, and helps decision makers understand their usefulness, improving their viability towards implementation.

arXiv.org

Dirichlet-Neumann Averaging: The DNA of Efficient Gaussian Process Simulation arxiv.org/abs/2412.07929 .CO .NA .PR .ME .NA

Dirichlet-Neumann Averaging: The DNA of Efficient Gaussian Process Simulation

Gaussian processes (GPs) and Gaussian random fields (GRFs) are essential for modelling spatially varying stochastic phenomena. Yet, the efficient generation of corresponding realisations on high-resolution grids remains challenging, particularly when a large number of realisations are required. This paper presents two novel contributions. First, we propose a new methodology based on Dirichlet-Neumann averaging (DNA) to generate GPs and GRFs with isotropic covariance on regularly spaced grids. The combination of discrete cosine and sine transforms in the DNA sampling approach allows for rapid evaluations without the need for modification or padding of the desired covariance function. While this introduces an error in the covariance, our numerical experiments show that this error is negligible for most relevant applications, representing a trade-off between efficiency and precision. We provide explicit error estimates for Matérn covariances. The second contribution links our new methodology to the stochastic partial differential equation (SPDE) approach for sampling GRFs. We demonstrate that the concepts developed in our methodology can also guide the selection of boundary conditions in the SPDE framework. We prove that averaging specific GRFs sampled via the SPDE approach yields genuinely isotropic realisations without domain extension, with the error bounds established in the first part remaining valid.

arXiv.org

Spatial scale-aware tail dependence modeling for high-dimensional spatial extremes arxiv.org/abs/2412.07957 .ME

Spatial scale-aware tail dependence modeling for high-dimensional spatial extremes

Extreme events over large spatial domains may exhibit highly heterogeneous tail dependence characteristics, yet most existing spatial extremes models yield only one dependence class over the entire spatial domain. To accurately characterize "data-level dependence'' in analysis of extreme events, we propose a mixture model that achieves flexible dependence properties and allows high-dimensional inference for extremes of spatial processes. We modify the popular random scale construction that multiplies a Gaussian random field by a single radial variable; we allow the radial variable to vary smoothly across space and add non-stationarity to the Gaussian process. As the level of extremeness increases, this single model exhibits both asymptotic independence at long ranges and either asymptotic dependence or independence at short ranges. We make joint inference on the dependence model and a marginal model using a copula approach within a Bayesian hierarchical model. Three different simulation scenarios show close to nominal frequentist coverage rates. Lastly, we apply the model to a dataset of extreme summertime precipitation over the central United States. We find that the joint tail of precipitation exhibits non-stationary dependence structure that cannot be captured by limiting extreme value models or current state-of-the-art sub-asymptotic models.

arXiv.org

Network Structural Equation Models for Causal Mediation and Spillover Effects arxiv.org/abs/2412.05397 .ME

Network Structural Equation Models for Causal Mediation and Spillover Effects

Social network interference induces spillover effects from neighbors' exposures, and the complexity of statistical analysis increases when mediators are involved with network interference. To address various technical challenges, we develop a theoretical framework employing a structural graphical modeling approach to investigate both mediation and interference effects within network data. Our framework enables us to capture the multifaceted mechanistic pathways through which neighboring units' exposures and mediators exert direct and indirect influences on an individual's outcome. We extend the exposure mapping paradigm in the context of a random-effects network structural equation models (REN-SEM), establishing its capacity to delineate spillover effects of interest. Our proposed methodology contributions include maximum likelihood estimation for REN-SEM and inference procedures with theoretical guarantees. Such guarantees encompass consistent asymptotic variance estimators, derived under a non-i.i.d. asymptotic theory. The robustness and practical utility of our methodology are demonstrated through simulation experiments and a real-world data analysis of the Twitch Gamers Network Dataset, underscoring its effectiveness in capturing the intricate dynamics of network-mediated exposure effects. This work is the first to provide a rigorous theoretical framework and analytic toolboxes to the mediation analysis of network data, including a robust assessment on the interplay of mediation and interference.

arXiv.org

The BPgWSP test: a Bayesian Weibull Shape Parameter signal detection test for adverse drug reactions arxiv.org/abs/2412.05463 .ME

The BPgWSP test: a Bayesian Weibull Shape Parameter signal detection test for adverse drug reactions

We develop a Bayesian Power generalized Weibull shape parameter (PgWSP) test as statistical method for signal detection of possible drug-adverse event associations using electronic health records for pharmacovigilance. The Bayesian approach allows the incorporation of prior knowledge about the likely time of occurrence along time-to-event data. The test is based on the shape parameters of the Power generalized Weibull (PgW) distribution. When both shape parameters are equal to one, the PgW distribution reduces to an exponential distribution, i.e. a constant hazard function. This is interpreted as no temporal association between drug and adverse event. The Bayesian PgWSP test involves comparing a region of practical equivalence (ROPE) around one reflecting the null hypothesis with estimated credibility intervals reflecting the posterior means of the shape parameters. The decision to raise a signal is based on the outcomes of the ROPE test and the selected combination rule for these outcomes. The development of the test requires a simulation study for tuning of the ROPE and credibility intervals to optimize specifcity and sensitivity of the test. Samples are generated under various conditions, including differences in sample size, prevalence of adverse drug reactions (ADRs), and the proportion of adverse events. We explore prior assumptions reflecting the belief in the presence or absence of ADRs at different points in the observation period. Various types of ROPE, credibility intervals, and combination rules are assessed and optimal tuning parameters are identifed based on the area under the curve. The tuned Bayesian PgWSP test is illustrated in a case study in which the time-dependent correlation between the intake of bisphosphonates and four adverse events is investigated.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.