Show newer

On the ERM Principle in Meta-Learning arxiv.org/abs/2411.17898 .ML .LG

On the ERM Principle in Meta-Learning

Classic supervised learning involves algorithms trained on $n$ labeled examples to produce a hypothesis $h \in \mathcal{H}$ aimed at performing well on unseen examples. Meta-learning extends this by training across $n$ tasks, with $m$ examples per task, producing a hypothesis class $\mathcal{H}$ within some meta-class $\mathbb{H}$. This setting applies to many modern problems such as in-context learning, hypernetworks, and learning-to-learn. A common method for evaluating the performance of supervised learning algorithms is through their learning curve, which depicts the expected error as a function of the number of training examples. In meta-learning, the learning curve becomes a two-dimensional learning surface, which evaluates the expected error on unseen domains for varying values of $n$ (number of tasks) and $m$ (number of training examples). Our findings characterize the distribution-free learning surfaces of meta-Empirical Risk Minimizers when either $m$ or $n$ tend to infinity: we show that the number of tasks must increase inversely with the desired error. In contrast, we show that the number of examples exhibits very different behavior: it satisfies a dichotomy where every meta-class conforms to one of the following conditions: (i) either $m$ must grow inversely with the error, or (ii) a \emph{finite} number of examples per task suffices for the error to vanish as $n$ goes to infinity. This finding illustrates and characterizes cases in which a small number of examples per task is sufficient for successful learning. We further refine this for positive values of $\varepsilon$ and identify for each $\varepsilon$ how many examples per task are needed to achieve an error of $\varepsilon$ in the limit as the number of tasks $n$ goes to infinity. We achieve this by developing a necessary and sufficient condition for meta-learnability using a bounded number of examples per domain.

arXiv.org

Repeated sampling of different individuals but the same clusters to improve precision of difference-in-differences estimators: the DISC design arxiv.org/abs/2411.17905 .ME

Repeated sampling of different individuals but the same clusters to improve precision of difference-in-differences estimators: the DISC design

We describe the DISC (Different Individuals, Same Clusters) design, a sampling scheme that can improve the precision of difference-in-differences (DID) estimators in settings involving repeated sampling of a population at multiple time points. Although cohort designs typically lead to more efficient DID estimators relative to repeated cross-sectional (RCS) designs, they are often impractical in practice due to high rates of loss-to-follow-up, individuals leaving the risk set, or other reasons. The DISC design represents a hybrid between a cohort sampling design and a RCS sampling design, an alternative strategy in which the researcher takes a single sample of clusters, but then takes different cross-sectional samples of individuals within each cluster at two or more time points. We show that the DISC design can yield DID estimators with much higher precision relative to a RCS design, particularly if random cluster effects are present in the data-generating mechanism. For example, for a design in which 40 clusters and 25 individuals per cluster are sampled (for a total sample size of n=1,000), the variance of a commonly-used DID treatment effect estimator is 2.3 times higher in the RCS design for an intraclass correlation coefficient (ICC) of 0.05, 3.8 times higher for an ICC of 0.1, and 7.3 times higher for an ICC of 0.2.

arXiv.org

Bayesian Variable Selection for High-Dimensional Mediation Analysis: Application to Metabolomics Data in Epidemiological Studies arxiv.org/abs/2411.17910 .ME .AP

Bayesian Variable Selection for High-Dimensional Mediation Analysis: Application to Metabolomics Data in Epidemiological Studies

In epidemiological research, causal models incorporating potential mediators along a pathway are crucial for understanding how exposures influence health outcomes. This work is motivated by integrated epidemiological and blood biomarker studies, investigating the relationship between long-term adherence to a Mediterranean diet and cardiometabolic health, with plasma metabolomes as potential mediators. Analyzing causal mediation in such high-dimensional omics data presents substantial challenges, including complex dependencies among mediators and the need for advanced regularization or Bayesian techniques to ensure stable and interpretable estimation and selection of indirect effects. To this end, we propose a novel Bayesian framework for identifying active pathways and estimating indirect effects in the presence of high-dimensional multivariate mediators. Our approach adopts a multivariate stochastic search variable selection method, tailored for such complex mediation scenarios. Central to our method is the introduction of a set of priors for the selection: a Markov random field prior and sequential subsetting Bernoulli priors. The first prior's Markov property leverages the inherent correlations among mediators, thereby increasing power to detect mediated effects. The sequential subsetting aspect of the second prior encourages the simultaneous selection of relevant mediators and their corresponding indirect effects from the two model parts, providing a more coherent and efficient variable selection framework, specific to mediation analysis. Comprehensive simulation studies demonstrate that the proposed method provides superior power in detecting active mediating pathways. We further illustrate the practical utility of the method through its application to metabolome data from two cohort studies, highlighting its effectiveness in real data setting.

arXiv.org

Optimized Conformal Selection: Powerful Selective Inference After Conformity Score Optimization arxiv.org/abs/2411.17983 .ME .ML .AI .LG

Optimized Conformal Selection: Powerful Selective Inference After Conformity Score Optimization

Model selection/optimization in conformal inference is challenging, since it may break the exchangeability between labeled and unlabeled data. We study this problem in the context of conformal selection, which uses conformal p-values to select ``interesting'' instances with large unobserved labels from a pool of unlabeled data, while controlling the FDR in finite sample. For validity, existing solutions require the model choice to be independent of the data used to construct the p-values and calibrate the selection set. However, when presented with many model choices and limited labeled data, it is desirable to (i) select the best model in a data-driven manner, and (ii) mitigate power loss due to sample splitting. This paper presents OptCS, a general framework that allows valid statistical testing (selection) after flexible data-driven model optimization. We introduce general conditions under which OptCS constructs valid conformal p-values despite substantial data reuse and handles complex p-value dependencies to maintain finite-sample FDR control via a novel multiple testing procedure. We instantiate this general recipe to propose three FDR-controlling procedures, each optimizing the models differently: (i) selecting the most powerful one among multiple pre-trained candidate models, (ii) using all data for model fitting without sample splitting, and (iii) combining full-sample model fitting and selection. We demonstrate the efficacy of our methods via simulation studies and real applications in drug discovery and alignment of large language models in radiology report generation.

arXiv.org

Statistical Emulation of Human Operational Motions arxiv.org/abs/2411.16929 .AP

Statistical Emulation of Human Operational Motions

This paper addresses the critical and challenging task of developing emulators for simulating human operational motions in industrial workplaces. We conceptualize human motion as a sequence of human body shapes. Leveraging statistical shape theory, we develop statistical generative models for sequences of human (body) shapes of human workers in workplaces. Our goal is to create emulators that generate random, realistic human motions statistically consistent with past work performances, essential for simulating industrial operations involving human labor. We derive the space of shapes as a Riemannian shape manifold, modeling human motion as a continuous-time stochastic process on this manifold. Representing such processes is challenging due to the nonlinearity of the shape manifold, variability in execution rates across observations, infinite dimensionality of stochastic processes, and population variability within and across action classes. This paper studies multiple solutions to these problems. This paper proposes multiple solutions to these challenges, presenting a comprehensive framework that incorporates (1) time warping for temporal alignment of training data, (2) Riemannian geometry for tackling manifold nonlinearity, and (3) Shape- and Functional-PCA for dimension reduction leading to traditional finite-dimensional Euclidean representations. In particular, it develops the transported velocity field representation for motion sequences and imposes a Gaussian model on the resulting Euclidean spaces. It then applies these models to emulate human shape sequences from an industrial operation dataset and evaluates these simulations in multiple ways.

arXiv.org

Fragility Index for Time-to-Event Endpoints in Single-Arm Clinical Trials arxiv.org/abs/2411.16938 .ME .AP

Fragility Index for Time-to-Event Endpoints in Single-Arm Clinical Trials

The reliability of clinical trial outcomes is crucial, especially in guiding medical decisions. In this paper, we introduce the Fragility Index (FI) for time-to-event endpoints in single-arm clinical trials - a novel metric designed to quantify the robustness of study conclusions. The FI represents the smallest number of censored observations that, when reclassified as uncensored events, causes the posterior probability of the median survival time exceeding a specified threshold to fall below a predefined confidence level. While drug effectiveness is typically assessed by determining whether the posterior probability exceeds a specified confidence level, the FI offers a complementary measure, indicating how robust these conclusions are to potential shifts in the data. Using a Bayesian approach, we develop a practical framework for computing the FI based on the exponential survival model. To facilitate the application of our method, we developed an R package fi, which provides a tool to compute the Fragility Index. Through real world case studies involving time to event data from single arms clinical trials, we demonstrate the utility of this index. Our findings highlight how the FI can be a valuable tool for assessing the robustness of survival analyses in single-arm studies, aiding researchers and clinicians in making more informed decisions.

arXiv.org

Interval-Valued Fuzzy Fault Tree Analysis through Qualitative Data Processing and its Applications in Marine Operations arxiv.org/abs/2411.15249 .AP .PR

Interval-Valued Fuzzy Fault Tree Analysis through Qualitative Data Processing and its Applications in Marine Operations

Marine accidents highlight the crucial need for human safety. They result in loss of life, environmental harm, and significant economic costs, emphasizing the importance of being proactive and taking precautionary steps. This study aims to identify the root causes of accidents, to develop effective strategies for preventing them. Due to the lack of accurate quantitative data or reliable probability information, we employ qualitative approaches to assess the reliability of complex systems. We collect expert judgments regarding the failure likelihood of each basic event and aggregate those opinions using the Similarity-based Aggregation Method (SAM) to form a collective assessment. In SAM, we convert expert opinions into failure probability using interval-valued triangular fuzzy numbers. Since each expert possesses different knowledge and various levels of experience, we need to assign weights to their opinions to reflect their relative expertise. We employ the Best-Worst Method (BWM) to calculate the weights of each criterion, and then use the weighting scores to determine the weights of each expert. Ranking of basic events according to their criticality is a crucial step, and in this study, we use the FVI measure to prioritize and rank these events according to their criticality level. To demonstrate the effectiveness and validity of our proposed methodology, we apply our method to two case studies: (1) chemical cargo contamination, and (2) the loss of ship steering ability. These case studies serve as examples to illustrate the practicality and utility of our approach in evaluating criticality and assessing risk in complex systems.

arXiv.org

Heavy-tailed Contamination is Easier than Adversarial Contamination arxiv.org/abs/2411.15306 .ST .ME .ML .TH .DS .LG

Heavy-tailed Contamination is Easier than Adversarial Contamination

A large body of work in the statistics and computer science communities dating back to Huber (Huber, 1960) has led to statistically and computationally efficient outlier-robust estimators. Two particular outlier models have received significant attention: the adversarial and heavy-tailed models. While the former models outliers as the result of a malicious adversary manipulating the data, the latter relaxes distributional assumptions on the data allowing outliers to naturally occur as part of the data generating process. In the first setting, the goal is to develop estimators robust to the largest fraction of outliers while in the second, one seeks estimators to combat the loss of statistical efficiency, where the dependence on the failure probability is paramount. Despite these distinct motivations, the algorithmic approaches to both these settings have converged, prompting questions on the relationship between the models. In this paper, we investigate and provide a principled explanation for this phenomenon. First, we prove that any adversarially robust estimator is also resilient to heavy-tailed outliers for any statistical estimation problem with i.i.d data. As a corollary, optimal adversarially robust estimators for mean estimation, linear regression, and covariance estimation are also optimal heavy-tailed estimators. Conversely, for arguably the simplest high-dimensional estimation task of mean estimation, we construct heavy-tailed estimators whose application to the adversarial setting requires any black-box reduction to remove almost all the outliers in the data. Taken together, our results imply that heavy-tailed estimation is likely easier than adversarially robust estimation opening the door to novel algorithmic approaches for the heavy-tailed setting. Additionally, confidence intervals obtained for adversarially robust estimation also hold with high-probability.

arXiv.org

Scalar-on-Shape Regression Models for Functional Data Analysis arxiv.org/abs/2411.15326 .ME

Scalar-on-Shape Regression Models for Functional Data Analysis

Functional data contains two components: shape (or amplitude) and phase. This paper focuses on a branch of functional data analysis (FDA), namely Shape-Based FDA, that isolates and focuses on shapes of functions. Specifically, this paper focuses on Scalar-on-Shape (ScoSh) regression models that incorporate the shapes of predictor functions and discard their phases. This aspect sets ScoSh models apart from the traditional Scalar-on-Function (ScoF) regression models that incorporate full predictor functions. ScoSh is motivated by object data analysis, {\it, e.g.}, for neuro-anatomical objects, where object morphologies are relevant and their parameterizations are arbitrary. ScoSh also differs from methods that arbitrarily pre-register data and uses it in subsequent analysis. In contrast, ScoSh models perform registration during regression, using the (non-parametric) Fisher-Rao inner product and nonlinear index functions to capture complex predictor-response relationships. This formulation results in novel concepts of {\it regression phase} and {\it regression mean} of functions. Regression phases are time-warpings of predictor functions that optimize prediction errors, and regression means are optimal regression coefficients. We demonstrate practical applications of the ScoSh model using extensive simulated and real-data examples, including predicting COVID outcomes when daily rate curves are predictors.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.