Show newer

Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting arxiv.org/abs/2409.05980 .ML .LG

Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting

Rested and Restless Bandits are two well-known bandit settings that are useful to model real-world sequential decision-making problems in which the expected reward of an arm evolves over time due to the actions we perform or due to the nature. In this work, we propose Graph-Triggered Bandits (GTBs), a unifying framework to generalize and extend rested and restless bandits. In this setting, the evolution of the arms' expected rewards is governed by a graph defined over the arms. An edge connecting a pair of arms $(i,j)$ represents the fact that a pull of arm $i$ triggers the evolution of arm $j$, and vice versa. Interestingly, rested and restless bandits are both special cases of our model for some suitable (degenerated) graph. As relevant case studies for this setting, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs. For these cases, we study the optimal policies. We provide suitable algorithms for all scenarios and discuss their theoretical guarantees, highlighting the complexity of the learning problem concerning instance-dependent terms that encode specific properties of the underlying graph structure.

arxiv.org

Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach arxiv.org/abs/2409.06180 -bio.GN .ME

Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach

Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment.

arxiv.org

Intrinsic geometry-inspired dependent toroidal distribution: Application to regression model for astigmatism data arxiv.org/abs/2409.06229 .AP

Intrinsic geometry-inspired dependent toroidal distribution: Application to regression model for astigmatism data

This paper introduces a dependent toroidal distribution, to analyze astigmatism data following cataract surgery. Rather than utilizing the flat torus, we opt to represent the bivariate angular data on the surface of a curved torus, which naturally offers smooth edge identifiability and accommodates a variety of curvatures: positive, negative, and zero. Beginning with the area-uniform toroidal distribution on this curved surface, we develop a five-parameter-dependent toroidal distribution that harnesses its intrinsic geometry via the area element to model the distribution of two dependent circular random variables. We show that both marginal distributions are Cardioid, with one of the conditional variables also following a Cardioid distribution. This key feature enables us to propose a circular-circular regression model based on conditional expectations derived from circular moments. To address the high rejection rate (approximately 50%) in existing acceptance-rejection sampling methods for Cardioid distributions, we introduce an exact sampling method based on a probabilistic transformation. Additionally, we generate random samples from the proposed dependent toroidal distribution through suitable conditioning. This bivariate distribution and the regression model are applied to analyze astigmatism data arising in the follow-up of one and three months due to cataract surgery.

arxiv.org

Applications of machine learning to predict seasonal precipitation for East Africa arxiv.org/abs/2409.06238 .AP .ML

Applications of machine learning to predict seasonal precipitation for East Africa

Seasonal climate forecasts are commonly based on model runs from fully coupled forecasting systems that use Earth system models to represent interactions between the atmosphere, ocean, land and other Earth-system components. Recently, machine learning (ML) methods are increasingly being investigated for this task where large-scale climate variability is linked to local or regional temperature or precipitation in a linear or non-linear fashion. This paper investigates the use of interpretable ML methods to predict seasonal precipitation for East Africa in an operational setting. Dimension reduction is performed by decomposing the precipitation fields via empirical orthogonal functions (EOFs), such that only the respective factor loadings need to the predicted. Indices of large-scale climate variability--including the rate of change in individual indices as well as interactions between different indices--are then used as potential features to obtain tercile forecasts from an interpretable ML algorithm. Several research questions regarding the use of data and the effect of model complexity are studied. The results are compared against the ECMWF seasonal forecasting system (SEAS5) for three seasons--MAM, JJAS and OND--over the period 1993-2020. Compared to climatology for the same period, the ECMWF forecasts have negative skill in MAM and JJAS and significant positive skill in OND. The ML approach is on par with climatology in MAM and JJAS and a significantly positive skill in OND, if not quite at the level of the OND ECMWF forecast.

arxiv.org

Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm arxiv.org/abs/2409.04500 .ML .ME .LG

Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm

Estimating the effect of treatments from natural experiments, where treatments are pre-assigned, is an important and well-studied problem. We introduce a novel natural experiment dataset obtained from an early childhood literacy nonprofit. Surprisingly, applying over 20 established estimators to the dataset produces inconsistent results in evaluating the nonprofit's efficacy. To address this, we create a benchmark to evaluate estimator accuracy using synthetic outcomes, whose design was guided by domain experts. The benchmark extensively explores performance as real world conditions like sample size, treatment correlation, and propensity score accuracy vary. Based on our benchmark, we observe that the class of doubly robust treatment effect estimators, which are based on simple and intuitive regression adjustment, generally outperform other more complicated estimators by orders of magnitude. To better support our theoretical understanding of doubly robust estimators, we derive a closed form expression for the variance of any such estimator that uses dataset splitting to obtain an unbiased estimate. This expression motivates the design of a new doubly robust estimator that uses a novel loss function when fitting functions for regression adjustment. We release the dataset and benchmark in a Python package; the package is built in a modular way to facilitate new datasets and estimators.

arxiv.org

Enhancing Electrocardiography Data Classification Confidence: A Robust Gaussian Process Approach (MuyGPs) arxiv.org/abs/2409.04642 .AP

Enhancing Electrocardiography Data Classification Confidence: A Robust Gaussian Process Approach (MuyGPs)

Analyzing electrocardiography (ECG) data is essential for diagnosing and monitoring various heart diseases. The clinical adoption of automated methods requires accurate confidence measurements, which are largely absent from existing classification methods. In this paper, we present a robust Gaussian Process classification hyperparameter training model (MuyGPs) for discerning normal heartbeat signals from the signals affected by different arrhythmias and myocardial infarction. We compare the performance of MuyGPs with traditional Gaussian process classifier as well as conventional machine learning models, such as, Random Forest, Extra Trees, k-Nearest Neighbors and Convolutional Neural Network. Comparing these models reveals MuyGPs as the most performant model for making confident predictions on individual patient ECGs. Furthermore, we explore the posterior distribution obtained from the Gaussian process to interpret the prediction and quantify uncertainty. In addition, we provide a guideline on obtaining the prediction confidence of the machine learning models and quantitatively compare the uncertainty measures of these models. Particularly, we identify a class of less-accurate (ambiguous) signals for further diagnosis by an expert.

arxiv.org

A Multi-objective Economic Statistical Design of the CUSUM chart: NSGA II Approach arxiv.org/abs/2409.04673 .AP

A Multi-objective Economic Statistical Design of the CUSUM chart: NSGA II Approach

This paper presents an approach for the economic statistical design of the Cumulative Sum (CUSUM) control chart in a multi-objective optimization framework. The proposed methodology integrates economic considerations with statistical aspects to optimize the design parameters like the sample size ($n$), sampling interval ($h$), and decision interval ($H$) of the CUSUM chart. The Non-dominated Sorting Genetic Algorithm II (NSGA II) is employed to solve the multi-objective optimization problem, aiming to minimize both the average cost per cycle ($C_E$) and the out-of-control Average Run Length ($ARL_δ$) simultaneously. The effectiveness of the proposed approach is demonstrated through a numerical example by determining the optimized CUSUM chart parameters using NSGA II. Additionally, sensitivity analysis is conducted to assess the impact of variations in input parameters. The corresponding results indicate that the proposed methodology significantly reduces the expected cost per cycle by about 43\% when compared to the findings of the article by M. Lee in the year 2011. A more extensive comparison with respect to both $C_E$ and $ARL_δ$ has also been provided for justifying the methodology proposed in this article. This highlights the practical relevance and potential of this study for the right application of the technique of the CUSUM chart for process control purposes in industries.

arxiv.org

Establishing the Parallels and Differences Between Right-Censored and Missing Covariates arxiv.org/abs/2409.04684 .ME .AP

Establishing the Parallels and Differences Between Right-Censored and Missing Covariates

While right-censored time-to-event outcomes have been studied for decades, handling time-to-event covariates, also known as right-censored covariates, is now of growing interest. So far, the literature has treated right-censored covariates as distinct from missing covariates, overlooking the potential applicability of estimators to both scenarios. We bridge this gap by establishing connections between right-censored and missing covariates under various assumptions about censoring and missingness, allowing us to identify parallels and differences to determine when estimators can be used in both contexts. These connections reveal adaptations to five estimators for right-censored covariates in the unexplored area of informative covariate right-censoring and to formulate a new estimator for this setting, where the event time depends on the censoring time. We establish the asymptotic properties of the six estimators, evaluate their robustness under incorrect distributional assumptions, and establish their comparative efficiency. We conducted a simulation study to confirm our theoretical results, and then applied all estimators to a Huntington disease observational study to analyze cognitive impairments as a function of time to clinical diagnosis.

arxiv.org

Privacy enhanced collaborative inference in the Cox proportional hazards model for distributed data arxiv.org/abs/2409.04716 .AP .ST .TH

Privacy enhanced collaborative inference in the Cox proportional hazards model for distributed data

Data sharing barriers are paramount challenges arising from multicenter clinical studies where multiple data sources are stored in a distributed fashion at different local study sites. Particularly in the case of time-to-event analysis when global risk sets are needed for the Cox proportional hazards model, access to a centralized database is typically necessary. Merging such data sources into a common data storage for a centralized statistical analysis requires a data use agreement, which is often time-consuming. Furthermore, the construction and distribution of risk sets to participating clinical centers for subsequent calculations may pose a risk of revealing individual-level information. We propose a new collaborative Cox model that eliminates the need for accessing the centralized database and constructing global risk sets but needs only the sharing of summary statistics with significantly smaller dimensions than risk sets. Thus, the proposed collaborative inference enjoys maximal protection of data privacy. We show theoretically and numerically that the new distributed proportional hazards model approach has little loss of statistical power when compared to the centralized method that requires merging the entire data. We present a renewable sieve method to establish large-sample properties for the proposed method. We illustrate its performance through simulation experiments and a real-world data example from patients with kidney transplantation in the Organ Procurement and Transplantation Network (OPTN) to understand the factors associated with the 5-year death-censored graft failure (DCGF) for patients who underwent kidney transplants in the US.

arxiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.