Can SGD Select Good Fishermen? Local Convergence under Self-Selection Biases and Beyond arxiv.org/abs/2504.07133 .ML .ST .TH .DS .LG

Can SGD Select Good Fishermen? Local Convergence under Self-Selection Biases and Beyond

We revisit the problem of estimating $k$ linear regressors with self-selection bias in $d$ dimensions with the maximum selection criterion, as introduced by Cherapanamjeri, Daskalakis, Ilyas, and Zampetakis [CDIZ23, STOC'23]. Our main result is a $\operatorname{poly}(d,k,1/\varepsilon) + {k}^{O(k)}$ time algorithm for this problem, which yields an improvement in the running time of the algorithms of [CDIZ23] and [GM24, arXiv]. We achieve this by providing the first local convergence algorithm for self-selection, thus resolving the main open question of [CDIZ23]. To obtain this algorithm, we reduce self-selection to a seemingly unrelated statistical problem called coarsening. Coarsening occurs when one does not observe the exact value of the sample but only some set (a subset of the sample space) that contains the exact value. Inference from coarse samples arises in various real-world applications due to rounding by humans and algorithms, limited precision of instruments, and lag in multi-agent systems. Our reduction to coarsening is intuitive and relies on the geometry of the self-selection problem, which enables us to bypass the limitations of previous analytic approaches. To demonstrate its applicability, we provide a local convergence algorithm for linear regression under another self-selection criterion, which is related to second-price auction data. Further, we give the first polynomial time local convergence algorithm for coarse Gaussian mean estimation given samples generated from a convex partition. Previously, only a sample-efficient algorithm was known due to Fotakis, Kalavasis, Kontonis, and Tzamos [FKKT21, COLT'21].

arXiv.org

Effective treatment allocation strategies under partial interference arxiv.org/abs/2504.07305 .ME .AP

Effective treatment allocation strategies under partial interference

Interference occurs when the potential outcomes of a unit depend on the treatment of others. Interference can be highly heterogeneous, where treating certain individuals might have a larger effect on the population's overall outcome. A better understanding of how covariates explain this heterogeneity may lead to more effective interventions. In the presence of clusters of units, we assume that interference occurs within clusters but not across them. We define novel causal estimands under hypothetical, stochastic treatment allocation strategies that fix the marginal treatment probability in a cluster and vary how the treatment probability depends on covariates, such as a unit's network position and characteristics. We illustrate how these causal estimands can shed light on the heterogeneity of interference and on the network and covariate profile of influential individuals. For experimental settings, we develop standardized weighting estimators for our novel estimands and derive their asymptotic distribution. We design an inferential procedure for testing the null hypothesis of interference homogeneity with respect to covariates. We validate the performance of the estimator and inferential procedure through simulations.We then apply the novel estimators to a clustered experiment in China to identify the important characteristics that drive heterogeneity in the effect of providing information sessions on insurance uptake.

arXiv.org

A Unified Framework for Large-Scale Classification: Error Rate Control and Optimality arxiv.org/abs/2504.07321 .ME

A Unified Framework for Large-Scale Classification: Error Rate Control and Optimality

Classification is a fundamental task in supervised learning, while achieving valid misclassification rate control remains challenging due to possibly the limited predictive capability of the classifiers or the intrinsic complexity of the classification task. In this article, we address large-scale multi-class classification problems with general error rate guarantees to enhance algorithmic trustworthiness. To this end, we first introduce a notion of group-wise classification, which unifies the common class-wise and overall classifications as special cases. We then develop a unified algorithmic framework for the general group-wise classification that consists of three steps: Pre-classification, Selective $p$-value construction, and large-scale Post-classification decisions (PSP). Theoretically, PSP is distribution-free and provides valid finite-sample guarantees for controlling general group-wise false decision rates at target levels. To show the power of PSP, we demonstrate that the step of post-classification decisions never degrades the power of pre-classification, provided that pre-classification has been sufficiently powerful to meet the target error levels. Additionally, we further establish general power optimality theories for PSP from both non-asymptotic and asymptotic perspectives. Numerical results in both simulations and real data analysis validate the performance of the proposed PSP approach.

arXiv.org

A GARMA Framework for Unit-Bounded Time Series Based on the Unit-Lindley Distribution with Application to Renewable Energy Data arxiv.org/abs/2504.07351 .ST .AP .TH

A GARMA Framework for Unit-Bounded Time Series Based on the Unit-Lindley Distribution with Application to Renewable Energy Data

The Unit-Lindley is a one-parameter family of distributions in $(0,1)$ obtained from an appropriate transformation of the Lindley distribution. In this work, we introduce a class of dynamical time series models for continuous random variables taking values in $(0,1)$ based on the Unit-Lindley distribution. The models pertaining to the proposed class are observation-driven ones for which, conditionally on a set of covariates, the random component is modeled by a Unit-Lindley distribution. The systematic component aims at modeling the conditional mean through a dynamical structure resembling the classical ARMA models. Parameter estimation in conducted using partial maximum likelihood, for which an asymptotic theory is available. Based on asymptotic results, the construction of confidence intervals, hypotheses testing, model selection, and forecasting can be carried on. A Monte Carlo simulation study is conducted to assess the finite sample performance of the proposed partial maximum likelihood approach. Finally, an application considering forecasting of the proportion of net electricity generated by conventional hydroelectric power in the United States is presented. The application show the versatility of the proposed method compared to other benchmarks models in the literature.

arXiv.org

Estimand framework development for eGFR slope estimation and comparative analyses across various estimation methods arxiv.org/abs/2504.07411 .ME

Estimand framework development for eGFR slope estimation and comparative analyses across various estimation methods

Chronic kidney disease (CKD) is a global health challenge characterized by progressive kidney function decline, often culminating in end-stage kidney disease (ESKD) and increased mortality. To address the limitations such as the extended trial follow-up necessitated by the low incidence of kidney composite endpoint, the eGFR slope -- a surrogate endpoint reflecting the trajectory of kidney function decline -- has gained prominence for its predictive power and regulatory support. Despite its advantages, the lack of a standardized framework for eGFR slope estimand and estimation complicates consistent interpretation and cross-trial comparisons. Existing methods, including simple linear regression and mixed-effects models, vary in their underlying assumptions, creating a need for a formalized approach to align estimation methods with trial objectives. This manuscript proposes an estimand framework tailored to eGFR slope-based analyses in CKD RCTs, ensuring clarity in defining "what to estimate" and enhancing the comparability of results. Through simulation studies and real-world data applications, we evaluate the performance of various commonly applied estimation techniques under distinct scenarios. By recommending a clear characterization for eGFR slope estimand and providing considerations for estimation approaches, this work aims to improve the reliability and interpretability of CKD trial results, advancing therapeutic development and clinical decision-making.

arXiv.org

Conditional Data Synthesis Augmentation arxiv.org/abs/2504.07426 .ME .LG

Conditional Data Synthesis Augmentation

Reliable machine learning and statistical analysis rely on diverse, well-distributed training data. However, real-world datasets are often limited in size and exhibit underrepresentation across key subpopulations, leading to biased predictions and reduced performance, particularly in supervised tasks such as classification. To address these challenges, we propose Conditional Data Synthesis Augmentation (CoDSA), a novel framework that leverages generative models, such as diffusion models, to synthesize high-fidelity data for improving model performance across multimodal domains including tabular, textual, and image data. CoDSA generates synthetic samples that faithfully capture the conditional distributions of the original data, with a focus on under-sampled or high-interest regions. Through transfer learning, CoDSA fine-tunes pre-trained generative models to enhance the realism of synthetic data and increase sample density in sparse areas. This process preserves inter-modal relationships, mitigates data imbalance, improves domain adaptation, and boosts generalization. We also introduce a theoretical framework that quantifies the statistical accuracy improvements enabled by CoDSA as a function of synthetic sample volume and targeted region allocation, providing formal guarantees of its effectiveness. Extensive experiments demonstrate that CoDSA consistently outperforms non-adaptive augmentation strategies and state-of-the-art baselines in both supervised and unsupervised settings.

arXiv.org

Deep Fair Learning: A Unified Framework for Fine-tuning Representations with Sufficient Networks arxiv.org/abs/2504.06470 .ML .LG

LassoRNet: Accurate dim-light melatonin onset time prediction from multiple blood tissue samples arxiv.org/abs/2504.06494 .AP .CO

LassoRNet: Accurate dim-light melatonin onset time prediction from multiple blood tissue samples

Research on chemotherapy, heart surgery, and vaccines has indicated that the risks and benefits of a treatment could vary depending on the time of day it is administered. A challenge with performing studies on timing treatment administration is that the optimal treatment time is different for each patient, as it would be based on a patient's internal clock time (ICT) rather than the 24-hour day-night cycle time. Prediction methods have been developed to determine a patient's ICT based on biomarker measurements, which can be leveraged to personalize treatment time. However, these methods face two limitations. First, these methods are designed to output predictions given biomarker measurements from a single tissue sample, when multiple tissue samples can be collected over time. Second, these methods are based on linear modelling frameworks, which would not capture the potentially complex relationships between biomarkers and a patient's ICT. To address these two limitations, this paper introduces a recurrent neural network framework, which we refer to as LassoRNet, for predicting the ICT at which a patient's biomarkers are measured as well as the underlying offset between a patient's ICT and the 24-hour day-night cycle time, or that patient's dim-light melatonin onset (DLMO) time. A novel feature of LassoRNet is a proposed variable selection scheme that minimizes the number of biomarkers needed to predict ICT. We evaluate LassoRNet on three longitudinal circadian transcriptome study data sets where DLMO time was determined for each study participant, and find that it consistently outperforms state-of-the art in both ICT and DLMO time prediction. Notably, LassoRNet obtains a median absolute error of approximately one hour in ICT prediction and 30 to 40 minutes in DLMO time prediction, where DLMO time prediction is performed using three samples collected at sequential time points.

arXiv.org

Microbial correlation: a semi-parametric model for investigating microbial co-metabolism arxiv.org/abs/2504.05450 .ME .AP

Microbial correlation: a semi-parametric model for investigating microbial co-metabolism

The gut microbiome plays a crucial role in human health, yet the mechanisms underlying host-microbiome interactions remain unclear, limiting its translational potential. Recent microbiome multiomics studies, particularly paired microbiome-metabolome studies (PM2S), provide valuable insights into gut metabolism as a key mediator of these interactions. Our preliminary data reveal strong correlations among certain gut metabolites, suggesting shared metabolic pathways and microbial co-metabolism. However, these findings are confounded by various factors, underscoring the need for a more rigorous statistical approach. Thus, we introduce microbial correlation, a novel metric that quantifies how two metabolites are co-regulated by the same gut microbes while accounting for confounders. Statistically, it is based on a partially linear model that isolates microbial-driven associations, and a consistent estimator is established based on semi-parametric theory. To improve efficiency, we develop a calibrated estimator with a parametric rate, maximizing the use of large external metagenomic datasets without paired metabolomic profiles. This calibrated estimator also enables efficient p-value calculation for identifying significant microbial co-metabolism signals. Through extensive numerical analysis, our method identifies important microbial co-metabolism patterns for healthy individuals, serving as a benchmark for future studies in diseased populations.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.