arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 603 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

Nonparametric Likelihood Ratio Test for Univariate Shape-constrained Densities. (arXiv:2211.13272v1 [math.ST]) http://arxiv.org/abs/2211.13272

Nonparametric Likelihood Ratio Test for Univariate Shape-constrained Densities

We provide a comprehensive study of a nonparametric likelihood ratio test on whether a random sample follows a distribution in a prespecified class of shape-constrained densities. While the conventional definition of likelihood ratio is not well-defined for general nonparametric problems, we consider a working sub-class of alternative densities that leads to test statistics with desirable properties. Under the null, a scaled and centered version of the test statistic is asymptotic normal and distribution-free, which comes from the fact that the asymptotic dominant term under the null depends only on a function of spacings of transformed outcomes that are uniform distributed. The nonparametric maximum likelihood estimator (NPMLE) under the hypothesis class appears only in an average log-density ratio which often converges to zero at a faster rate than the asymptotic normal term under the null, while diverges in general test so that the test is consistent. The main technicality is to show these results for log-density ratio which requires a case-by-case analysis, including new results for k-monotone densities with unbounded support and completely monotone densities that are of independent interest. A bootstrap method by simulating from the NPMLE is shown to have the same limiting distribution as the test statistic.

arXiv Statistics @arxiv_stats@qoto.org

Shapley Curves: A Smoothing Perspective. (arXiv:2211.13289v1 [stat.ML]) http://arxiv.org/abs/2211.13289

Shapley Curves: A Smoothing Perspective

Originating from cooperative game theory, Shapley values have become one of the most widely used measures for variable importance in applied Machine Learning. However, the statistical understanding of Shapley values is still limited. In this paper, we take a nonparametric (or smoothing) perspective by introducing Shapley curves as a local measure of variable importance. We propose two estimation strategies and derive the consistency and asymptotic normality both under independence and dependence among the features. This allows us to construct confidence intervals and conduct inference on the estimated Shapley curves. The asymptotic results are validated in extensive experiments. In an empirical application, we analyze which attributes drive the prices of vehicles.

arXiv Statistics @arxiv_stats@qoto.org

Learning and Testing Latent-Tree Ising Models Efficiently. (arXiv:2211.13291v1 [cs.LG]) http://arxiv.org/abs/2211.13291

Learning and Testing Latent-Tree Ising Models Efficiently

We provide time- and sample-efficient algorithms for learning and testing latent-tree Ising models, i.e. Ising models that may only be observed at their leaf nodes. On the learning side, we obtain efficient algorithms for learning a tree-structured Ising model whose leaf node distribution is close in Total Variation Distance, improving on the results of prior work. On the testing side, we provide an efficient algorithm with fewer samples for testing whether two latent-tree Ising models have leaf-node distributions that are close or far in Total Variation distance. We obtain our algorithms by showing novel localization results for the total variation distance between the leaf-node distributions of tree-structured Ising models, in terms of their marginals on pairs of leaves.

arXiv Statistics @arxiv_stats@qoto.org

Multiple Imputation with Neural Network Gaussian Process for High-dimensional Incomplete Data. (arXiv:2211.13297v1 [cs.LG]) http://arxiv.org/abs/2211.13297

Multiple Imputation with Neural Network Gaussian Process for High-dimensional Incomplete Data

Missing data are ubiquitous in real world applications and, if not adequately handled, may lead to the loss of information and biased findings in downstream analysis. Particularly, high-dimensional incomplete data with a moderate sample size, such as analysis of multi-omics data, present daunting challenges. Imputation is arguably the most popular method for handling missing data, though existing imputation methods have a number of limitations. Single imputation methods such as matrix completion methods do not adequately account for imputation uncertainty and hence would yield improper statistical inference. In contrast, multiple imputation (MI) methods allow for proper inference but existing methods do not perform well in high-dimensional settings. Our work aims to address these significant methodological gaps, leveraging recent advances in neural network Gaussian process (NNGP) from a Bayesian viewpoint. We propose two NNGP-based MI methods, namely MI-NNGP, that can apply multiple imputations for missing values from a joint (posterior predictive) distribution. The MI-NNGP methods are shown to significantly outperform existing state-of-the-art methods on synthetic and real datasets, in terms of imputation error, statistical inference, robustness to missing rates, and computation costs, under three missing data mechanisms, MCAR, MAR, and MNAR.

arXiv Statistics @arxiv_stats@qoto.org

Nepal Himalaya Offers Considerable Potential for Pumped Storage Hydropower. (arXiv:2211.13306v1 [stat.AP]) http://arxiv.org/abs/2211.13306

Nepal Himalaya Offers Considerable Potential for Pumped Storage Hydropower

There is a pressing need for a transition from fossil-fuel to renewable energy to meet the increasing energy demands and reduce greenhouse gas emissions. The Nepal Himalaya possesses substantial renewable energy potential that can be harnessed through hydropower projects due to its peculiar topographic characteristics and abundant water resources. However, the current exploitation rate is low owing to the predominance of run-of-river hydropower systems to support the nation's power system. The utility-scale storage facility is crucial in the load scenario of an integrated Nepalese power system to manage diurnal variation, peak demand, and penetration of intermittent energy sources. In this study, we first identify the potential of pumped storage hydropower across the country under multiple configurations by pairing lakes, hydropower projects, rivers, and available flat terrains. We then identify technically feasible pairs from those of potential locations. Infrastructural, environmental, operational, and other technical constraints govern the choice of feasible locations. We find the flat land-to-river configuration most promising over other configurations for Nepal. Our results provide insight into the potential of pumped storage hydropower and are of practical importance in planning sustainable power systems in the Himalayas.

arXiv Statistics @arxiv_stats@qoto.org

A Moment-Matching Approach to Testable Learning and a New Characterization of Rademacher Complexity. (arXiv:2211.13312v1 [cs.LG]) http://arxiv.org/abs/2211.13312

A Moment-Matching Approach to Testable Learning and a New Characterization of Rademacher Complexity

A remarkable recent paper by Rubinfeld and Vasilyan (2022) initiated the study of \emph{testable learning}, where the goal is to replace hard-to-verify distributional assumptions (such as Gaussianity) with efficiently testable ones and to require that the learner succeed whenever the unknown distribution passes the corresponding test. In this model, they gave an efficient algorithm for learning halfspaces under testable assumptions that are provably satisfied by Gaussians. In this paper we give a powerful new approach for developing algorithms for testable learning using tools from moment matching and metric distances in probability. We obtain efficient testable learners for any concept class that admits low-degree \emph{sandwiching polynomials}, capturing most important examples for which we have ordinary agnostic learners. We recover the results of Rubinfeld and Vasilyan as a corollary of our techniques while achieving improved, near-optimal sample complexity bounds for a broad range of concept classes and distributions. Surprisingly, we show that the information-theoretic sample complexity of testable learning is tightly characterized by the Rademacher complexity of the concept class, one of the most well-studied measures in statistical learning theory. In particular, uniform convergence is necessary and sufficient for testable learning. This leads to a fundamental separation from (ordinary) distribution-specific agnostic learning, where uniform convergence is sufficient but not necessary.

arXiv Statistics @arxiv_stats@qoto.org

Extent of Safety Database in Pediatric Drug Development: Types of Assessment, Analytical Precision, and Pathway for Extrapolation through On-Target Effects. (arXiv:2211.13329v1 [stat.AP]) http://arxiv.org/abs/2211.13329

Extent of Safety Database in Pediatric Drug Development: Types of Assessment, Analytical Precision, and Pathway for Extrapolation through On-Target Effects

Pediatric patients should have access to medicines that have been appropriately evaluated for safety and efficacy. Given this goal of revised labelling, the adequacy of the pediatric clinical development plan and resulting safety database must inform a favorable benefit-risk assessment for the intended use of the medicinal product. While extrapolation from adults can be used to support efficacy of drugs in children, there may be a reluctance to use the same approach in safety assessments, wiping out potential gains in trial efficiency through a reduction of sample size. To address this reluctance, we explore safety review in pediatric trials, including factors affecting these data, specific types of safety assessments, and precision on the estimation of event rates for specific adverse events (AEs) that can be achieved. In addition, we discuss the assessments which can provide a benchmark for the use of extrapolation of safety that focuses on on-target effects. Finally, we explore a unified approach for understanding precision using Bayesian approaches as the most appropriate methodology to describe/ascertain risk in probabilistic terms for the estimate of the event rate of specific AEs.

arXiv Statistics @arxiv_stats@qoto.org

A Multivariate Non-Gaussian Bayesian Filter Using Power Moments. (arXiv:2211.13374v1 [stat.ME]) http://arxiv.org/abs/2211.13374

A Multivariate Non-Gaussian Bayesian Filter Using Power Moments

In this paper, which is a very preliminary version, we extend our results on the univariate non-Gaussian Bayesian filter using power moments to the multivariate systems, which can be either linear or nonlinear. Doing this introduces several challenging problems, for example a positive parametrization of the density surrogate, which is not only a problem of filter design, but also one of the multiple dimensional Hamburger moment problem. We propose a parametrization of the density surrogate with the proofs to its existence, Positivstellensatze and uniqueness. Based on it, we analyze the error of moments of the density estimates through the filtering process with the proposed density surrogate. An error upper bound in the sense of total variation distance is also given. A discussion on continuous and discrete treatments to the non-Gaussian Bayesian filtering problem is proposed to explain why our proposed filter shall also be a mainstream of the non-Gaussian Bayesian filtering research and motivate the research on continuous parametrization of the system state. Last but not the least, simulation results on estimating different types of multivariate density functions are given to validate our proposed filter. To the best of our knowledge, the proposed filter is the first one implementing the multivariate Bayesian filter with the system state parameterized as a continuous function, which only requires the true states being Lebesgue integrable.

arXiv Statistics @arxiv_stats@qoto.org

Lifting Weak Supervision To Structured Prediction. (arXiv:2211.13375v1 [cs.LG]) http://arxiv.org/abs/2211.13375

Lifting Weak Supervision To Structured Prediction

Weak supervision (WS) is a rich set of techniques that produce pseudolabels by aggregating easily obtained but potentially noisy label estimates from a variety of sources. WS is theoretically well understood for binary classification, where simple approaches enable consistent estimation of pseudolabel noise rates. Using this result, it has been shown that downstream models trained on the pseudolabels have generalization guarantees nearly identical to those trained on clean labels. While this is exciting, users often wish to use WS for structured prediction, where the output space consists of more than a binary or multi-class label set: e.g. rankings, graphs, manifolds, and more. Do the favorable theoretical properties of WS for binary classification lift to this setting? We answer this question in the affirmative for a wide range of scenarios. For labels taking values in a finite metric space, we introduce techniques new to weak supervision based on pseudo-Euclidean embeddings and tensor decompositions, providing a nearly-consistent noise rate estimator. For labels in constant-curvature Riemannian manifolds, we introduce new invariants that also yield consistent noise rate estimation. In both cases, when using the resulting pseudolabels in concert with a flexible downstream model, we obtain generalization guarantees nearly identical to those for models trained on clean data. Several of our results, which can be viewed as robustness guarantees in structured prediction with noisy labels, may be of independent interest. Empirical evaluation validates our claims and shows the merits of the proposed method.

arXiv Statistics @arxiv_stats@qoto.org

A Non-Gaussian Bayesian Filter Using Power and Generalized Logarithmic Moments. (arXiv:2211.13383v1 [math.OC]) http://arxiv.org/abs/2211.13383

A Non-Gaussian Bayesian Filter Using Power and Generalized Logarithmic Moments

In our previous paper, we proposed a non-Gaussian Bayesian filter using power moments of the system state. A density surrogate parameterized as an analytic function is proposed to approximate the true system state, of which the distribution is only assumed Lebesgue integrable. To our knowledge, it is the first Bayesian filter where there is no prior constraints on the true density of the state and the state estimate has a continuous form of function. In this very preliminary version of paper, we propose a new type of statistics, which is called the generalized logarithmic moments. They are used to parameterize the state distribution together with the power moments. The map from the parameters of the proposed density surrogate to the power moments is proved to be a diffeomorphism, which allows to use gradient methods to treat the optimization problem determining the parameters. The simulation results reveal the advantage of using both moments for estimating mixtures of complicated types of functions.

arXiv Statistics @arxiv_stats@qoto.org

Testing for Publication Bias in Diagnostic Meta-Analysis: A Simulation Study. (arXiv:2211.12538v1 [stat.ME]) http://arxiv.org/abs/2211.12538

Testing for Publication Bias in Diagnostic Meta-Analysis: A Simulation Study

The present study investigates the performance of several statistical tests to detect publication bias in diagnostic meta-analysis by means of simulation. While bivariate models should be used to pool data from primary studies in diagnostic meta-analysis, univariate measures of diagnostic accuracy are preferable for the purpose of detecting publication bias. In contrast to earlier research, which focused solely on the diagnostic odds ratio or its logarithm ($\lnω$), the tests are combined with four different univariate measures of diagnostic accuracy. For each combination of test and univariate measure, both type I error rate and statistical power are examined under diverse conditions. The results indicate that tests based on linear regression or rank correlation cannot be recommended in diagnostic meta-analysis, because type I error rates are either inflated or power is too low, irrespective of the applied univariate measure. In contrast, the combination of trim and fill and $\lnω$ has non-inflated or only slightly inflated type I error rates and medium to high power, even under extreme circumstances (at least when the number of studies per meta-analysis is large enough). Therefore, we recommend the application of trim and fill combined with $\lnω$ to detect funnel plot asymmetry in diagnostic meta-analysis. Please cite this paper as published in Statistics in Medicine (https://doi.org/10.1002/sim.6177).

arXiv Statistics @arxiv_stats@qoto.org

Optimal design of the Wilcoxon-Mann-Whitney-test. (arXiv:2211.12556v1 [stat.ME]) http://arxiv.org/abs/2211.12556

Optimal design of the Wilcoxon-Mann-Whitney-test

In scientific research, many hypotheses relate to the comparison of two independent groups. Usually, it is of interest to use a design (i.e., the allocation of sample sizes $m$ and $n$ for fixed $N = m + n$) that maximizes the power of the applied statistical test. It is known that the two-sample t-tests for homogeneous and heterogeneous variances may lose substantial power when variances are unequal but equally large samples are used. We demonstrate that this is not the case for the non-parametric Wilcoxon-Mann-Whitney-test, whose application in biometrical research fields is motivated by two examples from cancer research. We prove the optimality of the design $m = n$ in case of symmetric and identically shaped distributions using normal approximations and show that this design generally offers power only negligibly lower than the optimal design for a wide range of distributions. Please cite this paper as published in the Biometrical Journal (https://doi.org/10.1002/bimj.201600022).

arXiv Statistics @arxiv_stats@qoto.org

Coverage of Credible Intervals in Bayesian Multivariate Isotonic Regression. (arXiv:2211.12566v1 [math.ST]) http://arxiv.org/abs/2211.12566

Coverage of Credible Intervals in Bayesian Multivariate Isotonic Regression

We consider the nonparametric multivariate isotonic regression problem, where the regression function is assumed to be nondecreasing with respect to each predictor. Our goal is to construct a Bayesian credible interval for the function value at a given interior point with assured limiting frequentist coverage. We put a prior on unrestricted step-functions, but make inference using the induced posterior measure by an "immersion map" from the space of unrestricted functions to that of multivariate monotone functions. This allows maintaining the natural conjugacy for posterior sampling. A natural immersion map to use is a projection via a distance, but in the present context, a block isotonization map is found to be more useful. The approach of using the induced "immersion posterior" measure instead of the original posterior to make inference provides a useful extension of the Bayesian paradigm, particularly helpful when the model space is restricted by some complex relations. We establish a key weak convergence result for the posterior distribution of the function at a point in terms of some functional of a multi-indexed Gaussian process that leads to an expression for the limiting coverage of the Bayesian credible interval. Analogous to a recent result for univariate monotone functions, we find that the limiting coverage is slightly higher than the credibility, the opposite of a phenomenon observed in smoothing problems. Interestingly, the relation between credibility and limiting coverage does not involve any unknown parameter. Hence by a recalibration procedure, we can get a predetermined asymptotic coverage by choosing a suitable credibility level smaller than the targeted coverage, and thus also shorten the credible intervals.

arXiv Statistics @arxiv_stats@qoto.org

Online Federated Learning via Non-Stationary Detection and Adaptation amidst Concept Drift. (arXiv:2211.12578v1 [cs.LG]) http://arxiv.org/abs/2211.12578

Online Federated Learning via Non-Stationary Detection and Adaptation amidst Concept Drift

Federated Learning (FL) is an emerging domain in the broader context of artificial intelligence research. Methodologies pertaining to FL assume distributed model training, consisting of a collection of clients and a server, with the main goal of achieving optimal global model with restrictions on data sharing due to privacy concerns. It is worth highlighting that the diverse existing literature in FL mostly assume stationary data generation processes; such an assumption is unrealistic in real-world conditions where concept drift occurs due to, for instance, seasonal or period observations, faults in sensor measurements. In this paper, we introduce a multiscale algorithmic framework which combines theoretical guarantees of \textit{FedAvg} and \textit{FedOMD} algorithms in near stationary settings with a non-stationary detection and adaptation technique to ameliorate FL generalization performance in the presence of model/concept drifts. We present a multi-scale algorithmic framework leading to $\Tilde{\mathcal{O}} ( \min \{ \sqrt{LT} , Δ^{\frac{1}{3}}T^{\frac{2}{3}} + \sqrt{T} \})$ \textit{dynamic regret} for $T$ rounds with an underlying general convex loss function, where $L$ is the number of times non-stationary drifts occured and $Δ$ is the cumulative magnitude of drift experienced within $T$ rounds.

arXiv Statistics @arxiv_stats@qoto.org

Quasi-Newton Sequential Monte Carlo. (arXiv:2211.12580v1 [stat.ME]) http://arxiv.org/abs/2211.12580

Quasi-Newton Sequential Monte Carlo

Sequential Monte Carlo samplers represent a compelling approach to posterior inference in Bayesian models, due to being parallelisable and providing an unbiased estimate of the posterior normalising constant. In this work, we significantly accelerate sequential Monte Carlo samplers by adopting the L-BFGS Hessian approximation which represents the state-of-the-art in full-batch optimisation techniques. The L-BFGS Hessian approximation has only linear complexity in the parameter dimension and requires no additional posterior or gradient evaluations. The resulting sequential Monte Carlo algorithm is adaptive, parallelisable and well-suited to high-dimensional and multi-modal settings, which we demonstrate in numerical experiments on challenging posterior distributions.

arXiv Statistics @arxiv_stats@qoto.org

Characterizing Persistence and Disparity of Covid-19 Infection Rates with City Level Demographic and Regional Features. (arXiv:2211.12583v1 [stat.AP]) http://arxiv.org/abs/2211.12583

Characterizing Persistence and Disparity of Covid-19 Infection Rates with City Level Demographic and Regional Features

The design of data-driven dashboards that inform municipalities on ongoing changes in infections within their community is addressed in this research. Daily reports of Covid-19 infections published by the state of Wisconsin as the initial surge in the pandemic ensued during the October 2020 to September 2021 time frame is considered as a case study. Of particular interest is the identification of regions and population groups distinguished by race and ethnicity that may be experiencing a disproportional rate of infections over time. This study integrates the municipality-level daily positive cases that are disaggregated by race and ethnicity and population size data derived from the US Census Bureau. The goal is to present timely data-driven information in a manner that is accessible to the general population, is relatable to the constituents and promotes community engagement in managing and mitigating the infections. A statistical metric referred to as the rank difference and its persistence over time is used to capture the disproportional incidence of Covid-19 positive cases on particular race and ethnic groups in relation to their population size. A persistence index is derived to identify regions that continually exhibit positive rank differences on a daily time scale and indicate disparity in disease incidence. The analysis leads to the identification that several municipalities in Wisconsin that are located in regions of low population and away from the denser urban centers are those that continue to exhibit disparity in the infection rates for Black/African American and Hispanic/Latino population groups. Examples of a dashboard that can be utilized to capture both aggregate level and temporal patterns of Covid-19 infections are presented.

arXiv Statistics @arxiv_stats@qoto.org

A new and asymptotically optimally contracting coupling for the random walk Metropolis. (arXiv:2211.12585v1 [stat.CO]) http://arxiv.org/abs/2211.12585

A new and asymptotically optimally contracting coupling for the random walk Metropolis

The reflection-maximal coupling of the random walk Metropolis (RWM) algorithm was recently proposed for use within unbiased MCMC. Numerically, when the target is spherical this coupling has been shown to perform well even in high dimensions. We derive high-dimensional ODE limits for Gaussian targets, which confirm this behaviour in the spherical case. However, we then extend our theory to the elliptical case and find that as the dimension increases the reflection coupling performs increasingly poorly relative to the mixing of the underlying RWM chains. To overcome this obstacle, we introduce gradient common random number (GCRN) couplings, which leverage gradient information. We show that the behaviour of GCRN couplings does not break down with the ellipticity or dimension. Moreover, we show that GCRN couplings are asymptotically optimal for contraction, in a sense which we make precise, and scale in proportion to the mixing of the underling RWM chains. Numerically, we apply GCRN couplings for convergence and bias quantification, and demonstrate that our theoretical findings extend beyond the Gaussian case.

arXiv Statistics @arxiv_stats@qoto.org

Posterior Contraction and Testing for Multivariate Isotonic Regression. (arXiv:2211.12595v1 [math.ST]) http://arxiv.org/abs/2211.12595

Posterior Contraction and Testing for Multivariate Isotonic Regression

We consider the nonparametric regression problem with multiple predictors and an additive error, where the regression function is assumed to be coordinatewise nondecreasing. We propose a Bayesian approach to make an inference on the multivariate monotone regression function, obtain the posterior contraction rate, and construct a universally consistent Bayesian testing procedure for multivariate monotonicity. To facilitate posterior analysis, we set aside the shape restrictions temporarily, and endow a prior on blockwise constant regression functions with heights independently normally distributed. The unknown variance of the error term is either estimated by the marginal maximum likelihood estimate or is equipped with an inverse-gamma prior. Then the unrestricted block heights are a posteriori also independently normally distributed given the error variance, by conjugacy. To comply with the shape restrictions, we project samples from the unrestricted posterior onto the class of multivariate monotone functions, inducing the "projection-posterior distribution", to be used for making an inference. Under an $\mathbb{L}_1$-metric, we show that the projection-posterior based on $n$ independent samples contracts around the true monotone regression function at the optimal rate $n^{-1/(2+d)}$. Then we construct a Bayesian test for multivariate monotonicity based on the posterior probability of a shrinking neighborhood of the class of multivariate monotone functions. We show that the test is universally consistent, that is, the level of the Bayesian test goes to zero, and the power at any fixed alternative goes to one. Moreover, we show that for a smooth alternative function, power goes to one as long as its distance to the class of multivariate monotone functions is at least of the order of the estimation error for a smooth function.

arXiv Statistics @arxiv_stats@qoto.org

Transfer Learning for Contextual Multi-armed Bandits. (arXiv:2211.12612v1 [stat.ML]) http://arxiv.org/abs/2211.12612

Transfer Learning for Contextual Multi-armed Bandits

Motivated by a range of applications, we study in this paper the problem of transfer learning for nonparametric contextual multi-armed bandits under the covariate shift model, where we have data collected on source bandits before the start of the target bandit learning. The minimax rate of convergence for the cumulative regret is established and a novel transfer learning algorithm that attains the minimax regret is proposed. The results quantify the contribution of the data from the source domains for learning in the target domain in the context of nonparametric contextual multi-armed bandits. In view of the general impossibility of adaptation to unknown smoothness, we develop a data-driven algorithm that achieves near-optimal statistical guarantees (up to a logarithmic factor) while automatically adapting to the unknown parameters over a large collection of parameter spaces under an additional self-similarity assumption. A simulation study is carried out to illustrate the benefits of utilizing the data from the auxiliary source domains for learning in the target domain.

arXiv Statistics @arxiv_stats@qoto.org

Good Data from Bad Models : Foundations of Threshold-based Auto-labeling. (arXiv:2211.12620v1 [cs.LG]) http://arxiv.org/abs/2211.12620

Good Data from Bad Models : Foundations of Threshold-based Auto-labeling

Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Auto-labeling systems are a promising way to reduce reliance on manual labeling for dataset construction. Threshold-based auto-labeling, where validation data obtained from humans is used to find a threshold for confidence above which the data is machine-labeled, is emerging as a popular solution used widely in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. In this work, we analyze threshold-based auto-labeling systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two insights. First, reasonable chunks of the unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of threshold-based auto-labeling systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with simulations and study the efficacy of threshold-based auto-labeling on real datasets.

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019