arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 602 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

Using Multivariate Linear Regression for Biochemical Oxygen Demand Prediction in Waste Water. (arXiv:2209.14297v1 [q-bio.OT]) http://arxiv.org/abs/2209.14297

Using Multivariate Linear Regression for Biochemical Oxygen Demand Prediction in Waste Water

There exist opportunities for Multivariate Linear Regression (MLR) in the prediction of Biochemical Oxygen Demand (BOD) in waste water, using the diverse water quality parameters as the input variables. The goal of this work is to examine the capability of MLR in prediction of BOD in waste water through four input variables: Dissolved Oxygen (DO), Nitrogen, Fecal Coliform and Total Coliform. The four input variables have higher correlation strength to BOD out of the seven parameters examined for the strength of correlation. Machine Learning (ML) was done with both 80% and 90% of the data as the training set and 20% and 10% as the test set respectively. MLR performance was evaluated through the coefficient of correlation (r), Root Mean Square Error (RMSE) and the percentage accuracy in prediction of BOD. The performance indices for the input variables of Dissolved Oxygen, Nitrogen, Fecal Coliform and Total Coliform in prediction of BOD are: RMSE=6.77mg/L, r=0.60 and accuracy 70.3% for training dataset of 80% and RMSE=6.74mg/L, r=0.60 and accuracy of 87.5% for training set of 90% of the dataset. It was found that increasing the percentage of the training set above 80% of the dataset improved the accuracy of the model only but did not have a significant impact on the prediction capacity of the model. The results showed that MLR model could be successfully employed in the estimation of BOD in waste water using appropriately selected input parameters.

arXiv Statistics @arxiv_stats@qoto.org

Model Specification in Mixed-Effects Models: A Focus on Random Effects. (arXiv:2209.14349v1 [stat.ME]) http://arxiv.org/abs/2209.14349

Model Specification in Mixed-Effects Models: A Focus on Random Effects

Mixed-effect regression models are powerful tools for researchers in a myriad of fields. Part of the appeal of mixed-effect models is their great flexibility, but that flexibility comes at the cost of complexity and if users are not careful in how their model is specified, they could be making faulty inferences from their data. As others have argued, we think there is a great deal of confusion around appropriate random effects to be included in a model given the study design, with researchers generally being better at specifying the fixed effects of a model which map onto to their research hypotheses. To that end, we present an instructive framework for evaluating the random effects of a model in three different situations: (1) longitudinal designs; (2) factorial repeated measures; and (3) when dealing with multiple sources of variance. We provide worked examples with open-access code and data in an online repository. This framework will be helpful for students and researchers who are new to mixed effect models, and to reviewers who may have to evaluate a novel model as part of their review. Ultimately, it is difficult to specify "the" appropriate random-effects structure for a mixed model, but by giving users tools to think more deeply about their random effects, we can improve the validity of statistical conclusions in many areas of research.

arXiv Statistics @arxiv_stats@qoto.org

Generalized Kernel Regularized Least Squares. (arXiv:2209.14355v1 [stat.ML]) http://arxiv.org/abs/2209.14355

Generalized Kernel Regularized Least Squares

Kernel Regularized Least Squares (KRLS) is a popular method for flexibly estimating models that may have complex relationships between variables. However, its usefulness to many researchers is limited for two reasons. First, existing approaches are inflexible and do not allow KRLS to be combined with theoretically-motivated extensions such as fixed effects or non-linear outcomes. Second, estimation is extremely computationally intensive for even modestly sized datasets. Our paper addresses both concerns by introducing generalized KRLS (gKRLS). We note that KRLS can be re-formulated as a hierarchical model thereby allowing easy inference and modular model construction. Computationally, we also implement random sketching to dramatically accelerate estimation while incurring a limited penalty in estimation quality. We demonstrate that gKRLS can be fit on datasets with tens of thousands of observations in under one minute. Further, state-of-the-art techniques that require fitting the model over a dozen times (e.g. meta-learners) can be estimated quickly.

arXiv Statistics @arxiv_stats@qoto.org

Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees. (arXiv:2209.14414v1 [stat.ML]) http://arxiv.org/abs/2209.14414

Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees

We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon $H$ with $S$ states, and $A$ actions. The performance of an agent is measured by the regret after interacting with the environment for $T$ episodes. We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL), a simple variant of posterior sampling that only needs a number of posterior samples logarithmic in $H$, $S$, $A$, and $T$ per state-action pair. For OPSRL we guarantee a high-probability regret bound of order at most $\widetilde{\mathcal{O}}(\sqrt{H^3SAT})$ ignoring $\text{poly}\log(HSAT)$ terms. The key novel technical ingredient is a new sharp anti-concentration inequality for linear forms which may be of independent interest. Specifically, we extend the normal approximation-based lower bound for Beta distributions by Alfers and Dinges [1984] to Dirichlet distributions. Our bound matches the lower bound of order $Ω(\sqrt{H^3SAT})$, thereby answering the open problems raised by Agrawal and Jia [2017b] for the episodic setting.

arXiv Statistics @arxiv_stats@qoto.org

Minimax Optimal Kernel Operator Learning via Multilevel Training. (arXiv:2209.14430v1 [cs.LG]) http://arxiv.org/abs/2209.14430

Minimax Optimal Kernel Operator Learning via Multilevel Training

Learning mappings between infinite-dimensional function spaces has achieved empirical success in many disciplines of machine learning, including generative modeling, functional data analysis, causal inference, and multi-agent reinforcement learning. In this paper, we study the statistical limit of learning a Hilbert-Schmidt operator between two infinite-dimensional Sobolev reproducing kernel Hilbert spaces. We establish the information-theoretic lower bound in terms of the Sobolev Hilbert-Schmidt norm and show that a regularization that learns the spectral components below the bias contour and ignores the ones that are above the variance contour can achieve the optimal learning rate. At the same time, the spectral components between the bias and variance contours give us flexibility in designing computationally feasible machine learning algorithms. Based on this observation, we develop a multilevel kernel operator learning algorithm that is optimal when learning linear operators between infinite-dimensional function spaces.

arXiv Statistics @arxiv_stats@qoto.org

GeONet: a neural operator for learning the Wasserstein geodesic. (arXiv:2209.14440v1 [cs.LG]) http://arxiv.org/abs/2209.14440

GeONet: a neural operator for learning the Wasserstein geodesic

Optimal transport (OT) offers a versatile framework to compare complex data distributions in a geometrically meaningful way. Traditional methods for computing the Wasserstein distance and geodesic between probability measures require mesh-dependent domain discretization and suffer from the curse-of-dimensionality. We present GeONet, a mesh-invariant deep neural operator network that learns the non-linear mapping from the input pair of initial and terminal distributions to the Wasserstein geodesic connecting the two endpoint distributions. In the offline training stage, GeONet learns the saddle point optimality conditions for the dynamic formulation of the OT problem in the primal and dual spaces that are characterized by a coupled PDE system. The subsequent inference stage is instantaneous and can be deployed for real-time predictions in the online learning setting. We demonstrate that GeONet achieves comparable testing accuracy to the standard OT solvers on a simulation example and the CIFAR-10 dataset with considerably reduced inference-stage computational cost by orders of magnitude.

arXiv Statistics @arxiv_stats@qoto.org

Using the Sinkhorn divergence in permutation tests for the multivariate two-sample problem. (arXiv:2209.14455v1 [math.ST]) http://arxiv.org/abs/2209.14455

Using the Sinkhorn divergence in permutation tests for the multivariate two-sample problem

In order to adapt the Wasserstein distance to the large sample multivariate non-parametric two-sample problem, making its application computationally feasible, permutation tests based on the Sinkhorn divergence between probability vectors associated to data dependent partitions are considered. Different ways of implementing these tests are evaluated and the asymptotic distribution of the underlying statistic is established in some cases. The statistics proposed are compared, in simulated examples, with the test of Schilling's, one of the best non-parametric tests available in the literature.

arXiv Statistics @arxiv_stats@qoto.org

Use of indicator functions to enumerate cross-array designs without direct product structure. (arXiv:2209.14477v1 [stat.ME]) http://arxiv.org/abs/2209.14477

Use of indicator functions to enumerate cross-array designs without direct product structure

Use of polynomial indicator functions to enumerate fractional factorial designs with given properties is first introduced by Fontana, Pistone and Rogantin (2000) for two-level factors, and generalized by Aoki (2019) for multi-level factors. In this paper, we apply this theory to enumerate cross-array designs. For the experiments of several control factors and noise factors, use of the cross-array designs with direct product structure is widespread as an effective robust strategy in Taguchi method. In this paper, we relax this direct product structure to reduce the size of the designs. We obtain 24-runs cross-array designs without direct product structure with some desirable properties for 6 control factors and 3 noise factors, each with two-levels, instead of 32-runs design that is widely used.

arXiv Statistics @arxiv_stats@qoto.org

Fast Inference for Quantile Regression with Millions of Observations. (arXiv:2209.14502v1 [econ.EM]) http://arxiv.org/abs/2209.14502

Fast Inference for Quantile Regression with Millions of Observations

While applications of big data analytics have brought many new opportunities to economic research, with datasets containing millions of observations, making usual econometric inferences based on extreme estimators would require huge computing powers and memories that are often not accessible. In this paper, we focus on linear quantile regression employed to analyze "ultra-large" datasets such as U.S. decennial censuses. We develop a new inference framework that runs very fast, based on the stochastic sub-gradient descent (S-subGD) updates. The cross-sectional data are treated sequentially into the inference procedure: (i) the parameter estimate is updated when each "new observation" arrives, (ii) it is aggregated as the Polyak-Ruppert average, and (iii) a pivotal statistic for inference is computed using a solution path only. We leverage insights from time series regression and construct an asymptotically pivotal statistic via random scaling. Our proposed test statistic is computed in a fully online fashion and the critical values are obtained without any resampling methods. We conduct extensive numerical studies to showcase the computational merits of our proposed inference. For inference problems as large as $(n, d) \sim (10^7, 10^3)$, where $n$ is the sample size and $d$ is the number of regressors, our method can generate new insights beyond the computational capabilities of existing inference methods. Specifically, we uncover the trends in the gender gap in the U.S. college wage premium using millions of observations, while controlling over $10^3$ covariates to mitigate confounding effects.

arXiv Statistics @arxiv_stats@qoto.org

Superset model problem. (arXiv:2209.14555v1 [stat.ME]) http://arxiv.org/abs/2209.14555

Superset model problem

This paper focuses on the superset model problem that arises in the context of regression. To address this problem, we take the Bayesian approach to measure its uncertainty. An illustrative example with the real dataset is provided.

arXiv Statistics @arxiv_stats@qoto.org

LapGM: A Multisequence MR Bias Correction and Normalization Model. (arXiv:2209.13619v1 [physics.med-ph]) http://arxiv.org/abs/2209.13619

LapGM: A Multisequence MR Bias Correction and Normalization Model

A spatially regularized Gaussian mixture model, LapGM, is proposed for the bias field correction and magnetic resonance normalization problem. The proposed spatial regularizer gives practitioners fine-tuned control between balancing bias field removal and preserving image contrast preservation for multi-sequence, magnetic resonance images. The fitted Gaussian parameters of LapGM serve as control values which can be used to normalize image intensities across different patient scans. LapGM is compared to well-known debiasing algorithm N4ITK in both the single and multi-sequence setting. As a normalization procedure, LapGM is compared to known techniques such as: max normalization, Z-score normalization, and a water-masked region-of-interest normalization. Lastly a CUDA-accelerated Python package $\texttt{lapgm}$ is provided from the authors for use.

arXiv Statistics @arxiv_stats@qoto.org

Multistate Models as a Framework for Estimand Specification in Clinical Trials of Complex Processes. (arXiv:2209.13658v1 [stat.ME]) http://arxiv.org/abs/2209.13658

Multistate Models as a Framework for Estimand Specification in Clinical Trials of Complex Processes

Intensity-based multistate models provide a useful framework for characterizing disease processes, the introduction of interventions, loss to follow-up, and other complications arising in the conduct of randomized trials studying complex life history processes. Within this framework we discuss the issues involved in the specification of estimands and show the limiting values of common estimators of marginal process features based on cumulative incidence function regression models. When intercurrent events arise we stress the need to carefully define the target estimand and the importance of avoiding targets of inference that are not interpretable in the real world. This has implications for analyses, but also the design of clinical trials where protocols may help in the interpretation of estimands based on marginal features.

arXiv Statistics @arxiv_stats@qoto.org

A new method to construct high-dimensional copulas with Bernoulli and Coxian-2 distributions. (arXiv:2209.13675v1 [math.ST]) http://arxiv.org/abs/2209.13675

A new method to construct high-dimensional copulas with Bernoulli and Coxian-2 distributions

We propose an approach to construct a new family of generalized Farlie-Gumbel-Morgenstern (GFGM) copulas that naturally scales to high dimensions. A GFGM copula can model moderate positive and negative dependence, cover different types of asymmetries, and admits exact expressions for many quantities of interest such as measures of association or risk measures in actuarial science or quantitative risk management. More importantly, this paper presents a new method to construct high-dimensional copulas based on mixtures of power functions, and may be adapted to more general contexts to construct broader families of copulas. We construct a family of copulas through a stochastic representation based on multivariate Bernoulli distributions and Coxian-2 distributions. This paper will cover the construction of a GFGM copula, study its measures of multivariate association and dependence properties. We explain how to sample random vectors from the new family of copulas in high dimensions. Then, we study the bivariate case in detail and find that our construction leads to an asymmetric modified Huang-Kotz FGM copula. Finally, we study the exchangeable case and provide some insights into the most negative dependence structure within this new class of high-dimensional copulas.

arXiv Statistics @arxiv_stats@qoto.org

False Discovery Rate Adjustments for Average Significance Level Controlling Tests. (arXiv:2209.13686v1 [stat.ME]) http://arxiv.org/abs/2209.13686

False Discovery Rate Adjustments for Average Significance Level Controlling Tests

Multiple testing adjustments, such as the Benjamini and Hochberg (1995) step-up procedure for controlling the false discovery rate (FDR), are typically applied to families of tests that control significance level in the classical sense: for each individual test, the probability of false rejection is no greater than the nominal level. In this paper, we consider tests that satisfy only a weaker notion of significance level control, in which the probability of false rejection need only be controlled on average over the hypotheses. We find that the Benjamini and Hochberg (1995) step-up procedure still controls FDR in the asymptotic regime with many weakly dependent $p$-values, and that certain adjustments for dependent $p$-values such as the Benjamini and Yekutieli (2001) procedure continue to yield FDR control in finite samples. Our results open the door to FDR controlling procedures in nonparametric and high dimensional settings where weakening the notion of inference allows for large power improvements.

arXiv Statistics @arxiv_stats@qoto.org

A Doubly Optimistic Strategy for Safe Linear Bandits. (arXiv:2209.13694v1 [cs.LG]) http://arxiv.org/abs/2209.13694

A Doubly Optimistic Strategy for Safe Linear Bandits

We propose a \underline{d}oubly \underline{o}ptimistic strategy for the \underline{s}afe-\underline{l}inear-\underline{b}andit problem, DOSLB. The safe linear bandit problem is to optimise an unknown linear reward whilst satisfying unknown round-wise safety constraints on actions, using stochastic bandit feedback of reward and safety-risks of actions. In contrast to prior work on aggregated resource constraints, our formulation explicitly demands control on roundwise safety risks. Unlike existing optimistic-pessimistic paradigms for safe bandits, DOSLB exercises supreme optimism, using optimistic estimates of reward and safety scores to select actions. Yet, and surprisingly, we show that DOSLB rarely takes risky actions, and obtains $\tilde{O}(d \sqrt{T})$ regret, where our notion of regret accounts for both inefficiency and lack of safety of actions. Specialising to polytopal domains, we first notably show that the $\sqrt{T}$-regret bound cannot be improved even with large gaps, and then identify a slackened notion of regret for which we show tight instance-dependent $O(\log^2 T)$ bounds. We further argue that in such domains, the number of times an overly risky action is played is also bounded as $O(\log^2T)$.

arXiv Statistics @arxiv_stats@qoto.org

Hamiltonian Adaptive Importance Sampling. (arXiv:2209.13716v1 [cs.LG]) http://arxiv.org/abs/2209.13716

Hamiltonian Adaptive Importance Sampling

Importance sampling (IS) is a powerful Monte Carlo (MC) methodology for approximating integrals, for instance in the context of Bayesian inference. In IS, the samples are simulated from the so-called proposal distribution, and the choice of this proposal is key for achieving a high performance. In adaptive IS (AIS) methods, a set of proposals is iteratively improved. AIS is a relevant and timely methodology although many limitations remain yet to be overcome, e.g., the curse of dimensionality in high-dimensional and multi-modal problems. Moreover, the Hamiltonian Monte Carlo (HMC) algorithm has become increasingly popular in machine learning and statistics. HMC has several appealing features such as its exploratory behavior, especially in high-dimensional targets, when other methods suffer. In this paper, we introduce the novel Hamiltonian adaptive importance sampling (HAIS) method. HAIS implements a two-step adaptive process with parallel HMC chains that cooperate at each iteration. The proposed HAIS efficiently adapts a population of proposals, extracting the advantages of HMC. HAIS can be understood as a particular instance of the generic layered AIS family with an additional resampling step. HAIS achieves a significant performance improvement in high-dimensional problems w.r.t. state-of-the-art algorithms. We discuss the statistical properties of HAIS and show its high performance in two challenging examples.

arXiv Statistics @arxiv_stats@qoto.org

Statistical limits of correlation detection in trees. (arXiv:2209.13723v1 [math.ST]) http://arxiv.org/abs/2209.13723

Statistical limits of correlation detection in trees

In this paper we address the problem of testing whether two observed trees $(t,t')$ are sampled either independently or from a joint distribution under which they are correlated. This problem, which we refer to as correlation detection in trees, plays a key role in the study of graph alignment for two correlated random graphs. Motivated by graph alignment, we investigate the conditions of existence of one-sided tests, i.e. tests which have vanishing type I error and non-vanishing power in the limit of large tree depth. For the correlated Galton-Watson model with Poisson offspring of mean $λ>0$ and correlation parameter $s \in (0,1)$, we identify a phase transition in the limit of large degrees at $s = \sqrtα$, where $α \sim 0.3383$ is Otter's constant. Namely, we prove that no such test exists for $s \leq \sqrtα$, and that such a test exists whenever $s > \sqrtα$, for $λ$ large enough. This result sheds new light on the graph alignment problem in the sparse regime (with $O(1)$ average node degrees) and on the performance of the MPAlign method studied in Ganassali et al. (2021), Piccioli et al. (2021), proving in particular the conjecture of Piccioli et al. (2021) that MPAlign succeeds in the partial recovery task for correlation parameter $s>\sqrtα$ provided the average node degree $λ$ is large enough.

arXiv Statistics @arxiv_stats@qoto.org

Multi-Stage Multi-Fidelity Gaussian Process Modeling, with Application to Heavy-Ion Collisions. (arXiv:2209.13748v1 [stat.ME]) http://arxiv.org/abs/2209.13748

Multi-Stage Multi-Fidelity Gaussian Process Modeling, with Application to Heavy-Ion Collisions

In an era where scientific experimentation is often costly, multi-fidelity emulation provides a powerful tool for predictive scientific computing. While there has been notable work on multi-fidelity modeling, existing models do not incorporate an important multi-stage property of multi-fidelity simulators, where multiple fidelity parameters control for accuracy at different experimental stages. Such multi-stage simulators are widely encountered in complex nuclear physics and astrophysics problems. We thus propose a new Multi-stage Multi-fidelity Gaussian Process (M$^2$GP) model, which embeds this multi-stage structure within a novel non-stationary covariance function. We show that the M$^2$GP model can capture prior knowledge on the numerical convergence of multi-stage simulators, which allows for cost-efficient emulation of multi-fidelity systems. We demonstrate the improved predictive performance of the M$^2$GP model over state-of-the-art methods in a suite of numerical experiments and two applications, the first for emulation of cantilever beam deflection and the second for emulating the evolution of the quark-gluon plasma, which was theorized to have filled the Universe shortly after the Big Bang.

arXiv Statistics @arxiv_stats@qoto.org

Consensus Knowledge Graph Learning via Multi-view Sparse Low Rank Block Model. (arXiv:2209.13762v1 [stat.ML]) http://arxiv.org/abs/2209.13762

Consensus Knowledge Graph Learning via Multi-view Sparse Low Rank Block Model

Network analysis has been a powerful tool to unveil relationships and interactions among a large number of objects. Yet its effectiveness in accurately identifying important node-node interactions is challenged by the rapidly growing network size, with data being collected at an unprecedented granularity and scale. Common wisdom to overcome such high dimensionality is collapsing nodes into smaller groups and conducting connectivity analysis on the group level. Dividing efforts into two phases inevitably opens a gap in consistency and drives down efficiency. Consensus learning emerges as a new normal for common knowledge discovery with multiple data sources available. To this end, this paper features developing a unified framework of simultaneous grouping and connectivity analysis by combining multiple data sources. The algorithm also guarantees a statistically optimal estimator.

arXiv Statistics @arxiv_stats@qoto.org

Target Features Affect Visual Search, A Study of Eye Fixations. (arXiv:2209.13771v1 [cs.CV]) http://arxiv.org/abs/2209.13771

Target Features Affect Visual Search, A Study of Eye Fixations

Visual Search is referred to the task of finding a target object among a set of distracting objects in a visual display. In this paper, based on an independent analysis of the COCO-Search18 dataset, we investigate how the performance of human participants during visual search is affected by different parameters such as the size and eccentricity of the target object. We also study the correlation between the error rate of participants and search performance. Our studies show that a bigger and more eccentric target is found faster with fewer number of fixations. Our code for the graphics are publicly available at: \url{https://github.com/ManooshSamiei/COCOSearch18_Analysis}

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019