Show newer

Using Multivariate Linear Regression for Biochemical Oxygen Demand Prediction in Waste Water. (arXiv:2209.14297v1 [q-bio.OT]) arxiv.org/abs/2209.14297

Using Multivariate Linear Regression for Biochemical Oxygen Demand Prediction in Waste Water

There exist opportunities for Multivariate Linear Regression (MLR) in the prediction of Biochemical Oxygen Demand (BOD) in waste water, using the diverse water quality parameters as the input variables. The goal of this work is to examine the capability of MLR in prediction of BOD in waste water through four input variables: Dissolved Oxygen (DO), Nitrogen, Fecal Coliform and Total Coliform. The four input variables have higher correlation strength to BOD out of the seven parameters examined for the strength of correlation. Machine Learning (ML) was done with both 80% and 90% of the data as the training set and 20% and 10% as the test set respectively. MLR performance was evaluated through the coefficient of correlation (r), Root Mean Square Error (RMSE) and the percentage accuracy in prediction of BOD. The performance indices for the input variables of Dissolved Oxygen, Nitrogen, Fecal Coliform and Total Coliform in prediction of BOD are: RMSE=6.77mg/L, r=0.60 and accuracy 70.3% for training dataset of 80% and RMSE=6.74mg/L, r=0.60 and accuracy of 87.5% for training set of 90% of the dataset. It was found that increasing the percentage of the training set above 80% of the dataset improved the accuracy of the model only but did not have a significant impact on the prediction capacity of the model. The results showed that MLR model could be successfully employed in the estimation of BOD in waste water using appropriately selected input parameters.

arxiv.org

Model Specification in Mixed-Effects Models: A Focus on Random Effects. (arXiv:2209.14349v1 [stat.ME]) arxiv.org/abs/2209.14349

Model Specification in Mixed-Effects Models: A Focus on Random Effects

Mixed-effect regression models are powerful tools for researchers in a myriad of fields. Part of the appeal of mixed-effect models is their great flexibility, but that flexibility comes at the cost of complexity and if users are not careful in how their model is specified, they could be making faulty inferences from their data. As others have argued, we think there is a great deal of confusion around appropriate random effects to be included in a model given the study design, with researchers generally being better at specifying the fixed effects of a model which map onto to their research hypotheses. To that end, we present an instructive framework for evaluating the random effects of a model in three different situations: (1) longitudinal designs; (2) factorial repeated measures; and (3) when dealing with multiple sources of variance. We provide worked examples with open-access code and data in an online repository. This framework will be helpful for students and researchers who are new to mixed effect models, and to reviewers who may have to evaluate a novel model as part of their review. Ultimately, it is difficult to specify "the" appropriate random-effects structure for a mixed model, but by giving users tools to think more deeply about their random effects, we can improve the validity of statistical conclusions in many areas of research.

arxiv.org

Fast Inference for Quantile Regression with Millions of Observations. (arXiv:2209.14502v1 [econ.EM]) arxiv.org/abs/2209.14502

Fast Inference for Quantile Regression with Millions of Observations

While applications of big data analytics have brought many new opportunities to economic research, with datasets containing millions of observations, making usual econometric inferences based on extreme estimators would require huge computing powers and memories that are often not accessible. In this paper, we focus on linear quantile regression employed to analyze "ultra-large" datasets such as U.S. decennial censuses. We develop a new inference framework that runs very fast, based on the stochastic sub-gradient descent (S-subGD) updates. The cross-sectional data are treated sequentially into the inference procedure: (i) the parameter estimate is updated when each "new observation" arrives, (ii) it is aggregated as the Polyak-Ruppert average, and (iii) a pivotal statistic for inference is computed using a solution path only. We leverage insights from time series regression and construct an asymptotically pivotal statistic via random scaling. Our proposed test statistic is computed in a fully online fashion and the critical values are obtained without any resampling methods. We conduct extensive numerical studies to showcase the computational merits of our proposed inference. For inference problems as large as $(n, d) \sim (10^7, 10^3)$, where $n$ is the sample size and $d$ is the number of regressors, our method can generate new insights beyond the computational capabilities of existing inference methods. Specifically, we uncover the trends in the gender gap in the U.S. college wage premium using millions of observations, while controlling over $10^3$ covariates to mitigate confounding effects.

arxiv.org

A new method to construct high-dimensional copulas with Bernoulli and Coxian-2 distributions. (arXiv:2209.13675v1 [math.ST]) arxiv.org/abs/2209.13675

A new method to construct high-dimensional copulas with Bernoulli and Coxian-2 distributions

We propose an approach to construct a new family of generalized Farlie-Gumbel-Morgenstern (GFGM) copulas that naturally scales to high dimensions. A GFGM copula can model moderate positive and negative dependence, cover different types of asymmetries, and admits exact expressions for many quantities of interest such as measures of association or risk measures in actuarial science or quantitative risk management. More importantly, this paper presents a new method to construct high-dimensional copulas based on mixtures of power functions, and may be adapted to more general contexts to construct broader families of copulas. We construct a family of copulas through a stochastic representation based on multivariate Bernoulli distributions and Coxian-2 distributions. This paper will cover the construction of a GFGM copula, study its measures of multivariate association and dependence properties. We explain how to sample random vectors from the new family of copulas in high dimensions. Then, we study the bivariate case in detail and find that our construction leads to an asymmetric modified Huang-Kotz FGM copula. Finally, we study the exchangeable case and provide some insights into the most negative dependence structure within this new class of high-dimensional copulas.

arxiv.org

Hamiltonian Adaptive Importance Sampling. (arXiv:2209.13716v1 [cs.LG]) arxiv.org/abs/2209.13716

Hamiltonian Adaptive Importance Sampling

Importance sampling (IS) is a powerful Monte Carlo (MC) methodology for approximating integrals, for instance in the context of Bayesian inference. In IS, the samples are simulated from the so-called proposal distribution, and the choice of this proposal is key for achieving a high performance. In adaptive IS (AIS) methods, a set of proposals is iteratively improved. AIS is a relevant and timely methodology although many limitations remain yet to be overcome, e.g., the curse of dimensionality in high-dimensional and multi-modal problems. Moreover, the Hamiltonian Monte Carlo (HMC) algorithm has become increasingly popular in machine learning and statistics. HMC has several appealing features such as its exploratory behavior, especially in high-dimensional targets, when other methods suffer. In this paper, we introduce the novel Hamiltonian adaptive importance sampling (HAIS) method. HAIS implements a two-step adaptive process with parallel HMC chains that cooperate at each iteration. The proposed HAIS efficiently adapts a population of proposals, extracting the advantages of HMC. HAIS can be understood as a particular instance of the generic layered AIS family with an additional resampling step. HAIS achieves a significant performance improvement in high-dimensional problems w.r.t. state-of-the-art algorithms. We discuss the statistical properties of HAIS and show its high performance in two challenging examples.

arxiv.org

Statistical limits of correlation detection in trees. (arXiv:2209.13723v1 [math.ST]) arxiv.org/abs/2209.13723

Statistical limits of correlation detection in trees

In this paper we address the problem of testing whether two observed trees $(t,t')$ are sampled either independently or from a joint distribution under which they are correlated. This problem, which we refer to as correlation detection in trees, plays a key role in the study of graph alignment for two correlated random graphs. Motivated by graph alignment, we investigate the conditions of existence of one-sided tests, i.e. tests which have vanishing type I error and non-vanishing power in the limit of large tree depth. For the correlated Galton-Watson model with Poisson offspring of mean $λ>0$ and correlation parameter $s \in (0,1)$, we identify a phase transition in the limit of large degrees at $s = \sqrtα$, where $α \sim 0.3383$ is Otter's constant. Namely, we prove that no such test exists for $s \leq \sqrtα$, and that such a test exists whenever $s > \sqrtα$, for $λ$ large enough. This result sheds new light on the graph alignment problem in the sparse regime (with $O(1)$ average node degrees) and on the performance of the MPAlign method studied in Ganassali et al. (2021), Piccioli et al. (2021), proving in particular the conjecture of Piccioli et al. (2021) that MPAlign succeeds in the partial recovery task for correlation parameter $s>\sqrtα$ provided the average node degree $λ$ is large enough.

arxiv.org

Multi-Stage Multi-Fidelity Gaussian Process Modeling, with Application to Heavy-Ion Collisions. (arXiv:2209.13748v1 [stat.ME]) arxiv.org/abs/2209.13748

Multi-Stage Multi-Fidelity Gaussian Process Modeling, with Application to Heavy-Ion Collisions

In an era where scientific experimentation is often costly, multi-fidelity emulation provides a powerful tool for predictive scientific computing. While there has been notable work on multi-fidelity modeling, existing models do not incorporate an important multi-stage property of multi-fidelity simulators, where multiple fidelity parameters control for accuracy at different experimental stages. Such multi-stage simulators are widely encountered in complex nuclear physics and astrophysics problems. We thus propose a new Multi-stage Multi-fidelity Gaussian Process (M$^2$GP) model, which embeds this multi-stage structure within a novel non-stationary covariance function. We show that the M$^2$GP model can capture prior knowledge on the numerical convergence of multi-stage simulators, which allows for cost-efficient emulation of multi-fidelity systems. We demonstrate the improved predictive performance of the M$^2$GP model over state-of-the-art methods in a suite of numerical experiments and two applications, the first for emulation of cantilever beam deflection and the second for emulating the evolution of the quark-gluon plasma, which was theorized to have filled the Universe shortly after the Big Bang.

arxiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.