arXiv Statistics @arxiv_stats@qoto.org

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019

2 Following 600 Followers

Posts Posts and replies Media

arXiv Statistics @arxiv_stats@qoto.org

A response-adaptive multi-arm design for continuous endpoints based on a weighted information measure https://arxiv.org/abs/2409.04970 #stat.ME #math.IT #stat.AP #cs.IT

A response-adaptive multi-arm design for continuous endpoints based on a weighted information measure

Multi-arm trials are gaining interest in practice given the statistical and logistical advantages that they can offer. The standard approach is to use a fixed (throughout the trial) allocation ratio, but there is a call for making it adaptive and skewing the allocation of patients towards better performing arms. However, among other challenges, it is well-known that these approaches might suffer from lower statistical power. We present a response-adaptive design for continuous endpoints which explicitly allows to control the trade-off between the number of patients allocated to the 'optimal' arm and the statistical power. Such a balance is achieved through the calibration of a tuning parameter, and we explore various strategies to effectively select it. The proposed criterion is based on a context-dependent information measure which gives a greater weight to those treatment arms which have characteristics close to a pre-specified clinical target. We also introduce a simulation-based hypothesis testing procedure which focuses on selecting the target arm, discussing strategies to effectively control the type-I error rate. The potential advantage of the proposed criterion over currently used alternatives is evaluated in simulations, and its practical implementation is illustrated in the context of early Phase IIa proof-of-concept oncology clinical trials.

arXiv Statistics @arxiv_stats@qoto.org

Resultant: Incremental Effectiveness on Likelihood for Unsupervised Out-of-Distribution Detection https://arxiv.org/abs/2409.03801 #stat.ML #cs.LG

Resultant: Incremental Effectiveness on Likelihood for Unsupervised Out-of-Distribution Detection

Unsupervised out-of-distribution (U-OOD) detection is to identify OOD data samples with a detector trained solely on unlabeled in-distribution (ID) data. The likelihood function estimated by a deep generative model (DGM) could be a natural detector, but its performance is limited in some popular "hard" benchmarks, such as FashionMNIST (ID) vs. MNIST (OOD). Recent studies have developed various detectors based on DGMs to move beyond likelihood. However, despite their success on "hard" benchmarks, most of them struggle to consistently surpass or match the performance of likelihood on some "non-hard" cases, such as SVHN (ID) vs. CIFAR10 (OOD) where likelihood could be a nearly perfect detector. Therefore, we appeal for more attention to incremental effectiveness on likelihood, i.e., whether a method could always surpass or at least match the performance of likelihood in U-OOD detection. We first investigate the likelihood of variational DGMs and find its detection performance could be improved in two directions: i) alleviating latent distribution mismatch, and ii) calibrating the dataset entropy-mutual integration. Then, we apply two techniques for each direction, specifically post-hoc prior and dataset entropy-mutual calibration. The final method, named Resultant, combines these two directions for better incremental effectiveness compared to either technique alone. Experimental results demonstrate that the Resultant could be a new state-of-the-art U-OOD detector while maintaining incremental effectiveness on likelihood in a wide range of tasks.

arXiv Statistics @arxiv_stats@qoto.org

Exploratory Visual Analysis for Increasing Data Readiness in Artificial Intelligence Projects https://arxiv.org/abs/2409.03805 #stat.ME #cs.AI

Exploratory Visual Analysis for Increasing Data Readiness in Artificial Intelligence Projects

We present experiences and lessons learned from increasing data readiness of heterogeneous data for artificial intelligence projects using visual analysis methods. Increasing the data readiness level involves understanding both the data as well as the context in which it is used, which are challenges well suitable to visual analysis. For this purpose, we contribute a mapping between data readiness aspects and visual analysis techniques suitable for different data types. We use the defined mapping to increase data readiness levels in use cases involving time-varying data, including numerical, categorical, and text. In addition to the mapping, we extend the data readiness concept to better take aspects of the task and solution into account and explicitly address distribution shifts during data collection time. We report on our experiences in using the presented visual analysis techniques to aid future artificial intelligence projects in raising the data readiness level.

arXiv Statistics @arxiv_stats@qoto.org

A tutorial on panel data analysis using partially observed Markov processes via the R package panelPomp https://arxiv.org/abs/2409.03876 #stat.CO #stat.ME

A tutorial on panel data analysis using partially observed Markov processes via the R package panelPomp

The R package panelPomp supports analysis of panel data via a general class of partially observed Markov process models (PanelPOMP). This package tutorial describes how the mathematical concept of a PanelPOMP is represented in the software and demonstrates typical use-cases of panelPomp. Monte Carlo methods used for POMP models require adaptation for PanelPOMP models due to the higher dimensionality of panel data. The package takes advantage of recent advances for PanelPOMP, including an iterated filtering algorithm, Monte Carlo adjusted profile methodology and block optimization methodology to assist with the large parameter spaces that can arise with panel models. In addition, tools for manipulation of models and data are provided that take advantage of the panel structure.

arXiv Statistics @arxiv_stats@qoto.org

Active Sampling of Interpolation Points to Identify Dominant Subspaces for Model Reduction https://arxiv.org/abs/2409.03892 #stat.ML #math.DS #math.NA #cs.LG #cs.NA

Active Sampling of Interpolation Points to Identify Dominant Subspaces for Model Reduction

Model reduction is an active research field to construct low-dimensional surrogate models of high fidelity to accelerate engineering design cycles. In this work, we investigate model reduction for linear structured systems using dominant reachable and observable subspaces. When the training set $-$ containing all possible interpolation points $-$ is large, then these subspaces can be determined by solving many large-scale linear systems. However, for high-fidelity models, this easily becomes computationally intractable. To circumvent this issue, in this work, we propose an active sampling strategy to sample only a few points from the given training set, which can allow us to estimate those subspaces accurately. To this end, we formulate the identification of the subspaces as the solution of the generalized Sylvester equations, guiding us to select the most relevant samples from the training set to achieve our goals. Consequently, we construct solutions of the matrix equations in low-rank forms, which encode subspace information. We extensively discuss computational aspects and efficient usage of the low-rank factors in the process of obtaining reduced-order models. We illustrate the proposed active sampling scheme to obtain reduced-order models via dominant reachable and observable subspaces and present its comparison with the method where all the points from the training set are taken into account. It is shown that the active sample strategy can provide us $17$x speed-up without sacrificing any noticeable accuracy.

arXiv Statistics @arxiv_stats@qoto.org

Causal effect of the infield shift in the MLB https://arxiv.org/abs/2409.03940 #stat.AP

Causal effect of the infield shift in the MLB

The infield shift has been increasingly used as a defensive strategy in baseball in recent years. Along with the upward trend in its usage, the notoriety of the shift has grown, as it is believed to be responsible for the recent decline in offence. In the 2023 season, Major League Baseball (MLB) implemented a rule change prohibiting the infield shift. However, there has been no systematic analysis of the effectiveness of infield shift to determine if it is a cause of the cooling in offence. We used publicly available data on MLB from 2015-2022 to evaluate the causal effect of the infield shift on the expected runs scored. We employed three methods for drawing causal conclusions from observational data -- nearest neighbour matching, inverse probability of treatment weighting, and instrumental variable analysis -- and evaluated the causal effect in subgroups defined by batter-handedness. The results of all methods showed the shift is effective at preventing runs, but primarily for left-handed batters.

arXiv Statistics @arxiv_stats@qoto.org

Average Causal Effect Estimation in DAGs with Hidden Variables: Extensions of Back-Door and Front-Door Criteria https://arxiv.org/abs/2409.03962 #stat.ME #stat.ML #cs.LG

Average Causal Effect Estimation in DAGs with Hidden Variables: Extensions of Back-Door and Front-Door Criteria

The identification theory for causal effects in directed acyclic graphs (DAGs) with hidden variables is well-developed, but methods for estimating and inferring functionals beyond the g-formula remain limited. Previous studies have proposed semiparametric estimators for identifiable functionals in a broad class of DAGs with hidden variables. While demonstrating double robustness in some models, existing estimators face challenges, particularly with density estimation and numerical integration for continuous variables, and their estimates may fall outside the parameter space of the target estimand. Their asymptotic properties are also underexplored, especially when using flexible statistical and machine learning models for nuisance estimation. This study addresses these challenges by introducing novel one-step corrected plug-in and targeted minimum loss-based estimators of causal effects for a class of DAGs that extend classical back-door and front-door criteria (known as the treatment primal fixability criterion in prior literature). These estimators leverage machine learning to minimize modeling assumptions while ensuring key statistical properties such as asymptotic linearity, double robustness, efficiency, and staying within the bounds of the target parameter space. We establish conditions for nuisance functional estimates in terms of L2(P)-norms to achieve root-n consistent causal effect estimates. To facilitate practical application, we have developed the flexCausal package in R.

arXiv Statistics @arxiv_stats@qoto.org

Entry-Specific Matrix Estimation under Arbitrary Sampling Patterns through the Lens of Network Flows https://arxiv.org/abs/2409.03980 #stat.ML #cs.LG

Entry-Specific Matrix Estimation under Arbitrary Sampling Patterns through the Lens of Network Flows

Matrix completion tackles the task of predicting missing values in a low-rank matrix based on a sparse set of observed entries. It is often assumed that the observation pattern is generated uniformly at random or has a very specific structure tuned to a given algorithm. There is still a gap in our understanding when it comes to arbitrary sampling patterns. Given an arbitrary sampling pattern, we introduce a matrix completion algorithm based on network flows in the bipartite graph induced by the observation pattern. For additive matrices, the particular flow we used is the electrical flow and we establish error upper bounds customized to each entry as a function of the observation set, along with matching minimax lower bounds. Our results show that the minimax squared error for recovery of a particular entry in the matrix is proportional to the effective resistance of the corresponding edge in the graph. Furthermore, we show that our estimator is equivalent to the least squares estimator. We apply our estimator to the two-way fixed effects model and show that it enables us to accurately infer individual causal effects and the unit-specific and time-specific confounders. For rank-$1$ matrices, we use edge-disjoint paths to form an estimator that achieves minimax optimal estimation when the sampling is sufficiently dense. Our discovery introduces a new family of estimators parametrized by network flows, which provide a fine-grained and intuitive understanding of the impact of the given sampling pattern on the relative difficulty of estimation at an entry-specific level. This graph-based approach allows us to quantify the inherent complexity of matrix completion for individual entries, rather than relying solely on global measures of performance.

arXiv Statistics @arxiv_stats@qoto.org

Fitting the Discrete Swept Skeletal Representation to Slabular Objects https://arxiv.org/abs/2409.04079 #stat.ME

Fitting the Discrete Swept Skeletal Representation to Slabular Objects

Statistical shape analysis of slabular objects like groups of hippocampi is highly useful for medical researchers as it can be useful for diagnoses and understanding diseases. This work proposes a novel object representation based on locally parameterized discrete swept skeletal structures. Further, model fitting and analysis of such representations are discussed. The model fitting procedure is based on boundary division and surface flattening. The quality of the model fitting is evaluated based on the symmetry and tidiness of the skeletal structure as well as the volume of the implied boundary. The power of the method is demonstrated by visual inspection and statistical analysis of a synthetic and an actual data set in comparison with an available skeletal representation.

arXiv Statistics @arxiv_stats@qoto.org

Estimation of service value parameters for a queue with unobserved balking https://arxiv.org/abs/2409.04090 #math.ST #stat.TH

Estimation of service value parameters for a queue with unobserved balking

In Naor's model [16], customers decide whether or not to join a queue after observing its length. We suppose that customers are heterogeneous in their service value (reward) $R$ from completed service and homogeneous in the cost of staying in the system per unit of time. It is assumed that the values of customers are independent random variables generated from a common parametric distribution. The manager observes the queue length process, but not the balking customers. Based on the queue length data, an MLE is constructed for the underlying parameters of $R$. We provide verifiable conditions for which the estimator is consistent and asymptotically normal. A dynamic pricing scheme is constructed that starts from some arbitrary price and iteratively updates the price using the estimated parameters. The performance of the estimator and the pricing algorithm are studied through a series of simulation experiments.

arXiv Statistics @arxiv_stats@qoto.org

Estimation of Proportion of Null Hypotheses Under Dependence https://arxiv.org/abs/2409.04100 #math.ST #stat.TH

Estimation of Proportion of Null Hypotheses Under Dependence

Estimation of the proportion of null hypotheses in a multiple testing problem can greatly enhance the performance of the existing algorithms. Although various estimators for the proportion of null hypotheses have been proposed, most are designed for independent samples, and their effectiveness in dependent scenarios is not well explored. This article investigates the asymptotic behavior of the BH estimator and evaluates its performance across different types of dependence. Additionally, we assess Storey's estimator and another estimator proposed by Patra and Sen (2016) to understand their effectiveness in these settings.

arXiv Statistics @arxiv_stats@qoto.org

Categorical Data Analysis https://arxiv.org/abs/2409.02942 #stat.ME

Categorical Data Analysis

Categorical data are common in educational and social science research; however, methods for its analysis are generally not covered in introductory statistics courses. This chapter overviews fundamental concepts and methods in categorical data analysis. It describes and illustrates the analysis of contingency tables given different sampling processes and distributions, estimation of probabilities, hypothesis testing, measures of associations, and tests of no association with nominal variables, as well as the test of linear association with ordinal variables. Three data sets illustrate fatal police shootings in the United States, clinical trials of the Moderna vaccine, and responses to General Social Survey questions.

arXiv Statistics @arxiv_stats@qoto.org

On Inference of Weitzman Overlapping Coefficient in Two Weibull Distributions https://arxiv.org/abs/2409.02950 #stat.ME

On Inference of Weitzman Overlapping Coefficient in Two Weibull Distributions

Studying overlapping coefficients has recently become of great benefit, especially after its use in goodness-of-fit tests. These coefficients are defined as the amount of similarity between two statistical distributions. This research examines the estimation of one of these overlapping coefficients, which is the Weitzman coefficient Δ, assuming two Weibull distributions and without using any restrictions on the parameters of these distributions. We studied the relative bias and relative mean square error of the resulting estimator by implementing a simulation study. The results show the importance of the resulting estimator.

arXiv Statistics @arxiv_stats@qoto.org

Discussion of "Data fission: splitting a single data point" https://arxiv.org/abs/2409.03069 #stat.ME

Discussion of "Data fission: splitting a single data point"

Leiner et al. [2023] introduce an important generalization of sample splitting, which they call data fission. They consider two cases of data fission: P1 fission and P2 fission. While P1 fission is extremely useful and easy to use, Leiner et al. [2023] provide P1 fission operations only for the Gaussian and the Poisson distributions. They provide little guidance on how to apply P2 fission operations in practice, leaving the reader unsure of how to apply data fission outside of the Gaussian and Poisson settings. In this discussion, we describe how our own work provides P1 fission operations in a wide variety of families and offers insight into when P1 fission is possible. We also provide guidance on how to actually apply P2 fission in practice, with a special focus on logistic regression. Finally, we interpret P2 fission as a remedy for distributional misspecification when carrying out P1 fission operations.

arXiv Statistics @arxiv_stats@qoto.org

Penalized Subgrouping of Heterogeneous Time Series https://arxiv.org/abs/2409.03085 #stat.ME

Penalized Subgrouping of Heterogeneous Time Series

Interest in the study and analysis of dynamic processes in the social, behavioral, and health sciences has burgeoned in recent years due to the increased availability of intensive longitudinal data. However, how best to model and account for the persistent heterogeneity characterizing such processes remains an open question. The multi-VAR framework, a recent methodological development built on the vector autoregressive model, accommodates heterogeneous dynamics in multiple-subject time series through structured penalization. In the original multi-VAR proposal, individual-level transition matrices are decomposed into common and unique dynamics, allowing for generalizable and person-specific features. The current project extends this framework to allow additionally for the identification and penalized estimation of subgroup-specific dynamics; that is, patterns of dynamics that are shared across subsets of individuals. The performance of the proposed subgrouping extension is evaluated in the context of both a simulation study and empirical application, and results are compared to alternative methods for subgrouping multiple-subject, multivariate time series.

arXiv Statistics @arxiv_stats@qoto.org

A Bayesian Optimization through Sequential Monte Carlo and Statistical Physics-Inspired Techniques https://arxiv.org/abs/2409.03094 #physics.data-an #stat.CO #cs.DC

A Bayesian Optimization through Sequential Monte Carlo and Statistical Physics-Inspired Techniques

In this paper, we propose an approach for an application of Bayesian optimization using Sequential Monte Carlo (SMC) and concepts from the statistical physics of classical systems. Our method leverages the power of modern machine learning libraries such as NumPyro and JAX, allowing us to perform Bayesian optimization on multiple platforms, including CPUs, GPUs, TPUs, and in parallel. Our approach enables a low entry level for exploration of the methods while maintaining high performance. We present a promising direction for developing more efficient and effective techniques for a wide range of optimization problems in diverse fields.

arXiv Statistics @arxiv_stats@qoto.org

Co-Developing Causal Graphs with Domain Experts Guided by Weighted FDR-Adjusted p-values https://arxiv.org/abs/2409.03126 #stat.ME

Co-Developing Causal Graphs with Domain Experts Guided by Weighted FDR-Adjusted p-values

This paper proposes an approach facilitating co-design of causal graphs between subject matter experts and statistical modellers. Modern causal analysis starting with formulation of causal graphs provides benefits for robust analysis and well-grounded decision support. Moreover, this process can enrich the discovery and planning phase of data science projects. The key premise is that plotting relevant statistical information on a causal graph structure can facilitate an intuitive discussion between domain experts and modellers. Furthermore, Hand-crafting causality graphs, integrating human expertise with robust statistical methodology, enables ensuring responsible AI practices. The paper focuses on using multiplicity-adjusted p-values, controlling for the false discovery rate (FDR), as an aid for co-designing the graph. A family of hypotheses relevant to causal graph construction is identified, including assessing correlation strengths, directions of causal effects, and how well an estimated structural causal model induces the observed covariance structure. An iterative flow is described where an initial causal graph is drafted based on expert beliefs about likely causal relationships. The subject matter expert's beliefs, communicated as ranked scores could be incorporated into the control of the measure proposed by Benjamini and Kling, the FDCR (False Discovery Cost Rate). The FDCR-adjusted p-values then provide feedback on which parts of the graph are supported or contradicted by the data. This co-design process continues, adding, removing, or revising arcs in the graph, until the expert and modeller converge on a satisfactory causal structure grounded in both domain knowledge and data evidence.

arXiv Statistics @arxiv_stats@qoto.org

A New Forward Discriminant Analysis Framework Based On Pillai's Trace and ULDA https://arxiv.org/abs/2409.03136 #stat.ME #stat.CO #stat.ML

A New Forward Discriminant Analysis Framework Based On Pillai's Trace and ULDA

Linear discriminant analysis (LDA), a traditional classification tool, suffers from limitations such as sensitivity to noise and computational challenges when dealing with non-invertible within-class scatter matrices. Traditional stepwise LDA frameworks, which iteratively select the most informative features, often exacerbate these issues by relying heavily on Wilks' $Λ$, potentially causing premature stopping of the selection process. This paper introduces a novel forward discriminant analysis framework that integrates Pillai's trace with Uncorrelated Linear Discriminant Analysis (ULDA) to address these challenges, and offers a unified and stand-alone classifier. Through simulations and real-world datasets, the new framework demonstrates effective control of Type I error rates and improved classification accuracy, particularly in cases involving perfect group separations. The results highlight the potential of this approach as a robust alternative to the traditional stepwise LDA framework.

arXiv Statistics @arxiv_stats@qoto.org

Non-stationary and Sparsely-correlated Multi-output Gaussian Process with Spike-and-Slab Prior https://arxiv.org/abs/2409.03149 #stat.ML #eess.SY #cs.LG #cs.MA #cs.SY

Non-stationary and Sparsely-correlated Multi-output Gaussian Process with Spike-and-Slab Prior

Multi-output Gaussian process (MGP) is commonly used as a transfer learning method to leverage information among multiple outputs. A key advantage of MGP is providing uncertainty quantification for prediction, which is highly important for subsequent decision-making tasks. However, traditional MGP may not be sufficiently flexible to handle multivariate data with dynamic characteristics, particularly when dealing with complex temporal correlations. Additionally, since some outputs may lack correlation, transferring information among them may lead to negative transfer. To address these issues, this study proposes a non-stationary MGP model that can capture both the dynamic and sparse correlation among outputs. Specifically, the covariance functions of MGP are constructed using convolutions of time-varying kernel functions. Then a dynamic spike-and-slab prior is placed on correlation parameters to automatically decide which sources are informative to the target output in the training process. An expectation-maximization (EM) algorithm is proposed for efficient model fitting. Both numerical studies and a real case demonstrate its efficacy in capturing dynamic and sparse correlation structure and mitigating negative transfer for high-dimensional time-series data. Finally, a mountain-car reinforcement learning case highlights its potential application in decision making problems.

arXiv Statistics @arxiv_stats@qoto.org

Wrapped Gaussian Process Functional Regression Model for Batch Data on Riemannian Manifolds https://arxiv.org/abs/2409.03181 #stat.ME

Wrapped Gaussian Process Functional Regression Model for Batch Data on Riemannian Manifolds

Regression is an essential and fundamental methodology in statistical analysis. The majority of the literature focuses on linear and nonlinear regression in the context of the Euclidean space. However, regression models in non-Euclidean spaces deserve more attention due to collection of increasing volumes of manifold-valued data. In this context, this paper proposes a concurrent functional regression model for batch data on Riemannian manifolds by estimating both mean structure and covariance structure simultaneously. The response variable is assumed to follow a wrapped Gaussian process distribution. Nonlinear relationships between manifold-valued response variables and multiple Euclidean covariates can be captured by this model in which the covariates can be functional and/or scalar. The performance of our model has been tested on both simulated data and real data, showing it is an effective and efficient tool in conducting functional data regression on Riemannian manifolds.

Bot

I post the feed of the arXiv Statistics.

#Statistics #Stats #Mathematics #Math #Maths #Science #arXiv #News #PeerReview

Joined Aug 2019