Show newer

A generalized Bayesian approach for high-dimensional robust regression with serially correlated errors and predictors arxiv.org/abs/2412.05673 .ME .ST .CO .TH

A generalized Bayesian approach for high-dimensional robust regression with serially correlated errors and predictors

This paper presents a loss-based generalized Bayesian methodology for high-dimensional robust regression with serially correlated errors and predictors. The proposed framework employs a novel scaled pseudo-Huber (SPH) loss function, which smooths the well-known Huber loss, achieving a balance between quadratic and absolute linear loss behaviors. This flexibility enables the framework to accommodate both thin-tailed and heavy-tailed data effectively. The generalized Bayesian approach constructs a working likelihood utilizing the SPH loss that facilitates efficient and stable estimation while providing rigorous estimation uncertainty quantification for all model parameters. Notably, this allows formal statistical inference without requiring ad hoc tuning parameter selection while adaptively addressing a wide range of tail behavior in the errors. By specifying appropriate prior distributions for the regression coefficients -- e.g., ridge priors for small or moderate-dimensional settings and spike-and-slab priors for high-dimensional settings -- the framework ensures principled inference. We establish rigorous theoretical guarantees for the accurate estimation of underlying model parameters and the correct selection of predictor variables under sparsity assumptions for a wide range of data generating setups. Extensive simulation studies demonstrate the superiority of our approach compared to traditional quadratic and absolute linear loss-based Bayesian regression methods, highlighting its flexibility and robustness in high-dimensional and challenging data contexts.

arXiv.org

Sequential anomaly identification with observation control under generalized error metrics arxiv.org/abs/2412.04693 .ST .TH

Sequential anomaly identification with observation control under generalized error metrics

The problem of sequential anomaly detection and identification is considered, where multiple data sources are simultaneously monitored and the goal is to identify in real time those, if any, that exhibit ``anomalous" statistical behavior. An upper bound is postulated on the number of data sources that can be sampled at each sampling instant, but the decision maker selects which ones to sample based on the already collected data. Thus, in this context, a policy consists not only of a stopping rule and a decision rule that determine when sampling should be terminated and which sources to identify as anomalous upon stopping, but also of a sampling rule that determines which sources to sample at each time instant subject to the sampling constraint. Two distinct formulations are considered, which require control of different, ``generalized" error metrics. The first one tolerates a certain user-specified number of errors, of any kind, whereas the second tolerates distinct, user-specified numbers of false positives and false negatives. For each of them, a universal asymptotic lower bound on the expected time for stopping is established as the error probabilities go to 0, and it is shown to be attained by a policy that combines the stopping and decision rules proposed in the full-sampling case with a probabilistic sampling rule that achieves a specific long-run sampling frequency for each source. Moreover, the optimal to a first order asymptotic approximation expected time for stopping is compared in simulation studies with the corresponding factor in a finite regime, and the impact of the sampling constraint and tolerance to errors is assessed.

arXiv.org

Modeling High-Dimensional Dependent Data in the Presence of Many Explanatory Variables and Weak Signals arxiv.org/abs/2412.04736 .ME .ML

Modeling High-Dimensional Dependent Data in the Presence of Many Explanatory Variables and Weak Signals

This article considers a novel and widely applicable approach to modeling high-dimensional dependent data when a large number of explanatory variables are available and the signal-to-noise ratio is low. We postulate that a $p$-dimensional response series is the sum of a linear regression with many observable explanatory variables and an error term driven by some latent common factors and an idiosyncratic noise. The common factors have dynamic dependence whereas the covariance matrix of the idiosyncratic noise can have diverging eigenvalues to handle the situation of low signal-to-noise ratio commonly encountered in applications. The regression coefficient matrix is estimated using penalized methods when the dimensions involved are high. We apply factor modeling to the regression residuals, employ a high-dimensional white noise testing procedure to determine the number of common factors, and adopt a projected Principal Component Analysis when the signal-to-noise ratio is low. We establish asymptotic properties of the proposed method, both for fixed and diverging numbers of regressors, as $p$ and the sample size $T$ approach infinity. Finally, we use simulations and empirical applications to demonstrate the efficacy of the proposed approach in finite samples.

arXiv.org

Dynamic Prediction of High-density Generalized Functional Data with Fast Generalized Functional Principal Component Analysis arxiv.org/abs/2412.02014 .ME .AP

Dynamic Prediction of High-density Generalized Functional Data with Fast Generalized Functional Principal Component Analysis

Dynamic prediction, which typically refers to the prediction of future outcomes using historical records, is often of interest in biomedical research. For datasets with large sample sizes, high measurement density, and complex correlation structures, traditional methods are often infeasible because of the computational burden associated with both data scale and model complexity. Moreover, many models do not directly facilitate out-of-sample predictions for generalized outcomes. To address these issues, we develop a novel approach for dynamic predictions based on a recently developed method estimating complex patterns of variation for exponential family data: fast Generalized Functional Principal Components Analysis (fGFPCA). Our method is able to handle large-scale, high-density repeated measures much more efficiently with its implementation feasible even on personal computational resources (e.g., a standard desktop or laptop computer). The proposed method makes highly flexible and accurate predictions of future trajectories for data that exhibit high degrees of nonlinearity, and allows for out-of-sample predictions to be obtained without reestimating any parameters. A simulation study is designed and implemented to illustrate the advantages of this method. To demonstrate its practical utility, we also conducted a case study to predict diurnal active/inactive patterns using accelerometry data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014. Both the simulation study and the data application demonstrate the better predictive performance and high computational efficiency of the proposed method compared to existing methods. The proposed method also obtains more personalized prediction that improves as more information becomes available, which is an essential goal of dynamic prediction that other methods fail to achieve.

arXiv.org

Comparing Clustering Approaches for Smart Meter Time Series: Investigating the Influence of Dataset Properties on Performance arxiv.org/abs/2412.02026 .AP

Comparing Clustering Approaches for Smart Meter Time Series: Investigating the Influence of Dataset Properties on Performance

The widespread adoption of smart meters for monitoring energy consumption has generated vast quantities of high-resolution time series data which remains underutilised. While clustering has emerged as a fundamental tool for mining smart meter time series (SMTS) data, selecting appropriate clustering methods remains challenging despite numerous comparative studies. These studies often rely on problematic methodologies and consider a limited scope of methods, frequently overlooking compelling methods from the broader time series clustering literature. Consequently, they struggle to provide dependable guidance for practitioners designing their own clustering approaches. This paper presents a comprehensive comparative framework for SMTS clustering methods using expert-informed synthetic datasets that emphasise peak consumption behaviours as fundamental cluster concepts. Using a phased methodology, we first evaluated 31 distance measures and 8 representation methods using leave-one-out classification, then examined the better-suited methods in combination with 11 clustering algorithms. We further assessed the robustness of these combinations to systematic changes in key dataset properties that affect clustering performance on real-world datasets, including cluster balance, noise, and the presence of outliers. Our results revealed that methods accommodating local temporal shifts while maintaining amplitude sensitivity, particularly Dynamic Time Warping and $k$-sliding distance, consistently outperformed traditional approaches. Among other key findings, we identified that when combined with hierarchical clustering using Ward's linkage, these methods demonstrated consistent robustness across varying dataset characteristics without careful parameter tuning. These and other findings inform actionable recommendations for practitioners.

arXiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.