Show newer

Instance-Specific Asymmetric Sensitivity in Differential Privacy. (arXiv:2311.14681v1 [cs.DS]) arxiv.org/abs/2311.14681

Business Policy Experiments using Fractional Factorial Designs: Consumer Retention on DoorDash. (arXiv:2311.14698v1 [stat.ME]) arxiv.org/abs/2311.14698

Deep State-Space Model for Predicting Cryptocurrency Price. (arXiv:2311.14731v1 [q-fin.ST]) arxiv.org/abs/2311.14731

Forecasting Cryptocurrency Prices Using Deep Learning: Integrating Financial, Blockchain, and Text Data. (arXiv:2311.14759v1 [q-fin.ST]) arxiv.org/abs/2311.14759

Forecasting Cryptocurrency Prices Using Deep Learning: Integrating Financial, Blockchain, and Text Data

This paper explores the application of Machine Learning (ML) and Natural Language Processing (NLP) techniques in cryptocurrency price forecasting, specifically Bitcoin (BTC) and Ethereum (ETH). Focusing on news and social media data, primarily from Twitter and Reddit, we analyse the influence of public sentiment on cryptocurrency valuations using advanced deep learning NLP methods. Alongside conventional price regression, we treat cryptocurrency price forecasting as a classification problem. This includes both the prediction of price movements (up or down) and the identification of local extrema. We compare the performance of various ML models, both with and without NLP data integration. Our findings reveal that incorporating NLP data significantly enhances the forecasting performance of our models. We discover that pre-trained models, such as Twitter-RoBERTa and BART MNLI, are highly effective in capturing market sentiment, and that fine-tuning Large Language Models (LLMs) also yields substantial forecasting improvements. Notably, the BART MNLI zero-shot classification model shows considerable proficiency in extracting bullish and bearish signals from textual data. All of our models consistently generate profit across different validation scenarios, with no observed decline in profits or reduction in the impact of NLP data over time. The study highlights the potential of text analysis in improving financial forecasts and demonstrates the effectiveness of various NLP techniques in capturing nuanced market sentiment.

arxiv.org

Reinforcement Learning from Statistical Feedback: the Journey from AB Testing to ANT Testing. (arXiv:2311.14766v1 [cs.LG]) arxiv.org/abs/2311.14766

Deep Latent Force Models: ODE-based Process Convolutions for Bayesian Deep Learning. (arXiv:2311.14828v1 [stat.ML]) arxiv.org/abs/2311.14828

Kriging Methods for Modelling Spatial Basis Risk in Weather Index Insurances: A Technical Note. (arXiv:2311.14844v1 [stat.AP]) arxiv.org/abs/2311.14844

Fast Estimation of the Renshaw-Haberman Model and Its Variants. (arXiv:2311.14846v1 [stat.ME]) arxiv.org/abs/2311.14846

Effective Structural Encodings via Local Curvature Profiles. (arXiv:2311.14864v1 [cs.LG]) arxiv.org/abs/2311.14864

Depth-Based Statistical Inferences in the Spike Train Space. (arXiv:2311.13676v1 [stat.AP]) arxiv.org/abs/2311.13676

Depth-Based Statistical Inferences in the Spike Train Space

Metric-based summary statistics such as mean and covariance have been introduced in neural spike train space. They can properly describe template and variability in spike train data, but are often sensitive to outliers and expensive to compute. Recent studies also examine outlier detection and classification methods on point processes. These tools provide reasonable and efficient result, whereas the accuracy remains at a low level in certain cases. In this study, we propose to adopt a well-established notion of statistical depth to the spike train space. This framework can naturally define the median in a set of spike trains, which provides a robust description of the 'center' or 'template' of the observations. It also provides a principled method to identify 'outliers' in the data and classify data from different categories. We systematically compare the median with the state-of-the-art 'mean spike trains' in terms of robustness and efficiency. The performance of our novel outlier detection and classification tools will be compared with previous methods. The result shows the median has superior description for 'template' than the mean. Moreover, the proposed outlier detection and classification perform more accurately than previous methods. The advantages and superiority are well illustrated with simulations and real data.

arxiv.org

Reexamining Statistical Significance and P-Values in Nursing Research: Historical Context and Guidance for Interpretation, Alternatives, and Reporting. (arXiv:2311.13701v1 [stat.ME]) arxiv.org/abs/2311.13701

Reexamining Statistical Significance and P-Values in Nursing Research: Historical Context and Guidance for Interpretation, Alternatives, and Reporting

Nurses should rely on the best evidence, but tend to struggle with statistics, impeding research integration into clinical practice. Statistical significance, a key concept in classical statistics, and its primary metric, the p-value, are frequently misused. This topic has been debated in many disciplines but rarely in nursing. The aim is to present key arguments in the debate surrounding the misuse of p-values, discuss their relevance to nursing, and offer recommendations to address them. The literature indicates that the concept of probability in classical statistics is not easily understood, leading to misinterpretations of statistical significance. Much of the critique concerning p-values arises from such misunderstandings and imprecise terminology. Thus, some scholars have argued for the complete abandonment of p-values. Instead of discarding p-values, this article provides a comprehensive account of their historical context and the information they convey. This will clarify why they are widely used yet often misunderstood. The article also offers recommendations for accurate interpretation of statistical significance by incorporating other key metrics. To mitigate publication bias resulting from p-value misuse, pre-registering the analysis plan is recommended. The article also explores alternative approaches, particularly Bayes factors, as they may resolve several of these issues. P-values serve a purpose in nursing research as an initial safeguard against the influence of randomness. Much criticism directed towards p-values arises from misunderstandings and inaccurate terminology. Several considerations and measures are recommended, some which go beyond the conventional, to obtain accurate p-values and to better understand statistical significance. Nurse educators and researchers should considerer these in their educational and research reporting practices.

arxiv.org

Bayes-xG: Player and Position Correction on Expected Goals (xG) using Bayesian Hierarchical Approach. (arXiv:2311.13707v1 [stat.AP]) arxiv.org/abs/2311.13707

Bayes-xG: Player and Position Correction on Expected Goals (xG) using Bayesian Hierarchical Approach

This study employs Bayesian methodologies to explore the influence of player or positional factors in predicting the probability of a shot resulting in a goal, measured by the expected goals (xG) metric. Utilising publicly available data from StatsBomb, Bayesian hierarchical logistic regressions are constructed, analysing approximately 10,000 shots from the English Premier League to ascertain whether positional or player-level effects impact xG. The findings reveal positional effects in a basic model that includes only distance to goal and shot angle as predictors, highlighting that strikers and attacking midfielders exhibit a higher likelihood of scoring. However, these effects diminish when more informative predictors are introduced. Nevertheless, even with additional predictors, player-level effects persist, indicating that certain players possess notable positive or negative xG adjustments, influencing their likelihood of scoring a given chance. The study extends its analysis to data from Spain's La Liga and Germany's Bundesliga, yielding comparable results. Additionally, the paper assesses the impact of prior distribution choices on outcomes, concluding that the priors employed in the models provide sound results but could be refined to enhance sampling efficiency for constructing more complex and extensive models feasibly.

arxiv.org

Hierarchical False Discovery Rate Control for High-dimensional Survival Analysis with Interactions. (arXiv:2311.13767v1 [stat.ME]) arxiv.org/abs/2311.13767

Hierarchical False Discovery Rate Control for High-dimensional Survival Analysis with Interactions

With the development of data collection techniques, analysis with a survival response and high-dimensional covariates has become routine. Here we consider an interaction model, which includes a set of low-dimensional covariates, a set of high-dimensional covariates, and their interactions. This model has been motivated by gene-environment (G-E) interaction analysis, where the E variables have a low dimension, and the G variables have a high dimension. For such a model, there has been extensive research on estimation and variable selection. Comparatively, inference studies with a valid false discovery rate (FDR) control have been very limited. The existing high-dimensional inference tools cannot be directly applied to interaction models, as interactions and main effects are not ``equal". In this article, for high-dimensional survival analysis with interactions, we model survival using the Accelerated Failure Time (AFT) model and adopt a ``weighted least squares + debiased Lasso'' approach for estimation and selection. A hierarchical FDR control approach is developed for inference and respect of the ``main effects, interactions'' hierarchy. { The asymptotic distribution properties of the debiased Lasso estimators} are rigorously established. Simulation demonstrates the satisfactory performance of the proposed approach, and the analysis of a breast cancer dataset further establishes its practical utility.

arxiv.org

Valid confidence intervals for regression with best subset selection. (arXiv:2311.13768v1 [stat.ME]) arxiv.org/abs/2311.13768

Valid confidence intervals for regression with best subset selection

Classical confidence intervals after best subset selection are widely implemented in statistical software and are routinely used to guide practitioners in scientific fields to conclude significance. However, there are increasing concerns in the recent literature about the validity of these confidence intervals in that the intended frequentist coverage is not attained. In the context of the Akaike information criterion (AIC), recent studies observe an under-coverage phenomenon in terms of overfitting, where the estimate of error variance under the selected submodel is smaller than that for the true model. Under-coverage is particularly troubling in selective inference as it points to inflated Type I errors that would invalidate significant findings. In this article, we delineate a complementary, yet provably more deciding factor behind the incorrect coverage of classical confidence intervals under AIC, in terms of altered conditional sampling distributions of pivotal quantities. Resting on selective techniques developed in other settings, our finite-sample characterization of the selection event under AIC uncovers its geometry as a union of finitely many intervals on the real line, based on which we derive new confidence intervals with guaranteed coverage for any sample size. This geometry derived for AIC selection enables exact (and typically less than exact) conditioning, circumventing the need for the excessive conditioning common in other post-selection methods. The proposed methods are easy to implement and can be broadly applied to other commonly used best subset selection criteria. In an application to a classical US consumption dataset, the proposed confidence intervals arrive at different conclusions compared to the conventional ones, even when the selected model is the full model, leading to interpretable findings that better align with empirical observations.

arxiv.org

Learning Hierarchical Polynomials with Three-Layer Neural Networks. (arXiv:2311.13774v1 [cs.LG]) arxiv.org/abs/2311.13774

Learning Hierarchical Polynomials with Three-Layer Neural Networks

We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class generalizes the single-index model, which corresponds to $k=1$, and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time. This is a strict improvement over kernel methods, which require $\widetilde Θ(d^{kq})$ samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of $p$ being a quadratic. When $p$ is indeed a quadratic, we achieve the information-theoretically optimal sample complexity $\widetilde{\mathcal{O}}(d^2)$, which is an improvement over prior work~\citep{nichani2023provable} requiring a sample size of $\widetildeΘ(d^4)$. Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature $p$ with $\widetilde{\mathcal{O}}(d^k)$ samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.

arxiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.