Show newer

Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and Stable Online Fine-Tuning. (arXiv:2211.11802v1 [cs.LG]) arxiv.org/abs/2211.11802

Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and Stable Online Fine-Tuning

The ability to discover optimal behaviour from fixed data sets has the potential to transfer the successes of reinforcement learning (RL) to domains where data collection is acutely problematic. In this offline setting, a key challenge is overcoming overestimation bias for actions not present in data which, without the ability to correct for via interaction with the environment, can propagate and compound during training, leading to highly sub-optimal policies. One simple method to reduce this bias is to introduce a policy constraint via behavioural cloning (BC), which encourages agents to pick actions closer to the source data. By finding the right balance between RL and BC such approaches have been shown to be surprisingly effective while requiring minimal changes to the underlying algorithms they are based on. To date this balance has been held constant, but in this work we explore the idea of tipping this balance towards RL following initial training. Using TD3-BC, we demonstrate that by continuing to train a policy offline while reducing the influence of the BC component we can produce refined policies that outperform the original baseline, as well as match or exceed the performance of more complex alternatives. Furthermore, we demonstrate such an approach can be used for stable online fine-tuning, allowing policies to be safely improved during deployment.

arxiv.org

A Bi-level Nonlinear Eigenvector Algorithm for Wasserstein Discriminant Analysis. (arXiv:2211.11891v1 [stat.ML]) arxiv.org/abs/2211.11891

A Bi-level Nonlinear Eigenvector Algorithm for Wasserstein Discriminant Analysis

Much like the classical Fisher linear discriminant analysis, Wasserstein discriminant analysis (WDA) is a supervised linear dimensionality reduction method that seeks a projection matrix to maximize the dispersion of different data classes and minimize the dispersion of same data classes. However, in contrast, WDA can account for both global and local inter-connections between data classes using a regularized Wasserstein distance. WDA is formulated as a bi-level nonlinear trace ratio optimization. In this paper, we present a bi-level nonlinear eigenvector (NEPv) algorithm, called WDA-nepv. The inner kernel of WDA-nepv for computing the optimal transport matrix of the regularized Wasserstein distance is formulated as an NEPv, and meanwhile the outer kernel for the trace ratio optimization is also formulated as another NEPv. Consequently, both kernels can be computed efficiently via self-consistent-field iterations and modern solvers for linear eigenvalue problems. Comparing with the existing algorithms for WDA, WDA-nepv is derivative-free and surrogate-model-free. The computational efficiency and applications in classification accuracy of WDA-nepv are demonstrated using synthetic and real-life datasets.

arxiv.org

Curiosity in hindsight. (arXiv:2211.10515v1 [stat.ML]) arxiv.org/abs/2211.10515

Curiosity in hindsight

Consider the exploration in sparse-reward or reward-free environments, such as Montezuma's Revenge. The curiosity-driven paradigm dictates an intuitive technique: At each step, the agent is rewarded for how much the realized outcome differs from their predicted outcome. However, using predictive error as intrinsic motivation is prone to fail in stochastic environments, as the agent may become hopelessly drawn to high-entropy areas of the state-action space, such as a noisy TV. Therefore it is important to distinguish between aspects of world dynamics that are inherently predictable and aspects that are inherently unpredictable: The former should constitute a source of intrinsic reward, whereas the latter should not. In this work, we study a natural solution derived from structural causal models of the world: Our key idea is to learn representations of the future that capture precisely the unpredictable aspects of each outcome -- not any more, not any less -- which we use as additional input for predictions, such that intrinsic rewards do vanish in the limit. First, we propose incorporating such hindsight representations into the agent's model to disentangle "noise" from "novelty", yielding Curiosity in Hindsight: a simple and scalable generalization of curiosity that is robust to all types of stochasticity. Second, we implement this framework as a drop-in modification of any prediction-based exploration bonus, and instantiate it for the recently introduced BYOL-Explore algorithm as a prime example, resulting in the noise-robust "BYOL-Hindsight". Third, we illustrate its behavior under various stochasticities in a grid world, and find improvements over BYOL-Explore in hard-exploration Atari games with sticky actions. Importantly, we show SOTA results in exploring Montezuma with sticky actions, while preserving performance in the non-sticky setting.

arxiv.org

Phase transition and higher order analysis of $L_q$ regularization under dependence. (arXiv:2211.10541v1 [math.ST]) arxiv.org/abs/2211.10541

Phase transition and higher order analysis of $L_q$ regularization under dependence

We study the problem of estimating a $k$-sparse signal $\bbeta_0\in\bR^p$ from a set of noisy observations $\by\in\bR^n$ under the model $\by=\bX\bbeta+w$, where $\bX\in\bR^{n\times p}$ is the measurement matrix the row of which is drawn from distribution $N(0,\bSigma)$. We consider the class of $L_q$-regularized least squares (LQLS) given by the formulation $\hat{\bbeta}(λ,q)=\text{argmin}_{\bbeta\in\bR^p}\frac{1}{2}\|\by-\bX\bbeta\|^2_2+λ\|\bbeta\|_q^q$, where $\|\cdot\|_q$ $(0\le q\le 2)$ denotes the $L_q$-norm. In the setting $p,n,k\rightarrow\infty$ with fixed $k/p=ε$ and $n/p=δ$, we derive the asymptotic risk of $\hat{\bbeta}(λ,q)$ for arbitrary covariance matrix $\bSigma$ which generalizes the existing results for standard Gaussian design, i.e. $X_{ij}\overset{i.i.d}{\sim}N(0,1)$. We perform a higher-order analysis for LQLS in the small-error regime in which the first dominant term can be used to determine its phase transition behavior. Our results show that the first dominant term does not depend on the covariance structure of $\bSigma$ for the cases $0\le q\textless 1$ and $1\textless q\le 2$ which indicates that the correlations among predictors only affect the phase transition curve in the case $q=1$ also known as LASSO. To study the influence of the covariance structure of $\bSigma$ on the performance of LQLS in the cases $0\le q\textless 1$ and $1\textless q\le 2$, we derive the explicit formulas for the second dominant term in the expansion of the asymptotic risk in terms of the small error. Extensive computational experiments confirm that our analytical predictions are consistent with numerical results.

arxiv.org
Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.