Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time GuaranteesInverse reinforcement learning (IRL) aims to recover the reward function and
the associated optimal policy that best fits observed sequences of states and
actions implemented by an expert. Many algorithms for IRL have an inherently
nested structure: the inner loop finds the optimal policy given parametrized
rewards while the outer loop updates the estimates towards optimizing a measure
of fit. For high dimensional environments such nested-loop structure entails a
significant computational burden. To reduce the computational burden of a
nested loop, novel methods such as SQIL [1] and IQ-Learn [2] emphasize policy
estimation at the expense of reward estimation accuracy. However, without
accurate estimated rewards, it is not possible to do counterfactual analysis
such as predicting the optimal policy under different environment dynamics
and/or learning new tasks. In this paper we develop a novel single-loop
algorithm for IRL that does not compromise reward estimation accuracy. In the
proposed algorithm, each policy improvement step is followed by a stochastic
gradient step for likelihood maximization. We show that the proposed algorithm
provably converges to a stationary solution with a finite-time guarantee. If
the reward is parameterized linearly, we show the identified solution
corresponds to the solution of the maximum entropy IRL problem. Finally, by
using robotics control problems in MuJoCo and their transfer settings, we show
that the proposed algorithm achieves superior performance compared with other
IRL and imitation learning benchmarks.
arxiv.org