Hidden Markov Pólya trees for high-dimensional distributionsThe Pólya tree (PT) process is a general-purpose Bayesian nonparametric
model that has found wide application in a range of inference problems. The PT
has a simple analytic form and the resulting posterior computation boils down
to straight-forward beta-binomial conjugate updates along a partition tree over
the sample space. Recent development in PT models shows that performance of
these models can be substantially improved by (i) incorporating latent state
variables that characterize local features of the underlying distributions and
(ii) allowing the partition tree to adapt to the structure of the underlying
distribution. Despite these advances, however, some important limitations of
the PT that remain include---(i) the sensitivity in the posterior inference
with respect to the choice of the partition points, and (ii) the lack of
computational scalability to multivariate problems beyond a small number
($<10$) of dimensions. We consider a modeling strategy for PT models that
incorporates a very flexible prior on the partition tree along with latent
states that can be first-order dependent (i.e., following a Markov process),
and introduce a hybrid algorithm that combines sequential Monte Carlo (SMC) and
recursive message passing for posterior inference that can readily accommodate
PT models with or without latent states as well as flexible partition points in
problems up to 100 dimensions. Moreover, we investigate the large sample
properties of the tree structures and latent states under the posterior model.
We carry out extensive numerical experiments in the context of density
estimation and two-sample testing, which show that flexible partitioning can
substantially improve the performance of PT models in both inference tasks. We
demonstrate an application to a flow cytometry data set with 19 dimensions and
over 200,000 observations.
arxiv.org