https://github.com/dinhanhx/VisualRoBERTa/
@dinhanhx i wonder how varied your data source is
@xarvos it's translated from COCO dataset to vietnamese. COCO mainly has flickrs images before 2017. So yeah not vary enough to perform at human level.
QOTO: Question Others to Teach Ourselves An inclusive, Academic Freedom, instance All cultures welcome. Hate speech and harassment strictly forbidden.