One of the main challenges in learning fine-grained visual categories is gathering training images. Recent work in Zero-Shot Learning (ZSL) circumvents this challenge by describing categories via attributes or text. However, not all visual concepts, e.g., two people dancing, are easily amenable to such descriptions. In this paper, we propose a new modality for ZSL using visual abstraction to learn difficult-to-describe concepts. Specifically, we explore concepts related to people and their interactions with others. Our proposed modality allows one to provide training data by manipulating abstract visualizations, e.g., one can illustrate interactions between two clipart people by manipulating each person's pose, expression, gaze, and gender. The feasibility of our approach is shown on a human pose dataset and a new dataset containing complex interactions between two people, where we outperform several baselines. To better match across the two domains, we learn an explicit mapping between the abstract and real worlds.
Anyone that has tried to collect an image dataset online knows that it can be very difficult and time-consuming. This is because of the heavy-tail distribution of categories on the web, meaning that the majority of categories have very few images available. The more specific/fine-grained the concepts of interest are, the more difficult collection becomes.
Problem 1: Requiring difficult-to-come-by datasets to train visual models.
To get around Problem 1, researchers have been investigating Zero-Shot Learning (ZSL), i.e., training models without any training data. Current approaches to do ZSL for visual concepts utilize textual descriptions in order to train visual models.
Problem 2: Attribute- or text-based descriptions are not intuitive for all visual concepts.
Visual concepts that are difficult to describe are probably easy to illustrate.
We evaluate our approach on two image datasets:
Qualitative results: INTERACT and PARSE.
See the paper for more results and details.
@inproceedings{ Antol2014, title = {{Zero-Shot Learning via Visual Abstraction}}, author = {Antol, Stanislaw and Zitnick, C. Lawrence and Parikh, Devi}, booktitle = {ECCV}, year = {2014} }