One of the main challenges in learning fine-grained visual categories is gathering training images. Recent work in Zero-Shot Learning (ZSL) circumvents this challenge by describing categories via attributes or text. However, not all visual concepts, e.g., two people dancing, are easily amenable to such descriptions. In this paper, we propose a new modality for ZSL using visual abstraction to learn difficult-to-describe concepts. Specifically, we explore concepts related to people and their interactions with others. Our proposed modality allows one to provide training data by manipulating abstract visualizations, e.g., one can illustrate interactions between two clipart people by manipulating each person's pose, expression, gaze, and gender. The feasibility of our approach is shown on a human pose dataset and a new dataset containing complex interactions between two people, where we outperform several baselines. To better match across the two domains, we learn an explicit mapping between the abstract and real worlds.

Motivation: Dataset collection is difficult

Anyone that has tried to collect an image dataset online knows that it can be very difficult and time-consuming. This is because of the heavy-tail distribution of categories on the web, meaning that the majority of categories have very few images available. The more specific/fine-grained the concepts of interest are, the more difficult collection becomes.

Image not found.
Image from the Caltech-UCSD Birds dataset page.
Figure: A sampling of the popular, fine-grained image dataset, CUB-200. It consists of 11,788 images from one of 200 different bird species.

Problem 1: Requiring difficult-to-come-by datasets to train visual models.

To get around Problem 1, researchers have been investigating Zero-Shot Learning (ZSL), i.e., training models without any training data. Current approaches to do ZSL for visual concepts utilize textual descriptions in order to train visual models.

Figure: Some concepts are easy to describe (left), but other concepts can be difficult to describe (right).

Problem 2: Attribute- or text-based descriptions are not intuitive for all visual concepts.

Solution idea: Visual abstraction as ZSL modality

Visual concepts that are difficult to describe are probably easy to illustrate.

Figure: A concept that is difficult to describe (left) is easy to illustrate (right).


Click on the image to go to the interface demo.

Main result: Classification of interactions/poses in images with no training images

We evaluate our approach on two image datasets:

Figure: Even at 1 training illustration (per category), we perform several times random with Perfect Pose (PP) detection. Performance improves significantly as we train on more illustrations per category, although it begins to saturate. We also beat an attribute baseline (Attributes w/ PP), showing the advantage of this approach for certain visual concepts. We evaluate our method using the Yang and Ramanan pose detector (YR), showing that we still do a reasonable job in a more automatic setting. For INTERACT, we tried to assist the YR detector by providing bounding boxes around the people, which shows an improvement in performance.

Qualitative results: INTERACT and PARSE.

See the paper for more results and details.


@inproceedings{ Antol2014,
  title = {{Zero-Shot Learning via Visual Abstraction}},
  author = {Antol, Stanislaw and Zitnick, C. Lawrence and Parikh, Devi},
  booktitle = {ECCV},
  year = {2014}



Supplementary material

ECCV 2014 poster


Features as .mat files

INTERACT Dataset webpage

PARSE Dataset webpage