In order to completely solve artificial intelligence, multi-model understanding of vision and natural language is paramount.
To evaluate the fine-grained understanding of both image and language of the system,
the task of visual question answering was introduced recently.
Since humans are capable of attending to
specific regions of the scene to based on their need, we believe using attention for visual question answering will be useful.
This form of attention is different from earlier work which focussed on finding salient regions in an image while
attention for VQA is task-specific since based on question, the region to attend to will be different for different images.
We collect human annotation data by showing subjects image-question pairs and ask them to mark regions necessary to answer
the question. We incorporate this attention into the VQA pipeline in various ways and study their performance.
Multimodal understanding of vision and natural language is one of the big challenges of artificial intelligence.
The aim is to build models powerful enough to recognize objects in a scene, understand and reason about their
relationships and express these relationships in natural language.
To this end, there have been several recent efforts on tasks such as visual question answering [2, 6, 19] and image caption generation
[5, 8, 9, 14, 15, 17, 20].
A visual question answering (VQA) system takes as input an image and a question about that image and produces
an answer as output as shown in the figure below. This goal-driven task is applicable to scenarios where
visually-impaired humans or intelligence analysts obtain visual information from an intelligent system.
Unlike image captioning, where a coarse understanding is often sufficient in describing images in a
vague sense , a VQA model needs
to pay attention to the fine-grained details to answer a question correctly.
What color are her eyes?
What is the mustache made of?
How many slices of pizza are there?
Is this a vegetarian pizza?
Does it appear to be rainy?
Does this person have a 20/20 vision?
Prior work  has shown that humans have the ability to quickly perceive a scene,
shifting focus from one part of the scene to another multiple times per second.
Visual questions selectively target different areas of an image, including background details
and underlying context. This suggests that attention mechanisms should play a crucial role in any
successful VQA system. For instance, for the first image in the above figure, if the question was
"What color are her eyes?", a VQA system that doesn't attend to the region
in the image corresponding to the eyes is unlikely to answer the question
What is on the front of the bus?
Which company is on the left of the player?
What are the people in the background doing?
We believe attention is critical in VQA, and in this project, we study the role attention can play in VQA.
We design and conduct human studies to collect "human attention maps" i.e. where
do people look if asked to answer a question. Concretely, our contributions are twofold:
We experiment with multiple variants of our VQA attention-annotation interface that typically requires
the subject to sharpen regions of a blurred image to answer visual questions accurately.
Using the interface that captures attention best, we collect human annotations for the VQA dataset
which will be made publicly available.
We study two different ways of incorporating attention in the original VQA pipeline.
We use attention maps to dampen
intermediate layer activations of the VGG16 feature extraction pipeline and finetuning the VQA model introduced
in . We also finetune the VQA model when instead of original input, we pass the blurred image as input.
In the blurred image, only those regions which are important to answering the question are sharp while the rest of the image
is deblurred. We also calculate accuracies
per question type such as
"What color ...", "How many ...", "Is there ..." etc.
Additionally, we also train a variant of the original VQA model in which instead of using point-wise multiplication
for fusing the image and question features, we use simple concatenation.
Automatically generating question-specific attention maps for images is a fruitful direction of future research,
and our dataset and initial approach can help gain insights into how to incorporate attention for the task of
visual question answering.
Dataset: Collection and Analysis
In order to accurately capture attention regions that help in answering visual questions, we experimented with
three variants of our attention-annotation interface.
In all of these, we present a blurred image and a question about the image, and ask subjects to sharpen regions
of the image that are relevant to answering the question correctly, in a smooth, click-and-drag, 'coloring'
motion with the mouse.
Starting with only a blurred image and question (in our first interface) has two consequences.
Sometimes, we are able to capture exploratory attention, where the subject lightly sharpens large regions of an
image to find salient regions that eventually lead him to the answer.
However, for certain question types, such as counting ("How many ...") and binary ("Is there ..."), the captured
attention maps are often incomplete, and hence inaccurate.
So in addition to the question and blurred image, in our second interface, we show the correct answer and
ask subjects to sharpen as few regions as possible such that someone can answer the question just by looking at
the blurred image with sharpened regions.
To encourage exploitation instead of exploration, in our third interface, we show them the question-answer
pair and full-resolution image.
While we thought presenting the full-resolution image and answer would enable subjects to sharpen regions most
relevant to answering the question correctly, this task turns out to be counter-intuitive.
We show full-resolution images but ask them to imagine a scenario where someone else has to answer the question
without seeing the full-resolution image.
We ensure that we don't present the same image to the same subject twice.
Since each image has three associated questions, it was important that the subjects didn't become familiar
with the image as that would bias the attention annotations.
We also performed human study to compare the quality of annotations collected by the different interfaces.
Table 1 shows that Interface 2: Blurred Image with Answer, gets good annotations that are almost
as accurate at getting the turkers to answer by looking at the entire image and the question. We therefore used interface
2 for collecting rest of the annotation. For this project, we collected
annotations on 60,000 image-question pairs in training set and 5,000 image-question pairs
in validation set. The figure below shows all the three interfaces.
(a) Interface 1: Blurred image and question without answer.
(b) Interface 2: Blurred image and question with answer.
(c) Interface 3: Blurred image and original image with question and answer.
Blurred Image without Ans.
Blurred Image with Ans.
Blurred & Original Image with Ans.
(d) Human study to compare the quality of annotations collected by different interfaces.
Note: You can click the interface screenshot to go to the interface page where you can test the corresponding interface to get
a sense of the different ways in which we tried collecting annotations.
Attention Map Examples:
Here are some of the attention maps collected by us.
Q: What is in the vase?
Q: What is the many using to take a picture?
Q: What is floating in the sky?
The model used for the project is based on the VQA model used in .
The VQA model uses a LSTM unit to convert the question into a 1024 dimension encoding.
The LSTM model takes one-hot encoding for the question words as input, and the same image features
as above followed by a linear transformation to transform the image features to 1024 dimensions to match
the LSTM encoding of the question. The question and image encodings are fused via element-wise multiplication or simple concatenation.
Model corresponding to element-wise multiplication fusion begins with a prefix of VQA-qxi while concatenation model starts with VQA-qpi.
The fused features are then passed through a multi-layer perceptron (MLP) neural network classifier with 2 hidden layers and
1000 hidden units. Each fully connected layer is followed by dropout layer with a dropout ratio of 0.5 and
tanh non-linearity. The output is a 1000 way softmax classifier that that predicts one of top-1000 answers in the training dataset.
It has been observed in , that classifying into 1000 most frequent
answers covers 82.67% of the train + val answers.
Various ways of incorporating attention:
We are leveraging human attention maps to modulate the convolutional layer activations of VGG-16 network pretrained on ImageNet.
The VQA model uses fc7 features from the VGG-16 network and we incorporate attention in multiple ways:
By passing the blurred image as input. Blurred input image was obtained during annotation collection process
described earlier. For this input, we evaluate multiple models:
by passing the blurred image in the original VQA pipeline (VQA-qxi-blur-wo-ft),
by fine-tuning the last two
fully connected layers of the VQA Pipeline (VQA-qxi-blur-last2-ft),
and by learning the weights of the last
two layers starting with random initilization (VQA-qxi-blur-last2-rand).
Additionally, we dampen the activation obtained after "pool-5" layer using soft attention maps.
This is obtained by taking the soft-attention map for a given
question-image pair and resizing it to match the size of pool-5 output feature map. By multiplying the normalized soft attention
map to the pool layer, the output will have higher activations for regions that are important while activations for
unimportant regions will be dampened. For damped input, we repeat the three settings described earlier.
We also used a variant of the original VQA model where instead of point-wise multiplication to fuse image and question features,
we simply concatenate them. To remain consistent with our naming convention, these models are prefixed by "VQA-q+i" in rest of the paper.
We also tried extracting crops of original image using attention maps. For each image-question pair, we extracted two crops
such that 90% and 70% of the mass of the soft-attention map lies inside the crop. These two approaches are marked as VQA-qxi-crop-0.9, VQA-qxi-crop-0.7 respectively.
To compare models that were trained using attention annotations and random initialization of the last two layers,
we also trained a model on original images by randomly initializing the last two layers but using only those image-question pairs
for which we had attention annotations i.e 60K image-question pairs of the training set. Since we had only 60K annotations in the training dataset, it would be unfair
to compare the attention approach with models trained on the full 240K annotations contained in the original training dataset. These
models are named "VQA-qxi-last2-rand" and "VQA-q+i-last2-rand" in the results table.
All the models were trained using Keras and the finetuning experiments were done using Caffe. Both of these libraries are open source
and pretty popular for running experiments involving Convolutional Neural Networks and Recurrent Neural Networks.
Experiments and Results
Training and Testing Data:
We used the original VQA model trained on entire training split of the VQA dataset. To incorporate attention, the models were
finetuned using attention annotations collected for 60K image-question pairs of the training dataset. The models were then
evaluated on 5K image-question pairs of the validation split for which we had collected human annotations.
The evaluation metric used for our experiments is taken from . They use following accuracy metric:
According to this metric, an answer is deemed 100% accurate if at least 3 workers provided that exact answer.
The two tables below shows the accuracies obtained by various models using the accuracy metric described above. Table 2 on the left
shows result for model that fused question and image features using point-wise multiplication while the Table 3 on the right shows the result for model
where the fusion step was simple concatenation.
Table 2: Point-wise multiplication model.
Table 3: Concatenation Model
Table 4: Table displaying validation accuracy per question type for all the methods when using point-wise multiplication models.
Table 5: Table displaying validation accuracy per question type for all the methods when using concatenation models.
By looking at table 2, we can see that incorporating attention doesn't perform well in the "VQA-qxi" setting.
By passing the blurred image as input,
it does better than the model where pool-5 output feature maps are dampened. One of the possible reasons behind this observation
is that when questions are point-wise multiplied with the image features, we can argue that it is providing a form of attention.
By already dampening the activations in the pool-5 layer there is a mis-alignment between both the forms of attention. One possible way
to fix this would be to finetune the fully connected layers of the VGG-Net and the transformation layer that brings the 4096
dimensional image feature vector into a 1024-dimensional question feature space. However, given only limited amount of data (50K vs
1.2 Million images of ImageNet), finetuning these layers is not possible. This explanation is further supported by the fact that
the performance of both "VQA-q+i-damp-last2-ft" and "VQA-q+i-damp-last2-rand" is significantly better than "VQA-qxi-damp-last-2-ft" and
"VQA-qxi-last2-rand". Since the image features and question features are concatenated, there is no implicit attention that is being provided by the question
In this setting, the dampened activations also outperform the model in which blurred image was passed as input.
The overall performance of the methods using visual attention seem to suggest that incorporating attention in the original VQA model
does not improve the performance which is counter intuitive. A possible explanation to this would be that the features
obtained by the VGG-Net architecture does not capture fine-grained details about the image for attention to matter.
This claim is supported by the fact that VGG-Net was trained for classification where fine-grained details about the image does not matter
and secondly, it was shown in the VQA paper that there is only a little improvement (~2%) over the question only baseline.
Another interesting observation is the fact that model using cropped image such as "VQA-qxi-crop-0.9", "VQA-qxi-crop-0.7" etc seem
to be performing nearly as good as the original model which does support the fact that only those regions of the image are important for
answering the question.
Here are some qualitative results. These results were produced by the VQA-q+i-conc-last2-ft model. The answers in green are the correct
predictions while answers in red are incorrect predictions. The images are also overlayed with their corresponding human attention map.
Q: What color are the shoes for the woman on the left? A: yellow
Q: Are the women cold? A: yes
Q: Is there any purple? A: no
Q: What color is the floor? A: brown
Q: What brand of tennis shoes is the boy wearing? A: nike
Q: What sport is the boy playing? A: tennis
Q: What is the person holding up to his mouth? A: food
Q: Is this indoors? A: yes
Q: What color is the board? A: blue
Q: What does the person have on his face? A: glasses
Q: What ethnicity is this meal? A: banana
Q: Was this photo taken this century? A: yes
Experiments conducted in this project provides many interesting insights. Although, humans have the capability of attending to different parts
of the scene to answer different questions, the current model does not capture the fine-grained details for attention to matter.
In the VQA-qxi model, it seems that the question features are providing some form of attention and using question features to determine
important regions in an image seem to be a good approach for incorporating attention. Since such an approach will be an unsupervised approach of
providing attention will also not suffer from data bottleneck as we will not have to rely on human attention annotations. Recently,
there have been a flurry of papers [23, 24, 25, 26] that incorporate attention in their pipeline for solving the task of VQA however none of them clearly
show that attention is the major cause of improvement over the VQA baseline. Performing ablation studies on those models will help us
discover insights into what part of the pipeline is causing the major improvement. Another way in which accuracies can be improved is
by thinking of proxy tasks such as attribute classification etc. which will help VGG understand fine grained details. Overall, visual
question answering is a task that requires understanding fine-grained details about the image and using attention to
focus on the region of importance and reduce background clutter seems like a promising approach.
 M. Malinowski, M. Fritz. A Multi-World Approach to Question Answering about
Real-World Scenes based on Uncertain Input. NIPS, 2014.
 S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L.
Zitnick, and D. Parikh. Vqa: Visual question answering. In
 J. L. Ba, V. Mnih, and K. Kavukcuoglu. Multiple Object
Recognition With Visual Attention. Iclr-2015, 2015.
 D. Bahdanau, K. Cho, and Y. Bengio. Neural machine
translation by jointly learning to align and translate. CoRR,
 X. Chen and C. Lawrence Zitnick. Mind’s eye: A recurrent
visual representation for image caption generation. June
 J. Deng, J. Krause, M. Stark, and L. Fei-Fei. Leveraging the
Wisdom of the Crowd for Fine-Grained Recognition. PAMI,
 J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick.
Exploring Nearest Neighbor Approaches for Image
Captioning. arXiv preprint, 2015.
 J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,
S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent
convolutional networks for visual recognition and description.
CoRR, abs/1411.4389, 2014.
 H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng,
P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick,
and G. Zweig. From captions to visual concepts and
back. CoRR, abs/1411.4952, 2014.
 L. Fei-Fei, A. Iyer, C. Koch, and P. Perona. What do we
perceive in a glance of a real-world scene? Journal of Vision,
 M. Jiang, S. Huang, J. Duan, and Q. Zhao. Salicon: Saliency
in context. In CVPR, June 2015.
 M. Jiang, J. Xu, and Q. Zhao. Saliency in Crowd. ECCV,
 T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to
predict where humans look. In IEEE International Conference
on Computer Vision (ICCV), 2009.
 A. Karpathy and F. Li. Deep visual-semantic alignments for
generating image descriptions. CoRR, abs/1412.2306, 2014.
 R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying
visual-semantic embeddings with multimodal neural language
models. CoRR, abs/1411.2539, 2014.
 L. Ma, Z. Lu, and H. Li. Learning to Answer Questions From
Image using Convolutional Neural Network. 2015.
 J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain
images with multimodal recurrent neural networks. CoRR,
 V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent
Models of Visual Attention. 2014.
 M. Ren, R. Kiros, and R. Zemel. Exploring Models
and Data for Image Question Answering. arXiv preprint
 O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show
and tell: A neural image caption generator. CoRR,
 L. von Ahn and L. Dabbish. Labeling images with a computer
game. In CHI, CHI ’04, 2004.
 K. Xu, A. Courville, R. S. Zemel, and Y. Bengio. Show
, Attend and Tell : Neural Image Caption Generation with
Visual Attention. Arxiv, 2015.
 J. Andreas, M. Rohrbach, T. Darrell and D. Klein. Deep Compositional Question Answering
with Neural Module Networks. Arxiv, 2015
 Z. Yang, X. He, J. Gao, L. Deng and A. Smola. Stacked Attention Networks for Image Question Answering. Arxiv, 2015.
 H. Noh, P. Hongsuck and S. Bohyung Han. Image Question Answering using Convolutional Neural Network
with Dynamic Parameter Prediction Arxiv, 2015.
 Y. Zhu, O. Groth, M. Bernstein and Li Fei-Fei. Visual7W: Grounded Question Answering in Images. Arxiv, 2015.
This work was in collaboration with Abhishek Das who is an intern at the Machine Learning and Perception Lab
under the guidance of Prof. Dhruv Batra and Prof. Devi Parikh. I would also like to acknowledge the effort
of workers on Amazon Mechanical Turk who did an amazing job providing attention annotations for the dataset.
Also note that the images displayed on this page are either taken from the VQA Paper or are part of the VQA dataset.