In order to completely solve artificial intelligence, multi-model understanding of vision and natural language is paramount. To evaluate the fine-grained understanding of both image and language of the system, the task of visual question answering was introduced recently. Since humans are capable of attending to specific regions of the scene to based on their need, we believe using attention for visual question answering will be useful. This form of attention is different from earlier work which focussed on finding salient regions in an image while attention for VQA is task-specific since based on question, the region to attend to will be different for different images. We collect human annotation data by showing subjects image-question pairs and ask them to mark regions necessary to answer the question. We incorporate this attention into the VQA pipeline in various ways and study their performance.


Multimodal understanding of vision and natural language is one of the big challenges of artificial intelligence. The aim is to build models powerful enough to recognize objects in a scene, understand and reason about their relationships and express these relationships in natural language. To this end, there have been several recent efforts on tasks such as visual question answering [2, 6, 19] and image caption generation [5, 8, 9, 14, 15, 17, 20].

A visual question answering (VQA) system takes as input an image and a question about that image and produces an answer as output as shown in the figure below. This goal-driven task is applicable to scenarios where visually-impaired humans or intelligence analysts obtain visual information from an intelligent system. Unlike image captioning, where a coarse understanding is often sufficient in describing images in a vague sense [7], a VQA model needs to pay attention to the fine-grained details to answer a question correctly.

What color are her eyes?
What is the mustache made of?

How many slices of pizza are there?
Is this a vegetarian pizza?

Does it appear to be rainy?
Does this person have a 20/20 vision?

Prior work [10] has shown that humans have the ability to quickly perceive a scene, shifting focus from one part of the scene to another multiple times per second. Visual questions selectively target different areas of an image, including background details and underlying context. This suggests that attention mechanisms should play a crucial role in any successful VQA system. For instance, for the first image in the above figure, if the question was "What color are her eyes?", a VQA system that doesn't attend to the region in the image corresponding to the eyes is unlikely to answer the question correctly.

What is on the front of the bus?

Which company is on the left of the player?

What are the people in the background doing?

We believe attention is critical in VQA, and in this project, we study the role attention can play in VQA. We design and conduct human studies to collect "human attention maps" i.e. where do people look if asked to answer a question. Concretely, our contributions are twofold: Automatically generating question-specific attention maps for images is a fruitful direction of future research, and our dataset and initial approach can help gain insights into how to incorporate attention for the task of visual question answering.

Dataset: Collection and Analysis

In order to accurately capture attention regions that help in answering visual questions, we experimented with three variants of our attention-annotation interface. In all of these, we present a blurred image and a question about the image, and ask subjects to sharpen regions of the image that are relevant to answering the question correctly, in a smooth, click-and-drag, 'coloring' motion with the mouse.

Starting with only a blurred image and question (in our first interface) has two consequences. Sometimes, we are able to capture exploratory attention, where the subject lightly sharpens large regions of an image to find salient regions that eventually lead him to the answer. However, for certain question types, such as counting ("How many ...") and binary ("Is there ..."), the captured attention maps are often incomplete, and hence inaccurate.

So in addition to the question and blurred image, in our second interface, we show the correct answer and ask subjects to sharpen as few regions as possible such that someone can answer the question just by looking at the blurred image with sharpened regions.

To encourage exploitation instead of exploration, in our third interface, we show them the question-answer pair and full-resolution image. While we thought presenting the full-resolution image and answer would enable subjects to sharpen regions most relevant to answering the question correctly, this task turns out to be counter-intuitive. We show full-resolution images but ask them to imagine a scenario where someone else has to answer the question without seeing the full-resolution image.

We ensure that we don't present the same image to the same subject twice. Since each image has three associated questions, it was important that the subjects didn't become familiar with the image as that would bias the attention annotations.

We also performed human study to compare the quality of annotations collected by the different interfaces. Table 1 shows that Interface 2: Blurred Image with Answer, gets good annotations that are almost as accurate at getting the turkers to answer by looking at the entire image and the question. We therefore used interface 2 for collecting rest of the annotation. For this project, we collected annotations on 60,000 image-question pairs in training set and 5,000 image-question pairs in validation set. The figure below shows all the three interfaces.

(a) Interface 1: Blurred image and question without answer.

(b) Interface 2: Blurred image and question with answer.

(c) Interface 3: Blurred image and original image with question and answer.
Interface Type Human Accuracy
Blurred Image without Ans. 75.17%
Blurred Image with Ans. 78.69%
Blurred & Original Image with Ans. 71.2%
Original Image 80.0%

(d) Human study to compare the quality of annotations collected by different interfaces.
Note: You can click the interface screenshot to go to the interface page where you can test the corresponding interface to get a sense of the different ways in which we tried collecting annotations.

Attention Map Examples:

Here are some of the attention maps collected by us.

Q: What is in the vase?

Q: What is the many using to take a picture?

Q: What is floating in the sky?


VQA Model:

The model used for the project is based on the VQA model used in [2]. The VQA model uses a LSTM unit to convert the question into a 1024 dimension encoding. The LSTM model takes one-hot encoding for the question words as input, and the same image features as above followed by a linear transformation to transform the image features to 1024 dimensions to match the LSTM encoding of the question. The question and image encodings are fused via element-wise multiplication or simple concatenation. Model corresponding to element-wise multiplication fusion begins with a prefix of VQA-qxi while concatenation model starts with VQA-qpi. The fused features are then passed through a multi-layer perceptron (MLP) neural network classifier with 2 hidden layers and 1000 hidden units. Each fully connected layer is followed by dropout layer with a dropout ratio of 0.5 and tanh non-linearity. The output is a 1000 way softmax classifier that that predicts one of top-1000 answers in the training dataset. It has been observed in [2], that classifying into 1000 most frequent answers covers 82.67% of the train + val answers.

Various ways of incorporating attention:

We are leveraging human attention maps to modulate the convolutional layer activations of VGG-16 network pretrained on ImageNet. The VQA model uses fc7 features from the VGG-16 network and we incorporate attention in multiple ways:
All the models were trained using Keras and the finetuning experiments were done using Caffe. Both of these libraries are open source and pretty popular for running experiments involving Convolutional Neural Networks and Recurrent Neural Networks.

Experiments and Results

Training and Testing Data:

We used the original VQA model trained on entire training split of the VQA dataset. To incorporate attention, the models were finetuned using attention annotations collected for 60K image-question pairs of the training dataset. The models were then evaluated on 5K image-question pairs of the validation split for which we had collected human annotations.

Evaluation Metric:

The evaluation metric used for our experiments is taken from [1]. They use following accuracy metric:
According to this metric, an answer is deemed 100% accurate if at least 3 workers provided that exact answer.


The two tables below shows the accuracies obtained by various models using the accuracy metric described above. Table 2 on the left shows result for model that fused question and image features using point-wise multiplication while the Table 3 on the right shows the result for model where the fusion step was simple concatenation.

Method Accuracy
VQA-qxi 59.96%
VQA-qxi-last2-rand. 45.19%
VQA-qxi-blur-wo-ft 48.56%
VQA-qxi-damp-wo-ft 40.94%
VQA-qxi-blur-last2-ft 50.19%
VQA-qxi-damp-last2-ft 44.47%
VQA-qxi-blur-last2-rand 44.85%
VQA-qxi-damp-last2-rand 37.10%
VQA-qxi-crop-0.9 60.73%
VQA-qxi-crop-0.7 60.06%

Table 2: Point-wise multiplication model.
Method Accuracy
VQA-q+i 62.06%
VQA-q+i-last2-rand. 48.04%
VQA-q+i-blur-wo-ft 49.95%
VQA-q+i-damp-wo-ft 49.94%
VQA-q+i-blur-last2-ft 51.05%
VQA-q+i-damp-last2-ft 55.44%
VQA-q+i-blur-last2-rand 46.67%
VQA-q+i-damp-last2-rand 47.74%
VQA-q+i-crop-0.9 60.55%
VQA-q+i-crop-0.7 60.32%

Table 3: Concatenation Model

Table 4: Table displaying validation accuracy per question type for all the methods when using point-wise multiplication models.

Table 5: Table displaying validation accuracy per question type for all the methods when using concatenation models.

By looking at table 2, we can see that incorporating attention doesn't perform well in the "VQA-qxi" setting. By passing the blurred image as input, it does better than the model where pool-5 output feature maps are dampened. One of the possible reasons behind this observation is that when questions are point-wise multiplied with the image features, we can argue that it is providing a form of attention. By already dampening the activations in the pool-5 layer there is a mis-alignment between both the forms of attention. One possible way to fix this would be to finetune the fully connected layers of the VGG-Net and the transformation layer that brings the 4096 dimensional image feature vector into a 1024-dimensional question feature space. However, given only limited amount of data (50K vs 1.2 Million images of ImageNet), finetuning these layers is not possible. This explanation is further supported by the fact that the performance of both "VQA-q+i-damp-last2-ft" and "VQA-q+i-damp-last2-rand" is significantly better than "VQA-qxi-damp-last-2-ft" and "VQA-qxi-last2-rand". Since the image features and question features are concatenated, there is no implicit attention that is being provided by the question In this setting, the dampened activations also outperform the model in which blurred image was passed as input.

The overall performance of the methods using visual attention seem to suggest that incorporating attention in the original VQA model does not improve the performance which is counter intuitive. A possible explanation to this would be that the features obtained by the VGG-Net architecture does not capture fine-grained details about the image for attention to matter. This claim is supported by the fact that VGG-Net was trained for classification where fine-grained details about the image does not matter and secondly, it was shown in the VQA paper that there is only a little improvement (~2%) over the question only baseline. Another interesting observation is the fact that model using cropped image such as "VQA-qxi-crop-0.9", "VQA-qxi-crop-0.7" etc seem to be performing nearly as good as the original model which does support the fact that only those regions of the image are important for answering the question.

Qualitative Results

Here are some qualitative results. These results were produced by the VQA-q+i-conc-last2-ft model. The answers in green are the correct predictions while answers in red are incorrect predictions. The images are also overlayed with their corresponding human attention map.

Q: What color are the shoes for the woman on the left? A: yellow

Q: Are the women cold? A: yes

Q: Is there any purple? A: no

Q: What color is the floor? A: brown

Q: What brand of tennis shoes is the boy wearing? A: nike

Q: What sport is the boy playing? A: tennis

Q: What is the person holding up to his mouth? A: food

Q: Is this indoors? A: yes

Q: What color is the board? A: blue

Q: What does the person have on his face? A: glasses

Q: What ethnicity is this meal? A: banana

Q: Was this photo taken this century? A: yes


Experiments conducted in this project provides many interesting insights. Although, humans have the capability of attending to different parts of the scene to answer different questions, the current model does not capture the fine-grained details for attention to matter. In the VQA-qxi model, it seems that the question features are providing some form of attention and using question features to determine important regions in an image seem to be a good approach for incorporating attention. Since such an approach will be an unsupervised approach of providing attention will also not suffer from data bottleneck as we will not have to rely on human attention annotations. Recently, there have been a flurry of papers [23, 24, 25, 26] that incorporate attention in their pipeline for solving the task of VQA however none of them clearly show that attention is the major cause of improvement over the VQA baseline. Performing ablation studies on those models will help us discover insights into what part of the pipeline is causing the major improvement. Another way in which accuracies can be improved is by thinking of proxy tasks such as attribute classification etc. which will help VGG understand fine grained details. Overall, visual question answering is a task that requires understanding fine-grained details about the image and using attention to focus on the region of importance and reduce background clutter seems like a promising approach.


[1] M. Malinowski, M. Fritz. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. NIPS, 2014.
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015.
[3] J. L. Ba, V. Mnih, and K. Kavukcuoglu. Multiple Object Recognition With Visual Attention. Iclr-2015, 2015.
[4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
[5] X. Chen and C. Lawrence Zitnick. Mind’s eye: A recurrent visual representation for image caption generation. June 2015.
[6] J. Deng, J. Krause, M. Stark, and L. Fei-Fei. Leveraging the Wisdom of the Crowd for Fine-Grained Recognition. PAMI, 2015.
[7] J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick. Exploring Nearest Neighbor Approaches for Image Captioning. arXiv preprint, 2015.
[8] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. CoRR, abs/1411.4389, 2014.
[9] H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From captions to visual concepts and back. CoRR, abs/1411.4952, 2014.
[10] L. Fei-Fei, A. Iyer, C. Koch, and P. Perona. What do we perceive in a glance of a real-world scene? Journal of Vision, 7(1):10, 2007.
[11] M. Jiang, S. Huang, J. Duan, and Q. Zhao. Salicon: Saliency in context. In CVPR, June 2015.
[12] M. Jiang, J. Xu, and Q. Zhao. Saliency in Crowd. ECCV, 2014.
[13] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In IEEE International Conference on Computer Vision (ICCV), 2009.
[14] A. Karpathy and F. Li. Deep visual-semantic alignments for generating image descriptions. CoRR, abs/1412.2306, 2014.
[15] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014.
[16] L. Ma, Z. Lu, and H. Li. Learning to Answer Questions From Image using Convolutional Neural Network. 2015.
[17] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. CoRR, abs/1410.1090, 2014.
[18] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent Models of Visual Attention. 2014.
[19] M. Ren, R. Kiros, and R. Zemel. Exploring Models and Data for Image Question Answering. arXiv preprint arXiv:1505.02074, 2015.
[20] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014.
[21] L. von Ahn and L. Dabbish. Labeling images with a computer game. In CHI, CHI ’04, 2004.
[22] K. Xu, A. Courville, R. S. Zemel, and Y. Bengio. Show , Attend and Tell : Neural Image Caption Generation with Visual Attention. Arxiv, 2015.
[23] J. Andreas, M. Rohrbach, T. Darrell and D. Klein. Deep Compositional Question Answering with Neural Module Networks. Arxiv, 2015
[24] Z. Yang, X. He, J. Gao, L. Deng and A. Smola. Stacked Attention Networks for Image Question Answering. Arxiv, 2015.
[25] H. Noh, P. Hongsuck and S. Bohyung Han. Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction Arxiv, 2015.
[26] Y. Zhu, O. Groth, M. Bernstein and Li Fei-Fei. Visual7W: Grounded Question Answering in Images. Arxiv, 2015.