Untitled Presentation - Hacettepe Üniversitesiaykut/classes/spring...Untitled Presentation Author...

Preview:

Citation preview

Slide Credits:Agrawal

Slide Credits:Agrawal

Slide Credits:Agrawal

Kolmogorov-Smirnov Test

p(Captions vs (Q+A))<0.001

LSTM : one hidden layer MLP : 2 hidden layer fc network output size 1024 1000 dropout(0.5) units tanheach word size 300 end-to-end learning cross-entropy

Deeper LSTM: two hidden layeroutput :

2048 > fc+tanh >1024

Input Vocabulary : All question words

2-Channel VQA Model

Convolution Layer+ Non-Linearity

Pooling Layer Convolution Layer+ Non-Linearity

Pooling Layer Fully-Connected MLP

4096-dim

Embedding

Embedding

“How many horses are in this image?”

Neural Network Softmax

over top K answers

Image

Question

1024-dim

Slide Credits:Agrawal

Ablation #1: Language-alone

Convolution Layer+ Non-Linearity

Pooling Layer Convolution Layer+ Non-Linearity

Pooling Layer Fully-Connected MLP

1k outputunits

EmbeddingNeural Network

Softmaxover top K answers

Image

“How many horses are in this image?”

Question Embedding

1024-dim

Slide Credits:Agrawal

Ablation #2: Vision-alone

Convolution Layer+ Non-Linearity

Pooling Layer Convolution Layer+ Non-Linearity

Pooling Layer Fully-Connected MLP

4096-dim

EmbeddingNeural Network

Softmaxover top K answers

Image

“How many horses are in this image?”

Question Embedding

Slide Credits:Agrawal

Slide Credits:Agrawal

Slide Credits:Agrawal

Current Leaderboard

Questions&Discussion&Demo