Genera&on of Image Descrip&ons Tambet Ma&isen 14.10.2015
Agenda Datasets Convolu&onal neural networks Neural language models Neural machine transla&on Genera&on of image descrip&ons AFen&on Metrics
A year ago Baidu/UCLA: Explain Images with Mul&modal Recurrent Neural Networks Toronto: Unifying Visual- Seman&c Embeddings with Mul&modal Neural Language Models Berkeley: Long- term Recurrent Convolu&onal Networks for Visual Recogni&on and Descrip&on Google: Show and Tell: A Neural Image Cap&on Generator Stanford: Deep Visual- Seman&c Alignments for Genera&ng Image Descrip&on UML/UT: Transla&ng Videos to Natural Language Using Deep Recurrent Neural Networks Microso=/CMU: Learning a Recurrent Visual Representa&on for Image Cap&on Genera&on Microso=: From Cap&ons to Visual Concepts and Back
123287 images 5 descrip&ons for each 1. A woman with a bike walks by a blue bus with the bbc logo in the front of it. 2. A woman with a bike in front of a bus. 3. A girl walks her bicycle in front of a bus on a busy city street. 4. Young girl with bicycle in front of a public transporta&on bus and large group of people. 5. Woman with a bicycle wearing a helmet crossing the street in front of a blue bus.
Descrip&on vs Cap&on SBU 1M Cap&ons BBC News A woman with a bike walks by a blue bus with the bbc logo in the front of it. VS Me and Lisa had a blast in London last weekend.
Convolu&onal neural networks Learn layers of hierarchical features. Transfer learning: discard last classifica&on layer and use fixed network as a feature extractor. Image: NVidia
Transfer learning
Language Model Predict next word using previous words. THE CAT SAT ON A MAT??? Classical N- gram model Feed- forward neural network Recurrent neural network Long Short- Term Memory (LSTM)
Tri- gram Model ) ( ) ( ), ( 1 2 1 2 2 1 = t t t t t t t t w w count w w w count w w w P MAT THE CAT ON A SAT Simple to implement. Huge memory needs in case of bigger vocabulary and bigger N.
Feed- forward Neural Network MAT H THE CAT SAT ON THE Straighaoward extension of N- gram, more powerful model. S&ll only fixed context is considered.
Neural Language Model Socmax output layer (probabili&es for word t) V nodes HxV weights Hidden layer H nodes DxH weights DxH weights D nodes Learned distributed representa&on of word t- 2 Learned distributed representa&on of word t- 1 D nodes VxD weights (shared) V nodes 1- of- V representa&on of word t- 2 1- of- V representa&on of word t- 1 V nodes Bengio et al. A Neural Probabilis&c Language Model (2003)
Recurrent Neural Network THE CAT SAT ON THE MAT H 1 H 2 H 3 H 4 H 5 H 6 <BOS> THE CAT SAT ON THE Retains theore&cally context of any length.
Long Short Term Memory Able to retain context longer than vanilla RNN. Image: Wikipedia
Neural Machine Transla&on KASS ISTUS MATIL <EOS> H 1 H 2 H 3 H 4 H 5 H 6 H 7 H 8 H 9 H 10 THE CAT SAT ON THE MAT <EOS> KASS ISTUS MATIL
Genera&ng Descrip&ons for Images THE CAT SAT ON THE MAT <EOS> H 1 H 2 H 3 H 4 H 5 H 6 H 7 THE CAT SAT ON THE MAT
Vinyals et al. Show and Tell: A Neural Image Cap&on Generator (2014)
Demos hfp://nic.droppages.com/ Results for 1000 images from each dataset. In addi&on, one ground truth sentence is shown. hfp://cs.stanford.edu/people/karpathy/ deepimagesent/rankingdemo/ For every test set sentence below we retrieve the top images (from set of 1000). hfp://deeplearning.cs.toronto.edu/i2t Internal Server Error hfps://www.youtube.com/watch? v=w2iv8gt5cd4&feature=youtu.be
AFen&on The concept of afen&on is the most interes&ng recent architectural innova&on in neural networks. Andrej Karpathy Two kinds of afen&on: Soc afen&on Hard afen&on
Soc AFen&on Probability distribu&on is laid over the image. This distribu&on depends on higher level features and is learned using backpropaga&on. Xu et al. Show, AFend and Tell: Neural Image Cap&on Genera&on with Visual AFen&on (2015)
Correct Examples Xu et al. Show, AFend and Tell: Neural Image Cap&on Genera&on with Visual AFen&on (2015)
Incorrect examples Xu et al. Show, AFend and Tell: Neural Image Cap&on Genera&on with Visual AFen&on (2015)
Hard AFen&on At each &mestemp network focuses only on part of the image. Implemented using reinforcement learning. Mnih et al. Recurrent Models of Visual AFen&on (2014)
Hard AFen&on Implementa&on In case of classifica&on, the ac&on is the class. Class of last glimpse is the output of the network. Mnih et al. Recurrent Models of Visual AFen&on (2014)
Soc vs Hard AFen&on Soc afen&on Simpler to implement. Doesn t scale to big images. Hard afen&on More complicated implementa&on. Scales to big images and beats convolu&onal networks. Xu et al. Show, AFend and Tell: Neural Image Cap&on Genera&on with Visual AFen&on (2015)
Metrics BLEU (Bilingual Evalua&on Understudy) METEOR (Metric for Evalua&on of Transla&on with Explicit ORdering) CIDEr (Consensus- based Image Descrip&on Evalua&on) ROUGE (Recall- Oriented Understudy for Gis&ng Evalua&on) TER (Transla&on Error Rate)
BLEU The closer a machine transla&on is to a professional human transla&on, the befer it is. N- gram overlap between machine transla&on output and reference transla&on. Compute precision for n- grams of size 1 to 4. Add brevity penalty (for too short transla&ons). BLEU 4 output length = min 1, reference length i= 1 precision i 1 4
Correla&ons with Human Judgement Spearman's rho CIDEr 0.581 Meteor 0.560 BLEU4 0.459 ROUGE- SU4 0.440 TER - 0.290 Desmon Elliot, hfps://github.com/elliofd/compareimagedescrip&onmeasures
Ideal World Desmon Elliot, hfps://github.com/elliofd/compareimagedescrip&onmeasures
BLEU4 Desmon Elliot, hfps://github.com/elliofd/compareimagedescrip&onmeasures
METEOR Desmon Elliot, hfps://github.com/elliofd/compareimagedescrip&onmeasures
Thank you! Tambet Ma&isen tambet@ut.ee