Week 42: Siamese Network: Architecture and Applications in Visual Object Tracking Yuanwei Wu 10-21-2016 1
Outline Siamese Architecture Siamese Applications in Computer Vision Paper review Visual Object Tracking using Siamese CNN Future Work 2
What does Siamese mean? Source: http://vision.ia.ac.cn/zh/senimar/reports/siamese-network-architecture-and-applications-in-computer-vision.pdf 3
Siamese Architecture Source: Learning Hierarchies of Invariant Features. Yann LeCun. helper.ipam.ucla.edu/publications/gss2012/gss2012_10739.pdf 4
Siamese Architecture and loss function Source: Learning Hierarchies of Invariant Features. Yann LeCun. helper.ipam.ucla.edu/publications/gss2012/gss2012_10739.pdf 5
Siamese Applications in Computer Vision: 1. Signature Verification Source: http://vision.ia.ac.cn/zh/senimar/reports/siamese-network-architecture-and-applications-in-computer-vision.pdf 6
Siamese Applications in Computer Vision: 2. Dimensionality Reduction Source: http://vision.ia.ac.cn/zh/senimar/reports/siamese-network-architecture-and-applications-in-computer-vision.pdf 7
Siamese Applications in Computer Vision: 3.1 Learning Image Descriptors CNN Model Source: http://vision.ia.ac.cn/zh/senimar/reports/siamese-network-architecture-and-applications-in-computer-vision.pdf 8
Siamese Applications in Computer Vision: 3.2 Learning Image Descriptors Source: http://vision.ia.ac.cn/zh/senimar/reports/siamese-network-architecture-and-applications-in-computer-vision.pdf 9
Siamese Applications in Computer Vision: 4.1 Face Verification Source: http://vision.ia.ac.cn/zh/senimar/reports/siamese-network-architecture-and-applications-in-computer-vision.pdf 10
Siamese Applications in Computer Vision: 4.2 Face Verification Source: http://vision.ia.ac.cn/zh/senimar/reports/siamese-network-architecture-and-applications-in-computer-vision.pdf 11
Siamese Applications in Computer Vision: 4.3 Face Verification Source: http://vision.ia.ac.cn/zh/senimar/reports/siamese-network-architecture-and-applications-in-computer-vision.pdf 12
Siamese Applications in Computer Vision: 4.4 Face Verification Source: http://vision.ia.ac.cn/zh/senimar/reports/siamese-network-architecture-and-applications-in-computer-vision.pdf 13
Siamese Applications in Computer Vision: 4.5 Face Verification Source: http://vision.ia.ac.cn/zh/senimar/reports/siamese-network-architecture-and-applications-in-computer-vision.pdf 14
Paper Review: Fully-Convolutional Siamese Networks for Object Tracking @article{bertinetto2016fully, title={fully-convolutional Siamese Networks for Object Tracking}, author={bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS}, journal={arxiv preprint arxiv:1606.09549}, year={2016} } 15
Architecture of Siamese CNN Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 16
Details of the Architecture of Siamese CNN 1. Source: 1: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012. 17
Details of the Architecture of Siamese CNN 1. 2. Cross-correlation layer Source: 1: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012. 2: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, 18 fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016.
Training: dataset ImageNet Video dataset of 2015: contains ~4000 videos with ~1 million annotated frames Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 19
Training: preprocessing on the images Preprocessing: 2820 videos, examplar image: 127 x 127, search image: 255 x 255 Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 20
Training: recap the steps ImageNet Video dataset of 2015: contains ~4000 videos with ~1 million annotated frames Preprocessing: 2820 videos examplar image: 127 x 127 search image: 255 x 255 Training with a standard Stochastic Gradient Descent (SGD) solver using MathConvNet Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 21
Training: loss function Employing a discriminative training approach using positive and negative pairs and adopting the logistic loss: Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 22
Training: loss function Employing a discriminative training approach using positive and negative pairs and adopting the logistic loss: The loss of a score map is the mean of the individual losses: Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 23
Training: loss function Employing a discriminative training approach using positive and negative pairs and adopting the logistic loss: The loss of a score map is the mean of the individual losses: Applying SGD to find the conv-net Ѳ using Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 24
Tracking algorithm Use a search image centered at the previous position of the target. Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 25
Tracking algorithm Use a search image centered at the previous position of the target. Only search for the object within a region of approximately four times its previous size. Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 26
Tracking algorithm Use a search image centered at the previous position of the target. Only search for the object within a region of approximately four times its previous size. A cosine window is added to the score map to penalize large displacements. Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 27
Tracking algorithm Use a search image centered at the previous position of the target. Only search for the object within a region of approximately four times its previous size. A cosine window is added to the score map to penalize large displacements. The position of the maximum score relative to the center of the score map, multiplied by the stride of the network, gives the displacement of the target from frame to frame. Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 28
Experiments: training dataset size Accuracy: is calculated as the average Intersection-over-Union (IoU) Robustness: in terms of the total number of failures Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 29
Experiments: training dataset size Accuracy: is calculated as the average Intersectionover-Union (IoU) Robustness: in terms of the total number of failures Using a larger video dataset could increase the performance even further. Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 30
Experiments: OTB13 benchmark results Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 31
Experiments: VOT15 benchmark results Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 32
Experiments: VOT15 benchmark results Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 33
Experiments: VOT15 benchmark results Estimates the new position of the target object by merely crosscorrelating the embeddings of two patches over three scales. Achieves real-time performance and state-of-the-art results. Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-convolutional Siamese Networks for Object Tracking, arxiv preprint, 2016. 34
Future work: How to improve the performance? By augmenting the online tracking pipeline: online model updating (i.e. tracking-by-detection) bounding-box regression (i.e. YOLO, Faster-CNN) fine-tuning (i.e. correlation filters + CNN features) memory (i.e. add RNN, LSTM) 35
Source: Guanghan Ning, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, Haohong Wang, Spatially Supervised Recurrent Convolutional 36 Neural Networks for Visual Object Tracking, arxiv preprint, 2016.
Future work: How to improve the performance? By augmenting the online tracking pipeline: online model updating (i.e. tracking-by-detection) bounding-box regression (i.e. YOLO, Faster-CNN) fine-tuning (i.e. correlation filters + CNN features) memory (i.e. add RNN, LSTM) By introducing new architecture in the framework of Siamese CNN, need to dig deeply in the structure of networks (i.e. regression network, triplet network). 37
Triplet Network Source: http://vision.ia.ac.cn/zh/senimar/reports/siamese-network-architecture-and-applications-in-computer-vision.pdf 38
Future work: How to improve the performance? By augmenting the online tracking pipeline: online model updating (i.e. tracking-by-detection) bounding-box regression (i.e. YOLO, Faster-CNN) fine-tuning (i.e. correlation filters + CNN features) memory (i.e. add RNN, LSTM) By introducing new architecture in the framework of Siamese CNN, need to dig deeply in the structure of networks (i.e. regression network, triplet network). By introducing new loss function is Siamese network. 39
Loss function used in face verification Source: http://vision.ia.ac.cn/zh/senimar/reports/siamese-network-architecture-and-applications-in-computer-vision.pdf 40
Thank you! 41