SlideShare a Scribd company logo
Acknowledgements:	The	authors	wish	to	thank	Edward	Scott,	Kevin	Brady,	and	Charlie	K.	Dagli for	their	technical	contributions.
Pre-trained Deep CNN Models for Detecting Complex Events in
Unconstrained VideosRapid event detection faces an emergent need to process large videos collections;
whether surveillance or unconstrained web videos, automatic recognition of high-level,
complex events is challenging. In response to the complicated, computationally
demanding, non-replicable pre-existing methods, we designed a simple system that is
quick, effective and carries minimal overhead in terms of memory and storage. Most
importantly, our system is clearly described, modular in nature, and easily
reproducible. We explored various pre-trained Convolutional Neural Networks (CNN)
as off-the-shelf feature extractors, as both stand-alone and fused with others. For this,
we used a large corpus of unconstrained, real-world video data. Frame-level features
were fused to form video-level descriptors by means of both early and late fusion.
Several insights were found on using pre-trained CNNs as off-the-shelf feature
extractors for the task of event detection. Fusing SVMs with different CNNs revealed
some interesting facts, finding some combinations to be complimentary. It was found
that no single CNN works best for all events, as some events are more object- driven
while others are more scene-based. Our top performance resulted from learning
event-dependent weights for different CNNs.
Abstract
[1] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe:
Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014).
[2] Vedaldi, A., & Fulkerson, B. (200 8). VLFeat: An Open and Portable Library of Computer Vision Algorithms. In
ACM MM (2010).
[3] Wang ,L., Guo ,S., Huang ,W., and Qiao,Y., Places205-vgg net models for scene recognition. CoRR
abs/1508.01667 (2015).
[4] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L., Imagenet: A large-scale hierarchical image
database. In Proceedings of IEEE CVPR (2009).
[5] Zhou,B., Lapedriza,A., Xiao,J., Torralba,A., & Oliva,A. Learning deep features for scene recognition using places
database. NIPS (2014).
[6] Simonyan, K. and Zisserman, A., Very deep convolutional networks for large-scale image recognition. CoRR
abs/1409.1556 (2014).
[7] Robinson, J. P., Scott, E., and Fu, Y., NEU MIT-LL @ TRECVid 2015: Multimedia Event Detection by Deep
Feature Learning. In Proceedings of TRECVID 2015 (2015).
[8] Over, P., Awad, G., Michel, M., Fiscus, J., Kraaij, W., Smeaton, A. F., Quenot, G., and Ordelman, R., Trecvid 2015
– an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID (2015).
[9] Hsu, C.W. & Lin, C.J. A comparison of methods for multiclass support vector machines. IEEE Transactions on
Neural Networks (2002).
[10] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,
Bernstein, M., Berg, A. C., & Fei-Fei, L., Imagenet large scale visual recognition challenge. IJCV (April 2015).
[11] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A.,
Going deeper with convolutions. In Proceedings of IEEE CVPR (2015).
References
CNN Frame Features Video Feature One-vs-rest SVMs
(a) (b) (c) (d) (e)
Image
conv64
conv64
maxpool
conv128
conv128
maxpool
conv256
conv256
maxpool
conv512
conv512
maxpool
FC-4096
FC-4096
FC-1000
softmax
conv512
conv512
maxpool
…
Average	Pooling 𝒳"
Video Frames
Fig 1 A single system fusing across modalities: multi-modal to exploit subtle connections which enables deeper interfaces.
Fig 3 List of events included in MED15 (a). Each comes with an event kit: e.g., event 21 (b), consisting of a text
description (top) and video exemplars (bottom). Notice the ambiguity, i.e., its intra-class variations being broad in scope.
Evidential description explicitly lists different Scenes, Objects, and Activities. Note that listed entities may or may not exist.
E021
Event name: Attempting a bike trick
Definition: One or more people attempt to do a trick on a bicycle, motorcycle, or other type of motorized bike. To count as a
bike for purposes of this event, the vehicle must have two wheels (excludes unicycles, ATVs, etc.).
Explication: Bikes are normally ridden with a person sitting down on seat and holding onto the handlebars and steering with
their hands. Tricks consist of difficult ways of riding the bike, such as on one wheel, steering with feet or standing on the seat;
or intentional motions made with the bike that are not simply slowing down/stopping the bike, propelling it forward, or steering
the bike as it moves. Steering around obstacles or steering a bike off of a jump and landing on the ground are generally not
considered tricks in and of themselves, however if the bike jump is set up so that the person is jumping over something, (e.g.
jumping over people or vehicles or over a river), or if the person does a flip or other trick in the air, that would be considered a
trick.
Evidential description:
Scene: outside, often in a skate park, parking lot or street
Objects/ People: person riding a bike, bike, ramps, helmet, concrete floor, audiences
Activities: riding bike on one wheel, standing on top of bike, jumping with the bike (especially over or onto objects),
spinning or flipping bike
Audio: sounds of bike hitting surface during the trick, audience cheering
Text Descriptions
Exemplar Videos
• All experiments were conducted using pipeline shown in Fig 4.
• Table 1 lists descriptions of each run and MAP scores, revealing the following:
o VGG-16 CNN’s fc8-layer (Run 2) tops that of the fc7 (Run 1).
o Modeling the object distributions discriminates (Run2) better than scene
classes (Run3).
o Object and Scene fused together early (Run 6) yields best performance.
o Hybrid-CNN (Run 4) does not compare well to two separate CNN models.
• Table 2 displays class labels of maximum SVM weights for 4 extreme cases, 2
where objects did relatively well and 2 not well.
Results
Table	1	List of runs with descriptions and resulting MAP	scores.
• Multimedia Analysis data typically demands experts of each modality [see Fig 1].
• Internet archives continues to see an increasing pool of accessible videos that
demand better technologies for automatic tagging, search & retrieval [see Fig 2].
Motivation
• YouTube has more than 1 billion users
• Every day people watch hundreds of millions of
hours on YouTube and generate billions of views
• 300 hours of video are uploaded to
YouTube every minute
NEEDS
An efficient way to sort, search and retrieve this huge
amount of video based on CONTENT
Fig 2 YouTube report for year 2015.
• MED task is supported by TRECVid in the form of an annual workshop, where data
and metrics needed for a laboratory style evaluation are provided [see Fig3].
Fig 4 The work flow of our MED system. Given an input video, it is first sampled at approximately one fps (a). Sample frames are then fed forward through pre-trained CNN models (b),
yielding a set of frame-level feature vectors (c). Next, a video-level descriptor is formed by average pooling the frame-level features (d). Video descriptor is projected to a non-linear space
via RBF kernel, and then passed to the one-vs-rest SVM models (e).
Joseph P. Robinson and Yun Fu
(a) (b)
All-Terrain Bike
PonchoHoneycomb
E021
Unicycle Tricycle
Bee House
Theater Curtain Home Theater
All-Terrain Bike
Volleyball
Disc Brake Tandem Bike
E031E028E033
Top Activation 2nd Activation 3rd Activation
Vehiclewith1 wheel
drivenby pedals
Vehiclewith3 wheels
drivenby pedals
Bike w sturdy frame + fat
tires; designedfor
mountainouscountry
Shed containinga
numberof beehives
Tiny hex-cellsofbeeswax
usedby bees to store
honey+ larvae
Blanket-likecloakwitha
holecenteredfor head
Cloth, stagefront; opens
to start + closesto
break/end
TV + videoequipmentto
providetheatermovie
experienceathome
An inflatedballusedin
playingvolleyball
Bike w sturdy frame + fat
tires; designedfor
mountainouscountry
Brakes that applies
friction to spinningdisk
via brakepads
Bike with2 sets of pedals
and2 seats
Table 2 Class labels of max SVM weights. E021 and E031 scored the highest using VGG-16 CNN (object) features (top 2
rows); VGG-16 did not perform well on E028 and E033 (bottom 2 rows).
(a)
(b)
(c)
(d)
Fig 5 Frames nearest SVM weights:TP (green boxes)
and FP (red boxes), E21(a) E31(b) E28(c) E33 (d).
Fig 6 Notional capability. Example interface for facilitate multimodal search and retrieval capabilities.
• Incorporate multimodality capability for cross media searches [see Fig 6].
• Integrate temporal information into models.
• Train set of object classes that is event specific.
Conclusions/Future Work
Descriptions MAP(%)
Run 1 fc7 features, VGG model (4,096D) 12.4
Run 2 fc8 features, VGG model (1,000D) 13.3
Run 3 fc8 features, Places205 model (205D) 8.4
Run 4 fc8 features, Hybrid-CNN model (1,186D) 11.7
Run 5 fc8 features, VGG (1,000D) + Places205 (205D) models; average SVM scores 11.8
Run 6 fc8 features, VGG + Places205 models concatenated (1,250D) 16.0
This material is based upon work supported by the U.S. Department of Homeland Security, Science and Technology Directorate, Officeof University Programs, under Grant Award 2013-ST-061-ED0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security. [10/2013]

More Related Content

PDF
Multi-label Remote Sensing Image Retrieval based on Deep Features
Universitat Politècnica de Catalunya
 
PDF
Lockwood.scott
NASAPMC
 
PDF
YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)
Universitat Politècnica de Catalunya
 
PPTX
SIGGRAPH 2014 Course on Computational Cameras and Displays (part 3)
Matthew O'Toole
 
PDF
Background Subtraction Algorithm for Moving Object Detection Using Denoising ...
International Journal of Science and Research (IJSR)
 
PDF
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Yan Xu
 
PDF
Inclined Image Recognition for Aerial Mapping using Deep Learning and Tree ba...
TELKOMNIKA JOURNAL
 
PPTX
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Saimunur Rahman
 
Multi-label Remote Sensing Image Retrieval based on Deep Features
Universitat Politècnica de Catalunya
 
Lockwood.scott
NASAPMC
 
YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)
Universitat Politècnica de Catalunya
 
SIGGRAPH 2014 Course on Computational Cameras and Displays (part 3)
Matthew O'Toole
 
Background Subtraction Algorithm for Moving Object Detection Using Denoising ...
International Journal of Science and Research (IJSR)
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Yan Xu
 
Inclined Image Recognition for Aerial Mapping using Deep Learning and Tree ba...
TELKOMNIKA JOURNAL
 
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Saimunur Rahman
 

What's hot (20)

PPTX
OpenStreetMap in 3D - current developments
virtualcitySYSTEMS GmbH
 
PDF
Paper
Emre Külah
 
PDF
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
Seiya Ito
 
PDF
An Open Source solution for Three-Dimensional documentation: archaeological a...
Giulio Bigliardi
 
PDF
IRJET- Object Detection in Underwater Images using Faster Region based Convol...
IRJET Journal
 
PPTX
Background subtraction
Shashank Dhariwal
 
PPTX
3D Shape and Indirect Appearance by Structured Light Transport
Matthew O'Toole
 
PPTX
Presnt3
yogi123maurya
 
PPT
Action Recognition (Thesis presentation)
nikhilus85
 
PDF
Multi-core GPU – Fast parallel SAR image generation
Mahesh Khadatare
 
PPTX
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Fadwa Fouad
 
PDF
Medical image fusion using curvelet transform 2-3-4-5
IAEME Publication
 
PDF
Computer Science Thesis Defense
tompitkin
 
PDF
Background Subtraction Based on Phase and Distance Transform Under Sudden Ill...
Shanghai Jiao Tong University(上海交通大学)
 
PDF
Moving object detection using background subtraction algorithm using simulink
eSAT Publishing House
 
PDF
Depth estimation do we need to throw old things away
NAVER Engineering
 
PDF
ADAPTIVE, SCALABLE, TRANSFORMDOMAIN GLOBAL MOTION ESTIMATION FOR VIDEO STABIL...
cscpconf
 
PDF
Tracking Robustness and Green View Index Estimation of Augmented and Diminish...
Tomohiro Fukuda
 
PDF
Deblurring of License Plate Image using Blur Kernel Estimation
IRJET Journal
 
OpenStreetMap in 3D - current developments
virtualcitySYSTEMS GmbH
 
[論文紹介] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Ne...
Seiya Ito
 
An Open Source solution for Three-Dimensional documentation: archaeological a...
Giulio Bigliardi
 
IRJET- Object Detection in Underwater Images using Faster Region based Convol...
IRJET Journal
 
Background subtraction
Shashank Dhariwal
 
3D Shape and Indirect Appearance by Structured Light Transport
Matthew O'Toole
 
Presnt3
yogi123maurya
 
Action Recognition (Thesis presentation)
nikhilus85
 
Multi-core GPU – Fast parallel SAR image generation
Mahesh Khadatare
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Fadwa Fouad
 
Medical image fusion using curvelet transform 2-3-4-5
IAEME Publication
 
Computer Science Thesis Defense
tompitkin
 
Background Subtraction Based on Phase and Distance Transform Under Sudden Ill...
Shanghai Jiao Tong University(上海交通大学)
 
Moving object detection using background subtraction algorithm using simulink
eSAT Publishing House
 
Depth estimation do we need to throw old things away
NAVER Engineering
 
ADAPTIVE, SCALABLE, TRANSFORMDOMAIN GLOBAL MOTION ESTIMATION FOR VIDEO STABIL...
cscpconf
 
Tracking Robustness and Green View Index Estimation of Augmented and Diminish...
Tomohiro Fukuda
 
Deblurring of License Plate Image using Blur Kernel Estimation
IRJET Journal
 
Ad

Similar to med_poster_spie (20)

PDF
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
IRJET Journal
 
PDF
pgdip-project-report-final-148245F
Vimukthi Wickramasinghe
 
PDF
IRJET- Study of SVM and CNN in Semantic Concept Detection
IRJET Journal
 
PDF
Activity detection at TRECVID
George Awad
 
PDF
Similarity-based retrieval of multimedia content
Symeon Papadopoulos
 
PDF
Convolutional Features for Instance Search
Universitat Politècnica de Catalunya
 
PDF
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
IRJET-Multiple Object Detection using Deep Neural Networks
IRJET Journal
 
PDF
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
Universitat Politècnica de Catalunya
 
PDF
Intelligent Transportation System Based On Machine Learning For Vehicle Perce...
IRJET Journal
 
PDF
Introduction about TRECVID (The Video Retreival benchmark)
George Awad
 
PDF
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PDF
Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)
Universitat Politècnica de Catalunya
 
DOCX
Large-scale Video Classification with Convolutional Neural Net.docx
croysierkathey
 
PDF
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Universitat Politècnica de Catalunya
 
PDF
ITI-CERTH participation in TRECVID 2018
MOVING Project
 
PPTX
FINAL_Team_4.pptx
nitin571047
 
PDF
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
CSCJournals
 
PDF
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
IRJET Journal
 
pgdip-project-report-final-148245F
Vimukthi Wickramasinghe
 
IRJET- Study of SVM and CNN in Semantic Concept Detection
IRJET Journal
 
Activity detection at TRECVID
George Awad
 
Similarity-based retrieval of multimedia content
Symeon Papadopoulos
 
Convolutional Features for Instance Search
Universitat Politècnica de Catalunya
 
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
IRJET-Multiple Object Detection using Deep Neural Networks
IRJET Journal
 
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Universitat Politècnica de Catalunya
 
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
Universitat Politècnica de Catalunya
 
Intelligent Transportation System Based On Machine Learning For Vehicle Perce...
IRJET Journal
 
Introduction about TRECVID (The Video Retreival benchmark)
George Awad
 
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Deep Convnets for Video Processing (Master in Computer Vision Barcelona, 2016)
Universitat Politècnica de Catalunya
 
Large-scale Video Classification with Convolutional Neural Net.docx
croysierkathey
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Universitat Politècnica de Catalunya
 
ITI-CERTH participation in TRECVID 2018
MOVING Project
 
FINAL_Team_4.pptx
nitin571047
 
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
CSCJournals
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Ad

med_poster_spie

  • 1. Acknowledgements: The authors wish to thank Edward Scott, Kevin Brady, and Charlie K. Dagli for their technical contributions. Pre-trained Deep CNN Models for Detecting Complex Events in Unconstrained VideosRapid event detection faces an emergent need to process large videos collections; whether surveillance or unconstrained web videos, automatic recognition of high-level, complex events is challenging. In response to the complicated, computationally demanding, non-replicable pre-existing methods, we designed a simple system that is quick, effective and carries minimal overhead in terms of memory and storage. Most importantly, our system is clearly described, modular in nature, and easily reproducible. We explored various pre-trained Convolutional Neural Networks (CNN) as off-the-shelf feature extractors, as both stand-alone and fused with others. For this, we used a large corpus of unconstrained, real-world video data. Frame-level features were fused to form video-level descriptors by means of both early and late fusion. Several insights were found on using pre-trained CNNs as off-the-shelf feature extractors for the task of event detection. Fusing SVMs with different CNNs revealed some interesting facts, finding some combinations to be complimentary. It was found that no single CNN works best for all events, as some events are more object- driven while others are more scene-based. Our top performance resulted from learning event-dependent weights for different CNNs. Abstract [1] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014). [2] Vedaldi, A., & Fulkerson, B. (200 8). VLFeat: An Open and Portable Library of Computer Vision Algorithms. In ACM MM (2010). [3] Wang ,L., Guo ,S., Huang ,W., and Qiao,Y., Places205-vgg net models for scene recognition. CoRR abs/1508.01667 (2015). [4] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L., Imagenet: A large-scale hierarchical image database. In Proceedings of IEEE CVPR (2009). [5] Zhou,B., Lapedriza,A., Xiao,J., Torralba,A., & Oliva,A. Learning deep features for scene recognition using places database. NIPS (2014). [6] Simonyan, K. and Zisserman, A., Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). [7] Robinson, J. P., Scott, E., and Fu, Y., NEU MIT-LL @ TRECVid 2015: Multimedia Event Detection by Deep Feature Learning. In Proceedings of TRECVID 2015 (2015). [8] Over, P., Awad, G., Michel, M., Fiscus, J., Kraaij, W., Smeaton, A. F., Quenot, G., and Ordelman, R., Trecvid 2015 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID (2015). [9] Hsu, C.W. & Lin, C.J. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks (2002). [10] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L., Imagenet large scale visual recognition challenge. IJCV (April 2015). [11] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A., Going deeper with convolutions. In Proceedings of IEEE CVPR (2015). References CNN Frame Features Video Feature One-vs-rest SVMs (a) (b) (c) (d) (e) Image conv64 conv64 maxpool conv128 conv128 maxpool conv256 conv256 maxpool conv512 conv512 maxpool FC-4096 FC-4096 FC-1000 softmax conv512 conv512 maxpool … Average Pooling 𝒳" Video Frames Fig 1 A single system fusing across modalities: multi-modal to exploit subtle connections which enables deeper interfaces. Fig 3 List of events included in MED15 (a). Each comes with an event kit: e.g., event 21 (b), consisting of a text description (top) and video exemplars (bottom). Notice the ambiguity, i.e., its intra-class variations being broad in scope. Evidential description explicitly lists different Scenes, Objects, and Activities. Note that listed entities may or may not exist. E021 Event name: Attempting a bike trick Definition: One or more people attempt to do a trick on a bicycle, motorcycle, or other type of motorized bike. To count as a bike for purposes of this event, the vehicle must have two wheels (excludes unicycles, ATVs, etc.). Explication: Bikes are normally ridden with a person sitting down on seat and holding onto the handlebars and steering with their hands. Tricks consist of difficult ways of riding the bike, such as on one wheel, steering with feet or standing on the seat; or intentional motions made with the bike that are not simply slowing down/stopping the bike, propelling it forward, or steering the bike as it moves. Steering around obstacles or steering a bike off of a jump and landing on the ground are generally not considered tricks in and of themselves, however if the bike jump is set up so that the person is jumping over something, (e.g. jumping over people or vehicles or over a river), or if the person does a flip or other trick in the air, that would be considered a trick. Evidential description: Scene: outside, often in a skate park, parking lot or street Objects/ People: person riding a bike, bike, ramps, helmet, concrete floor, audiences Activities: riding bike on one wheel, standing on top of bike, jumping with the bike (especially over or onto objects), spinning or flipping bike Audio: sounds of bike hitting surface during the trick, audience cheering Text Descriptions Exemplar Videos • All experiments were conducted using pipeline shown in Fig 4. • Table 1 lists descriptions of each run and MAP scores, revealing the following: o VGG-16 CNN’s fc8-layer (Run 2) tops that of the fc7 (Run 1). o Modeling the object distributions discriminates (Run2) better than scene classes (Run3). o Object and Scene fused together early (Run 6) yields best performance. o Hybrid-CNN (Run 4) does not compare well to two separate CNN models. • Table 2 displays class labels of maximum SVM weights for 4 extreme cases, 2 where objects did relatively well and 2 not well. Results Table 1 List of runs with descriptions and resulting MAP scores. • Multimedia Analysis data typically demands experts of each modality [see Fig 1]. • Internet archives continues to see an increasing pool of accessible videos that demand better technologies for automatic tagging, search & retrieval [see Fig 2]. Motivation • YouTube has more than 1 billion users • Every day people watch hundreds of millions of hours on YouTube and generate billions of views • 300 hours of video are uploaded to YouTube every minute NEEDS An efficient way to sort, search and retrieve this huge amount of video based on CONTENT Fig 2 YouTube report for year 2015. • MED task is supported by TRECVid in the form of an annual workshop, where data and metrics needed for a laboratory style evaluation are provided [see Fig3]. Fig 4 The work flow of our MED system. Given an input video, it is first sampled at approximately one fps (a). Sample frames are then fed forward through pre-trained CNN models (b), yielding a set of frame-level feature vectors (c). Next, a video-level descriptor is formed by average pooling the frame-level features (d). Video descriptor is projected to a non-linear space via RBF kernel, and then passed to the one-vs-rest SVM models (e). Joseph P. Robinson and Yun Fu (a) (b) All-Terrain Bike PonchoHoneycomb E021 Unicycle Tricycle Bee House Theater Curtain Home Theater All-Terrain Bike Volleyball Disc Brake Tandem Bike E031E028E033 Top Activation 2nd Activation 3rd Activation Vehiclewith1 wheel drivenby pedals Vehiclewith3 wheels drivenby pedals Bike w sturdy frame + fat tires; designedfor mountainouscountry Shed containinga numberof beehives Tiny hex-cellsofbeeswax usedby bees to store honey+ larvae Blanket-likecloakwitha holecenteredfor head Cloth, stagefront; opens to start + closesto break/end TV + videoequipmentto providetheatermovie experienceathome An inflatedballusedin playingvolleyball Bike w sturdy frame + fat tires; designedfor mountainouscountry Brakes that applies friction to spinningdisk via brakepads Bike with2 sets of pedals and2 seats Table 2 Class labels of max SVM weights. E021 and E031 scored the highest using VGG-16 CNN (object) features (top 2 rows); VGG-16 did not perform well on E028 and E033 (bottom 2 rows). (a) (b) (c) (d) Fig 5 Frames nearest SVM weights:TP (green boxes) and FP (red boxes), E21(a) E31(b) E28(c) E33 (d). Fig 6 Notional capability. Example interface for facilitate multimodal search and retrieval capabilities. • Incorporate multimodality capability for cross media searches [see Fig 6]. • Integrate temporal information into models. • Train set of object classes that is event specific. Conclusions/Future Work Descriptions MAP(%) Run 1 fc7 features, VGG model (4,096D) 12.4 Run 2 fc8 features, VGG model (1,000D) 13.3 Run 3 fc8 features, Places205 model (205D) 8.4 Run 4 fc8 features, Hybrid-CNN model (1,186D) 11.7 Run 5 fc8 features, VGG (1,000D) + Places205 (205D) models; average SVM scores 11.8 Run 6 fc8 features, VGG + Places205 models concatenated (1,250D) 16.0 This material is based upon work supported by the U.S. Department of Homeland Security, Science and Technology Directorate, Officeof University Programs, under Grant Award 2013-ST-061-ED0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security. [10/2013]