Breaking Through The Challenges of Scalable Deep Learning for Video Analytics

Breaking Through the Challenges of
Scalable Deep Learning for
Video Analytics
Steven Flores, sflores@compthree.com
Luke Hosking, lhosking@compthree.com

Use cases
A customer is somebody with a lot of unannotated video whose content they
want annotated and indexed into a searchable database. For example,
● Media: video library going back decades.
● Research institutions: video from a lecture series.
● Management and HR: conference/meetings notes.

What info do we want from video?
● What and who is in the video?
● What happens in the video?
● What is the video about?
(Example here: https://www.youtube.com/watch?v=X3a-ZX6ObJU)

Information from audio
● Topic modeling speech transcripts.
● Sentiment analysis of speech transcripts.
● Hot language and/or loud sounds heat map.
● Keywords (named entities) from transcripts. The Federal Reserve is widely expected to
increase interest rates again Wednesday...
Politics and policy
Sports
Science and
Technology

Using keywords to extract info
Within transcripts, keywords such as people, locations, organizations, and
geo-political entities carry much of the latent information we seek from a video.
For example, a video transcript containing the excerpt
...probably confirm the North Korean side in its willingness…
should appear if we search for the term “North Korea.” Also, the presence of
this term, along with other keywords, may support a topic assignment.

Keyword extraction
Keyword extraction can be a difficult problem. Free extractors always come
with their own ridgid taxonomy and may not be production quality:
For example, with the python natural language toolkit (NLTK)...
Geo-socio-political group Geo-political entity

Using a human-curated whitelist
We maintain a “whitelist” of extracted keywords. This solves two problems:
● Quality control supervision of proposed keywords.
● Better custom keyword taxonomies are assigned to keywords on the list.
NLTK finds “North Korean” in the text, and we find it in the whitelist with its tag
Ethnicity
But we have two more problems:
● Human supervision is time-consuming (prohibitively so with a large list).
● This doesn’t solve the case of a keyword phrase incorrectly split by NLTK.

Building a custom keyword extractor
The article Natural Language Processing (almost) from Scratch (R. Collobert et
al. 2011) introduces the “senna” named entity (keyword) extractor:
● A two-layer fully connected neural network.
● For each word, the input is its surrounding “context” words in the text.
● Input context words are mapped to 50-dim vectors in a word2vec model.
cat
sat
on
the
mat
I
O
E
B
S

The senna architecture
Natural Language Processing (almost) from Scratch (R. Collobert et al. 2011)

Senna architecture advantages
● Results are often better than NLTK, thus requiring less human supervision.
● Minimal text preprocessing (for example, no chunking) is required.
● Because input is context-based, it may be possible to train a senna
network with automatically generated partially-annotated training data.
● With greater ease of generating training data, we can train keyword
extractors that are tailored to customer needs (taxonomy, jargon, etc.).

Sentiment heat maps
Sentiment heat maps indicate areas of potentially high interest in the video.
● Based on word sentiment and heated language.
● This may not be sufficient. We can also incorporate information from the
audio stream, such as loudness, to indicate areas of interest.

Challenges and future work
Keyword extraction:
● Adapting the senna model for in-house custom keyword extractors.
● Improving keyword extraction for “messy” spoken-language transcripts.
● How to quickly create training data for customer-dependent taxonomies?
Topic modeling:
● Supervised for customer-dependent topics?
● Unsupervised if the user wants to discover unknown information?
● How to do good topic modeling for “messy” spoken-language transcripts?

Information from video
● Object detection
● Face recognition
● Scene recognition

Object detection
Performing object detection on frames tells you what objects appear in a video:
We use various pre-trained models from the TensorFlow detection model zoo.

Challenges with object detection
Freely-available object detection models based on ResNet and Inception
architectures are production quality. Nonetheless, there are some challenges:
● What objects do we want to detect? Is this customer dependent?
● How to we create enough training data to build custom models quickly?

Scene recognition
We train a wide-ResNet model (S. Zagoruyko et al. 2016) to recognize scenes:
We train the network using the Places365 dataset with consolidated scene
categories (for example, not distinguishing stores based on their interiors).

Face recognition
A face recognition model require millions of faces for training and comprises
many steps: face detection, cropping and re-scaling, and classification.
To train such a model from scratch is very time-consuming. However, near
state-of-the art models are freely available. We are using dlib face recognition.

Face embeddings
Rather than simply recognize faces from a small list of people, most face
recognition models are trained to give good face-to-vector embeddings.
The model user then provides a list of images of faces to recognize, the model
maps the faces to vectors, and query faces are identified via k-nn search.

Who should we recognize?
What faces should we recognize? The answer may be customer dependent:
In generic situations, we should recognize people who are “famous enough”
(well-known politicians, celebrities, artists, scientists, thought-leaders, etc.)
What constitutes famous enough? How do we make a list of their names?
Given the list of names, how do we get enough pictures of their faces?
Steven Flores
(Engineer, Comp Three)
Luke Hosking
(Engineer, Comp Three)

Famous enough?
Our criteria for “famous enough” is partly set by our need to get a list of names
of such famous people: famous = has a wikipedia biography with birthday.
We can easily pull this list of famous people from the wikidata API. We record
each person’s name, birthday, occupation(s), and wikipedia page address.
Brad Pitt is in... Rich Skrenta is out (no b-day on wikipedia)

The gallery problem
Many state of the art facial recognition systems are still not good at picking the
correct face from a large gallery of faces. They generate many false positives.
The rank-1 accuracy decreases as the gallery “distractor” face count increases. (The MegaFace
Benchmark: 1 Million Faces for Recognition at Scale, I. Kemelmacher-Shlizerman et al. 2015)

A potential solution...
Given some faces each with a list of candidate names, use other information
(topic modeling, co-occurrence frequency) to find optimal name assignments:
On the left, Idina Menzel is correctly tagged. On the right, Amy Grant is wrongly
tagged “Fanny Cadeo;” her name is the second choice based on the image.
Use the fact that both are musicians to correct the second tag to “Amy Grant.”

Processing time considerations
● Estimated size of a “large” video cache: 40,000
● Number of frames in a typical 30 second video: 750
● Average video frame processing time (GTX 1080 GPU): about 1 second
→ Estimated time to process the entire video cache: almost one year...
The long time to process this hypothetical video cache is way too long!
Solution: only sample video keyframes (frames at shot changes or high-action
moments). These may contain most of the relevant information. For example,
● https://www.youtube.com/watch?v=_7WZ74F3j_I: 2650 frames
● Number of “irregularly spaced” keyframes processed: 10 keyframes

Challenges and future work
Object detection and scene recognition:
● What do we want to detect? (Customer-dependent?)
● How to we generate enough training data quickly and efficiently?
● What benchmarks do we need to hit for production quality?
Face recognition:
● Who can we / do we want to detect? (Customer-dependent?)
● How can we use other information to improve face-to-name assignments?
● What benchmarks do we need to hit for production quality?
Scalability:
● How can we speed up the wait time for image evaluation?
● What tradeoffs must we make to minimize video processing time?
● What can we trim without compromising performance benchmarks?

Digital Ocean Instance
Docker Host
Augi Real-time Components
Port 5000
Augi Backend
Port 5001
Text Annotator
index.html
bundle.js
Port 80
Nginx
Elasticsearch
File System
Video Object
Store
Port 9200/videos/
Port 5002
Image Service

Real-time Technologies
Frontend
● React
● Apollo
● ChartJS
Backend
● Flask
● Graphene
● Elasticsearch Client

Microservices
Augi Preprocessing Pipeline
Python Code
Video Frame
Sampling
Transcript
Extractor
Audio
Extractor
Elasticsearch
Text
Annotation
Video Store on
File System
Classify
Image
DataConsolidation
ESDocumentInsert
LoopOverVideos

Preprocessing Technologies
● Core pipeline
○ ffmpeg
○ Google Cloud Speech
○ Amazon S3
○ Elasticsearch
● Image classification
○ Tensorflow
○ dlib
○ flask
● Text annotation
○ pygtrie
○ flask

Augi Preprocessing Workflow
Python scripts
● download videos and video metadata (youtube, proprietary APIs)
● manage overall process for list of videos to be enriched
Docker
● text Annotator
● image Classifier
Modular architecture
● file system based cache
● orchestration with override flags

Challenges
Iterative development over tens, to hundreds, of thousands of videos
File system based cache of data produced by each step in preprocessing,
along with granular overrides for each preprocessing method, allow for targeted
testing and implementation.
On-prem challenge: no internet access
We needed the architecture to be usable on-prem for clients that require data
security (confidential/healthcare sectors). Current external services used are
Google Cloud Speech and AWS S3, disk storage and products like Nuance
Dragon could be run on-prem.

Breaking Through The Challenges of Scalable Deep Learning for Video Analytics

More Related Content

Similar to Breaking Through The Challenges of Scalable Deep Learning for Video Analytics (20)

Recently uploaded (20)

Breaking Through The Challenges of Scalable Deep Learning for Video Analytics