SlideShare a Scribd company logo
Breaking Through the Challenges of
Scalable Deep Learning for
Video Analytics
Steven Flores, sflores@compthree.com
Luke Hosking, lhosking@compthree.com
Use cases
A customer is somebody with a lot of unannotated video whose content they
want annotated and indexed into a searchable database. For example,
โ— Media: video library going back decades.
โ— Research institutions: video from a lecture series.
โ— Management and HR: conference/meetings notes.
What info do we want from video?
โ— What and who is in the video?
โ— What happens in the video?
โ— What is the video about?
(Example here: https://www.youtube.com/watch?v=X3a-ZX6ObJU)
Information from audio
โ— Topic modeling speech transcripts.
โ— Sentiment analysis of speech transcripts.
โ— Hot language and/or loud sounds heat map.
โ— Keywords (named entities) from transcripts. The Federal Reserve is widely expected to
increase interest rates again Wednesday...
Politics and policy
Sports
Science and
Technology
Using keywords to extract info
Within transcripts, keywords such as people, locations, organizations, and
geo-political entities carry much of the latent information we seek from a video.
For example, a video transcript containing the excerpt
...probably confirm the North Korean side in its willingnessโ€ฆ
should appear if we search for the term โ€œNorth Korea.โ€ Also, the presence of
this term, along with other keywords, may support a topic assignment.
Keyword extraction
Keyword extraction can be a difficult problem. Free extractors always come
with their own ridgid taxonomy and may not be production quality:
For example, with the python natural language toolkit (NLTK)...
...probably confirm the North Korean side in its willingnessโ€ฆ
Geo-socio-political group Geo-political entity
Using a human-curated whitelist
We maintain a โ€œwhitelistโ€ of extracted keywords. This solves two problems:
โ— Quality control supervision of proposed keywords.
โ— Better custom keyword taxonomies are assigned to keywords on the list.
NLTK finds โ€œNorth Koreanโ€ in the text, and we find it in the whitelist with its tag
...probably confirm the North Korean side in its willingnessโ€ฆ
Ethnicity
But we have two more problems:
โ— Human supervision is time-consuming (prohibitively so with a large list).
โ— This doesnโ€™t solve the case of a keyword phrase incorrectly split by NLTK.
Building a custom keyword extractor
The article Natural Language Processing (almost) from Scratch (R. Collobert et
al. 2011) introduces the โ€œsennaโ€ named entity (keyword) extractor:
โ— A two-layer fully connected neural network.
โ— For each word, the input is its surrounding โ€œcontextโ€ words in the text.
โ— Input context words are mapped to 50-dim vectors in a word2vec model.
cat
sat
on
the
mat
I
O
E
B
S
The senna architecture
Natural Language Processing (almost) from Scratch (R. Collobert et al. 2011)
Senna architecture advantages
โ— Results are often better than NLTK, thus requiring less human supervision.
โ— Minimal text preprocessing (for example, no chunking) is required.
โ— Because input is context-based, it may be possible to train a senna
network with automatically generated partially-annotated training data.
โ— With greater ease of generating training data, we can train keyword
extractors that are tailored to customer needs (taxonomy, jargon, etc.).
Sentiment heat maps
Sentiment heat maps indicate areas of potentially high interest in the video.
โ— Based on word sentiment and heated language.
โ— This may not be sufficient. We can also incorporate information from the
audio stream, such as loudness, to indicate areas of interest.
Challenges and future work
Keyword extraction:
โ— Adapting the senna model for in-house custom keyword extractors.
โ— Improving keyword extraction for โ€œmessyโ€ spoken-language transcripts.
โ— How to quickly create training data for customer-dependent taxonomies?
Topic modeling:
โ— Supervised for customer-dependent topics?
โ— Unsupervised if the user wants to discover unknown information?
โ— How to do good topic modeling for โ€œmessyโ€ spoken-language transcripts?
Information from video
โ— Object detection
โ— Face recognition
โ— Scene recognition
Object detection
Performing object detection on frames tells you what objects appear in a video:
We use various pre-trained models from the TensorFlow detection model zoo.
Challenges with object detection
Freely-available object detection models based on ResNet and Inception
architectures are production quality. Nonetheless, there are some challenges:
โ— What objects do we want to detect? Is this customer dependent?
โ— How to we create enough training data to build custom models quickly?
Scene recognition
We train a wide-ResNet model (S. Zagoruyko et al. 2016) to recognize scenes:
We train the network using the Places365 dataset with consolidated scene
categories (for example, not distinguishing stores based on their interiors).
Face recognition
A face recognition model require millions of faces for training and comprises
many steps: face detection, cropping and re-scaling, and classification.
To train such a model from scratch is very time-consuming. However, near
state-of-the art models are freely available. We are using dlib face recognition.
Face embeddings
Rather than simply recognize faces from a small list of people, most face
recognition models are trained to give good face-to-vector embeddings.
The model user then provides a list of images of faces to recognize, the model
maps the faces to vectors, and query faces are identified via k-nn search.
Who should we recognize?
What faces should we recognize? The answer may be customer dependent:
In generic situations, we should recognize people who are โ€œfamous enoughโ€
(well-known politicians, celebrities, artists, scientists, thought-leaders, etc.)
What constitutes famous enough? How do we make a list of their names?
Given the list of names, how do we get enough pictures of their faces?
Steven Flores
(Engineer, Comp Three)
Luke Hosking
(Engineer, Comp Three)
Famous enough?
Our criteria for โ€œfamous enoughโ€ is partly set by our need to get a list of names
of such famous people: famous = has a wikipedia biography with birthday.
We can easily pull this list of famous people from the wikidata API. We record
each personโ€™s name, birthday, occupation(s), and wikipedia page address.
Brad Pitt is in... Rich Skrenta is out (no b-day on wikipedia)
The gallery problem
Many state of the art facial recognition systems are still not good at picking the
correct face from a large gallery of faces. They generate many false positives.
The rank-1 accuracy decreases as the gallery โ€œdistractorโ€ face count increases. (The MegaFace
Benchmark: 1 Million Faces for Recognition at Scale, I. Kemelmacher-Shlizerman et al. 2015)
A potential solution...
Given some faces each with a list of candidate names, use other information
(topic modeling, co-occurrence frequency) to find optimal name assignments:
On the left, Idina Menzel is correctly tagged. On the right, Amy Grant is wrongly
tagged โ€œFanny Cadeo;โ€ her name is the second choice based on the image.
Use the fact that both are musicians to correct the second tag to โ€œAmy Grant.โ€
Processing time considerations
โ— Estimated size of a โ€œlargeโ€ video cache: 40,000
โ— Number of frames in a typical 30 second video: 750
โ— Average video frame processing time (GTX 1080 GPU): about 1 second
โ†’ Estimated time to process the entire video cache: almost one year...
The long time to process this hypothetical video cache is way too long!
Solution: only sample video keyframes (frames at shot changes or high-action
moments). These may contain most of the relevant information. For example,
โ— https://www.youtube.com/watch?v=_7WZ74F3j_I: 2650 frames
โ— Number of โ€œirregularly spacedโ€ keyframes processed: 10 keyframes
Challenges and future work
Object detection and scene recognition:
โ— What do we want to detect? (Customer-dependent?)
โ— How to we generate enough training data quickly and efficiently?
โ— What benchmarks do we need to hit for production quality?
Face recognition:
โ— Who can we / do we want to detect? (Customer-dependent?)
โ— How can we use other information to improve face-to-name assignments?
โ— What benchmarks do we need to hit for production quality?
Scalability:
โ— How can we speed up the wait time for image evaluation?
โ— What tradeoffs must we make to minimize video processing time?
โ— What can we trim without compromising performance benchmarks?
Augi Demo
Digital Ocean Instance
Docker Host
Augi Real-time Components
Port 5000
Augi Backend
Port 5001
Text Annotator
index.html
bundle.js
Port 80
Nginx
Elasticsearch
File System
Video Object
Store
Port 9200/videos/
Port 5002
Image Service
Real-time Technologies
Frontend
โ— React
โ— Apollo
โ— ChartJS
Backend
โ— Flask
โ— Graphene
โ— Elasticsearch Client
Microservices
Augi Preprocessing Pipeline
Python Code
Video Frame
Sampling
Transcript
Extractor
Audio
Extractor
Elasticsearch
Text
Annotation
Video Store on
File System
Classify
Image
DataConsolidation
ESDocumentInsert
LoopOverVideos
Preprocessing Technologies
โ— Core pipeline
โ—‹ ffmpeg
โ—‹ Google Cloud Speech
โ—‹ Amazon S3
โ—‹ Elasticsearch
โ— Image classification
โ—‹ Tensorflow
โ—‹ dlib
โ—‹ flask
โ— Text annotation
โ—‹ pygtrie
โ—‹ flask
Where the magic happens
Augi Preprocessing Workflow
Python scripts
โ— download videos and video metadata (youtube, proprietary APIs)
โ— manage overall process for list of videos to be enriched
Docker
โ— text Annotator
โ— image Classifier
Modular architecture
โ— file system based cache
โ— orchestration with override flags
Challenges
Iterative development over tens, to hundreds, of thousands of videos
File system based cache of data produced by each step in preprocessing,
along with granular overrides for each preprocessing method, allow for targeted
testing and implementation.
On-prem challenge: no internet access
We needed the architecture to be usable on-prem for clients that require data
security (confidential/healthcare sectors). Current external services used are
Google Cloud Speech and AWS S3, disk storage and products like Nuance
Dragon could be run on-prem.
Questions?

More Related Content

Similar to Breaking Through The Challenges of Scalable Deep Learning for Video Analytics (20)

PDF
Video + Language: Where Does Domain Knowledge Fit in?
Goergen Institute for Data Science
ย 
PDF
K1803027074
IOSR Journals
ย 
PDF
Big Data Analytics course: Named Entities and Deep Learning for NLP
Christian Morbidoni
ย 
PPTX
deep fake detection deep fake detection a
sadman190214
ย 
PDF
Paper 153
Guillaume Dupont
ย 
PDF
Deep Representation: Building a Semantic Image Search Engine
C4Media
ย 
PPTX
AlexNet
Bertil Hatt
ย 
PPTX
FASSOLD Deep learning for semantic analysis and annotation of conventional an...
FIAT/IFTA
ย 
PDF
pgdip-project-report-final-148245F
Vimukthi Wickramasinghe
ย 
PPTX
Image Annotation
Yomna Mahmoud Ibrahim Hassan
ย 
PPTX
Facial expression recognition projc 2 (3) (1)
AbhiAchalla
ย 
PPTX
Mtech Fourth progress presentation
NEERAJ BAGHEL
ย 
PDF
A framework for visual search in broadcasting companies' multimedia archives
FIAT/IFTA
ย 
PDF
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...
Athens Big Data
ย 
PDF
Prior AI consulting use cases
Harendra Singh
ย 
PPTX
People detection in a video
Yonatan Katz
ย 
PDF
Machine Learning for Developers - Danilo Poccia - Codemotion Rome 2017
Codemotion
ย 
PPTX
Event-based MultiMedia Search and Retrieval for Question Answering
Benoit HUET
ย 
PDF
Video+Language: From Classification to Description
Goergen Institute for Data Science
ย 
PDF
med_poster_spie
Joe Robinson
ย 
Video + Language: Where Does Domain Knowledge Fit in?
Goergen Institute for Data Science
ย 
K1803027074
IOSR Journals
ย 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Christian Morbidoni
ย 
deep fake detection deep fake detection a
sadman190214
ย 
Paper 153
Guillaume Dupont
ย 
Deep Representation: Building a Semantic Image Search Engine
C4Media
ย 
AlexNet
Bertil Hatt
ย 
FASSOLD Deep learning for semantic analysis and annotation of conventional an...
FIAT/IFTA
ย 
pgdip-project-report-final-148245F
Vimukthi Wickramasinghe
ย 
Image Annotation
Yomna Mahmoud Ibrahim Hassan
ย 
Facial expression recognition projc 2 (3) (1)
AbhiAchalla
ย 
Mtech Fourth progress presentation
NEERAJ BAGHEL
ย 
A framework for visual search in broadcasting companies' multimedia archives
FIAT/IFTA
ย 
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...
Athens Big Data
ย 
Prior AI consulting use cases
Harendra Singh
ย 
People detection in a video
Yonatan Katz
ย 
Machine Learning for Developers - Danilo Poccia - Codemotion Rome 2017
Codemotion
ย 
Event-based MultiMedia Search and Retrieval for Question Answering
Benoit HUET
ย 
Video+Language: From Classification to Description
Goergen Institute for Data Science
ย 
med_poster_spie
Joe Robinson
ย 

Recently uploaded (20)

PDF
Best Software Development at Best Prices
softechies7
ย 
PPTX
IObit Driver Booster Pro 12.4-12.5 license keys 2025-2026
chaudhryakashoo065
ย 
PPTX
Android Notifications-A Guide to User-Facing Alerts in Android .pptx
Nabin Dhakal
ย 
PDF
Automated Test Case Repair Using Language Models
Lionel Briand
ย 
PPTX
Threat Modeling a Batch Job Framework - Teri Radichel - AWS re:Inforce 2025
2nd Sight Lab
ย 
PDF
Alur Perkembangan Software dan Jaringan Komputer
ssuser754303
ย 
PDF
Mastering VPC Architecture Build for Scale from Day 1.pdf
Devseccops.ai
ย 
PPTX
For my supp to finally picking supp that work
necas19388
ย 
PDF
IObit Uninstaller Pro 14.3.1.8 Crack for Windows Latest
utfefguu
ย 
PDF
Automated Testing and Safety Analysis of Deep Neural Networks
Lionel Briand
ย 
PDF
Building scalbale cloud native apps with .NET 8
GillesMathieu10
ย 
PPTX
Agentforce โ€“ TDX 2025 Hackathon Achievement
GetOnCRM Solutions
ย 
PPTX
IObit Uninstaller Pro 14.3.1.8 Crack Free Download 2025
sdfger qwerty
ย 
PDF
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
ย 
PDF
Azure AI Foundry: The AI app and agent factory
Maxim Salnikov
ย 
PDF
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
ย 
PPTX
IDM Crack with Internet Download Manager 6.42 [Latest 2025]
HyperPc soft
ย 
PPTX
declaration of Variables and constants.pptx
meemee7378
ย 
DOCX
Zoho Creator Solution for EI by Elsner Technologies.docx
Elsner Technologies Pvt. Ltd.
ย 
PPTX
Introduction to web development | MERN Stack
JosephLiyon
ย 
Best Software Development at Best Prices
softechies7
ย 
IObit Driver Booster Pro 12.4-12.5 license keys 2025-2026
chaudhryakashoo065
ย 
Android Notifications-A Guide to User-Facing Alerts in Android .pptx
Nabin Dhakal
ย 
Automated Test Case Repair Using Language Models
Lionel Briand
ย 
Threat Modeling a Batch Job Framework - Teri Radichel - AWS re:Inforce 2025
2nd Sight Lab
ย 
Alur Perkembangan Software dan Jaringan Komputer
ssuser754303
ย 
Mastering VPC Architecture Build for Scale from Day 1.pdf
Devseccops.ai
ย 
For my supp to finally picking supp that work
necas19388
ย 
IObit Uninstaller Pro 14.3.1.8 Crack for Windows Latest
utfefguu
ย 
Automated Testing and Safety Analysis of Deep Neural Networks
Lionel Briand
ย 
Building scalbale cloud native apps with .NET 8
GillesMathieu10
ย 
Agentforce โ€“ TDX 2025 Hackathon Achievement
GetOnCRM Solutions
ย 
IObit Uninstaller Pro 14.3.1.8 Crack Free Download 2025
sdfger qwerty
ย 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
ย 
Azure AI Foundry: The AI app and agent factory
Maxim Salnikov
ย 
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
ย 
IDM Crack with Internet Download Manager 6.42 [Latest 2025]
HyperPc soft
ย 
declaration of Variables and constants.pptx
meemee7378
ย 
Zoho Creator Solution for EI by Elsner Technologies.docx
Elsner Technologies Pvt. Ltd.
ย 
Introduction to web development | MERN Stack
JosephLiyon
ย 
Ad

Breaking Through The Challenges of Scalable Deep Learning for Video Analytics

  • 1. Breaking Through the Challenges of Scalable Deep Learning for Video Analytics Steven Flores, [email protected] Luke Hosking, [email protected]
  • 2. Use cases A customer is somebody with a lot of unannotated video whose content they want annotated and indexed into a searchable database. For example, โ— Media: video library going back decades. โ— Research institutions: video from a lecture series. โ— Management and HR: conference/meetings notes.
  • 3. What info do we want from video? โ— What and who is in the video? โ— What happens in the video? โ— What is the video about? (Example here: https://www.youtube.com/watch?v=X3a-ZX6ObJU)
  • 4. Information from audio โ— Topic modeling speech transcripts. โ— Sentiment analysis of speech transcripts. โ— Hot language and/or loud sounds heat map. โ— Keywords (named entities) from transcripts. The Federal Reserve is widely expected to increase interest rates again Wednesday... Politics and policy Sports Science and Technology
  • 5. Using keywords to extract info Within transcripts, keywords such as people, locations, organizations, and geo-political entities carry much of the latent information we seek from a video. For example, a video transcript containing the excerpt ...probably confirm the North Korean side in its willingnessโ€ฆ should appear if we search for the term โ€œNorth Korea.โ€ Also, the presence of this term, along with other keywords, may support a topic assignment.
  • 6. Keyword extraction Keyword extraction can be a difficult problem. Free extractors always come with their own ridgid taxonomy and may not be production quality: For example, with the python natural language toolkit (NLTK)... ...probably confirm the North Korean side in its willingnessโ€ฆ Geo-socio-political group Geo-political entity
  • 7. Using a human-curated whitelist We maintain a โ€œwhitelistโ€ of extracted keywords. This solves two problems: โ— Quality control supervision of proposed keywords. โ— Better custom keyword taxonomies are assigned to keywords on the list. NLTK finds โ€œNorth Koreanโ€ in the text, and we find it in the whitelist with its tag ...probably confirm the North Korean side in its willingnessโ€ฆ Ethnicity But we have two more problems: โ— Human supervision is time-consuming (prohibitively so with a large list). โ— This doesnโ€™t solve the case of a keyword phrase incorrectly split by NLTK.
  • 8. Building a custom keyword extractor The article Natural Language Processing (almost) from Scratch (R. Collobert et al. 2011) introduces the โ€œsennaโ€ named entity (keyword) extractor: โ— A two-layer fully connected neural network. โ— For each word, the input is its surrounding โ€œcontextโ€ words in the text. โ— Input context words are mapped to 50-dim vectors in a word2vec model. cat sat on the mat I O E B S
  • 9. The senna architecture Natural Language Processing (almost) from Scratch (R. Collobert et al. 2011)
  • 10. Senna architecture advantages โ— Results are often better than NLTK, thus requiring less human supervision. โ— Minimal text preprocessing (for example, no chunking) is required. โ— Because input is context-based, it may be possible to train a senna network with automatically generated partially-annotated training data. โ— With greater ease of generating training data, we can train keyword extractors that are tailored to customer needs (taxonomy, jargon, etc.).
  • 11. Sentiment heat maps Sentiment heat maps indicate areas of potentially high interest in the video. โ— Based on word sentiment and heated language. โ— This may not be sufficient. We can also incorporate information from the audio stream, such as loudness, to indicate areas of interest.
  • 12. Challenges and future work Keyword extraction: โ— Adapting the senna model for in-house custom keyword extractors. โ— Improving keyword extraction for โ€œmessyโ€ spoken-language transcripts. โ— How to quickly create training data for customer-dependent taxonomies? Topic modeling: โ— Supervised for customer-dependent topics? โ— Unsupervised if the user wants to discover unknown information? โ— How to do good topic modeling for โ€œmessyโ€ spoken-language transcripts?
  • 13. Information from video โ— Object detection โ— Face recognition โ— Scene recognition
  • 14. Object detection Performing object detection on frames tells you what objects appear in a video: We use various pre-trained models from the TensorFlow detection model zoo.
  • 15. Challenges with object detection Freely-available object detection models based on ResNet and Inception architectures are production quality. Nonetheless, there are some challenges: โ— What objects do we want to detect? Is this customer dependent? โ— How to we create enough training data to build custom models quickly?
  • 16. Scene recognition We train a wide-ResNet model (S. Zagoruyko et al. 2016) to recognize scenes: We train the network using the Places365 dataset with consolidated scene categories (for example, not distinguishing stores based on their interiors).
  • 17. Face recognition A face recognition model require millions of faces for training and comprises many steps: face detection, cropping and re-scaling, and classification. To train such a model from scratch is very time-consuming. However, near state-of-the art models are freely available. We are using dlib face recognition.
  • 18. Face embeddings Rather than simply recognize faces from a small list of people, most face recognition models are trained to give good face-to-vector embeddings. The model user then provides a list of images of faces to recognize, the model maps the faces to vectors, and query faces are identified via k-nn search.
  • 19. Who should we recognize? What faces should we recognize? The answer may be customer dependent: In generic situations, we should recognize people who are โ€œfamous enoughโ€ (well-known politicians, celebrities, artists, scientists, thought-leaders, etc.) What constitutes famous enough? How do we make a list of their names? Given the list of names, how do we get enough pictures of their faces? Steven Flores (Engineer, Comp Three) Luke Hosking (Engineer, Comp Three)
  • 20. Famous enough? Our criteria for โ€œfamous enoughโ€ is partly set by our need to get a list of names of such famous people: famous = has a wikipedia biography with birthday. We can easily pull this list of famous people from the wikidata API. We record each personโ€™s name, birthday, occupation(s), and wikipedia page address. Brad Pitt is in... Rich Skrenta is out (no b-day on wikipedia)
  • 21. The gallery problem Many state of the art facial recognition systems are still not good at picking the correct face from a large gallery of faces. They generate many false positives. The rank-1 accuracy decreases as the gallery โ€œdistractorโ€ face count increases. (The MegaFace Benchmark: 1 Million Faces for Recognition at Scale, I. Kemelmacher-Shlizerman et al. 2015)
  • 22. A potential solution... Given some faces each with a list of candidate names, use other information (topic modeling, co-occurrence frequency) to find optimal name assignments: On the left, Idina Menzel is correctly tagged. On the right, Amy Grant is wrongly tagged โ€œFanny Cadeo;โ€ her name is the second choice based on the image. Use the fact that both are musicians to correct the second tag to โ€œAmy Grant.โ€
  • 23. Processing time considerations โ— Estimated size of a โ€œlargeโ€ video cache: 40,000 โ— Number of frames in a typical 30 second video: 750 โ— Average video frame processing time (GTX 1080 GPU): about 1 second โ†’ Estimated time to process the entire video cache: almost one year... The long time to process this hypothetical video cache is way too long! Solution: only sample video keyframes (frames at shot changes or high-action moments). These may contain most of the relevant information. For example, โ— https://www.youtube.com/watch?v=_7WZ74F3j_I: 2650 frames โ— Number of โ€œirregularly spacedโ€ keyframes processed: 10 keyframes
  • 24. Challenges and future work Object detection and scene recognition: โ— What do we want to detect? (Customer-dependent?) โ— How to we generate enough training data quickly and efficiently? โ— What benchmarks do we need to hit for production quality? Face recognition: โ— Who can we / do we want to detect? (Customer-dependent?) โ— How can we use other information to improve face-to-name assignments? โ— What benchmarks do we need to hit for production quality? Scalability: โ— How can we speed up the wait time for image evaluation? โ— What tradeoffs must we make to minimize video processing time? โ— What can we trim without compromising performance benchmarks?
  • 26. Digital Ocean Instance Docker Host Augi Real-time Components Port 5000 Augi Backend Port 5001 Text Annotator index.html bundle.js Port 80 Nginx Elasticsearch File System Video Object Store Port 9200/videos/ Port 5002 Image Service
  • 27. Real-time Technologies Frontend โ— React โ— Apollo โ— ChartJS Backend โ— Flask โ— Graphene โ— Elasticsearch Client
  • 28. Microservices Augi Preprocessing Pipeline Python Code Video Frame Sampling Transcript Extractor Audio Extractor Elasticsearch Text Annotation Video Store on File System Classify Image DataConsolidation ESDocumentInsert LoopOverVideos
  • 29. Preprocessing Technologies โ— Core pipeline โ—‹ ffmpeg โ—‹ Google Cloud Speech โ—‹ Amazon S3 โ—‹ Elasticsearch โ— Image classification โ—‹ Tensorflow โ—‹ dlib โ—‹ flask โ— Text annotation โ—‹ pygtrie โ—‹ flask
  • 30. Where the magic happens
  • 31. Augi Preprocessing Workflow Python scripts โ— download videos and video metadata (youtube, proprietary APIs) โ— manage overall process for list of videos to be enriched Docker โ— text Annotator โ— image Classifier Modular architecture โ— file system based cache โ— orchestration with override flags
  • 32. Challenges Iterative development over tens, to hundreds, of thousands of videos File system based cache of data produced by each step in preprocessing, along with granular overrides for each preprocessing method, allow for targeted testing and implementation. On-prem challenge: no internet access We needed the architecture to be usable on-prem for clients that require data security (confidential/healthcare sectors). Current external services used are Google Cloud Speech and AWS S3, disk storage and products like Nuance Dragon could be run on-prem.