0% found this document useful (0 votes)
107 views28 pages

Real Time Machine Learning Proposers Day - v3

The document discusses Real Time Machine Learning (RTML) and background on DARPA's UPSIDE program. The UPSIDE program explored using emerging devices like analog CMOS and non-Boolean computation models to improve performance and power efficiency for real-time sensor applications. Selected results from UPSIDE demonstrated significant speed and power improvements over digital approaches. However, challenges remain in transitioning these technologies due to issues like a lack of data for comparisons, high design costs, manufacturing latency and scalability. The RTML program aims to address these challenges to enable embedded computing capabilities with over 1,000x faster processing speeds and 10,000x lower power consumption.

Uploaded by

seantips
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views28 pages

Real Time Machine Learning Proposers Day - v3

The document discusses Real Time Machine Learning (RTML) and background on DARPA's UPSIDE program. The UPSIDE program explored using emerging devices like analog CMOS and non-Boolean computation models to improve performance and power efficiency for real-time sensor applications. Selected results from UPSIDE demonstrated significant speed and power improvements over digital approaches. However, challenges remain in transitioning these technologies due to issues like a lack of data for comparisons, high design costs, manufacturing latency and scalability. The RTML program aims to address these challenges to enable embedded computing capabilities with over 1,000x faster processing speeds and 10,000x lower power consumption.

Uploaded by

seantips
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Real Time Machine Learning (RTML)

Andreas Olofsson
Program Manager
DARPA/MTO

Proposers Day

April 2, 2019

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited"


Background

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 2


DARPA UPSIDE program (2012-2018)
Unconventional Processing of Signals for Intelligent Data Exploitation
Objective: Low-precision probabilistic computing algorithms

Exploit the physics of emerging devices, analog CMOS, and


non-Boolean computational models to achieve new levels of
performance and power for real-time sensor imaging systems.
Detected Salient Pixels

Benchmarked using object classification and tracking applications

Approach:
Extracted Library
TA1: Image Application for Benchmarking: Recreate a Mapped into
Image Pixels

traditional image processing pipeline (IPP) using UPSIDE emerging devices


and analog CMOS
Compute models showing no degradation in performance.
3x3 pixels Ex. Edges

TA2: MS CMOS Demonstration: Mixed signal CMOS Emerging Devices Analog CMOS
implementation of the computational model and system test Analog, Floating Gate
Oscillators
bed showing 1x105x combined speed-power improvement for Pattern Match

analog CMOS.
0.9mm
L1L1 L2 L3 L2 L1L1

Analog Vector

0.4mm
DPU
RAC
TC
7 Nodes Matrix Multiply
Graphene
TA3: Emerging Device Implementation: Image processing Memristors
demonstration combining next-generation devices with new
computation model. 1x107x (projected)
Goal: Demonstrate the capability and pathway toward embedded computing efficiency in ISR
applications w/ >1,000x processing speed and >10,000x improvement in power consumption
"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 3
Selected UPSIDE results

University of Michigan UCSB


18 μm
(b)

Key Takeaways Dummy Array

Dummy Array
Dummy Array
Analog computing beats digital on VMMs

16 μm
Functional Array

Challenges: Dummy Array

2 Layer MLP Neural Network


• Comparing results (lack of data)  RTML
• Transition valley of death  RTML
(a) Floating-Gate Array
O1 O2 O3 O4 O5 O6 O7 O8 O9 O10
IE IE

• High cost of design  RTML I9


I10
Input Neurons

Incoming
Image
• Manufacturing latency too long RTML I7
I8

• Manufacturability and scalability  RTML


I6
I5

c
I4
I3

I2
I1

Output Neurons

• Mixed signal processing (50TOPS/W) • First memristor based multilayer perceptron


• Sparse image reconstruction in memristors • Flash based 55nm analog computing (>10TOPS/W)
• Numerous publications (Nature, …) • Numerous publications (Nature, …)
"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 4
Building a proper baseline for path-finding AI HW research

• Extreme expense of HW development

means extremely sparse data

• How can we know if a new result is

good without a baseline?

• A compiler would let us “paint the

space” of possibilities

• Objective: Better science

ISSCC2019

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 5


Generating “right-sized” HW for SWaP constrained systems

• 10-100X network tradeoffs

• Additional micro-tradeoffs (bit-

width, pruning, etc.)

• Having more accuracy than needed

wastes energy, latency, and power

• A compiler would enable generation

of right-sized HW

• Objective: Enable new applications

A. Canziani, et al, “Analysis of deep neural networks”


"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 6
Optimizing hardware for ultra-low latency

• Current HW optimized for throughput

and programmability

• Extreme expense of HW development

means latency of ASICs is unexplored.

• Green-field: How low can we go?

• Objective: Enable new applications

Source: NVIDIA "Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 7
Building bridges

Application Experts ML Experts Platform Experts

RTML
(New!)
TensorFlow PyTorch

Objective: Faster innovation


Source: NVIDIA, Getty Images, Wikipedia "Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 8
Example of a low-latency application

Source: Qualcomm "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"
Source: Qualcomm, 2017
9
DARPA RTML Program

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 10


DARPA RTML program details

The DARPA RTML program seeks to create no-human-in-the-loop


hardware generators and compilers to enable fully automated
creation of ML Application-Specific Integrated Circuits
(ASICs) from high level source code.

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 11


Phase 1: machine learning hardware compiler

• Develop hardware generator that converts programs expressed in common ML frameworks (such as
TensorFlow, PyTorch) and generate standard Verilog code and hardware configurations

• Generate synthesizable Verilog that can be fed into layout generation tools, such as from DARPA IDEA

• Demonstrate a compiler that auto-generates a large catalog of scalable ML hardware instances

• Demonstrate generation of instances for a diversity of architectures

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 12


RTML general purpose generator

The RTML generator should support a diversity of ML architectures. Architectures of interest include:

a) conventional feed forward (convolutional) neural networks,


b) recurrent networks and their specialized versions,
c) neuroscience-inspired architectures, such as spike time-dependent neural nets including their stochastic
counterparts,
d) non-neural ML architectures inspired by psychophysics as well as statistical techniques,
e) classical supervised learning (e.g., regression and decision trees),
f) unsupervised learning (e.g., clustering) approaches,
g) semi-supervised learning methods,
h) generative adversarial learning techniques, and
i) other approaches such as transfer learning, reinforcement learning, manifold learning, and/or life-long
learning

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 13


Phase 1 RTML generator metrics

Metrics
Type Training and Inference
Scalable configurable at generation with support
Peak Performance
up to full reticle size at 14nm
Inference Energy Efficiency1 >10 TOPS/W
Min Number of Architectures2 10
Hardware Generation Automation 100% (ML to Verilog)
Highly efficient chip-to-chip interface
I/O Interface
(such as from the DARPA CHIPS program)

High level network description. Support for


Design Input (source code)
TensorFlow, PyTorch, Caffe2, CNTK, MXNet, ONNX
Generator (Compiler Front-end) Output Verilog
Software, license3, generator source code, flow scripts, documentation,
Deliverable
GDSII for generated designs
1Program is interested in real work accomplished per Watt, not arbitrary peak mathematical ops/W. As a general guidance we are specifying a 10 TOPS/W at 14nm
as a minimum threshold with the understanding that efficiency numbers are tightly coupled to accuracy, data sets, and actual applications. Efficiency metric includes all
SoC power including IO power needed to sustain peak throughput. Based upon normalized MAC for the proposed application.
2To demonstrate a general purpose ML compiler, teams are expected to complete GDSII implementation of multiple ML architectures
3Delivered with a minimum of government purpose rights; open source licenses are preferred.

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 14


An introduction to the IDEA silicon compiler (RTL/schematic to
GDSII)
Data

• Program Kickoff (Jun)


2018

Training
IDEA • First Integration Exercise (Jan)
Unified Layout Generator Models 2019

Chip Package Board


• Alpha code drop (Jun)
2019

• A usable Silicon Compiler


24 hours, No Human In the Loop • 50% PPA
2020

• A great Silicon Compiler


2022 • 100% PPA

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 15


An introduction to the CHIPS interface

• AIB (Advanced Interface Bus) is a PHY-level interface standard


ADC/DAC
for high bandwidth, low power die-to-die communication Machine Learning

AIB
AIB
• AIB is a clock-forwarded parallel data transfer like DDR DRAM Your
Our Chiplet
Memory
Chiplet Processors
• High density with 2.5D interposer (e.g., CoWoS, EMIB) for Adjacent IP

AIB
multi-chip packaging AIB Etc. …
• AIB is PHY level (OSI Layer 1)
• Can build protocols like AXI-4 on top of AIB AIB Adopters:
-Boeing
-Intrinsix
• AIB Performance:
-Synopsys
• 1Tbps/mm shoreline CHIPS Platform
-Intel
• ~0.1pJ/bit A
I
A
I
Your Chip
-Lockheed Martin
Here
• <5ns latency
B B

A Stratix 10 A
Ethernet Tile
-Sandia
I FPGA die I

-Jariet
56G PAM/28G NRZ
B B
14nm
Your Chip
-NCSU
A A

• Open Source! I
B
I
B Here
-U. of Michigan
• Standard and reference implementation
Opt1 Opt2 Opt4 Opt5 -Ayar Labs
• https://github.com/intel/aib-phy-hardware

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 16


Phase 2: real time machine learning systems

• Design space exploration through circuit implementation of multiple ML architectures

• General purpose, tunable generator that can support optimization of ML hardware for specific
requirements

• Hardware demonstration of RTML for a particular application area

• Application areas:
• Future high bandwidth wireless communication systems, like the 60 GHz range of the 5G standard
• High bandwidth image processing in SWaP constrained systems

• DARPA will provide fabrication support through a number of separately funded multi-project or
dedicated wafer runs

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 17


Phase 2 RTML metrics

Phase 2 Hardware Guidelines Min1 Max1


Data Throughput 400 Kbps 400 Gbps
Latency 100 µs 100 s
Total Power2 200 µW 200 W
Application-Specific Accuracy3 0.6 0.99
Dataset Proposer defined4
I/O Interface Highly efficient chip to chip interface (such as CHIPS)
High level network description. Support for
Design Input (source code)
TensorFlow, PyTorch, Caffe2, CNTK, and MXNet, ONNX
Design Output GDSII ready for manufacturing
Hardware Generation Automation 100%
Software, license5, Design Source code, flow scripts, documentation,
Deliverables
GDSII, chiplet hardware

1Teams are expected to explore a wide trade space of power, latency, accuracy, and data throughput and show the ability to tune hardware over a large range of performance metrics. Max
values are not expected to be achieved simultaneously.
2Power must include everything needed to operate, including power delivery, thermal management, external memory, and sensor interfaces.
3For example, ResNet152 has an accuracy of > 0.96 on the ImageNet database:

http://image-net.org/challenges/LSVRC/2015/results
4Proposals are expected to outline a clear plan for validating the quality of the compiler output, including details of the publicly available benchmarks and datasets from industry, government, and

academia that will be used


5Delivered with a minimum of government purpose rights; open source licenses are preferred

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 18


RTML schedule

• 0 months (Fall 2019): Kickoff workshop


• 9 months (Mid 2020): Alpha release of RTML generator at joint NSF/DARPA workshop
• 18 months (Spring 2021): Release of V1.0 RTML generator and demonstration with a RTML compiler flow
• 27 months (End 2021): Release of V1.5 tunable hardware generator
• 36 months (Fall 2022): Hardware demonstration of a real time machine for specific application

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 19


RTML seeks answers for the following research questions

• Can we build an application specific silicon compiler for RTML?

• What subset of current ML frameworks syntax/methods can be supported with a compiler?

• What needs to be added to current ML frameworks to support efficient translation?

• What hardware architectures are best suited for real time operation?

• What are the lower latency limits for various RTML tasks?

• What is the lowest SWaP feasible for various RTML tasks?

• What are the tradeoffs between energy efficiency, throughput, latency, area, accuracy?

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 20


RTML does NOT seek proposals for these areas

• Investigatory research that does not result in deliverable hardware designs

• Circuits that cannot be produced in standards CMOS foundries (like 14nm)

• New Domain Specific Languages

• New approaches to physical layout (RTL to GDSII)

• Incremental efforts

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 21


Joint NSF collaboration

DARPA Phase 1 (18 mos) DARPA Phase 2 (18 mos)


NSF and DARPA team to explore
Alpha V1.0 Release V1.5 Release Silicon
rapid development of energy efficient Release and GDSII & Tapeout Demo
hardware and real-time machine Delivery

learning architectures
NSF Phase 1 (36 mos)

• NSF: Single phase, exploratory research into circuit architectures and algorithms

• DARPA:
• Phase 1: Fully automated hardware generators “compilers” for state of the art machine learning algorithms
and networks, using existing programming frameworks (TensorFlow, etc.) as inputs
• Phase 2: Deliver novel machine learning architectures and circuit generators that enable real time machine
learning for autonomous machines

• Joint solicitation release and workshops at 9 and 18 mos into each phase

• DARPA teams pull in NSF work during Phase 1 to Phase 2 transition

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 22


Collaboration and licensing

Required:

• Collaboration with other program performers

• Active participation in joint DARPA-NSF workshops every 9 months

• Open interfaces

Strongly encouraged:

• Publishing code and results early and often

• Permissive (non-viral, non-proprietary) open source licensing

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 23


Funding of DARPA RTML Phase 2

• RTML includes a base Phase 1 and option Phase 2

• The proposed planning and costing by Phase (and by Task) provides DARPA with convenient times to
evaluate funding options and technical progress

• Progression into Phase 2 is not guaranteed; factors that may affect Phase 2 funding decisions
• Availability of funding
• Cost of proposals selected for funding
• Demonstrated performance relative to program goals
• Interaction with government evaluation teams
• Compatibility with potential national security needs

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 24


Important dates

• BAA Posting Date: March 15, 2019


• Proposers Day: April 2, 2019
• FAQ Submission Deadline: April 15, 2019 at 1:00 PM
o DARPA will post a consolidated Question and Answer (FAQ) document on a regular basis. To
access the posting go to: http://www.darpa.mil/work-with-us/opportunities.
• Proposal Due Date: May 1, 2019 at 1:00 PM
• Estimated period of performance start: October 2019
• Questions: [email protected]

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 25


Evaluation criteria, in order of importance

1. Overall Scientific and Technical Merit


o Demonstrate that the proposed technical approach is innovative, feasible, achievable, and complete

o A clear and feasible plan for release of high quality software is provided

o Task descriptions and associated technical elements provided are complete and in a logical sequence with all proposed research
clearly defined such that a final outcome that achieves the goal

2. Potential Contribution and Relevance to the DARPA Mission


o Note the updated wording, with an emphasis on contribution to U.S. national security and U.S. technological capabilities

3. Impact on Machine Learning Landscape


o The proposed research will successfully complete a fundamental exploration of the tradeoffs between system efficiency and
performance for a number of ML architectures

o The proposed research significantly advanced the state of the art in machine learning hardware

4. Cost Realism
o Ensure proposed costs are realistic for the technical and management approach and accurately reflect the goals and objectives of the
solicitation

o Verify that proposed costs are sufficiently detailed, complete, and consistent with the Statement of Work

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 26


Agenda

RTML Proposers Day


DARPA - 675 N Randolph Street, Arlington, VA 22203
Tuesday, April 2, 2019
Start End Time Speaker
8:00 AM 9:00 AM 60 Registration and Poster Setup
9:00 AM 9:15 AM 15 Welcome - Security, Logistics Ron Baxter
9:15 AM 9:55 AM 40 DARPA RTML Program Overview Andreas Olofsson
9:55 AM 10:15 AM 20 NSF RTML Collaboration Overview Sankar Basu
10:15 AM 10:45 AM 30 Contracting Overview Michael Blackstone
10:45 AM 11:00 AM 15 Break
11:00 AM 11:45 AM 45 Question and Answer Session Andreas Olofsson
11:45 AM 1:00 PM 75 Lunch (On Your Own)
1:00 PM 3:00 PM 120 Poster and Networking Session All
3:00 PM 3:00 PM 0 Conclude

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 27


www.darpa.mil

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 28

You might also like