Real Time Machine Learning Proposers Day - v3
Real Time Machine Learning Proposers Day - v3
Andreas Olofsson
Program Manager
DARPA/MTO
Proposers Day
April 2, 2019
Approach:
Extracted Library
TA1: Image Application for Benchmarking: Recreate a Mapped into
Image Pixels
TA2: MS CMOS Demonstration: Mixed signal CMOS Emerging Devices Analog CMOS
implementation of the computational model and system test Analog, Floating Gate
Oscillators
bed showing 1x105x combined speed-power improvement for Pattern Match
analog CMOS.
0.9mm
L1L1 L2 L3 L2 L1L1
Analog Vector
0.4mm
DPU
RAC
TC
7 Nodes Matrix Multiply
Graphene
TA3: Emerging Device Implementation: Image processing Memristors
demonstration combining next-generation devices with new
computation model. 1x107x (projected)
Goal: Demonstrate the capability and pathway toward embedded computing efficiency in ISR
applications w/ >1,000x processing speed and >10,000x improvement in power consumption
"Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 3
Selected UPSIDE results
Dummy Array
Dummy Array
Analog computing beats digital on VMMs
16 μm
Functional Array
Incoming
Image
• Manufacturing latency too long RTML I7
I8
c
I4
I3
I2
I1
Output Neurons
space” of possibilities
ISSCC2019
of right-sized HW
and programmability
Source: NVIDIA "Distribution Statement "A" Approved for Public Release, Distribution Unlimited" 7
Building bridges
RTML
(New!)
TensorFlow PyTorch
Source: Qualcomm "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"
Source: Qualcomm, 2017
9
DARPA RTML Program
• Develop hardware generator that converts programs expressed in common ML frameworks (such as
TensorFlow, PyTorch) and generate standard Verilog code and hardware configurations
• Generate synthesizable Verilog that can be fed into layout generation tools, such as from DARPA IDEA
The RTML generator should support a diversity of ML architectures. Architectures of interest include:
Metrics
Type Training and Inference
Scalable configurable at generation with support
Peak Performance
up to full reticle size at 14nm
Inference Energy Efficiency1 >10 TOPS/W
Min Number of Architectures2 10
Hardware Generation Automation 100% (ML to Verilog)
Highly efficient chip-to-chip interface
I/O Interface
(such as from the DARPA CHIPS program)
Training
IDEA • First Integration Exercise (Jan)
Unified Layout Generator Models 2019
AIB
AIB
• AIB is a clock-forwarded parallel data transfer like DDR DRAM Your
Our Chiplet
Memory
Chiplet Processors
• High density with 2.5D interposer (e.g., CoWoS, EMIB) for Adjacent IP
AIB
multi-chip packaging AIB Etc. …
• AIB is PHY level (OSI Layer 1)
• Can build protocols like AXI-4 on top of AIB AIB Adopters:
-Boeing
-Intrinsix
• AIB Performance:
-Synopsys
• 1Tbps/mm shoreline CHIPS Platform
-Intel
• ~0.1pJ/bit A
I
A
I
Your Chip
-Lockheed Martin
Here
• <5ns latency
B B
A Stratix 10 A
Ethernet Tile
-Sandia
I FPGA die I
-Jariet
56G PAM/28G NRZ
B B
14nm
Your Chip
-NCSU
A A
• Open Source! I
B
I
B Here
-U. of Michigan
• Standard and reference implementation
Opt1 Opt2 Opt4 Opt5 -Ayar Labs
• https://github.com/intel/aib-phy-hardware
• General purpose, tunable generator that can support optimization of ML hardware for specific
requirements
• Application areas:
• Future high bandwidth wireless communication systems, like the 60 GHz range of the 5G standard
• High bandwidth image processing in SWaP constrained systems
• DARPA will provide fabrication support through a number of separately funded multi-project or
dedicated wafer runs
1Teams are expected to explore a wide trade space of power, latency, accuracy, and data throughput and show the ability to tune hardware over a large range of performance metrics. Max
values are not expected to be achieved simultaneously.
2Power must include everything needed to operate, including power delivery, thermal management, external memory, and sensor interfaces.
3For example, ResNet152 has an accuracy of > 0.96 on the ImageNet database:
http://image-net.org/challenges/LSVRC/2015/results
4Proposals are expected to outline a clear plan for validating the quality of the compiler output, including details of the publicly available benchmarks and datasets from industry, government, and
• What hardware architectures are best suited for real time operation?
• What are the lower latency limits for various RTML tasks?
• What are the tradeoffs between energy efficiency, throughput, latency, area, accuracy?
• Incremental efforts
learning architectures
NSF Phase 1 (36 mos)
• NSF: Single phase, exploratory research into circuit architectures and algorithms
• DARPA:
• Phase 1: Fully automated hardware generators “compilers” for state of the art machine learning algorithms
and networks, using existing programming frameworks (TensorFlow, etc.) as inputs
• Phase 2: Deliver novel machine learning architectures and circuit generators that enable real time machine
learning for autonomous machines
• Joint solicitation release and workshops at 9 and 18 mos into each phase
Required:
• Open interfaces
Strongly encouraged:
• The proposed planning and costing by Phase (and by Task) provides DARPA with convenient times to
evaluate funding options and technical progress
• Progression into Phase 2 is not guaranteed; factors that may affect Phase 2 funding decisions
• Availability of funding
• Cost of proposals selected for funding
• Demonstrated performance relative to program goals
• Interaction with government evaluation teams
• Compatibility with potential national security needs
o A clear and feasible plan for release of high quality software is provided
o Task descriptions and associated technical elements provided are complete and in a logical sequence with all proposed research
clearly defined such that a final outcome that achieves the goal
o The proposed research significantly advanced the state of the art in machine learning hardware
4. Cost Realism
o Ensure proposed costs are realistic for the technical and management approach and accurately reflect the goals and objectives of the
solicitation
o Verify that proposed costs are sufficiently detailed, complete, and consistent with the Statement of Work