0% found this document useful (0 votes)
123 views35 pages

Christopher Celio, Krste Asanovic, David Palerson

The document describes BOOM, a synthesizable, parameterized out-of-order RISC-V processor developed at UC Berkeley using Chisel. BOOM aims to serve as a platform for computer architecture research. It supports the RV64G ISA and can boot Linux. The document provides details on BOOM's design, parameters that can be varied, and Berkeley's methodology for evaluating different configurations using FPGA emulation.

Uploaded by

kbkkr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views35 pages

Christopher Celio, Krste Asanovic, David Palerson

The document describes BOOM, a synthesizable, parameterized out-of-order RISC-V processor developed at UC Berkeley using Chisel. BOOM aims to serve as a platform for computer architecture research. It supports the RV64G ISA and can boot Linux. The document provides details on BOOM's design, parameters that can be varied, and Berkeley's methodology for evaluating different configurations using FPGA emulation.

Uploaded by

kbkkr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

The

 Berkeley  Out-­‐of-­‐Order  Machine  (BOOM!):  


Computer  Architecture  Research  Using  an  
Industry-­‐CompeBBve,  Synthesizable,  Parameterized  
RISC-­‐V  Processor

Christopher  Celio,  Krste  Asanovic,  


David  PaLerson
2015  June
UC Berkeley
[email protected]
Tuesday, June 30, 15
UC Berkeley What  is  BOOM?
§ superscalar,  out-­‐of-­‐order  processor  wriLen  in  Berkeley’s  
Chisel  RTL
§ It  is  synthesizable
§ It  is  parameterizable
§ We  hope  to  use  it  as  a  plaQorm  for  architecture  research

BOOM is a work-in-progress.
Results shown in the talk are
preliminary and subject to
change!

2
Tuesday, June 30, 15
UC Berkeley Other  Berkeley  RISC-­‐V  Processors
§ Sodor  CollecBon
- RV32I  -­‐  Bny,  educaBonal,  not-­‐synthesizable
§ Z-­‐scale  
- RV32IM  -­‐  micro-­‐controller
§ Rocket
- RV64G  -­‐  in-­‐order,  single-­‐issue  applicaBon  
core
§ BOOM
- RV64G  -­‐  out-­‐of-­‐order,  superscalar  
applicaBon  core

3
Tuesday, June 30, 15
UC Berkeley Why  OoO?
§ Great  for  ...
- tolera'ng  variable  latencies
- finding  ILP  in  code  (instruc'on-­‐level  parallelism)
- complex  method  for  fine-­‐grain  data  prefetching
- plays  nicely  with  poor  compilers  and  lazily  wri<en  code

Performance!

4
Tuesday, June 30, 15
UC Berkeley OoO  widely  used  in  industry
§ Intel  Xeon/i-­‐series  (10-­‐100W)
§ ARM  Cortex  mobile  chips  (1W)
§ Intel  Atom
§ Sun/Oracle  Niagara  UltraSPARC
§ Play  Sta'on

5
Tuesday, June 30, 15
UC Berkeley Academic  OoO  Research
§ general  lack  of  effort  in  academia  to  build,  evaluate  
OoO  designs
§ most  research  uses  so[ware  simulators
- cannot  produce  area,  power  numbers
- hard  to  trust,  verify  results
- McPAT  is  calibrated  against  90nm  Niagara,  65nm  Niagara  2,  65nm  
Xeon,  and  180nm  Alpha  21364
- very  slow
§ Other  Academic  OoO  RTL  efforts...
- Illinois  Verilog  Model,  Princeton  Sharing  Architecture,  NCSU  
FabScalar  (Alpha,  PISA)
- other  ISAs  can  be  very  challenging  to  implement  fully
- rely  on  SW  simulators  for  performance  numbers
- hopefully  RISC-­‐V  can  make  everybody’s  lives  easier!
6
Tuesday, June 30, 15
UC Berkeley Design-­‐space  exploraKon

Perf  (CoreMark/s)  vs.  Area  (um2)


4wide
§ Very  preliminary 2wide
§ Parameters
- fetch  width
- issue  width
- ROB  size
- IW  size 1wide
pareto  
- LSU  size curve
- Regfile  size
- #  of  branch  tags
§ 3x  range  in  area
§ 2x  range  in  performance data  collected  by
Orianna  DeMasi
7
Tuesday, June 30, 15
UC Berkeley Research  Methodology
§ Which  benchmarks?
§ How  many  cycles  do  we  need  to  run?
§ State  of  the  art
- “SimPoints”
- run  4-­‐10  snapshots  per  SPEC2000/2006  benchmark
- each  snapshot  runs  for  ~10M  instrucBons
§ What  other  people  do  (ISCA  2014  results)
- ~50M  instrucKons  /  workload
- ~200B  instrucKons  /  paper
§ What  we  can  do
- map  design  to  an  FPGA
- run  50  MHz  (~1T  cycles/6hrs)
- run  full  reference  benchmark  (~2  Trillion  instrucBons  avg)
- run  on  FPGA  cluster  (~1-­‐2  weeks  simulaBon  in  one  day,  or  
~30-­‐60T  instrucKons/day) 8
Tuesday, June 30, 15
Berkeley  Architecture  Research  
Infrastructure
UC Berkeley

§ RISC-­‐V  ISA
§ Chisel  HCL  (hardware  construcBon  language)
§ Rocket-­‐chip  SoC  generator

9
Tuesday, June 30, 15
UC Berkeley
The  RISC-­‐V  ISA  is  easy  to  implement!
§ relaxed  memory  model
§ accrued  FP  excepBon  flags
§ no  integer  side-­‐effects  (e.g.,  condiBon  codes)
§ no  cmov  or  predicaBon
§ no  implicit  register  specifiers  
- JAL  requires  explicit  rd
§ rs1,  rs2,  rs3,  rd  always  in  same  space
- allows  decode,  rename  to  proceed  in  parallel

10
Tuesday, June 30, 15
UC Berkeley
The  RISC-­‐V  ISA
§ BOOM  supports  “M”  (mul/div/rem)
- imul  can  be  either  pipelined  or  unpipelined
§ BOOM  supports  “A”  
- AMOs+LR/SC
§ BOOM  supports  “FD”  
- single,  double-­‐precision  floa'ng  point
- IEEE  754-­‐2008  compliant  FPU
- SP,  DP  FMA  with  hw  support  for  subnormals
§ RV64G

11
Tuesday, June 30, 15
UC Berkeley
Rocket-­‐Chip  SoC  Generator
§ open-­‐source
§ taped  out  10  Bmes  by  
Berkeley
§ runs  at  1.6  GHz  in  IBM  45nm
§ makes  for  a  great  library  of  
processor  components!

12
Tuesday, June 30, 15
UC Berkeley
Supports  Privileged  ISA  (“S”),  Virtual  Memory

§ boots  Linux!
§ just  released  Privileged  ISA  v1.7
§ instant  to  update
- Privileged  ISA  nearly  en'rely  isolated  
to  Control/Status  Register  (CSR)  File,  
TLBs
- updated  git  submodule  pointers
- changed  “tohost”  to  “mtohost”  in  
one  line

13
Tuesday, June 30, 15
UC Berkeley Chisel
§ Hardware  Construc'on  Language  
embedded  in  Scala
§ not  a  high-­‐level  synthesis  language
§ hardware  module  is  a  data  
structure  in  Scala
§ Full  power  of  Scala  for  wri'ng  
generators
- object-­‐oriented  programming
- factory  objects,  traits,  overloading
- funcBonal  programming
- high-­‐order  funs,  anonymous  funcs,  currying
§ generated  C++  simulator  is  1:1  
copy  of  Verilog  designs

14
Tuesday, June 30, 15
UC Berkeley
Chisel  Hardware  ConstrucKon  Language
§ object-­‐oriented,  funcBonal  programming
§ powerful  for  wriBng  hw  generators
§ 12  days  (+1092  loc)  to  add  SP,DP  floaBng  point  
§ 9  days  (+900  loc)  to  go  from  no  VM  to  booBng  Linux

15
Tuesday, June 30, 15
UC Berkeley BOOM

Issue
Window
Unified
Decode & Physical Functional Unit
Fetch Rename Register
File

in-­‐order out-­‐of-­‐order
front-­‐half back-­‐half

16
Tuesday, June 30, 15
UC Berkeley BOOM
Rename Map Tables & Freelist

Issue
Window
ALU
Unified
Physical
Decode &
Fetch Rename
Register
File FPU
(PRF)

ROB

Commit
§ PRF  
- explicit  renaming
- holds  specula've  and  commi<ed  data
- holds  both  x-­‐regs,  f-­‐regs
§ Unified  Issue  Window
- holds  all  instruc'ons
§ split  ROB/issue  window  design 17
Tuesday, June 30, 15
UC Berkeley Parameterized  Superscalar
bypassing
dual-issue (5r,3w)
val  exe_units  =  ArrayBuffer[ExecutionUnit]()
ALU
exe_units  +=  Module(new  ALUExeUnit(is_branch_unit        =  true
                                                                       ,  has_fpu                =  true
FPU                                                                        ,  has_mul                =  true
                                                                       ))
exe_units  +=  Module(new  ALUMemExeUnit(fp_mem_support  =  true
imul                                                                        ,  has_div                =  true
Issue Regfile                                                                        ))
Regfile
bypass
Select Read network Writeback
bypassing
ALU

div Quad-issue (9r,4w) bypassing


ALU
Agen
LSU D$

ALU

OR Issue
Select
Regfile bypass
Read network
FPU

imul
Regfile
Writeback

exe_units  +=  Module(new  ALUExeUnit(is_branch_unit  =  true)) ALU


exe_units  +=  Module(new  ALUExeUnit(has_fpu  =  true
                                                                 ,  has_mul  =  true div
                                                                 ))
exe_units  +=  Module(new  ALUExeUnit(has_div  =  true))
exe_units  +=  Module(new  MemExeUnit()) Agen
LSU D$
18
Tuesday, June 30, 15
UC Berkeley Full  Branch  SpeculaKon  Support
§ next-­‐line  predictor  (NLP)
- BTB,  BHT,  RAS
- combinaBonal NPC Fetch1 Fetch2
§ backing  predictor  (BPD)
- global  history  predictor  μDec
- SRAM  (1  r/w  port) TakePC PC1 NLP PC2

BHT
I$ >>
Target

Fetch
Front-end Buffer
ExeBrTarget

BPD
Branch
Prediction

Front-end
19
Tuesday, June 30, 15
UC Berkeley Load/Store  Unit
§ load/store  queue  with  store  ordering
- loads  execute  fully  out-­‐of-­‐order  wrt  stores,  other  loads
- store-­‐data  forwarded  to  loads  as  required
§ non-­‐blocking  data  cache

20
Tuesday, June 30, 15
UC Berkeley Synthesizable
§ Runs  on  FPGA
- (Zynq  zedboard  and  Zynq  zc706)
§ 2GHz  (30  FO4)  in  TSMC  45nm
- speed  of  logic  (SRAM  is  slower) 1.7mm2 @ 45nm

I$ D$ (32k)
LLC Data

Exe
Uncore
Regfile Ren
Issue
Exe
Uncore

ROB LLC Data (256k)


Rename

bpd I$ (32k)

2-wide BOOM layout.


preliminary results 21
Tuesday, June 30, 15
UC Berkeley Benefits  of  using  Chisel
§ ~9,000  loc  in  BOOM  github  repo
§ addiBonal  ~11,500  loc  instanBated  from  other  libraries
- ~5,000  loc  from  Rocket  core  repository
- func'onal  units,  caches,  PTWs,  etc.
- ~4,500  loc  from  uncore
- coherence  hubs,  L2  caches,  networks,  host/target  interfaces
- ~2000  loc  from  hardfloat
- floa'ng  point  hard  units

22
Tuesday, June 30, 15
UC Berkeley Feature  Summary
Feature BOOM

ISA RISC-V (RV64G)

Synthesizable √
FPGA √
Parameterized √
floating point √
AMOs+LR/SC √
caches √
VM √
Boots Linux √
Multi-core √
lines of code 9k + 11k
23
Tuesday, June 30, 15
UC Berkeley That’s  BOOM!
Quad-issue (9r,4w) bypassing
ALU

ALU

FPU
Issue Regfile bypass Regfile
Select Read network imul Writeback

ALU

div

Agen
LSU D$

24
Tuesday, June 30, 15
UC Berkeley Comparison  against  ARM
Category ARM Cortex-A9 RISC-V BOOM-2w

ISA 32-bit ARM v7 64-bit RISC-V v2 (RV64G)

2 wide, 3+1 issue Out-of- 2 wide, 3 issue Out-of-


Architecture
Order 8-stage Order 6-stage

% !
+9
Performance 3.59 CoreMarks/MHz 3.91 CoreMarks/MHz

Process TSMC 40GPLUS TSMC 40GPLUS

Area with 32K ~2.5 mm2 ~1.00 mm2


caches
Area efficiency 1.4 CoreMarks/MHz/mm2 3.9 CoreMarks/MHz/mm2

Frequency 1.4 GHz 1.5 GHz


I$ D$ (32k)

note:  
LLC Data

Exe

not  to  scale


Uncore
Regfile Ren
Issue
Exe
Uncore

ROB LLC Data (256k)

25
Rename

bpd I$ (32k)

preliminary results 2-wide BOOM layout.

Tuesday, June 30, 15


UC Berkeley Industry  Comparisons
dg e CoreMark/MHz
6.00   Bri
Ivy
15
x -­‐ A -­‐ 4w
5.00 r te O M
Co BO
-­‐ 2w
O M -­‐ A 9
4.00 BO tex
CoreMark/MHz

r
Co
74k 8
3.00 P S -­‐ A
MI r tex k e t
-­‐ A 5
Co c ex
Ro r t
Co
2.00

1.00

0
out-­‐of-­‐order in-­‐order
processors processors
preliminary results 26
Tuesday, June 30, 15
UC Berkeley Industry  Comparisons
CoreMark/
Processor Core Area Freq (MHz) IPC
MHz

48x
Intel Xeon E5 2668 (Ivy) ~12 mm2@22nm 5.60 3,300 1.96

ARM Cortex-A15 2.8 mm2@28nm 4.72 2,116 1.50

BOOM-4wide 1.1 mm2@45nm 4.70 1,000 1.50

BOOM-2wide 0.8 mm2@45nm 3.91 1,500 1.26

ARM Cortex-A9 2.5 mm2@40nm 3.59 1,400 1.27

MIPS 74K 2.5 mm2@65nm 2.50 1,600 -

Rocket (RV64G) 0.5 mm2@45nm 2.32 1,500 0.76

ARM Cortex-A5 0.5 mm2@40nm 2.13 - -

preliminary results 27
Tuesday, June 30, 15
UC Berkeley Ivy  Bridge  Tile  Comparison

I$ D$ (32k)
LLC Data
Ivy  Bridge-­‐EP  Tile  
Exe BOOM-2w Chip (32kB/32kB  +  256kB  caches)
Issue
Uncore

Exe
Regfile Ren (32kb/32kB + 256kB caches) ~12nm  @  22nm
Uncore
1.7mm2 @ 45nm
ROB LLC Data (256k)
Rename

bpd I$ (32k)

I$ D$ (32k) 2-wide BOOM layout.


BOOM-­‐2w  Chip
LLC Data

Exe
Uncore
Regfile Ren

scaled  to  0.4mm2  @  22nm


Issue
Exe

28
Uncore

ROB LLC Data (256k)

preliminary results
Rename

bpd I$ (32k)

2-wide BOOM layout.

Tuesday, June 30, 15


UC Berkeley Synthesis  Results
Core Area (um^2)
700000
BOOM Other
600000
Imul
500000
FPU
FetchBuffer
BusyTable
400000
Freelist
Br Predictor
300000 Rocket LSU
ROB
200000 Register File
RRd Stage (bypasses)
100000 Rename Stage (maptables)
Issue Unit
0
No FPU FPU BOOM-1w BOOM-2w BOOM-4w
29
preliminary results
Tuesday, June 30, 15
UC Berkeley Synthesis  Results
Core Area (um^2) Tile Area (um^2)
700000 1200000
D$ (16 KB)
600000 1000000 I$ (16 KB)
500000 Core
800000
400000
600000
300000
400000
200000
200000
100000
0 0
No FPU BOOM-1w BOOM-4w Rck-I Rck-G BOOM-1wBOOM-2wBOOM-4w

Issue Unit Rename Stage (maptables)


RRd Stage (bypasses) Register File
ROB LSU
Br Predictor Freelist
BusyTable FetchBuffer
FPU Imul
Other
preliminary results 30
Tuesday, June 30, 15
UC Berkeley Lessons
§ RISC-­‐V  is  a  great  ISA
- it  gets  out  of  your  way
- the  instrucBon  count  difference  is  greater  between  gcc  
versions  than  between  ISAs
§ code-­‐reuse  is  great
- leveraging  exisBng  Rocket-­‐chip  infrastructure
§ Way  too  much  of  my  Bme  is  wasted  on  corralling  
benchmarks
- we  should  share  our  efforts
- hLps://github.com/ccelio/Speckle/
- make  generaBng  portable  SPEC  CPU2006  easy
§ Debugging  is  hard
- good  verificaBon  tests  are  more  valuable  than  good  RTL
- use  asserts  EVERYWHERE
- use  an  ISA  simulator  in  parallel  with  RTL  simulaBon
31
Tuesday, June 30, 15
UC Berkeley “Speckle”  -­‐  a  wrapper  for  SPEC  CPU2006
§ SPEC  is  designed  to  be  run  naBvely
- a  pain  for  cross-­‐compiling,  running  on  a  simulator  or  FPGA
§ If  you  have  a  copy  of  CPU2006...
- modify  the  provided  cfg  file
- Speckle  will  compile  and  generate  a  portable  directory  of  
binaries,  input  files,  and  input  arguments,  and  a  run  script
§ hLps://github.com/ccelio/Speckle/

32
Tuesday, June 30, 15
UC Berkeley Conclusion
§ BOOM  supports  full  RV64G  +  privileged  ISA  (VM  support)
§ Able  to  boot  Linux  and  run  CoreMark,  SPECINT,  and  
Dhrystone  benchmarks
§ BOOM  is  9,000  loc  and  3  person-­‐years  of  work

§ Future  Work
- bring-­‐up  more  interes'ng  applica'ons
- add  ROCC  interface
- explore  new  µarch  designs
- tape-­‐out  this  fall
- open-­‐source  by  winter  workshop

33
Tuesday, June 30, 15
UC Berkeley QuesKons?

34
Tuesday, June 30, 15
UC Berkeley Funding  Acknowledgements
§ Research  par*ally  funded  by  DARPA  Award  Number  HR0011-­‐12-­‐2-­‐0016,  the  Center  for  Future  
Architecture  Research,  a  member  of  STARnet,  a  Semiconductor  Research  Corpora*on  program  
sponsored  by  MARCO  and  DARPA,  and  ASPIRE  Lab  industrial  sponsors  and  affiliates  Intel,  
Google,  Huawei,  Nokia,  NVIDIA,  Oracle,  and  Samsung.  
§ Approved  for  public  release;  distribu*on  is  unlimited.  The  content  of  this  presenta*on  does  not  
necessarily  reflect  the  posi*on  or  the  policy  of  the  US  government  and  no  official  endorsement  
should  be  inferred.
§ Any  opinions,  findings,  conclusions,  or  recommenda*ons  in  this  paper  are  solely  those  of  the  
authors  and  does  not  necessarily  reflect  the  posi*on  or  the  policy  of  the  sponsors.  

35
Tuesday, June 30, 15

You might also like