0% found this document useful (0 votes)
27 views

Logic Synthesis

Uploaded by

gsingh20be20
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Logic Synthesis

Uploaded by

gsingh20be20
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

Logic Synthesis

Part 1

Amr Adel Mohammady


/amradelm

/amradelm
Save The Palestinian Children

Israel has killed more than 13,000 children in Gaza since


October 7 while others are suffering from severe
malnutrition and do not “even have the energy to cry”,
says the United Nations Children’s Fund (UNICEF).

“Thousands more have been injured or we can’t even


determine where they are. They may be stuck under
rubble … We haven’t seen that rate of death
among children in almost any other conflict in
the world”
-UNICEF Executive Director

Till Nov,2023
Introduction
• Logic synthesis is the process of converting the HDL code into logic gates.
• The output of synthesis is called a netlist. A netlist is a textual representation of all the nets and cells in the designs and their connections.
• The aim of this document is to explain the steps within the synthesis flow and show the possible optimizations that can be done.

Synthesized
Logic Netlist

Behavioral Synthesized
Verilog Code Logic Schematic

/amradelm

/amradelm 3
Analyze

/amradelm

/amradelm 4
Analyze
Is this file available?
• The first step in synthesis is to read and analyze the HDL code and the other inputs to ensure
nothing is missing or corrupted.
• The analyze step will check that:
o HDL codes contain no syntax issues. Is the syntax correct?

o All modules/include files/functions/etc referenced inside the codes are available and nothing
is missing.

Example HDL1

/amradelm

[1] : https://github.com/pulp-platform/pulpino /amradelm 5


Elaborate

/amradelm

/amradelm 6
Elaborate
• The second step is called elaborate and sometimes called translate. It’s the
process of converting the HDL codes into actual logic gates.
• The elaboration step outputs are:
o A technology-independent (generic) netlist without any optimizations
done. The cells referenced in the netlist only have the functional
information but no timing, power, or physical information exist.
o Reports about linting and design issues in the codes such as missing
modules, multi-driven nets, width mismatch, etc.
o Reports about the cells, their types, and counts.

Example Cell Report from Design Compiler

/amradelm

/amradelm 7
Map and Optimize
(Compile)

/amradelm

/amradelm 8
Compile
• Compile is the most interesting and important step. Here we map the generic netlist into a technology-dependent netlist and then do several optimizations
to meet the given constraints.
• The constraints are supplied by the user and fall into 3 categories:
o Timing constraints: The clock frequency, the input/output delays, false paths, multi-cycle paths, max transitions, etc.
o Power constraints: Max dynamic power consumption, max leakage power consumption, etc.
o Area constraints.
• Along with the constraints, the user enables or disables application settings that controls the tool behavior and optimizations.
• The compile step outputs are:
o A technology-dependent and optimized netlist
o Reports about the design timing results, power consumption, area utilization, etc.

/amradelm

/amradelm 9
Compile – Map
• In the case of ASIC, the generic cells are mapped to the cells defined inside the standard cell libraries1. a b AND OR NAND NOR XOR

• In the case of FPGAs, the cells are mapped into FFs, LUTs, DSP blocks, and memory blocks that are 0 0 0 0 1 1 0
implemented within the FPGA fabric. 0 1 0 1 1 0 1
o The LUT (look-up table) is a small memory that can be configured to implement any basic logic 1 0 0 1 1 0 1
gate. 1 1 1 1 0 0 0
In the table shown, if we consider the inputs a, and b as the address, we can implement any function
by storing the truth table inside the memory.
o The DSP blocks are special and fast blocks that are used to implement arithmetic operations.

LUT as AND LUT as OR

DSP48e1 from Xilinx/AMD

/amradelm

[1] : You can learn more about standard cell libraries here : Click Me /amradelm 10
Compile – Optimize
• After mapping, the synthesis tool will do many optimizations and tradeoffs to fulfill the design constraints. We will go through the important ones.
Constant Propagation
• This optimization propagates constant values to remove redundant logic. For example, a 2-input AND where one of its inputs is a constant “0” will always produce “0”
regardless of the other input value.
• Consider the example below:
o The 1st XOR gate with both inputs having the same logic value will always produce “0”. Therefore we can remove the first XOR and propagate zero to the next XOR.
o The 2nd XOR has A in one input and “0” in the other. An XOR where one of its inputs is a “0” will pass the value of the other input as it is. Therefore we can remove the 2nd
XOR and propagate A to the next XOR.
o The 3rd XOR has A in both its inputs and so, will always produce 0. The entire circuit can be optimized away and a logic “0” be propagated to its output.

0 A 0

/amradelm

/amradelm 11
Compile – Optimize

Cross-Boundary Optimization
• The RTL code consists of several modules that are connected to each other.
• By default, the synthesis tools will synthesize each module separately and then connect them at the top module.
• In the example below, the inverter pair can be removed. But since the inverters exist in different modules the synthesis tool won’t remove them.
• One solution to this is to enable cross-boundary optimization. This allows the tool to perform the optimizations and also propagate constants across the modules. The
drawback is an increase in runtime.

/amradelm

/amradelm 12
Compile – Optimize

Hierarchy Flattening
• The other solution is to remove the module boundaries altogether. This is called flattening and generally produces better results but takes higher runtime and makes post-
synthesis simulations and the PNR flow more difficult1.

No Flattening With Flattening

/amradelm

[1] : Because it makes tracing signals and referencing cells more difficult. /amradelm 13
Compile – Optimize

NAND/NOR vs AND/OR
• In CMOS logic, NANDs and NORs have smaller area/power and faster delay than an AND or NOR. This is because an AND is implemented by connecting a NAND with an
inverter, and the same for OR and NOR.
• Because of this, ASIC1 synthesis tools will try to optimize the design to use NANDs and NORs when possible.
• The example on the right shows two circuits that perform the same function. However, the bottom one is better in all aspects (timing, area, power)

CMOS NAND CMOS Inverter

BAD

CMOS NAND CMOS AND Good

/amradelm

[1] : This is not the case in FPGAs because all gates are implemented with LUTs /amradelm 14
Compile – Optimize

Area vs Timing
• Synthesis tools have switches to control the tradeoff between area vs timing.
• Consider the example below, both circuits perform the same function but:
o The circuit on the left has less area 115 𝜇𝑚2 a but longer critical path 1050 𝑝𝑠 (better for area)
o The circuit on the right has more area 180 𝜇𝑚2 a but shorter critical path 800 𝑝𝑠 (better for timing)
o The synthesis tool will choose which circuit to use based on the user settings.

MUX
𝑇𝑐𝑜𝑚𝑏 = 100𝑝𝑠
𝐴𝑟𝑒𝑎 = 10𝜇𝑚2

Adder
𝑇𝑐𝑜𝑚𝑏 = 700𝑝𝑠
𝐴𝑟𝑒𝑎 = 75𝜇𝑚2

Logic
𝑇𝑐𝑜𝑚𝑏 = 250𝑝𝑠
𝐴𝑟𝑒𝑎 = 20𝜇𝑚2

/amradelm

/amradelm 15
Compile – Optimize

Power vs Timing
• Similarly, synthesis tools have switches to control the tradeoff between power vs timing.
• For example, a timing-driven synthesis may use large cells that have smaller delays but higher power and area.

2-Input AND Size x2


2-Input AND Size x1

Timing Library from Skywater 130nm Open-source PDK

/amradelm

/amradelm 16
Example

/amradelm

/amradelm 17
Example – Memory
• To learn how synthesis works we will manually synthesize the code on the right.
• The code shows a memory of 16 locations, each location is 8 bits wide. The memory is positive edge
triggered
• Let’s start with just a single bit element (1 Flip flop) and then build up the rest of the memory:
o The always block is positive edge triggered so we need a positive edge FF.
o We don’t have a reset. So we don’t need a FF with a reset pin.
o When the address corresponding to this FF is selected, the FF reads and stores the data, otherwise it
maintains the current stored value.
o To implement this we need a mux in front of the FF. When the MUX select signal is “1” it will read the
data, otherwise, it will keep the current value from the FF.
o FPGAs and some ASIC standard cell libraries have FFs with this MUX implemented inside the FF as a
single cell.
o Now we need to implement the circuit that will read the address and generate the enable signal

/amradelm

/amradelm 18
Example – Memory
o To generate the address we need a decoder that outputs 1 to the corresponding FF and 0 to the others.
o This is implemented using AND gates and inverters (the bubbles). The example below shows an AND gate that will produce “1” if the address is “0000” and
“0” otherwise.
o Now we will expand the FF into 8 FFs to store the 8-bit data. The entire row has the same enable/address signal.

Address
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111

/amradelm

/amradelm 19
Example – Memory
o We have 16 rows/locations in the memory. We will repeat the last structure 16 times. Each row has its
unique address generator.
• We now finished the elaboration step. We converted the behavioral HDL into logic gates but we still didn’t do
mapping and optimization yet.
• The initial cell count is:
o 16 × 8 𝐹𝐹𝑠 = 128 𝐹𝐹𝑠
o 16 4-Input ANDs
o 32 Inverters

/amradelm

/amradelm 20
Example – Memory
• We will now do the mapping. Assume we don’t have 4-input AND cells. We will have to break it into 2-inputs ANDs as shown.

• We have shown that NANDs and NORs are preferred over ANDs and ORs. We will use Boolean manipulations to convert the ANDs to NANDs and NORs as
shown below.

• The current cell count:


o 2 × 16 = 32 𝑁𝐴𝑁𝐷
o 16 𝑁𝑂𝑅

/amradelm

/amradelm 21
Example – Memory
• We have another optimization to reduce the cell count:
o Consider the last 4 addresses, the orange gates are all the same so we can remove them and create only one cell thus we save area.

addr[3] addr[2] addr[1] addr[0]


0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1

/amradelm

/amradelm 22
Example – Memory
• If we do the same to the entire address generator. We end up with 8 NAND gates instead of 32.

addr[3] addr[2] addr[1] addr[0]


0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1

/amradelm

/amradelm 23
Example – Memory
• Consider the first row in the schematic below:
o We have a NAND with 2 bubbles/inverters at its input. That is a total of 3 gates. We can replace them with one OR gate that we will do the same function.

/amradelm

/amradelm 24
Example – Memory
• Our circuit now looks like this.
• Let’s address the timing aspects of this circuit:
• Each NOR gate drives 8 FFs. This load might be big for each gate
We can solve this by upsizing the NOR gates to x4 for example.
• Each NAND drives 4 NOR gates. We might need to upsize it to x2.
• We replaced 2 NANDs with an OR (blue arrows). The speed of
an OR is less than a NAND. We might need to convert the OR
gates to low VT cells.
• The red arrows point to the longest and most
critical paths:
Each path goes through an inverter, a NAND, a NOR
then a FF.
The cells on these paths need to be faster
(if they cause setup violations) .
We can upsize them or change the
VT flavor

/amradelm

/amradelm 25
Example – Memory
• The final optimization: The circuit has no output, hence
we don’t need the entire circuit and we can remove it.
This way the area is reduced to 0 ☺.
• Such issues will get detected by post-synthesis
simulation, but you can save your co-workers lots of time
by doing sanity checks on the synthesis logs and
reviewing the modules and cells that got optimized
away.
• There are cases1 where we want to instruct the synthesis
to avoid optimizing certain cells even if they appear
useless to the synthesis tool. This is done using the
“set_dont_touch” command.

/amradelm

[1] : For example, if you have a physical only or analog cells inside your RTL. /amradelm 26
Thank You!

/amradelm

/amradelm 27
Logic Synthesis
Part 2a – FPGA Fabric

Amr Adel Mohammady


/amradelm

/amradelm
Save The Palestinian Children

Israel has killed more than 13,000 children in Gaza since


October 7 while others are suffering from severe
malnutrition and do not “even have the energy to cry”,
says the United Nations Children’s Fund (UNICEF).

“Thousands more have been injured or we can’t even


determine where they are. They may be stuck under
rubble … We haven’t seen that rate of death
among children in almost any other conflict in
the world”
-UNICEF Executive Director

Till Nov,2023
Introduction
• In the previous part we discussed the main steps of logic synthesis and saw some of the optimizations that can be done on the design to meet the
constraints
• In this part we will discuss the synthesis flow on FPGAs and the possible optimizations and settings.
• In order to fully understand how FPGA synthesis work we need to learn what’s inside an FPGA and how it operates.
• FPGAs differ from one vendor to another. So, in the following slides will focus on the XILINX/AMD Series 7 FPGAs

/amradelm

/amradelm 3
Configurable Logic Block (CLB)
• An FPGA consists of a matrix of configurable blocks interconnected via a grid of programmable wires
• The most common and main block is the CLB which is used to create logic elements as simple as an inverter or as complex as a multiplier.
• Each CLB consists of 2 subblocks called Slices. The slices can communicate with each other using special vertical routes or using the general routes
(more on this later).

Special Routes

General Routes

/amradelm

/amradelm 4
Slice
• Each CLB consists of 2 subblocks called Slices. The slices can communicate with each other
using special and fast vertical routes or using general routes.
• The slice consists of:
o 4 Lookup tables (LUT) : Used to create any logic function or small memory elements
o MUX7, MUX8: Used to merge the outputs of multiple LUTs to create larger LUTs
o Carry Chain: Used to implement arithmetic functions
o Storage Elements (FFs/Latches)
LUT
Registers

MUX

Carry Chain (Adders)

Slice
/amradelm

/amradelm 5
Lookup Table
(LUT)

/amradelm

/amradelm 6
Lookup Table (LUT)
• The LUT is the main block inside the slice, it is a small memory that can be configured to perform any a b AND OR NAND NOR XOR
logic function. 0 0 0 0 1 1 0
• Inputs to the LUT serve as addresses to its memory, enabling the implementation of any logic function 0 1 0 1 1 0 1
by pre-storing corresponding output values.
1 0 0 1 1 0 1
• The smallest LUT in the 7 series XILINX FPGAs is a 6-input LUT. However, we can implement a smaller 1 1 1 1 0 0 0
function with, for example, 3 inputs, by using the inputs we need and tying the rest to logic 0.
However, we won’t fully utilize the LUT1.

LUT as AND LUT as OR

/amradelm

[1] : This fact will be important when we discuss area optimizations in FPGA synthesis /amradelm 7
MUXES
(F7MUX, F8MUX)

/amradelm

/amradelm 8
LUT7 – LUT8
• A larger LUT can be constructed by combining the outputs of multiple smaller LUTs using multiplexers. The example below shows how two 2-input LUTs can be
made into a 3-input LUT using a multiplexer
LUT_8
• The Multiplexers inside the 7 series FPGAs are:
o MUX7: combines two LUT_6 to form a LUT_7. LUT_6
o MUX8: combines two LUT_7 to form a LUT_8.
• The largest combinatorial logic that can implemented within a single slice is an 8-input logic.
If we want to implement a larger logic we need to combine multiple slices.

x1 x0 Out LUT_7
0 0 0
0 1 0 x2 x1 x0 Out
1 0 0 0 0 0 0
1 1 1 0 0 1 0 MUX8
0 1 0 0
0 1 1 1
1 0 0 0
x1 x0 Out
1 0 1 1
0 0 0
1 1 0 1
0 1 1 MUX7
1 1 1 1
1 0 1
1 1 1

3-Input LUT
Slice
/amradelm

/amradelm 9
LUT Combining
• A LUT6 cell is internally constructed using two LUT5 elements and a multiplexer (MUX).
• If the input (I5) is connected to a dynamic input then LUT6 functions as a 6-input logic gate.
• If the input (I5) is tied to “logic 1” then LUT6 functions as two 5-input gates. However, both gates should have the same inputs.

6 Input Gate Two 5 Input Gates

/amradelm

/amradelm 10
Muxes
• Each LUT6 can be configured as a 4-to-1 Multiplexer (4 Data ports, 2 Selection Ports).
• If we combine 2 LUT6 with a MUX7 we get a 8-to-1 MUX. And if we combine all 4 LUT6
we get a 16-to-1 MUX.
• The largest MUX that can be implemented within one slice is a 16-to-1 MUX. If we
need to implement a larger MUX, we need to combine multiple slices.

/amradelm

/amradelm 11
LUT As Memory
(Distributed RAM)

/amradelm

/amradelm 12
LUT As Memory (Distributed RAM)
• The normal LUT is basically a memory so it can be used as a ROM (Read Only Memory). But can
it be used as a RAM that can be modified dynamically?
• The 7 series FPGAs contain special slices called SLICEM that contain special LUTs with a clock,
write_enable, and write_address pins. These pins allow the LUT to be used as a RAM
• The pins of each LUT are:
o DI: The data bit that will be written inside the LUT
o A[6:1] : Read address.
o WA[6:1] : Write address
o WE: Write enable
o O: The memory output

SliceM
/amradelm

/amradelm 13
Sync vs Async Operation
• By default, write operations to the LUTRAM are synchronous, occurring on the clock edge, while read operations are asynchronous.
• However, synchronous read operations can be enabled by utilizing a pipeline register at the LUT output, which introduces a one-cycle latency but improves
timing performance by reducing combinational delay.

Asynchronous Read
(Data changes with the address after a combinational delay)

Synchronous Read
(Data changes with the clock edge)

/amradelm

/amradelm 14
Multi Ports
• FPGAs allow for the creation of memories with multiple read and write ports. This is particularly useful in [1]
applications requiring simultaneous access to different memory locations.
• The diagram below shows how to have a dual port RAM:
o The write data and address are passed to 2 LUTs in parallel. Any write operation will write to both LUTs
simultaneously.
o The upper LUT: has the read address “A[6:1]” and write address “WA[6:1]” connected to each other1.
This means you can either read or write from this LUT at a time but not both simultaneously
o The lower LUT: has separate read and write address pins. This means you can read and write
simultaneously
o Based on the above, the operations we can do simultaneously are:
▪ Write (Both) and Read (Lower)
▪ Read (Upper) and Read (Lower)
• The slice has 4 LUTs which means it can be
configured to be up to a 4-port (quad) RAM.
o Similarly the write data and address goes to all LUTs
in parallel
o Under this configuration the operations that can
be done simultaneously are:
▪ Write and 3 Reads
▪ 4 Reads Dual Port RAM Quad Port RAM

[1] : The first LUT always has the write and read address connected. So under any configuration, it can either do a write or /amradelm

a read operation at a time but not both. The user guide doesn’t mention why the LUT is made so, but it could be that /amradelm 15
there are no routing resources to accommodate 2 separate ports for read and write addresses.
Deeper Memories
• We can have smaller memories with, for example, 16
locations. We will tie the most significant bit of the address
to logic “0”. But this way we don’t fully utilize the LUT.
• The question now is, can we have deeper memories with
more than 64 locations in a single slice?
• We can combine 2 LUT6 (64x1 RAM) to construct a LUT7
(128x1 RAM) using MUX7 as shown in the figure below.
• If we combine all 4 LUT6 we can construct a LUT8 (256x1
RAM). This is the deepest memory that can be implemented
in a single slice. If we need a deeper memory we have to use
multiple slices.
• We can create a dual port 128x1 RAM by using two LUT7

128x1 RAM Dual Port 128x1 RAM 256x1 RAM

/amradelm

/amradelm 16
Multi-Bit RAMs
• So far all the memory configurations we saw are 1-bit wide. We will now learn how to construct wider (multi-bit) memories.
• To construct wider memories we use multiple memories in parallel. The address and write enable are passed to all RAMs and each RAM has dedicated data in
and data out ports to store one bit as shown in the diagram.
• Since LUT6 is built internally with 2 LUT5, each LUT6 can be used as a 2-bit wide 32x1 RAM. We can use 2, 3, or 4 LUT6 to build 32 deep by 4, 6, or 8-bit wide
RAMs respectively.
• Other configurations include 64x2, 64x3, 128x2 RAMs etc.

4-Bit Wide RAM 32x2 RAM

/amradelm

/amradelm 17
Shift Register Logic
(SRL)

/amradelm

/amradelm 18
LUT As Shift Register Logic (SRL)
• The LUT can be configured as a shift register without using the FFs inside the slice.
• Each LUT6 can be configured to implement a chain of 32 shift registers1.
• The output of LUT6 (MC31) can be passed to the LUT next to it to form a longer
chain (64, 96, or 128).
• The address ports can be used to dynamically read out bits from the chain of
registers.

32-Bit Shift Register Reading Bits From the Shift Reg 96-Bit Shift Register

[1] : LUT6 can’t implement 64 shift registers because it consists internally of 2 LUT5 and there is no way to connect the /amradelm

output of one LUT5 to the next LUT5. /amradelm 19


Static vs Dynamic Configuration
• You can implement shift registers that are not multiple of 32.
• The diagram below shows an 8-bit shift register:
o The address is tied to logic “00111”, which means the output Q will always read out bit 8 from the shift
register.
o Since the address port is tied we lose the ability to dynamically read bits from the shift register chain.
o Xilinx calls this a static shift register configuration.

8-Bit Static Shift Register 72-Bit Static Shift Register

/amradelm

/amradelm 20
Carry Chain

/amradelm

/amradelm 21
Carry Chain
• The slice contains primitive cells to implement efficient carry look-ahead adders.
𝐶4
The carry lookahead algorithm calculates the sum as follows:
𝑆

1

𝑃3
o Propagate term: 𝑃𝑖 = 𝐴𝑖 ⨁𝐵𝑖 𝑆3
𝐺3
o Generate term: 𝐺𝑖 = 𝐴𝑖 ⋅ 𝐵𝑖
o Carry term: 𝐶𝑖+1 = 𝐺𝑖 + 𝑃𝐼 ⋅ 𝐶𝑖 𝐶3 𝑆 1
o Sum term: 𝑆𝑖 = 𝑃𝑖 ⨁𝐶𝑖 𝑃2
𝑆2
• The main issue lies with the carry term because each bit (𝐶𝑖+1 ) depends on the previous one 𝐺2
(𝐶𝑖 ). This creates a long chain of logic.
o 𝐶1 = 𝐺0 + 𝑃0 ⋅ 𝐶0 𝐶2 𝑆 1

o 𝐶2 = 𝐺1 + 𝑃1 ⋅ 𝐶1 = 𝐺1 + 𝑃1 ⋅ (𝐺0 + 𝑃0 ⋅ 𝐶0 ) 𝑃1
𝑆1
o 𝐶3 = 𝐺2 + 𝑃2 ⋅ 𝐶2 = 𝐺2 + 𝑃2 ⋅ (𝐺1 + 𝑃1 ⋅ (𝐺0 + 𝑃0 ⋅ 𝐶0 )) 𝐺1

o 𝐶4 = 𝐺3 + 𝑃3 ⋅ 𝐶3 = 𝐺3 + 𝑃3 ⋅ (𝐺2 + 𝑃2 ⋅ (𝐺1 + 𝑃1 ⋅ (𝐺0 + 𝑃0 ⋅ 𝐶0 )))


𝐶1 𝑆
• Xilinx 7 series aims to optimize the delay of this long chain using fast carry chain primitives. 1
𝑃0
• Each slice can implement up to a 4-bit adder. Larger adders will require multiple slices. 𝑆0
𝐺0

/amradelm

/amradelm 22
Registers

/amradelm

/amradelm 23
Registers
• There are eight1 storage elements per slice. Four of them can be configured as either
edge-triggered D-type flip-flops or level-sensitive latches.
• There are four additional storage elements that can only be configured as edge-
triggered D-type flip-flops
• Each storage element has the following pins:
o D : The input to the FF/Latch. It comes from a LUT or from the bypass signal 𝐴𝑥
that comes from outside the slice and doesn’t go through a LUT.
o Q: The output of the FF/Latch.
o CE: Write enable signal
o SR: Set/Reset pin.
• Each storage element can be configured during synthesis as follows:
o Is it a latch or FF?
o Does it contain 0 or 1 upon FPGA power-up? (INIT0/INIT1)
o Does the SR pin functions as a set pin (1) or reset pin (0)?
o Is the Set/Reset synchronous or asynchronous? This option affects all storage
elements in the slice and can’t be set individually.

[1]

[1] : There are 8 storage elements although there are just 4 LUTs because if we enable LUT combining, each LUT can /amradelm

provide 2 separate outputs (𝑂5 and 𝑂6 ) so under this configuration we can have 8 different outputs from the slice /amradelm 24
Digital Signal Processing
(DSP48E1)

/amradelm

/amradelm 25
DSP48e1
• The 7 series FPGAs contain lots of DSP slices which are used to implement fast, low-power arithmetic blocks.
• The DSP slice consists of:
o 25-bit pre-adder: (𝐴 ± 𝐷)
o 25 x 18 Multiplier: (𝐴 ± 𝐷) × 𝐵
o Control MUXs
o ALU: ( 𝐴 ± 𝐷 × 𝐵) + C
o Patter Detector ALU
o Optional Pipeline registers : To enhance the delay
and synchronize operands

Simplified DSP48e1 Block Diagram

/amradelm

/amradelm 26
Pre-adder
• The pre-adder adds A (least 25-bit) to D (25-bit) and then passes the result to the multiplier Static Bypass Muxes
(A MULT) or bypasses A concatenated to B (A:B) to the ALU (X MUX) or passes A to another
DSP slice (ACOUT).
Dynamic Bypass Muxes
• The pipeline registers inside the pre-adder can be static or dynamic:
o Static means they are always used and can’t be bypassed during runtime.
o Dynamic means they can be controlled during runtime by control signals to enable the
pipelines or bypass them.
o The dynamic behavior is controlled with the INMODE[4:0] port. The port allows more
dynamic operations other than bypassing the pipeline registers. The table below shows
all the different configurations

INMODE[3] INMODE[2] INMODE[1] INMODE[0] USE_DPORT Multiplier Input


0 0 0 0 FALSE A2
0 0 0 1 FALSE A1
0 0 1 0 FALSE Zero
0 0 1 1 FALSE Zero
0 1 0 0 TRUE A2
0 1 0 1 TRUE A1
0 1 1 0 TRUE Zero
0 1 1 1 TRUE Zero
1 0 0 0 TRUE D + A2
1 0 0 1 TRUE D + A1
1 0 1 0 TRUE D
1 0 1 1 TRUE D
1 1 0 0 TRUE -A2
1 1 0 1 TRUE -A1
1 1 1 0 TRUE Zero
1 1 1 1 TRUE Zero

/amradelm

/amradelm 27
Control MUXs
• The inputs to the Adder/ALU pass through 3 MUXs. The selections of these MUXs can be controlled dynamically
• MUX X:
OPMODE MUX Y Output
00 0
01 M (Output from the multiplier)
10 P (Final output from the same DSP slice to implement an accumulator)
11 A:B (Concatenated A and B)

• MUX Y Inputs:
OPMODE MUX Y Output
00 0
01 M (Output from the multiplier)1
10 48’FFFF_FFFF_FFFF
11 C (Input port)

• MUX Z Inputs:
OPMODE MUX Y Output
000 0
001 PCIN (Final output from the adjacent DSP slice)
010 P (Final output from the same DSP slice to implement an accumulator)
011 C (Input port) Inputs to The ALU
100 P
101 17-bit shift PCIN (To enable wider multiplier implementation)2
110 17-bit shift P (To enable wider multiplier implementation)2
111 Illegal selection

[1] : The multiplier output appears at both MUX X and MUX Y because the multiplier produces 2 partial products that /amradelm

should be summed together to get the final result of the multiplier. So each MUX gets one of the partial products /amradelm 28
[2] : Discussed in the next slide
Wider Multipliers – Partial Products
• We can build wider multipliers using smaller multipliers, shifters, and adders.
• Consider the example on the right:
o We want to multiply two 2-digit numbers but only have a single-digit multiplier.
o We can decompose the 2-digit numbers into smaller single-digit numbers like so:
▪ 35 = 30 + 5
▪ 24 = 20 + 4
▪ 32 × 24 = (30 × 20) + (5 × 20) + (30 × 4) + (5 × 4)
▪ We will ignore the right-hand zeros but we should shift the results by the number
of zeros we ignored
❑ 3 × 2 = 6 then shift by 2 = 600
❑ 5 × 2 = 10 then shift by 1 = 100
❑ 3 × 4 = 12 then shift by 1 = 120
❑ 5 × 4 = 20
▪ Now sum all the terms and get the result = 840
• We can do the same thing in binary 35 100011 × 24 011000 as shown in the example on
the right.

/amradelm

/amradelm 29
ALU
• The ALU unit can be configured dynamically using OPMODE[3:2] and ALUMODE[3:0] to
function as:
o Logic unit: to implement all logic functions (AND, NAND, XOR, etc.). When this mode is used,
the multiplier should not be used.
o Adder/Subtractor
• Single Instruction, Multiple Data (SIMD) Mode:
o The ALU can be split into smaller ALUs to do multiple calculations at once.
o The ALU can be configured as two 24-bit ALUs or four 12-bit ALUs. However, the ALUs
should do the same operation (Single Instruction)
o This configuration is static

Simplified DSP48e1 Block Diagram SIMD Config

/amradelm

/amradelm 30
DSP48e1 – Pattern Detector
• The DSP contains a specialized comparator to detect patterns in the output of the last stage.
• The pattern detector works as follows:
1. Select bits to consider/Mask: a mask is applied on the output to select which bits we want to
check and which we want to ignore. When a MASK bit is set to 1, the corresponding pattern bit
is ignored. When 0, the pattern bit is compared
2. Select comparing pattern: the pattern we want to compare our output to. It can be set statically
as a constant or dynamically using the C port1
3. Select operation on detection: If the pattern is matched, we choose one of many operations.
Such as raising a flag upon pattern detection. Other operations include resetting the P register
when a match occurs or resetting if a match didn’t occur.
• The example on the right shows a 16-bit wide fixed-point number. We want to check whether
the result is both odd (least significant integer bit = 1) and an integer (all fractions = 0):
1. The mask: We only consider the least significant bit of the integer part plus all the fraction bits.
Therefore we apply a mask as shown.
2. Comparing pattern: We want to check that the least significant bit = 1 and all the fraction bits =
0 . Therefore we apply a pattern as shown.
3. Operation on detection: We want to raise a flag when this pattern is detected. By default, the
port “PATDET” is set to “1” when a match occurs so no more work is needed.

[1] : Dynamic mask can only be used when the DSP is configured as a multiplier /amradelm

/amradelm 31
Block RAMs
(BRAM)

/amradelm

/amradelm 32
Block RAM (BRAM)
• The block RAM in Xilinx 7 series FPGAs stores up to 36 Kbits of data and can be configured as either two independent 18 Kb RAMs
• The block RAM can be configured into different depths/widths.
• Since the depth can only be a power of 2. The possible configurations are1:
Depth Width (Bits)
32K 1 Depth Width (Bits)
16K 2 16K 1
8K 4 8K 2
4K 9 4K 4
2K 18 2K 9
1K 36 1K 18
512 722 512 362

36Kbit 18Kbit

• The RAM sizes (36Kbits and 18Kbits) are not powers of 2 because the RAMs can include parity bits for each byte to allow error detection and
correction.
o 32 Bits (4 Bytes) + 4 Parity bits = 36
o 16 Bits (2 Bytes) + 2 Parity bits = 18
• Each block RAM can be configured as dual-port RAM to allow WRITE/WRITE, READ/READ, or WRITE/READ operations.
o In the case of collision, where both ports are trying to write to the same location, the memory location is written with non-deterministic data.
It’s up to the user to avoid or handle collisions.

[1] : The RAMs can be of any width or depth but won’t be fully utilized. /amradelm

[2] : This data width is only allowed with SDP configuration. See next slide for more info. /amradelm 33
True Dual Port (TDP) vs Simple Dual Port (SDP)
• The RAM can be:
o True dual port: A total of 2 ports, each port can read or write this allows you to (READ/READ) or (WRITE/WRITE) or (READ/WRITE)
o Simple dual port: A total of 2 ports, one port can only read and the other can only write. This allows (READ/WRITE) only.
▪ This configuration avoids the (WRITE/WRITE) collision and leaves routing resources for other blocks (less nets and pins).
▪ This configuration allows the write and read ports to have different widths1. For example, write can be 18-bit wide, and read can be 36-bit wide.

Simple Dual Port

True Dual Port

[1] : Either the Read or Write port is a fixed width of x32 or x36 for RAM18 and x64 or x72 for RAM36. /amradelm

/amradelm 34
Read and Write Operations
• The write operation is synchronous to the clock edge (can be configured to be positive or negative edge).
• By default the read operation is asynchronous to the clock. This mode is called “Latch mode”
o An optional register can be added to the output port to enable synchronous operation. This is done to separate the long combinational 𝑇𝑐𝑞 delay from the
memory to the output, thus allowing increasing the operating frequency.

Asynchronous Read
(Data changes with the address after a combinational delay)

Synchronous Read
(Data changes with the clock edge)

/amradelm

/amradelm 35
Write Modes
• In true dual port mode each port can do a read or write operation. A question arises, what happens to the output read port (DO) if a write operation is done? Does
it change or remain unchanged?
• The BRAM can have one of 3 write modes:
o Write First: The data that is being written appears at the output read port.
o Read First: The old data that was stored in the same address that will get overwritten appears at the output read port.
o No Change: The output data from the last read operation remains at the port.

WRITE_FIRST READ_FIRST NO_CHANGE

/amradelm

/amradelm 36
Byte Wide Write Enable
• The Block RAM allows byte wide write enable where only 1 byte is written and the other adjacent bytes are left unchanged
• This is done using the WE (write enable) port. The port is 4-bit wide where each bit corresponds to the required byte
• The example below shows a byte write in the least significant 2 bytes (0011). The data out (DO) port shows that the written data changed the least
significant 2 bytes only

Byte Write With WRITE_FIRST Mode

/amradelm

/amradelm 37
FIFO Configuration
• The Block RAM has dedicated logic (counters, synchronizers, etc) inside it to be used as a synchronous or an asynchronous FIFO
• The synthesizer will decide whether to configure the BRAM to a synchronous or asynchronous FIFO based on the timing constraints and the definition of
the clocks. You have to carefully check the synthesis log to make sure the BRAM has the required configuration

BRAM As FIFO

/amradelm

/amradelm 38
Conclusion – Where To Go From Here
• All the blocks we discussed have far more features that are explained in detail in the user guides. However, we now have enough knowledge to know how
synthesis works on FPGAs and understand the synthesis options and optimizations.
• The FPGA blocks can either be manually instantiated in the RTL code or inferred by the synthesis from the behavioral RTL code.
• FPGA vendors provide template RTL codes to help you infer the FPGA blocks with the intended functionality.
• In the next part we will discuss FPGA synthesis options and optimizations.

/amradelm

/amradelm 39
Thank You!

/amradelm

/amradelm 40
Logic Synthesis
Part 2b – FPGA Synthesis Optimization and Settings

Amr Adel Mohammady


/amradelm

/amradelm
Save The Palestinian Children

Israel has killed more than 13,000 children in Gaza since


October 7 while others are suffering from severe
malnutrition and do not “even have the energy to cry”,
says the United Nations Children’s Fund (UNICEF).

“Thousands more have been injured or we can’t even


determine where they are. They may be stuck under
rubble … We haven’t seen that rate of death
among children in almost any other conflict in
the world”
-UNICEF Executive Director

Till Nov,2023
Introduction
• In the previous part we learned the FPGA fabric and the blocks inside it.
• In this part we will discuss the synthesis options and optimization on FPGAs and the possible optimizations and settings.
• FPGAs differ from one vendor to another. So, in the following slides will focus on the XILINX/AMD 7 Series FPGAs

/amradelm

/amradelm 3
The Golden Rule For Optimization
• Consider the example below:
o The implementation on the top creates a large 8:1 MUX inside one slice:
▪ Lots of signals to go to a certain place which might cause routing congestion.
▪ Forces all the input logic cloud to be placed near the MUX which might not be the
optimal placement for these logic cells.
▪ We have a long route from the MUX to SLICE2
o The implementation on the bottom breaks down the 8:1 MUX into smaller 2:1 MUXs
implemented across multiple slices:
▪ The signals go to different slices so we have no routing congestion
Optimize For Area
▪ Each logic cell can be placed individually in its optimal location
▪ The long route was broken across different slices/MUXs1
▪ Uses more area/slices as each 2:1 MUX is implemented in a different slice.
• From this example we can see that when we try to fully utilize a slice and pack as much logic as we
can inside it, we save area but we might worsen timing or hinders routability.
• Many of the synthesis options we will discuss trade off between saving area by fully utilizing the
slices and enhancing timing by dividing the logic across multiple slices to have higher placement
and routing flexibility.

Optimize For Timing and Routability

[1] : Packing the logic inside a single slice may be better for routing delay since the interconnect delays inside the slice are /amradelm

much less compared to routing delays across different slices. However, this packing may result in long wires that /amradelm 4
overcome the benefit of containing things in a single slice
FPGA Synthesis Options

/amradelm

/amradelm 5
LUT Combining
• In the previous part we saw how LUT6 can be divided into 2 LUT5 to implement 2 separate functions provided they have the same input. This method is called
LUT combining.
• Although this method reduces LUT utilization, it causes congestion and might degrade timing as it increases the connectivity going to a single slice.
• Synthesis Commands:
o synth_design –no_lc #Disable LC on the entire design
o reset_property SOFT_HLUTNM \
[get_cells -hierarchical -filter {NAME =~ <module name> && SOFT_HLUTNM != ""}] #Disable LC on specific RTL module

6 Input Gate Two 5 Input Gates

/amradelm

/amradelm 6
MUXF Mapping
• Using MUXF* primitives helps critical paths with many logic levels while also reducing power and LUT utilization.
• These MUXs are used to group larger logic within a single slice. This grouping forces high CLB input utilization with higher routing demand and limits
placement flexibility when the netlist connectivity is complex, leading to potential higher routing congestion and timing degradation.
• Synthesis Commands:
o opt_design –muxf_remap #Remap all MUXF cells into LUTs
o set_property BLOCK_SYNTH.MUXF_MAPPING 0 [get_cells inst_name] #Disable MUXF on specific RTL module

Wide MUX Implemented Within a Single Slice 16-1 MUX before and after the MUXF* Remapping

/amradelm

/amradelm 7
Max LUT Input
• You can instruct the tool to limit the number of inputs a LUT can take resulting in smaller LUTs in the designs
o Advantages: Enhancing the placement flexibility and routability.
o Disadvantages: Higher LUT utilization.
• Synthesis Commands:
o set_property BLOCK_SYNTH.MAX_LUT_INPUT 4 [get_cells module_1] #Possible values are 4,5, and 6

Max LUT Input

/amradelm

/amradelm 8
SRL Mapping
• Shift registers can be implemented using either standard Flip-Flops (FFs) or utilizing Look-Up Tables
(LUTs) as Shift Register LUTs (SRLs). Each approach has its advantages and trade-offs.
• When Implemented with Flip-Flops (FFs):
o Higher FF Utilization: More flip-flops are used, leading to increased resource usage (less efficient in
terms of area).
o Improved Timing: The long wires needed for data transfer are divided between the flip-flops, which
can lead to better overall timing performance.
o Better Retiming: Flip-flop-based designs allow for retiming optimization, which can improve the
circuit's performance by redistributing the timing of operations.
• Using Shift Register Logic (SRL): Shift Registers with SRL
o Reduced FF Utilization: This approach replaced multiple FFs with one LUT, resulting in more
efficient area utilization. Better Retiming
o Creation of Long Wires: Shift registers implemented as SRLs may introduce longer wires, which
could potentially increase delay and complicate timing closure.
Shorter Wires
o Limited Retiming: Retiming might be restricted or more challenging, as SRLs are typically
implemented as static configurations within LUTs, making it less flexible compared to flip-flop-
based implementations.
• Synthesis Commands:
o synth_design -no_srlextract #Disable SRL usage on the entire design
o synth_design –shreg_min_size <int> #Minimum length for chain of
registers to be mapped onto SRL
o set_property SHREG_EXTRACT YES [get_cells my_shreg*] #instructs the
synthesis tool to implement certain shift registers <my_shreg[*]> with Shift Registers with FFs
SRL.

/amradelm

/amradelm 9
SRL Mapping – Cont’d
• Synthesis allows you to pull out one register from the SRL chain from either the input, output
or both. This configuration combines the benefits of both styles (SRL & Registers).
• This config is controlled with the SRL_STYLE1 property. The possible values are:
o register: The tool does not infer an SRL but instead only uses registers.
o srl: The tool infers an SRL without any registers before or after.
o srl_reg: The tool infers an SRL and leaves one register after the SRL.
o reg_srl: The tool infers an SRL and leaves one register before the SRL.
o reg_srl_reg: The tool infers an SRL and leaves one register before and one after the SRL. Shift Registers with SRL
• Synthesis Commands:
o set_property srl_style reg_srl_reg [get_cells my_shifter_reg*] REG_SRL_REG Config SRL_REG Config

Regs Extracted from SRL

[1] : Use care when using combinations of SRL_STYLE, SHREG_EXTRACT, and -shreg_min_size. The SHREG_EXTRACT /amradelm

attribute always takes precedence over the others. If SHREG_EXTRACT is set to "no" and SRL_STYLE is set to "srl", /amradelm 10
registers are used. The -shreg_min_size, being the global variable, always has the least amount of precedence.
Extract Enable
• To implement a register with enable we have 3 ways:
1. Adding logic in the data path to select between new data or current stored value
2. Use flip-flops with enable pins1.
3. Add clock gating (Not recommended in FPGAs)2.
• The synthesis tool will choose between adding logic in the data path (1) or using the enable signal of the FF (2). Sometimes this
choice might not be what we want so we need to instruct the tool on which option to pick
• Consider the examples below: Data Path

o On the left: the enable goes through a long data path which might cause a timing violation.
o On the right: the enable goes to the enable pin which is better for timing.
• Synthesis Commands:
o set_property EXTRACT_ENABLE yes [get_cells my_reg] #Use En Pin for register my_reg
o set_property EXTRACT_ENABLE no [get_cells my_reg] #Use datapath for register my_reg Enable Pin

Enable Goes Through a Long Data Path Enable Goes to EN Pin


Clock Gate

[1] : These FFs implement the MUX within the FF cell layout. /amradelm

/amradelm 11
[2] : FPGAs have dedicated routes and cells for the clock tree. Adding LUTs/Cells in the clock tree paths will worsen timing
and introduce skew.
Direct Enable
• Another similar attribute is “DIRECT_ENABLE” which is applied on enable signals
compared to “EXTRACT_ENABLE” which is applied on registers. Extract Enable

o EXTRACT ENABLE: Add any enable signal going to this register to EN pin
o DIRECT ENABLE: Add this enable signal to EN pin of any register it goes to.
• Another useful property of the DIRECT_ENABLE attribute is that it can give a certain
enable signal a priority over other enable signals.
• Consider the example below:
o The FF has 2 enable signals. They both go through combinational logic and then to
the EN pin. Direct Enable
o The signal “main_enable” goes through a long combinational path so it’s better to go
directly to the EN pin without passing the grey combinational logic.
o If we apply DIRECT_ENABLE on “main_enable” the synthesis will add it alone to EN pin
and add the other enable to the data path of the FF.
• Synthesis Commands:
o set_property direct_enable yes [get_nets -of [get_ports ena3]]
#Add this enable signal to EN pin of any register it goes to

Main Enable Goes With The Other Enable to EN Pin Main Enable Goes Alone to EN Pin

[1] : These FFs implement the MUX within the FF cell layout. /amradelm

/amradelm 12
[2] : FPGAs have dedicated routes and cells for the clock tree. Adding LUTs/Cells in the clock tree paths will worsen timing
and introduce skew.
Extract Reset / Direct Reset
• Similar to the enable signal, the reset signal can be applied on reset pin or through a data path
• Synthesis Commands:
o set_property EXTRACT_RESET yes [get_cells my_reg] #Use SR Pin for register my_reg
o set_property direct_reset yes \
[get_nets -of [get_ports rst3]] #Add this reset signal to SR pin of any register it goes to

Active High Reset Active Low Reset Active High Set Active Low Set

/amradelm

/amradelm 13
Gated Clock Conversion
• FPGAs have dedicated routes and buffers to deliver the clocks to the clock sinks
(pins) with minimum skew and latency.
• Adding any login in the clock path will force the clock to go through the general
routes and the LUTs which will degrade the skew and latency and might degrade
timing.
• FPGA vendors therefore recommend converting the clock logic to control signals to
the FFs when possible.
• This conversion can be done automatically using the “gated_clock_conversion”
synthesis option
• Synthesis Commands:
o synth_design -gated_clock_conversion on/off

Gated Clock
Conversion

Clocking Resources in 7 series FPGAs

/amradelm

/amradelm 14
Memory Types
• FPGAs can implement memories with BRAMs, LUTs (distributed), or registers. Each approach has its advantages and disadvantages.
• BRAM:
o Advantages:
▪ High Storage Density: Block RAM (BRAM) provides higher storage capacity, making it suitable for large memory requirements.
▪ Predictable Performance: BRAMs offer consistent and predictable timing characteristics.
▪ Dedicated Resources: Utilizes dedicated RAM blocks in the FPGA, freeing up LUTs and flip-flops for other logic.
o Disadvantages:
▪ Fixed Size: BRAMs come in fixed sizes, leading to potential wastage if the required memory size doesn't align perfectly with available BRAM sizes.
▪ Limited Number: The number of BRAMs on an FPGA is limited, potentially restricting large or multiple memory implementations
• Distributed:
o Advantages:
▪ Fine-Grained Control: Distributed RAM uses LUTs, providing flexibility in size and configuration.
▪ Small Memory Footprint: Ideal for small memory requirements, allowing for efficient use of FPGA resources.
▪ Closer to Logic: Placed closer to the logic that uses it, potentially improving performance (timing) for small, frequently accessed memories.
o Disadvantages:
▪ Lower Storage Density: Distributed RAM offers less storage capacity compared to block RAM.
▪ Increased Resource Usage: Consumes LUTs, which might be needed for other logic functions.

/amradelm

/amradelm 15
Memory Types Cont'd
BRAM
• Registers
-Can store up to 36Kbits
o Advantages: LUT6
-Footprint = 5 slices
-Can store up to 64bits
▪ High-Speed Access: Registers provide the fastest access times, suitable for -Has implemented address
very high-speed applications. -Footprint = 1/4 slice decoder, output multiplexer
and all required logic for
-Has implemented address
▪ Easy Timing Closure: Simplifies timing closure as registers are inherently decoder, output
memory operation
faster and have predictable timing. multiplexer and all
required logic for memory
▪ Placement Flexibility: Each register/bit can be placed individually providing operation
greater flexibility for placement
o Disadvantages:
▪ High Resource Usage: Registers consume many flip-flops and logic resources
since each register stores one bit compared to LUTs, which can store 64 bits
in one LUT. This can be a limiting factor in resource-constrained designs. Register
▪ Limited Capacity: Suitable only for very small memory requirements due to -Can store just 1 bit
high resource consumption -Footprint = 1/8 slice
• Synthesis Command -Fastest operation

o set_property RAM_STYLE DISTRIBUTED [get_cells hier mem_reg*] -Need LUTs to implement


address decoder, output
#Implement memory as LUTs (Distributed) multiplexer and all required
logic for memory operation
#Other attributes: BRAM, REGISTERS, AUTO

FPGA Floorplan

/amradelm

/amradelm 16
Cascading
• There are two ways to connect multiple FPGA
Dedicated Address
primitives (LUTs, DSPs, BRAMs, etc):
Decoder Extension
o General logic and routes
Dedicated Routes
o Cascading: Using special dedicated cells and routes
inside the primitive to allow connecting adjacent cells
• Cascading, in most cases, provides better timing, power, Dedicated
Output MUX
and area results because cascaded paths use fast
dedicated routes and sometimes have fast dedicated logic
cells (such as MUXs) to enable cascading.
• Using these dedicated routes and cells also helps save
resources for other logic.
• The example on the right shows two 36Kbit by 1-bit wide
RAMs being cascaded together to form a larger 64Kbit
RAM. The dedicated cells are implemented within the
BRAMs and don’t require additional LUT usage
• Synthesis Command
o set_property CASCADE_HEIGHT <int>
[get_cells my_RAM_reg]
#Limit cascading to <int> stacked BRAMs

/amradelm

/amradelm 17
BRAM Configuration and Cascading
• A maximum of 21 BRAMs can be cascaded if each BRAM is configured as 1-bit wide1.
Note that this doesn’t mean the memory has to be 1-bit wide since we can implement
multi-bit memories with an array of 1-bit wide BRAMs as shown in the figure.
• Consider the example on the right: Each configuration uses 16 BRAMs to
implement a 4-bit x 128K memory:
o The left config uses 4 columns of cascaded memory where
each is 1-bit x 32K
▪ Any memory access needs to access 4 BRAMs simultaneously.
(High power consumption)
▪ The BRAM 𝑇𝑐𝑞 delay is larger because the memory is deeper
(Worse timing)
▪ No needed logic to connect the cascaded BRAMs together
(Less area)
▪ The fast and special routes are used
(Better net delay)
o The right config uses non-cascaded BRAMs where each is 2-bit x 16K
▪ Any memory access needs to access 2 BRAMs simultaneously
(Low power consumption)
▪ The BRAM 𝑇𝑐𝑞 delay is smaller because the memory is less deeper 4x128K Memory Using 4 4x128K Memory Using 2
(Better timing) Columns of Cascaded BRAMs Columns of Non-Cascaded BRAMs
▪ Needs logic to connect the BRAMs together
(More area)
▪ Uses general routes that are slower
(Worse net delay)

[1] : Because there are special routes between only every 2 adjacent BRAMs. /amradelm

[2] : This condition doesn’t apply to Xilinx Ultrascale FPGAs /amradelm 18


BRAM Configuration and Cascading – Cont’d

4x128K Memory Using 4 4x128K Memory Using 2


Columns of Cascaded Columns of Non-Cascaded
BRAMs BRAMs
Mem-To-Out
Delay Mem-To-Out
Delay
More Area
Less Area Power
Power Consumption
Consumption

Cascaded Routes

[*] : By default, the tool (Vivado) used the case on the left which produced worse timing and power. This is a clear /amradelm

example to those who claim FPGA synthesis doesn’t require manual inspection and work /amradelm 19
DSP Tree vs Cascade
Serial (Cascaded) Implementation Parallel (Tree) Implementation
• DSP can be either cascaded serially or in a parallel tree fashion.
• Tree implementation:
o Advantages: 𝑇𝑐𝑜𝑚𝑏 = 100𝑝𝑠
▪ Less number of stages and therefore less latency (if pipeline registers are 100𝑝𝑠 3 stages
100𝑝𝑠
added after each stage) 2 stages

▪ Better for ASIC designs because the total propagation delay is less. 100𝑝𝑠 100𝑝𝑠
o Disadvantages:
▪ Worse for FPGA designs because it doesn’t use the dedicated
cells/adders and routes which have higher speed and less power
consumption.
• Cascaded implementation:
o Advantages:
▪ Better for FPGAs
o Disadvantages:
▪ More latency (if pipeline registers are added)
• Synthesis Command
o synth_design -cascade_dsp <option>
#: Possible options :
• Force: The sum of the DSP outputs is computed using the
block built-in adder in a cascaded fashion
• Tree: Forces the sum to be implemented in the fabric in
a tree fashion.
• Auto (default): Let the tool decides

[*] : For more info : DSP: Designing for Optimal Results (xilinx.com) /amradelm

/amradelm 20
DSP Usage and Thresholds
• By default, multipliers (mults), mult-add, mult-sub, and mult-accumulate type structures are assigned to DSP blocks. While, adders, subtractors, and accumulators
can also go into DSP blocks, but by default are implemented with logic (LUT/Carry Chain) instead. The USE_DSP attribute overrides the default behavior.
• Using DSPs has the following advantages and disadvantages
o Advantages
▪ High-Speed Operations: DSP blocks are optimized for high-speed arithmetic operations, significantly improving the performance of mathematical
computations.
▪ Efficient Resource Utilization: Utilizing DSP blocks frees up general-purpose logic resources (LUTs and FFs) for other parts of the design, enabling more
efficient use of the FPGA's resources.
▪ Lower Power Consumption: DSP blocks can be more power-efficient for arithmetic operations compared to implementing the same operations with
LUTs and FFs.
o Disadvantages:
▪ Limited Availability: The number of DSP blocks on an FPGA is limited, which may become a constraint if the design requires a large number of arithmetic
operations.
▪ Design Complexity: Using DSP blocks might require more careful design and planning to fully leverage their capabilities and ensure efficient utilization.
▪ Technology Dependency: Over-reliance on DSP blocks can make the design more dependent on specific FPGA architectures, potentially reducing
portability to other devices with different DSP block configurations.
• Synthesis Command
o synth_design -max_dsp1 <int> #Maximum number of block DSP allowed in design. (-1 means that the tool will choose the
maximum number allowed
o set_property USE_DSP YES/NO [get_cells hier my_arithmetic_module*] #Implement any arithmetic operations inside this
module with DSPs

[1] : Setting a limit on DSP blocks per module is useful for hierarchical OOC flow where each block is synthesized /amradelm

separately /amradelm 21
Flatten Hierarchy
• The RTL code consists of several modules connected to each other. By default synthesis tools will synthesize each module separately and then connect them together
in the top module, thus preserve the hierarchy and boundaries between the modules.
• Another approach is to remove the module boundaries and make all cells in one hierarchy. This is called flattening and generally produce better timing result.
However, it makes PnR and post-synthesis simulation more difficult1.
• FPGAs have another approach which allow the synthesis to flatten the hierarchy, do synthesis, then try to rebuild the hierarchy again based on the original RTL. This
allows you to get the benefits of flattening while reducing the side issues it can create.
• Synthesis Command
o synth_design -flatten_hierarchy rebuilt #Other options are none and full.

No Flattening Full Flattening Rebuilt

[1] : Because it makes tracing signals and referencing cells more difficult. /amradelm

/amradelm 22
FPGA Synthesis Settings – Preconfigured Strategies
• Xilinx Vivado offers several preconfigured strategies for synthesis and implementation, which help optimize various aspects of FPGA design such as
performance, area, power consumption, and compile time. Here’s an overview of some commonly used preconfigured strategies in Vivado:

Flow_Area Mult
Options\Strategies Default Flow_Area_Optimized_high Flow_AreaOptimized_medium Flow_Alternate Routability Flow_Perf Optimized_high Flow_Perf ThreshholdCarry Flow_Runtime Optimized
ThresholdDSP

-flatten_hierarchy rebuilt rebuilt rebuilt rebuilt rebuilt rebuilt rebuilt none

-gated_clock_conversion off off off off off off off off

-directive Default AreaOptimized_high AreaOptimized_medium AreaMult ThresholdDSP Alternate Routability PerformanceOptimized FewerCarry Chains RunTime Optimized

-fsm_extraction auto auto auto auto auto one_hot auto off

-resource_sharing auto auto auto auto auto off off auto

-control_set_opt_threshold auto 1 1 auto auto auto auto auto

-no_lc unchecked unchecked unchecked unchecked checked checked checked unchecked

-no_srlextract unchecked unchecked unchecked unchecked unchecked unchecked unchecked unchecked

-shreg_min_size 3 3 3 3 10 5 3 3

-max_bram -1 -1 -1 -1 -1 -1 -1 -1

-max_dsp -1 -1 -1 -1 -1 -1 -1 -1

-max_b_cascade_height -1 -1 -1 -1 -1 -1 -1 -1

-cascade_dsp auto auto auto auto auto auto auto auto

/amradelm

/amradelm 23
FPGA Synthesis Settings – Out of Context Flow (OOC)
• In the OOC flow, the Vivado tools synthesize a certain RTL module and produce an OOC design checkpoint (DCP).
• When synthesizing the top-level design, an HDL stub module is provided with the DCP file, and causes a black box to be inferred for the OOC module
• During implementation, the netlists from the OOC module DCPs are linked with the netlist produced when synthesizing the top-level design files, and the
Vivado Design Suite resolves the module black boxes.
• OOC flow has the advantages of:
• Providing better runtime when synthesizing the top module because you only synthesize the module when changes to the
module customization or version require it, rather than re-synthesizing it as part of the top-level design.
• Allowing setting options and strategies (such as limiting DSP count) separately for each module1.

(1) (2) (3)


Module 1 Synthesized Separately In OOC Flow Top Module Is Being Synthesized Module 1 Is Merged With The Top During
while Module 1 Is Treated As A Black Box implementation (Place and Route)

[1] : Some of the synthesis setting can’t be applied as a property for a certain module. In such cases OOC may be useful /amradelm

/amradelm 24
FPGA Synthesis Settings – Incremental Synthesis
• Vivado Synthesis can be run incrementally. In this flow, the tool puts incremental synthesis info in the generated DCP file that can be referenced in later
runs.
• When the design has changes, only the changed sections are re-synthesized
• This way the runtime is significantly reduced for designs with small changes. In addition, the QoR of the design fluctuates less when small changes are
inserted into the RTL.
• Synthesis Command
o read_checkpoint -incremental <path to dcp file> #Read the previous synthesis checkpoint
o synth_design #Then run synthesis

/amradelm

/amradelm 25
Conclusion – Where To Go From Here
• This document discussed only a small part of FPGA synthesis settings and properties. There are a lot more to learn in the vendor user guides.
• In the next part we will discuss ASIC synthesis flow and settings.

/amradelm

/amradelm 26
Thank You!

/amradelm

/amradelm 27
Logic Synthesis
Part 3 – ASIC Synthesis

Amr Adel Mohammady


/amradelm

/amradelm
Save The Palestinian Children

Israel has killed more than 13,000 children in Gaza since


October 7 while others are suffering from severe
malnutrition and do not “even have the energy to cry”,
says the United Nations Children’s Fund (UNICEF).

“Thousands more have been injured or we can’t even


determine where they are. They may be stuck under
rubble … We haven’t seen that rate of death
among children in almost any other conflict in
the world”
-UNICEF Executive Director

Till Nov,2023
Introduction
• In the previous parts we learned the FPGA fabric and the FPGA synthesis flow.
• In this part we will discuss the ASIC synthesis flow.

/amradelm

/amradelm 3
The Inputs and Outputs
• The inputs to ASIC synthesis are:
o HDL: The Verilog or VHDL design files
o Constraints: The timing, power, and area constraints
o Timing Libraries: The standard cell libraries.
o Synthesis Commands:
▪ set target_library <STDCELL_LIBRARY>
set link_library1 "* $target_library io.db rams.db"
read_verilog <RTL FILES LIST>
current_design <TOP_MODULE_NAME>
link
source <TIMING_CONSTRAINTS>
compile #Synthesize the design
• The outputs are:
o Design netlist
o Various reports about the design such as timing, power or area reports
o Synthesis Commands:
▪ write -f verilog -o ./netlist.v
report_timing
report_power
report_area

[1] : The target_library variable specifies the library that Design Compiler uses to select cells for optimization and mapping. The link_library variable /amradelm

specifies every library that has cells referenced by the netlist such as RAMs. The tool uses the libraries specified in the link_library variable for /amradelm 4
resolving references (linking)
Target Library
• Both FPGA and synthesis start synthesis by creating a tech independent netlist. NAND Cell
• After that, the generic netlist is mapped to the target technology and optimized
to meet the constraints.
• The target in ASIC is called a standard cell library:
o It’s a collection of pre-designed and pre-characterized logic gates and other
digital functions used for VLSI design.
o The information can be timing, power consumption, physical layout, logic
functionality, etc
o This information is scattered into multiple files. For example, the timing and
power information exist in timing lib/db file while the physical layout exists in
a LEF/GDS files.
[1]
o These files are sometimes called “Views” (e.g. timing view) as they represent
the cell info from a certain point of view.
Schematic View Layout View
Input Transition Time
𝒕

1.1 1.2 1.3 1.4


10 2.10 2.20 2.27 3.00
Load Capacitance
𝑪𝑳 20 2.50 3.00 3.45 3.96
30 2.90 3.40 3.80 4.15

Example Timing View2

[1] : Reference: An Exploration of Applying Gate-Length-Biasing Techniques to Deeply-Scaled FinFETs Operating in Multiple Voltage Regimes. IEEE /amradelm

Transactions on Emerging Topics in Computing. PP. 1-1. 10.1109/TETC.2016.2640185. /amradelm 5


[2] : These are arbitrary number for demonstration only
Wire Load Model (WLM)
OR Cap Wire Cap
• For the synthesis to know the cell delay and power, it needs to know the input transition and INV Cap
capacitive load.
• Both values depend on the cell type and also the wires connecting the cells.
• The cell information is known from the standard cell library. So, the missing info is the wire
values (resistance and capacitance).
• In older technologies, the wire values were estimated using a wire load model (WLM).
• This model estimates the length of a wire (and therefore the resistance and capacitance) based
on the number of fanouts and the block size as shown in the diagrams
• These estimations are based on results from previous designs

More Fanouts => More Wire Length Larger Block Size => More Wire Length

/amradelm

/amradelm 6
Wire Load Model (WLM) – Example

/amradelm

/amradelm 7
Physical Synthesis
• In newer tech nodes the WLM produced bad estimations so tool vendors tried another approach called Physical Synthesis.
• In this approach the floorplan and physical info (techfile, cell layout, parasitics, etc) are passed to the synthesis.
• This allows the synthesis to do cell placement along with logic synthesis and optimization.
• Since, it knows the distance between the cells, the synthesis can more accurately estimate the expected wire length.
• Physical synthesis produces much better results compared to the WLM approach but has a longer design time
• Two-pass Synthesis: Tool vendors recommend doing physical synthesis in two steps:
1. Synthesize the design with an initial floorplan. The resulting netlist gives info about the cell counts total area, and congestion which enables us to create a
better floorplan
2. Create a new floorplan then redo the synthesis with the physical info.
• In the next slides we will see the other inputs needed (along FP) to do
physical synthesis
One-Pass Synthesis (Not Recommended)

Two-Pass Synthesis (Recommended)

/amradelm

/amradelm 8
Physical Synthesis Inputs – Tech File
• The tech file contains various info about the
technology like:
o The units and precision.
o The coloring of the metals in the GUI.
o The minimum standard cell height and width.
o The design rules such as the layers’ default width
and spacing, etc.
o Via definitions

Example Tech File

/amradelm

/amradelm 9
Physical Synthesis Inputs – ITF & TLUPLUS
• The ITF (Interconnect Technology File) is a text-based file that contains raw information about
each technology layer such as the thicknesses, resistivity, and dielectric constants
• These values are further processed to generate the TLU+ file which contains tables of R, and C
values as functions of metal layer widths, and spacing. This is done while taking into account all
adjacent layers’ effects.
• The TLU+ contents are binary and only contain a text header showing the ITF that was used to
generate the TLU+ file
• Along with TLU+, we use a layer mapping file that maps the layer names between the tech file
and the TLU+

Example ITF File CMOS Cross Section1

[1] : Reference : Okuno, Hanako & Fournier, Adeline & Quesnel, E. & Muffato, V. & Poche, Hélène & Fayolle, M. & Dijon, J.. (2010). CNT integration /amradelm

on different materials suitable for VLSI interconnects. Comptes Rendus Physique - C R PHYS. 11. 381-388. 10.1016/j.crhy.2010.06.008. /amradelm 10
Physical Synthesis Inputs – LEF (Library Exchange Format)
• The GDS file contains full data about the design layout and masks and is sent to
the fabrication plant to fabricate the chip.
• From a runtime and memory usage point of view, we don’t need all the info of the
GDS when doing placement. We only care about the cell boundary, pin shapes and
locations.
• The LEF file contains only the necessary info needed to perform placement and is
used during physical synthesis and across the PnR stages.
• Once PnR is finished, the LEF views are replaced by the GDS views to produce the
final GDS file that contains all the info needed by the fabrication plant

[1] : Reference : Automated integrity checks stop out-of-sync data issues in parallel flows (techdesignforums.com) /amradelm

/amradelm 11
ASIC Synthesis Options

/amradelm

/amradelm 12
Critical Range & TNS Optimization
• By default the tool focuses on enhancing the worst negative slack (WNS).
• The tool considers the WNS and some paths before it. This is controlled with the critical range variable.
• A critical range of 0.0 means that only the most critical paths (the ones with the worst violation) are optimized. If you specify a nonzero critical range, near-
critical paths within that amount of the worst path will also be optimized, if possible.
• Also, you can instruct the tool to focus on enhancing the entire total negative slack (TNS)
at the cost of additional runtime.
• Synthesis Commands:
o set_critical_range 2.5 top TNS
o set compile_timing_high_effort_tns true

WNS
With critical range of 2.5

/amradelm

/amradelm 13
Arithmetic Blocks Architecture
• Digital blocks have a tradeoff between speed vs power and area. The designer might choose an implementation that consume more power or has larger area
but higher speed.
• For example, there are different ways to implement binary adders. One implementation is the ripple adder which has small area and power consumption but has
high 𝑇𝑐𝑜𝑚𝑏 , while a carry-look-ahead (CLA) adder has smaller 𝑇𝑐𝑜𝑚𝑏 but takes a larger area.
• The synthesis tool can automatically choose the best implementation to enhance timing, or area.
• Synthesis Commands:
o set_dp_smartgen_options -optimize_for [area | speed | area,speed]

𝑇𝑐𝑜𝑚𝑏 = 700𝑝𝑠 𝑇𝑐𝑜𝑚𝑏 = 400𝑝𝑠


𝐴𝑟𝑒𝑎 = 75𝜇𝑚2 𝐴𝑟𝑒𝑎 = 130𝜇𝑚2

/amradelm

Reference : Kamanga, Isaack. Design Optimization of the 64-Bit Carry Look-Ahead Adder Based on FPGA and Verilog HDL /amradelm 14
Register Duplication
• By duplicating registers, the timing paths can be shortened, reducing the wire and
cell propagation delays.
• This also reduces the fanout on the register which may enhance the output delay of
the register
• Consider the example on the right :
o By duplicating the green registers we managed to move each copy near one of
the blue register
o This first, reduces the wire length between the green and blue registers and
second, allows us to remove the buffers and inverter pairs on the nets and both
reduce the total combinational delay
o This shows that this method becomes more useful when the capture registers
(the blue ones) are placed far away from each other in the chip.
o However, FF1 now drives double the fanout so the delay of the timing path
between FF1 and FF2 is increased. We need to make sure this increase doesn’t
cause the path to violate setup timing.
• Duplication can be enabled globally or on a cell-by-cell basis
• Synthesis Commands:
Before Duplication After Duplication
o set compile_register_replication true
#When this variable is set to true, compile tries to
identify registers in the current design that can be split
to balance the loads for better QoR.
o set_register_replication -num_copies 3 <REGISTER>
#Duplicate a certain register 3 times.

/amradelm

/amradelm 15
Register Merging
• Merging is the opposite of duplication and is done to reduce the area in the design
but might degrade timing.
• Merging can be enabled globally or on a cell-by-cell basis
• Synthesis Commands:
o set compile_enable_register_merging true
o set_register_merging <REGISTER_LIST> true #Merge certain
registers.

Before Merging After Merging

/amradelm

/amradelm 16
Preferred MUX Implementation
Standard Cell
• Standard cell libraries have the basic cells needed to build a MUX (2 AND ,1 OR ,1 Inverter) but also have
integrated MUX cells.
• It’s better to use the basic cells to build a MUX because each cell can be placed and optimized individually
allowing for greater flexibility for placement and optimizations which produces better timing and area
results.
• The problem is this approach increases the number of pins. For example, a 2:1 MUX will have 11 pins (6 pins
for the 2 ANDs, 2 for Inverter, 3 for OR) compared to 4 pins for the integrated MUX (2 inputs, 1 output, 1
selection).
• This might create pin congestion and make routing difficult. In such cases, it’s better to use the MUX cells
• ASIC tools allow you to instruct the synthesis about which implementation it should prefer over the other.
• Synthesis Commands:
o set compile_prefer_mux true
#The default flow typically maps most multiplexers to and-or-invert (AOI)
logic in order to minimize area, but in some cases this can result in
congestion hotspots. With compile_prefer_mux enabled, multiplexing logic
that is likely to cause congestion is converted to MUX trees where possible.
o set hdlin_infer_mux all
set_size_only [get_cells -hier * -filter "@ref_name =~ *MUX_OP*"]
#These commands forces the compiler to use MUX cells instead of the basic
Standard Cells
gates. However, this restricts the tool and might degrade QoR.

/amradelm

/amradelm 17
Multi-Bit Banking
• ASIC standard cell libraries contain special flip-flops that can store more than one bit. These FFs are called multi-bit banking registers.
• The area of a multi-bit register is less than the total area of the registers if implemented individually.
• Also, the clock tree have less buffers (less area and power) when multi-bit banking is enabled.
• The disadvantage is the limited placement and since all the bits are forced to be placed at the same location.
• The other disadvantage is the limited CTS flexibility since all bits are forced to have the same clock latency which limits fixing timing violations using local skew
optimizations.
• Synthesis Commands:
o set hdlin_infer_multibit [never | default_all | default_none]
#The never setting prevents inference of multibit components
from HDL regardless of directives (Verilog) or attributes (VHDL).

#The default_all setting infers multibit components on all bused


registers except where directives or attributes
indicate otherwise.

#The default_none setting specifies that only attributes


or directives are used to infer multibit components.
This is the default for the hdlin_infer_multibit variable.

/amradelm

/amradelm 18
Thank You!

/amradelm

/amradelm 19

You might also like