Cnet Architecture in NetFPGA
Cnet Architecture in NetFPGA
Table of Contents
CNET microarchitecture........................................................................2 Clock domains.....................................................................................3 Write data path....................................................................................5 Read Data Path....................................................................................6 CNET/CPCI Bus protocol.....................................................................7 Write Transaction.............................................................................7 Read Transaction Register read.....................................................8 Read Transaction Packet read..................................................9 MAC/Core interface.............................................................................9 MAC Transmit logic..........................................................................9 MAC Receive logic..........................................................................10 Ingress FIFO arbitration....................................................................11 SRAM interface.................................................................................13 Write protocol.................................................................................14 Read protocol.................................................................................14 DMA FIFO..........................................................................................14 Ingress FIFO Controller.....................................................................16 Diagnostics........................................................................................18 Clock checks...................................................................................18 Appendix A CNET Address Map........................................................19 CNET Registers.................................................................................19 Tx FIFOs............................................................................................26 PHY Interface....................................................................................26 To write to a PHY register:.............................................................26 To read from a PHY register:..........................................................26
CNET microarchitecture
The main block diagram for the CNET device is shown in figure 1 below. Subsequent sections provide more detail on each block.
CNET
Tx FIFO
MAC 0
Rx FIFO Tx FIFO
INGRESS ARBITER
MAC 1
Rx FIFO Tx FIFO
MAC 2
Rx FIFO Tx FIFO
SRAM Interface
MAC 3
Rx FIFO
DMA FIFO
Clock domains
The CNET device is complicated by the presence of the four MACs. Each MAC requires five clocks, three of which are common to all MACs and two of which are unique to each MAC. See the section Multiple Cores in the chapter Special Design Considerations of [UG-138] The clocking structure is shown in figure 2. (Note: the domains show up in color!)
MII_TX_CLK IBUF
CLOCK LOGIC
GMII_TX_CLK GMII_RX_CLK
MAC 0
TXCORECLK
GMII_TX_CLK GMII_RX_CLK
MAC 0
TXCORECLK
GMII_TX_CLK GMII_RX_CLK
MAC 0
TXCORECLK
GMII_TX_CLK GMII_RX_CLK
MAC 0
TXCORECLK
CLK (62.5MHz)?
MAC 1
Registers
MAC 2
MAC 3
CNET
ADDR
DATA
CLK (62.5MHz)
Figure 3
Write transactions occur through the PCI bus and are terminated within the CPCI device. Note: write transactions may also occur as a result of a PCI DMA read (transferring a packet from kernel memory to the relevant MAC Tx FIFO). The CPCI needs to arbitrate between these accesses, though in practice they should be independent. Internally, the CNET will separate writes into two types: writes to FIFOs (packet data) and writes to everything else (registers). Writes to FIFOs should not be initiated by the CPCI unless there is space in the relevant FIFO for a complete packet (via the Programmable Nearly Full signals). Thus writes to FIFOs should simply stream through the interface as there is no conflict. In addition, the CPCI needs to use the address bits to indicate additional information
with each FIFO write: this includes the number of valid bytes in the 32bit word, and whether this is the last word in the packet (EOP). The address map is shown in Appendix A. Writes will be stored in a FIFO (shown). In general this is not required, however some of the write timing is not yet determined and writes to PHY registers (MDC/MDIO) are very slow. Before passing write data to the CNET the CPCI device should check the Almost Full signal from the CNET. If Almost Full is asserted then the CPCI should discard the write transaction and signal an appropriate error (mechanism TBD: register or interrupt or bus retry). The actual protocol between CNET and CPCI is described in a later section.
CNET __ RD/WR, CPCI_RD_RDY CPCI_REQ and CPCI_RD_WR_L CPCI_ADDR CPCI device CPCI_DATA CPCI_DMA_DATA CLK (62.5MHz)
There are two read data paths: one for register reads and a separate physical bus for packet DMA transfers to kernel memory. The register read path uses the same address/data pins as for the write path described in the previous section. The read protocol is a simple request/grant handshake and is described in the following section.
CPCI_DMA_NEARLY_FULL
Packet reads (DMA writes to kernel memory) require a separate path in order to make the transfer efficient. Once the CPCI has started a burst write to kernel memory then it must ensure that it does not underrun (for efficiency purposes). On the basis that Pull protocols are slow, Push protocols are fast, the CNET device effectively pushes the packet to the CPCI. The kernel driver initiates the transfer by writing to a register in the CNET device. This causes the CNET to push the packet into the CPCI's PKT_DATA FIFO. The actual size of the various FIFOs is TBD but will depend on bus throughput between the devices.
Write Transaction
Write transactions are optimized for burst writes to the packet Tx FIFOs. A single write is thus just a very short burst write. The waveform is shown in figure 5 below.
CLK62 __ CPCI_RD_WR CPCI_REQ CPCI_ADDR CPCI_DATA CPCI_WR_RDY T1 T2 T3 T4 T5
A<N> D<N> A<N+1> D<N+1> A<N+2> D<N+2> A<N+3> D<N+3>
All signals are CPCI->CNET except WR_RDY which is CNET->CPCI and DATA which is bi-directional (but always driven by CPCI during a write). From the writer's viewpoint (CPCI) this looks like a FIFO interface provided that WR_RDY is high then the CPCI can write data. Data is accepted at every rising clock edge that REQ and WR_RDY are both high. Note: WR_RDY may be de-asserted for many cycles.
There may be many cycles from issuing the request to receiving the data. The CPCI must not issue another read REQ until it has seen RD_RDY de-asserted at the end of the current transaction. The complete set of signals used for the CPCI bus interface are in the table below.
CPCI
In Out Out Out In
Signal
CLK62 CPCI_REQ CPCI_RD_WR_L CPCI_ADDR CPCI_TX_FULL
CNET Width
In In In In InOut Out 4 1 1 1 24 32
Description
System Clock (62.5MHz) Request Read (1) or Write (0) Address Data Indicates if Tx FIFO has space for a max packet (0 = space, 1= not enough space) Read data is ready Write is accepted
InOut CPCI_DATA
In In
CPCI_RD_RDY CPCI_WR_RDY
Out Out
1 1
A packet read is initiated by the DMA controller in the CPCI. See the section DMA FIFO on page 14 for more details of how packets are transferred from the CNET to the CPCI device.
MAC/Core interface
This section describes the interface exported to the core from each MAC. It is divided into transmit and receive sections. Note: this is the interface to be used in both Control and User applications. The Management interface (stats, configuration, PHY, etc) is not shown but will be driven from the CNET/CPCI bus interface.
The FIFO is written to from the core side with data 36 bits wide, with bits 35,26,17,8 being 1 iff the corresponding byte is the final byte of the packet. See NetFPGA Architecture document for more details. The NEARLY_FULL signal is asserted (high) when there remains insufficient space in the FIFO for a maximum sized packet. The actual value at which NEARLY_FULL is asserted is not yet decided, but will be a maximum sized packet plus some extra to allow for latency between the CNET and CPCI, so it will be 1518 + latency_clocks*4 bytes. The NEAR-
LY_FULL signals must be synchronized to the PCI clock domain inside the CNET (adding some latency). The Transmit state machine (MAC_Tx_SM) will initiate packet transmission to the actual MAC once the EOP has been observed on the ingress side of the FIFO. Data will be read out until either EOP is observed on the egress side, or else an underrun occurs. Once a packet has been transmitted then either PKT_SENT_OK or PKT_UNDERRUN will be pulsed high for one clock.
GOOD BAD
MAC_Rx_SM
Notes: 1. SM MUST store multiple of 4 bytes. 2. SM MUST store EOP even in overrun condition. 3. SM MUST NOT start to store packet unless ALMOST_FULL == 0 4. SM MUST always store one extra 36-bit word at the end of each packet. 0 = GOOD 1 = BAD.
The MAC_Rx_SM state machine manages the receive FIFO. The same format is used as is used in the transmit direction - each byte is associated with an extra bit which, if set to one, indicates the last byte of the packet.
Again the FIFO is asymmetrical it is 9 bits on the write side and 36 bits on the read side. Consequently the state machine must always write bytes in groups of four. If the last byte of the packet is byte 65, then three additional pad bytes must be written. Also, one extra full 36-bit status word is always stored after the last data word. This serves two purposes: 1. Bit 0 of the status word indicates if the packet was good (0) or bad (1). The read side must always read out an entire packet and the status word will then indicate whether the packet should be kept or discarded. 2. It provides the read process with one extra clock cycle in which to de-assert RD_EN after seeing the EOP bit. So, for example, if a bad 65-byte packet was received then the read process would see the last few words as: 35 34.....27 26 25.....18 17 16.....9 word 16 word 17 word 18 0 <byte 64> 0 <PAD> 0 <PAD> 0 <byte 63> 0 <PAD> 0 <PAD> 0 <PAD> 0 <PAD> 8 7......0 1 <byte 65> 0 < 0x1 > 0 <byte 62> 0 <byte 61>
On the read side the reading process should check PKT_AVAIL to see if there is at least one packet available. The PKT_AVAIL signal will go low (invalid) on the cycle after any of the four EOP bits go high, and will then be valid on the following cycle, as shown in figure 9.
CLK EOP PKT_AVAIL
Note: there are two reasons for the status word indicating a bad packet. The first is that the MAC saw a bad FCS. The second is that there was insufficient space in the FIFO to store the packet and so some of the packet was discarded.
complete packet. Figure 10 shows the main signals used by the arbiter. A separate Ingress FIFO Controller manages the various queue pointers and provides relevant signals for each queue, where <N> represents the queue number (N=0..3). The arbiter reads a packet from the RX MAC FIFO and stores it in the appropriate location in the SRAM. The last word indicates if the packet was good or bad. If bad, then the WR_INCR signal is not pulsed.
DOUT_0 [35:0] RX MAC 0 RD_EN_0 PKT_AVAIL_0 DOUT_1 [35:0] RX MAC 1 RD_EN_1 PKT_AVAIL_1 DOUT_2 [35:0] RX MAC 2 RD_EN_2 PKT_AVAIL_2 DOUT_3 [35:0] RX MAC 3 RD_EN_3 PKT_AVAIL_3 WR_INCR_<N> WR_PTR_<N>[X:0] FULL_<N> INGRESS FIFO CONTROLLER INGRESS ARBITER SRAM INTERFACE WR_DATA[35:0] WR_ADDR[X:0] WR_REQ WR_RDY
The arbiter will store the packet in the SRAM at the offset specified by WR_PTR_<N>. The first word stored will be the length (in bytes) of the packet. This word will also contain the ID of the MAC from which it was received. The last word stored will contain the last byte of the packet (the good/bad indicator is not stored). The FIFO SRAM is organized for simplicity rather than maximum utilization. It is split into 4 sections, one per MAC. Each section is then di-
vided into chunks of 2KBytes large enough to hold a maximum sized packet. The SRAM size is TBD, but assuming 2MB then each MAC should be able to store 256 packets.
SRAM interface
The SRAM is managed by the SRAM interface logic. This acts as an arbiter between two write ports and two read ports as shown in figure 11. The write and read ports are independent, though in practice the CPU interface will use both one read and one write port.
WR_0_DATA[35:0] WR_0_ADDR[X:0] WR_0_REQ WR_0_RDY WR_1_DATA[35:0] WR_1_ADDR[X:0] WR_1_REQ WR_1_RDY SRAM
In order to achieve high throughput the interface logic might implement an internal FIFO of requests (reads and writes). Consequently read data might be delayed by several clocks after the read request is accepted.
Write protocol
The write protocol is shown in figure 12. The writer issues a request. The interface logic will store the write information on each rising clock edge that WR_REQ and WR_RDY are both asserted (T2, T3, T5, and T6 in the figure).
The actual write operation to SRAM may happen several clocks later. Writes and reads are performed in order.
Read protocol
The SRAM read protocol is shown in figure 13 below.
Note: the latency from when the address is latched to the data valid is shown as 2 cycles in the figure, but in practice may be longer RD_VLD must be used to determine when the read data is valid.
DMA FIFO
The Linux driver is notified (or can read) when a packet has been
stored in the external SRAM (via the CPCI_DMA_PKT_AVAIL signal). The driver then must set up the DMA transfer from NetFPGA to kernel memory. The driver must write to internal CPCI registers, specifying which queue should be read from (it is always the packet at the head of the queue).
CPCI_DMA_SEND[3:0] CPCI_DMA_PKT_AVAIL[3:0]
SRAM INTERFACE
RD_ADDR[X:0] DMA FIFO RD_REQ RD_RDY RD_DATA[35:0] RD_VLD
CPCI
CPCI_DMA_WR_EN CPCI_DMA_DATA[31:0] CPCI_DMA_NEARLY_FULL
The CPCI must initiate the packet transfer by asserting the appropriate CPCI_DMA_SEND[X] signal. The DMA FIFO unit in the CNET will then transfer the packet at the head of queue X from the SRAM to the FIFO in the CPCI. The entire packet will be transferred, including the length (in bytes) in the first word. The CPCI can continue to assert CPCI_DMA_SEND[X] until it sees the first word transferred (CPCI_DMA_WR_EN = 1), at which time it should de-assert CPCI_DMA_SEND[X]. Once the CPCI has started to receive the packet from the CNET, then the CPCI can start to DMA the packet into kernel memory. The details of the DMA operation are decribed in [CPCI-ARCH]. If the CPCI asserts CPCI_DMA_NEARLY_FULL then the CNET will deassert CPCI_DMA_WR_EN within two clocks (so the CPCI needs to allow for this pipeline delay). See figure 15 for details.
CLK FULL_N WR_PTR_N WR_INCR_N EMPTY_N RD_PTR_N RD_INCR_N NUM_IN_Q MAX-1 MAX MAX-1 1 0 1 A A+1 B B+1 A-1 A B+1 B+2
PCI clock (33.33 MHz) can be checked via the registers in CPCI. sysclk (62.5 MHz) can be checked via the registers in CPCI.
Then there are 5 clocks used by the Ethernet MACs: 1 common GTX_CLK used by all four MACs. 1 receive clock per MAC. The MAC clocks and sysclk can be checked via some diagnostic registers in the CNET: MAC_CLK_CTRL and MAC_CLK_COUNTER The CNET has a clock checker module (cnet_mac_clk_checker.v) that has a counter connected to each of these 6 clocks. Before trying to check these clocks you must verify that the PCI clock is functioning (should be obvious if it isn't you will not be able to access the board!) Then, to check the clocks do the following: 1. Stop the counters (set RUN bits to 0) 2. Clear the counters (set CLEAR to 1 and then 0) 3. Start the counters (set RUN bits to 1) 4. Stop the counters (set RUN bits to 0 again) 5. Read the counters by setting the SELECT bits to the counter you want to read ( 0 = GTX_CLK, 1 = RX for MAC 0, 2 = RX for MAC 1, etc.) e.g. run the counters for 1msec and you should see a value in the counters of about 125,000 for the MAC clocks and about 62,500 for the sysclk. Allow for things such as clock precision (typically +/1 200ppm) and operating system timings.
Appendix A
Each NetFPGA board occupies a 16MB memory space. Within this space the CNET sub-divides the address space into different areas:
Address range
00_0000 to 3F_FFFF 40_0000 to 4F_FFFF 50_0000 to 5F_FFFF 60_0000 to 6F_FFFF 70_0000 to 7F_FFFF 80_0000 to BF_FFFF C0_0000 to FF_FFFF
Size
4MB 1MB 1MB 1MB 1MB 4MB 4MB CPCI
Function
CNET registers CNET Tx FIFOs PHY interface (MDC/MDIO) Not used SRAM1 (Queue SRAM - only 2MB present) SRAM2 (Scratch SRAM only 2MB present)
CNET Registers
The CNET registers are located in address range 40_0000 to 4F_FFFF within the 16MB address space allocated to the board.
Function
<version>c4 Identifier. e7 31:16 = Version. 15:0 = 0xc4e7 0 0 Scratchpad (32 bit read/write) MAC reset. Write a 1 to the MAC you want to reset [3:0]. The reset will automatically clear you do not need to write a zero after a one. This will read as zero. Indicates when a hardware error occurred in the Tx FIFO: 7:4 = 1 indicates a Tx packet underrun error for that MAC 3:0 = 1 indicates a Tx FIFO overrun for that MAC. These bits are sticky and will remain 1 until you overwrite with a zero. Note: the error must be cleared (by resetting the MAC) before clearing this bit. The OR of this register is propagated to the CPCI via pin CNET_ERR.
RW RW
Error 0x00C
RW
Enable 0x010
RW
0xFF07
Enable various subsystems within the CNET. 15:12 = Enable RX FIFO output (If 0 then packets will remain in the RX FIFO) 11:8 = Enable Tx MAC transmission. (If 0 then packets will remain in the Tx FIFO) 2 = Enable Debug bus tri-state. 1 = Enable Ingress Arbiter 0 = Enable Rx DMA
WR_SRAM1_EOP 0x0F0
RW
The SRAMs are 36 bits wide. Write to this register [3:0] to specify the data that will be written to bits 35:32 of SRAM1 whenever the CPU writes to SRAM1. 3:0 contain the data from bits 35:32 of the last read from SRAM1.
RD_SRAM1_EOP 0x0F4
RO
Function
The SRAMs are 36 bits wide. Write to this register [3:0] to specify the data that will be written to bits 35:32 of SRAM2 whenever the CPU writes to SRAM2. 3:0 contain the data from bits 35:32 of the last read from SRAM2. MAC FIFO Status for MAC 0: 25 = 1 if the Rx FIFO for this MAC is empty 24 = 1 if at least one packet is available to be read out of the Rx FIFO. 23:16 = number of packets waiting in Rx FIFO. 9 = 1 if the Tx FIFO for this MAC is completely full. 8 = 1 if the Tx FIFO for this MAC cannot accept a maximum sized packet (1518B). 7:0 = number of packets waiting in Tx FIFO.
RO RO
Number of packets transmitted on this MAC (read and clear). You can write to this counter. Number of packets received by this MAC (read and clear). You can write to this counter. Number of packets lost at ingress by this MAC due to FIFO full. (read and clear). You can write to this counter. Configuration for MAC 0. 5 = 0 for full duplex, 1 for half duplex DEFAULT: 0 4 = 1 if you supply FCS bytes on Tx side, else 0. DEFAULT: 0 3 = 1 if you want Rx to provide the FCS bytes, else 0. DEFAULT: 0 2 = 1 if you want to enable Jumbo frames, else 0. DEFAULT: 0 1:0 = MAC rate: 00 = 10 Mbit/s 01 = 100 Mbit/s 10 = 1000 Mbits/s (DEFAULT is 10 = 1Gb/s)
MAC_CONFIG_0 0x110 NOTE: ALL four MACs use bits 1:0 of this register to specify the data rate.
RW
Function
Number of packets lost due to bad fcs Number of packets lost due to the receive buffer being full Number of packets Received without any errors. i.e. Real number of packets received Number of useful data bytes received Number of bytes sent
See description for MAC 0 See description for MAC 0 See description for MAC 0
See description for MAC 0 Configuration for MAC 1. 5 = 0 for full duplex, 1 for half duplex DEFAULT: 0 4 = 1 if you supply FCS bytes on Tx side, else 0. DEFAULT: 0 3 = 1 if you want Rx to provide the FCS bytes, else 0. DEFAULT: 0 2 = 1 if you want to enable Jumbo frames, else 0. DEFAULT: 0 1:0 not used.
W CoR W CoR
Number of packets lost due to bad fcs Number of packets lost due to the receive buffer being full
Function
Number of packets Received without any errors. i.e. Real number of packets received Number of useful data bytes received Number of bytes sent
See description for MAC 0 See description for MAC 0 See description for MAC 0
See description for MAC 0 Configuration for MAC 2. 5 = 0 for full duplex, 1 for half duplex DEFAULT: 0 4 = 1 if you supply FCS bytes on Tx side, else 0. DEFAULT: 0 3 = 1 if you want Rx to provide the FCS bytes, else 0. DEFAULT: 0 2 = 1 if you want to enable Jumbo frames, else 0. DEFAULT: 0 1:0 not used.
CNET_REG_MF_RX_PKT S_LOST_BAD_FCS_2 1x194 CNET_REG_MF_RX_PKT S_LOST_FULL_FIFO_2 1x198 CNET_REG_MF_RX_GOO D_PKTS_RCVD_2 1x19C CNET_REG_MF_RX_GOO D_BYTES_RCVD_2 1x1A0
Number of packets lost due to bad fcs Number of packets lost due to the receive buffer being full Number of packets Received without any errors. i.e. Real number of packets received Number of useful data bytes received
Function
Number of bytes sent
See description for MAC 0 See description for MAC 0 See description for MAC 0
See description for MAC 0 Configuration for MAC 3. 5 = 0 for full duplex, 1 for half duplex DEFAULT: 0 4 = 1 if you supply FCS bytes on Tx side, else 0. DEFAULT: 0 3 = 1 if you want Rx to provide the FCS bytes, else 0. DEFAULT: 0 2 = 1 if you want to enable Jumbo frames, else 0. DEFAULT: 0 1:0 not used.
CNET_REG_MF_RX_PKT S_LOST_BAD_FCS_3 1x1D4 CNET_REG_MF_RX_PKT S_LOST_FULL_FIFO_3 1x1D8 CNET_REG_MF_RX_GOO D_PKTS_RCVD_3 1x1DC CNET_REG_MF_RX_GOO D_BYTES_RCVD_3 1x1E0 CNET_REG_MF_TX_BYTE S_SENT_3 1x1E4 RXQ_NUM_PKTS_0 0x200
Number of packets lost due to bad fcs Number of packets lost due to the receive buffer being full Number of packets Received without any errors. i.e. Real number of packets received Number of useful data bytes received Number of bytes sent
Function
23:16 = Write pointer 7:0 = Read pointer 8:0 = Number of packets in SRAM from MAC 1 (0-256). 23:16 = Write pointer 7:0 = Read pointer 8:0 = Number of packets in SRAM from MAC 2 (0-256). 23:16 = Write pointer 7:0 = Read pointer 8:0 = Number of packets in SRAM from MAC 3 (0-256). 23:16 = Write pointer 7:0 = Read pointer 18:16 = Counter read select choose which counter's current value will be read via the MAC_CLK_COUNTER address. 5 = RX MAC 3 4 = RX MAC 2 3 = RX MAC 1 2 = RX MAC 0 1 = TX MAC CLK 0 = SYSCLK 13:8 = Clear counter. 1 = clear counter, 0 = no effect. This overrides the run bit. 5:0 = Run counter. 1 = counter is running. 0 = stopped. This is overriden by the clear counter bit above.
MAC_CLK_COUNTER 0xf04
RO
23:0 = current value of counter selected by the Counter read select bits in the MAC_CLK_CTRL register. NOTE: counter 0 counts at 62.5MHz; the others count at 125MHz. They are only 24 bits wide, so be aware that they will overflow after a few msecs.
Tx FIFOs
Address bits are used to indicate metadata about each word when the CPCI transfers packets to the Tx FIFOs (for transmission by the CNET): Bit 7 Bits 5:4 Bits 3:2 Bits 1:0 1 = EOP else 0 MAC Id (0-3) Number of bytes in final word (0=1, 1=2, 2=3, 3=4) (only valid when bit 7 EOP is set) Always 0 (no byte addressing)
PHY Interface
The CNET device drives the MDC/MDIO pins that control the quad PHY device. The PHY itself has many internal registers; this section explains how to access these registers via the CNET. Access to the PHY is achieved via two registers: a command register (CMD) and a status register (STATUS). All PHY registers are 16 bits wide. The sequence of operations are:
31
16 15
15:0
Write Data
20:16 PHY Register(0-31) 25:24 PHY Channel (0-3) 31 Command: 0=READ 1=WRITE
Bibliography
UG-138: Xilinx, Tri-Mode Ethernet MAC User Guide, CPCI-ARCH: Glen Gibb, CPCI Architecture,