100% found this document useful (1 vote)
1K views

John Hyde - SuperSpeed Device Design by Example (2014)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views

John Hyde - SuperSpeed Device Design by Example (2014)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 301

SuperSpeed Device

Design By Example
John Hyde
USB Design By Example
EZ-USB, FX3 and GPIF are trademarks of Cypress Semiconductor.
All other trademarks or registered trademarks referenced herein are
the property of their respective owners.

© 2010 The SuperSpeed USB Trident Logo used on the front cover
is a registered trademark of the USB Implementers Forum (USB-IF).

All of the Figures in Chapter 1 were provided by the USB 3.0


Promoters Group and are gratefully used with permission.

Some of the SuperSpeed Explorer board photographs were provided


by Cypress Semiconductor and are used with permission.

First Edition: August 2014

Disclaimers
The information in this document is subject to change without notice
and should not be construed as a commitment by USB Design By
Example or Cypress Semiconductor. While reasonable precautions
have been taken, the author assumes no responsibility for any errors
that may appear in this document. No part of this document may be
copied or reproduced in any form or by any means without the prior
written consent of the author.
USB DESIGN BY EXAMPLE MAKES NO WARRANTY OF ANY
KIND, EXPRESS OR IMPLIED, WITH REGARD TO THIS
MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE. USB Design By Example reserves the
right to make changes without further notice to the materials
described herein. USB Design By Example does not assume any
liability arising out of the application or use of any product or circuit
described herein.

Cypress does not authorize its products for use as critical


components in life-support systems where a malfunction or failure
may reasonably be expected to result in significant injury to the user.
The inclusion of Cypress’ product in a life support systems
application implies that the manufacturer assumes all risk of such
use and in doing so indemnifies Cypress against all charges.

Copyright © 2014 USB Design By Example

All rights reserved.

ISBN: 1500588059
ISBN-13: 978-1500588052
ACKNOWLEDGMENTS
I have been wanting to write a SuperSpeed USB book for
some time so I must first thank Cypress Semiconductor for giving me
this opportunity. Cypress provided excellent support as I worked
through the many examples in the book and I particularly wanted to
thank Dhanraj Rajput, Sai Krishna, Kailas Iyer, Karthic
Sivaramakrishnan, Jegannathan Ramanujam, Venkat
Pattabhiraman, Madhura Tapse, Ed Rebbelo, Mathu Mani,
Manaskant Desai, Akshay Singhal, Mudabir Kabir, Nikhil Naik, Anup
Shivakumar, Eddie Zelaya and Gayathri Vasudevan for their
excellent answers and explanations to my never-ending questions.
Their contributions were made possible by the high profile that this
project was given by Cypress management - Badrinarayanan
Kothandaraman, Mark Fu and Veerappan Rajaram.

I am fortunate to know a lot of experts and many helped me in


the preparation of this manuscript. I owe particular thanks to Lane
Hauck, Jan Axelson (author of USB Complete), Bob Beauchaine,
Dhanraj Rajput (again), Steve McGowan, Kosta Koeman and
Gordon Euki – their contributions improved the quality and accuracy
of the book and their support was greatly appreciated.

I would like to thank the folks at CreateSpace who made the


process of creating, proofing and printing this book elegant, low
stress and trouble-free. It was by far the best experience that I have
had with all of the books that I have written.
Writing a book, especially one on a technical subject with
numerous examples, is an enormous time commitment so I must
also thank my wife, Lorraine, and other members of my family, Ben,
CJ and Paige, who haven’t seen dad or grand dad for several
months. As you can see, it is now DONE, time to relax and
celebrate!
Introduction - How to read this book
As a USB design consultant I have been supporting many
customer designs in a variety of industry segments. Recently I have
had several clients designing SuperSpeed devices around Cypress’s
FX3 component. Cypress has a large volume of design
documentation that covers the many features of the FX3 family
parts and I found myself constantly explaining which pieces of the
these various documents should be studied in detail and which could
be skimmed. Rather than describe everything that the FX3 family
can do, this book will explain what you need to know to design a
high-performance, low-power, standalone, SuperSpeed device. The
FX3 family is more capable than this but I believe that most of you
will be implementing the FX3 as a high-performance, low-power,
standalone, SuperSpeed device. This is the focus of this book.
I will admit that there is a lot that you have to know to be
successful. That is where this book comes in. I incrementally
describe aspects of the design in a series of easily consumable
chapters and I have examples that build throughout the book. You
will not be overwhelmed with data that you don't need; many details
are hidden either because you can't change them or because you
don't need to know them to be successful.

The book is divided into two sections; Chapters 1 through 11


that you read, and then a Reference Section. While writing the book
I found myself often repeating the same instructions so I moved
these basic instructions to a fuller description Reference Section that
you can refer to. I then “call” these sections when needed. I hope
that you find this “subroutine” method better and not annoying. Also,
when I needed to discuss a topic in more detail, or out of order, and
this would disrupt the learning flow, I put this material in a Reference
Section. Finally, rather than duplicate a lot of Cypress
documentation, I will be referencing a selection of their documents
throughout the book.

Cypress already had a low-cost, easy-to-use, development


board, called the SuperSpeed Explorer Board, in the works when I
joined the team. I designed a CPLD-based add-on board to enable
the thorough discovery of the GPIF-II interface and Cypress decided
to productize this board alongside their other IO expansion boards.
I know that you are going to have a lot of fun with this kit. Before you
reach the end of the book you will have the skills and confidence to
design your own SuperSpeed device. You will discover that it wasn’t
that difficult after all!

To get the most value from this book you should have a
Cypress SuperSpeed Explorer Kit and a CPLD Accessory Board.
Install the FX3 Toolset as described in the SuperSpeed Explorer Kit
User Guide and work through the examples in this book. I designed
them to be reusable building blocks and I encourage you to "copy-
and-paste" to create your prototypes. The SuperSpeed Explorer Kit
Guide is an essential companion for this book – this is a free
doenload from www.cypress.com/fx3 and I suggest that you
download this to your kindle now, then, if you are on the road or
reading this book in a ‘plane, then you will be able to refer to it.

I received a great deal of help from many talented people at


Cypress and also my technical reviewers; these are listed on the
next page. However, any errors that you find in the book, or in the
examples, are mine alone.

I am already working on the next revision of this book and


would be grateful for all comments and suggestions. What did you
expect to find in this book but was not there? Was something not
clear and requires additional explanation? If you would you like
additional examples please suggest some and I will implement the
most popular requests in the next edition.
Your one-stop-shop for additional information related to this
book is www.cypress.com/fx3book ; any errata or additional
information will be posted here. If you have any questions or
comments about this book or wish to report an error then please
send an email to [email protected] . This is an alias for a
group of people that includes myself; whoever can best answer your
question will reply.

Happy developing! John Hyde, USB Design By Example.


Table of Contents
Chapter 1: SuperSpeed USB is More Than Just Higher Speed
Dual bus architecture.
Review of USB 2.0 operation
USB 3.0 enhancements
USB 3.0 power management
Chapter 2: A SuperSpeed Device Hardware Platform
SuperSpeed Explorer Board
Chapter 3 A Robust Software Base
Multi-threading RTOS 101
Operation from RESET
API Overview
Key ThreadX features
Thread communication
Thread communications using a Queue
DMA Programming Model
Power Aware Programming
FX3 Power Mode Handling
Chapter 4 FX3 Firmware Development
Project Template
Adding Console_In
Adding Paramter Input
Display Program Threads
Display Stack Usage
Adding an Error Indicator
Adding RTOS Visibility
Chapter 5 Exploring the FX3 Low Speed Peripherals
Connecting the CPLD board
Dual Console Project
SPI Example
Chapter 6 SuperSpeed USB communications
Keyboard Example
CDC Example
Debug Console Over USB
Cypress USB examples
BulkLoop Firmware
Streamer firmware
Low speed IO examples
Other examples
Chapter 7 PC Host Software Development
CollectData
Cypress PC Utilities
BulkLoop Utility
Streamer utility
USB Control Center
Commercial USB Port Tester
Chapter 8 FX3 Throughput Benchmark
How Benchmark works
The Producer/Consumer model
The Low Level unmanaged C++ level
Producers
Consumers
OverlappedIO
Mid-level Managed C++ layer
USB Engine
Chapter 9 Getting Started With High-Speed IO.
First GPIF Project
Setting up GPIF II
Setting up a DMA Channel
Design Stage 1
Design stage 2
Design stage 3
Completed Design – a Logic Analyzer
Chapter 10 Moving Real Data, Part 1.
Chapter 11 Moving Real Data, Part 2.
Slave FIFO Design
Third Party Products
FIFO Master Design.
Combined master read and write
Master FX3 FIFO connected to a Slave FX3 FIFO
Load and Run
Programming the CPLD
How the CPLD Programmer Works
Developing your own CPLD Code
Introduction To Verilog
FX3 Lite (Boot) Firmware Library
Building an I2C Debug Console
FX3 Family Members
FX3S designed for storage application
CX3 designed for video capture applications
Chapter 1: SuperSpeed USB is More Than Just
Higher Speed

USB has come a long way since in its introduction as a


desktop expansion bus in 1996. In those days USB supported low
speed transfers at 1.5 Mbps and full speed transfers at 12 Mbps
using an A connector at the host and B connectors at devices. The
second generation USB 2.0 introduced 480 Mbps high-speed
transfers in 2001. It is now 2014 and USB is in its third generation
supporting transfers of 5 Gbps using connectors that are backwards
compatible with the 1996 versions.

Everyone knows that USB 3.0 is 5Gbps, now called


SuperSpeed, but USB 3.0 is more than this, much more. The
application area for USB has expanded well beyond the initial
desktop expansion model such that almost every piece of electronics
equipment manufactured today has some sort of USB connector.
This shifting application area has brought new requirements to USB
and USB 3.0 was also designed to address these. The biggest input
came from the portable electronics industries were battery life is a
key metric. USB 3.0 delivers its 10 times throughput at lower power
levels than USB 2.0! This demanded a new low-level
communications implementation and this was done within a
compatible software framework without changing the base
architecture or usage model.

This Chapter presents an overview of SuperSpeed USB 3.0


and why it should be your first choice for all but the lowest
performance USB devices (such as a mouse or keyboard).

Dual bus architecture.


USB 3.0 is, in fact, two independent buses in a single cable.
Figure 1.1 shows a diagram of a USB 3.0 cable which includes an
obvious USB 2.0 bus with the familiar D+ and D- signals and the
new USB 3.0 section. Also shown is the cross-section of a cable, two
variants are supported, the SuperSpeed signals can be implemented
as a twisted pair or as a micro coax. The designers were really
creative with the connectors at each end, shown in Figure 1.2, and
added the new USB 3.0 signals using contacts generally within the
same physical space or as an obvious connector extension. The
USB 3.0 standard B connector has an addition on the top and the
micro connectors have additions to the side. This has the benefit
that old cables still work if you only want USB 2.0 capability and the
user buys new cables to use new features.

Figure 1.1 USB 3.0 is a Dual Bus Architecture

Figure 1.2 USB 3.0 connectors are a superset of USB 2.0


connectors

accommodating either old or new cables


Note: USB 3.0 connectors are identified by blue inserts

The USB 3.1 Specification has been recently approved and


released and this includes a higher maximum speed attained with
different low level encoding methods. Since this will involve silicon
changes at the host controller and all devices Cypress, and other
silicon vendors, are studying the impacts to their current designs, so
don't expect to see compliant products for a year or so yet. Cypress
tell me that they intend to maintain their leadership SuperSpeed
device position and will continue to invest in their FX3 product
family. I can assure you that almost all of what you will learn from
this book will be relevant to a USB 3.1 design. A dramatic change to
USB 3.1 is a new Type-C connector and that is discussed next.
At the time of writing (August 2014) the Specification for the
Type-C connector has just been released by the USB 3.0 Promoters
Group, and a representation (copied from the Specification) is shown
in Figure 1.3. This has been the first major shift in physical
connector in about 20 years since USB was first introduced. Has it
really been 20 years! My, time does fly when you're having fun.

Figure 1.3 Drawing of the new Type-C connector

Again driven by portable applications, the new Type-C


connector is smaller than the current A connector and is still
rectangular but it is now reversible so there isn't a way to plug this in
upside down. This will replace all of the configurations shown in the
previous Figure. There are 12 contacts that appear on both sides of
the connector and full details are described in the specification that is
downloadable from the Developers section at www.USB.org.
The adoption of the Type-C connector by PC OEMs is
expected to be faster than previous connector introductions. The
choice of which connector to use on your upcoming peripheral
device should be straightforward since you can expect cable
manufacturers to be introducing Type-A to Type-C cables quite soon.

Every USB 3.0 cable has two buses and only one of them is
ever operating at a time, the other is suspended and consuming very
little power. A USB 3.0 device is required to also operate as a USB
2.0 device if it is attached to a USB 2.0 only hub so, in this case, only
the USB 2.0 section would be active. Figure 1.5 is drawn to show
the two separate bus systems. We would prefer the USB 3.0 section
active and the USB 2.0 bus suspended. Let's look at the features of
both buses and determine why we should be using the USB 3.0
wires rather than the USB 2.0 wires. The USB 2.0 portion of USB
3.0 is USB 2.0, pure and simple. It operates the same as it has done
since 2001. We will review some of this operation and discover why
it is not well suited to an energy conserving solution for 2014 and
beyond.

Figure 1.5 USB 3.0 is implemented as two separate bus systems


Review of USB 2.0 operation
The initial design of USB assumed that the host was a PC
that was plugged into a power source. Since the existing PC
peripherals at that time were low-cost the decision was made to put
most of the communications intelligence in the USB host controller,
since there was typically only one, and thus allow the USB peripheral
devices to be relatively dumb and therefore cheap. This would help
migration to the ‘new’ USB but to encourage migration there had to
be some tangible consumer benefit. USB 1.0 was the first bus to
provide a specified power source that the USB peripheral device
could use and this eliminated many “power warts” commonplace in
the late 1990s. The single master plus multiple slave architecture
meant that all communications were host centric and devices were
polled to discover if they needed attention. A device could not
initiate a transfer and the model was akin to that of a good child;
listen attentively and only speak when spoken to.

I remember the early days of USB when we presented USB


device operation with polling packets to marketing. After explaining
it several times they did finally understand and were then horrified. “I
can't sell this”, “this doesn't sound very good at all”, “hey, if we
changed the name of polling packets to interrupt packets THEN I
could sell this, I mean, every bus has to have interrupts”. Despite
our insistence that the polling packets were nothing like interrupts,
the name was changed, satisfying marketeers and confusing
engineers. I think we also let marketing choose the colors of wires.

The polling packets, sorry, the interrupt packets and all other
packets are broadcast from the host and are repeated on all
downstream sections as shown in Figure 1.6. The USB 1.0 hub was
a basic repeater and it was assumed that it too was attached to a
power source. USB 2.0 followed the same model but bus traffic
between hubs was always high-speed and the USB 2.0 hub did a
store-and-forward of packets to full and low speed devices.
However packets were always broadcast on all high-speed
downstream ports. Every device on a USB has a unique address.
Each packet contains a device address and all devices on the bus
check this and absorb, or respond to, the packet if it is addressed to
them.

Figure 1.6 USB 2.0 is broadcast bus


The broadcast approach, although simple, is very wasteful
from a power perspective. All of the devices, and all of the hubs
connecting the devices, have to be powered up and actively
checking every packet in case it is for them. The USB 2.0 situation
is, in fact, much worse than this. The standard method of talking to,
say, a mouse or keyboard, is to poll it at regular intervals (say 8, 16,
32 or 64 msec), with an interrupt packet to see if it has anything to
say. 99% of the time the mouse or keyboard will NAK the packet
since it does not have anything interesting to say. So there is a lot of
busy work going on for little gain.

Devices such as flash drives, printers and scanners, use bulk


packets to move large amounts of data across USB 2.0. This is also
done in a power intensive, wasteful kind of way. Bulk packets are
scheduled last by the host controller and it is assumed that the host
controller can “fill up the remainder of the frame” with bulk packets.
This is efficient if the peripheral device has data to send or if it has
buffers to accept data being sent to it, however, if the peripheral
device is not ready then it will NAK and the host will reschedule the
transaction for later. This results in more packets that may be
NAKed and rescheduled then NAKed and rescheduled then . . . . . .
USB 2.0 was an improvement over USB 1.1 and added a ‘Ping”
mechanism (without the data) to check for readiness. This traffic-
saving evolution is taken much further in USB 3.0 as we shall see.

And power is not just wasted on devices NAKing and at non-


addressed devices. The host computer is doing a lot of busy work
too, and consuming more power, organizing and scheduling all of
these transfers. The periodic polling using interrupt transfers or
continued NAKs during bulk transfers creates a lot of memory
access activity that prevents the host computer from powering down
to conserve power.

On the bright side, USB 2.0 does include a SUSPEND feature


that allows the OS to turn off the peripheral (actually put it in a low
power mode) if it is not being used. It may be able to enable this
device to wake itself if this capability is built into the device. This is
an “all or nothing” approach since the device must act like it has just
been attached when it turns back on and this typically takes
hundreds of milliseconds.

USB 3.0 enhancements


Obviously the FIRST thing to do is to get rid of polling! USB
3.0 adds routing information in the packet address and packets are
only directed to the intended recipient. If the recipient happens to be
suspended as the packet is about to be delivered, the owning hub
holds the packet, wakes up the device, then delivers the packet.

Note too that the USB 1.1/2.0 1 msec Start Of Frame (SOF)
indicator is gone. It is no longer needed. Isochronous transfers that
used this SOF information for synchronization now use timestamp
headers within the data packets.

A device is now allowed to initiate an action on USB 3.0,


these are called notifications.

As shown in Figure 1.1, USB 3.0 is implemented as two


twisted pairs of simplex (unidirectional) wires; one pair is used to
transmit and the other pair to receive. In contrast USB 2.0 uses one
pair of wires for half duplex communication which means that some
time is wasted turning around the bus. A greater advantage of dual
simplex is that the host can start transmitting a second and a third
and fourth etc packet before receiving an ACK for the first data
packet. A USB 2.0 transaction requires three packets, token, data
and handshake and typical bus turnaround is about 350 nsec while
USB 3.0 transactions only need two since two independent simplex
buses are involved.

Figure 1.7 shows the multiple levels that have been defined
and implemented by USB 3.0. I have included the diagram so that
you can appreciate the amount of thought and engineering effort that
many, many experts have put into this to define a robust high-
performance 5Gbps bus. I should mention at this stage that a USB
3.0 implementation also defines a new eXtensible Host Controller
Specification, this is designed to be event driven to match the power
conscious USB 3.0 specification. There are more technical details in
both 300+ page specifications than most people can absorb and
fortunately you don’t need to read either of them, unless you are a
silicon vendor or OS supplier, to be successful designing a
SuperSpeed device. For a review of USB hardware and protocols,
including the many high-level USB 2.0 protocols that also apply to
USB 3.0, I would recommend Jan Axelson’s USB Complete . If
you are a silicon vendor I would recommend that you start with
MindShare’s USB 3.0 Architecture book – it is over 650 pages but
is much easier to read.

The Cypress FX3 family of devices described in this book


have passed the rigorous USB Implementers Forum Compliance
Testing so you can be assured that everything in Figure 1.6 is
correctly implemented. You don't need to know the details of link
management or training or error recovery or a whole host of other
details. USB 3.0 is an approved standard that you cannot change.
Cypress have implemented this standard in silicon for you so that
you can focus upon your application.

Figure 1.7 SuperSpeed USB is specified in multiple layers

I'm not going to explain most of Figure 1.7 since there is little
that will help you use SuperSpeed USB. What I will discuss
however is the piece that you do need to know when implementing a
device which is the right-hand side of the Figure called power
management.
USB 3.0 power management
USB 3.0 power management affects all aspects of USB
system. The host controller architecture hardware and software
drivers are now interrupt or event driven to eliminate polling and
other “busy” work. This is thoroughly documented in the 500-page
xHCI Specification which is a download from Intel's developer site.
The USB communications protocol has been overhauled to eliminate
polling but in an architecturally compatible way so that USB
application software does not have to be rewritten. And we can do
our part by designing power aware SuperSpeed peripherals and this
is described in detail in the next Chapter and throughout this book
using examples. Power management was a fundamental design
criteria of SuperSpeed USB so let's look at some of the details of
how power is conserved while maintaining responsiveness and
performance.

The SuperSpeed bus implements a 5 Gbps (or greater) serial


connection that requires the constant transmission of information
(the logical bus idle) to ensure that the links between devices are
ready to exchange packets with low latency. However this constant
transmission results in constant power consumption so the strategy
to conserve power is to aggressively put links into a standby mode
and only have them operational when data actually has to be
moved. Four link power states are defined U0 through U3 and
Figure 1.8 (replicated from the USB 3.1 specification table C1)
shows characteristics of these four states.

Figure 1.8 Link States and Characteristics Summary

Sta Descripti Characteristi Transiti Devi Exit


te on cs on ce Laten
Initiator Cloc cy
k
Gen
U0 Active Link N/A On N/A
Operational
State

U1 Idle, fast Rx Tx Hardwar On us


exit Circuitry e or
quiesced Off

U2 Idle, slow May also Hardwar On us-


exit quiesce e or ms
clocks Off

U3 Suspend Interface (eg Entry: Off ms


ed Physical S/W
Layer) may Exit:
be removed S/W or
H/W

In U0 the link is fully operational and performing at maximum


throughput.

U1 is a power saving state that is characterized by fast


transition back to U0. The predominant latency is the time taken to
achieve signal lock between the two link partners. The upstream
device requests that the downstream device move to U1 by sending
a packet. The downstream device can reject the request if it prefer
to stay in U0 if, for example, it is almost finished preparing some
data that it needs to send upstream.

U2 will use less power than U1 but at the cost of increased


exit latency. As an example the device clock generation could be
disabled to reduce power.
U3 is a deep power saving state where some or all of the
device functionality is removed to save power. Host software is
required to move the device into U3 and the host will probably
enable some mechanism for the device to wake up again if the
device supports it. A new capability, Function Suspend, is applicable
to composite devices which will have multiple independent functions
within a single device. A Selected Suspend can be used to suspend
portions of the device if the device supports this. The host can also
initiated transition out of U3.

Hubs play a more important role in a USB SuperSpeed


system. Inactive links are powered down to U3 and active links are
switched from U0 to U1 and U2 often to conserve system power and
it is the hub role to implement this. Packets are now routed between
devices and the mechanism is shown in Figure 1. This allows the
hub to keep links without traffic in a U2 or U3 state.
As SuperSpeed hubs and devices are added to a
SuperSpeed host the xHCI driver builds and interconnection tree that
describes the physical topology of the SuperSpeed system. Routing
information consists of five nibbles which allow for a maximum hub
depth of five and a maximum of 15 devices per hub (sub-address 0
is used for packets targeted at the hub itself). The hub knows its hub
depth and will route incoming packets to the appropriate downstream
port. If a Header packet is sent to a downstream port and is
currently not inactive U0 state then the hub stores the packet while it
activates the link. Data packets are only sent by the host to active
links and these are forwarded immediately by the hub which must
include some buffering to maintain high throughput. The hub also
maintains activity timers on all downstream ports and once these
expire a transition to U1 or U2 is initiated. The hub ensures that link
power hierarchy is maintained so it will never allow an upstream port
to enter power state lower than any of its downstream ports.

Hubs also handle packet error detection and recovery. Hub


error detection focuses on verifying that Header packets have the
correct CRC. In addition, Header packet delivery must be
recognized within the logical idle datastream. The occasional
Header packet error is typically managed by port-to-port protocols
via the retransmission of the failed packet. A Data packet also
implements CRC which the hub checks but retransmission is not
allowed; it reported errors back to the host driver that must
implement some error recovery mechanism.

Also released in August 2014 was the USB Powered Delivery


2.0 Specification which raises the power delivery capacity of a host
or other downstream port from less than 5 W to about 100 W. This
Specification is also downloadable from the Developers section of
www.USB.org . The obvious benefit is that your phone or tablet will
now charge faster but this is also a huge benefit for peripheral
designers who now will not have to add at power connector on most
devices. Expect new hub products soon that implement the power
delivery specification.

This Chapter covered the essential differences that


SuperSpeed USB has over previous USB generations. The FX3
component described in this book manages the protocol complexity
so that you don't have to. You need to be aware of power
conservation and this is described in detail in the next chapter and
throughout the book using examples.
Chapter 2: A SuperSpeed Device Hardware
Platform

Cypress over-engineered the FX3 device to produce a family


of products that deliver on the requirements of Chapter 1. They
used their high transistor count budget wisely and built-in a great
deal of “hardware assist” features to make the FX3 family of
products easy to program and able to deliver on SuperSpeed
throughput at power levels suitable for battery operation. Figure 2.0
shows the 1000 foot view of the FX3 family members - they contain
HUGE data pipes that allow real-world data to be moved in and out
of a host computer at 5 Gbps at “portable” power levels. The FX3
will not be the performance bottleneck in your design!

Figure 2.0 The FX3 is a HUGE portable data pipe


The focus of this book is on the base family member, the fully
programmable FX3 device. I will explain everything that you need to
know to be successful in building a high-performance, low-power,
stand-alone USB device. The FX3 can also operate as a
coprocessor and as a high-speed OTG host and these are covered
in the reference section. Two family members, the FX3S which is
optimized for storage applications and the CX3 which is optimized
for video applications are also covered in the reference section.

This Chapter provides an introductory tour of the FX3. Its


major features will be highlighted but I don't describe the details until
later Chapters since the resources of the FX3 are best explained
with the use of working examples.

A block diagram of the FX3 is shown in Figure 2.2.

Figure 2.2 FX3 Block Diagram


The heart of the FX3 is a sophisticated, distributed DMA
controller that is capable of moving data at 800 MBps. This DMA
controller is attached to the internal devices via sockets; these are
shown in green in Figure 2.2 . A socket provides a consistent
interface to the DMA controller side and is customized on the device
side such that all internal devices look like “standard” block I/O
devices. This allows the hardware to manage continuous data
transfers between sockets without intervention from the CPU. This
includes multiple buffering schemes and this autonomous operation
is key to the FX3’s throughput. There is a lot more to say about the
DMA controller and this will be covered a little later in the Chapter
once we know which devices the DMA controller can talk to. For
now we will continue our overview tour of the block diagram starting
with the CPU block then moving clockwise around the diagram.

A review of the CPU block diagram is shown in Figure 2.3. As


seen the FX3 family is built around a 200 MHz ARM9 processor with
integrated 8 KB ICache and 8 KB DCache, both of which are
typically enabled. The CPU can run from a 19.2MHz crystal or can
be clocked from an external source. The CPU also includes a
standard PL192 Vectored Interrupt Controller and a standard JTAG
port for program download and debugging.

Figure 2.3 Detail of CPU Block


The CPU has three memories connected directly to it. 32 KB
of ROM holds the boot code for the device; the FX3 can boot from a
connected serial EEPROM (I2C or SPI), from USB or from various
GPIF II interface configurations. In this book I will boot from USB
most of the time and will implement an I2C EEPROM boot in
Chapter 5. The FX3 uses an internal preset VID and PID which is
recognized by the CyUsb3.sys driver. The default operation and
methods to change this are demonstrated in Chapter 7.

The CPU has 16 KB of tightly coupled instruction memory and


8 KB of tightly coupled data memory. The development tools put
interrupt service routines in the I-TCM and program stacks in the D-
TCM. This gives maximum performance to your program and it is
not recommended that you change this.

The CPU module is also in charge of the system clocks as


shown in Figure 2.4. A range of clocks can be generated and most
can be turned off to save power. The CPU has various power
conserving modes and discussion of these are deferred until later in
this Chapter.
Figure 2.4 System Clocks are configurable

Moving clockwise the next block we encounter is the GPIO


(General Purpose Input Output) block. At power-on all 61 GPIO
lines are configured as inputs however the boot loader is the first
program to run, so it configures the GPIOs it needs to support the
selected boot mode. Rather than repeat a lot of detailed information
here I refer you to AN76405 FX3 Boot Options which describes
explicitly the state of each GPIO pin; note that some pin assignments
are not obvious so I recommend you study AN76405 before
assigning your IO pins.

The 61 GPIO lines each have the circuitry shown in Figure


2.5.

Figure 2.5 Each of 61 GPIO pins has this circuitry


Each GPIO pin has selectable drive strength (up to 20mA
source and sink in four steps), optional pullups and pulldowns and a
keeper circuit that maintains IO levels during power saving modes.
Each GPIO can be set up to generate a CPU interrupt on either level
or either/both edges. After the boot loader has run, all 61 GPIO‘s
are available for general-purpose usage. Of these 61 GPIO’s, 8 can
be configured as a complex IO. A complex IO is a timer or counter
and these are described in the next section. Note that the high-
speed GPIF IOs and the low speed I2C, UART, I2S and SPI modules
also use the GPIO pins so as these modules are enabled there are
fewer GPIOs for general purpose use. However you should not run
out of IO's.

Figure 2.6 shows a complex IO being used as to provide


output signal. You choose an input clock from one of the four
system clocks then set the threshold and period to generate a single
pulse, a PWM signal or a software timer. Once set up the PWM is
autonomous and its operation does not depend upon CPU action.
We will use this capability in Chapter 5 to generate a fault indicator.
Figure 2.6 Driving Complex IO as an Output

Figure 2.7 shows a complex IO being used to monitor an input


signal. The timer and period can be set up to measure time between
input signal edges or can be set up to count input signal edges. This
too operates without CPU supervision and some input conditions can
be set up to cause a CPU interrupt.

Figure 2.7 Sampling Complex IO as an Input


The next stop on our tour is the GPIF II (General
Programmable Interface, Gen 2) block which is shown in Figure 2.8.
The GPIF II logic uses the same programmable philosophy as an
FPGA; it is RAM-based and consists of an uncommitted array of
logic elements that must be programmed following a power on (how
is described below). Much of the throughput capability of the FX3 is
due to this soft programmable state machine that can operate at up
to 100 MHz from an internal or external clock. The 32 high-speed IO
lines are controlled using up to 14 bidirectional control lines. My
GPIF II examples in Chapter 8 assume that you will be using all 32
IO lines so that you can achieve maximum throughput to your
hardware.

Figure 2.8 Overview of GPIF block


You use the graphical interface of an off-line tool, the GPIF II
Designer, to create state machines to control, and respond to, your
external hardware. GPIF II designer compiles your state machines
into a .h files that is included into your project. At runtime this
configuration information is loaded into the GPIF II engine. The
GPIF II block includes 32-bit address, 32-bit data and 16-bit control
counters and comparators that can also be used to control state
transitions. The state machines also have access to socket flags
such as DMA channel ready and DMA watermark exceeded. 32
sockets are available for GPIF II use which means that up to 32
independent data transfers could be taking place at any one time. I
haven't used more than 8 concurrent transfers yet! I haven't even
used a quarter of the possible 256 states either so I doubt that you
are going to run out of headroom in this block. Cypress provides
examples of standard interfaces such as slave FIFO, Asynchronous
RAM and Multiplexed address and data, so if your external hardware
is similar to one of these then you will have a head start. Chapter 8
works through several custom examples of GPIF II use and we will
write Verilog code for an external CPLD to create some systems
solutions.

The next stop on our tour is the low-speed peripherals block


and this is shown in Figure 2.9.

Figure 2.9 Low speed peripherals block

These low-speed peripherals are used to connect devices


such as an EEPROM and a debug console and are also useful if
your external GPIF II hardware needs an I2C or SPI control path.
Each peripheral has two sockets to connect to the DMA fabric but
even the fastest block, SPI at 33MHz, is not going to create much of
the load. These low speed devices are typically accessed via their
internal registers, but DMA transfers may also be set up such that
the CPU need not be bothered with low-level character IO. The I2C
channel, for example, can include a setup preamble that the DMA
controller prepends to DMA transfers. If hardware could be added to
simplify and speed data transfers then Cypress added it! Examples
of how to use each peripheral interface are included in Chapter 5.

The next peripheral block on our tour is USB as shown in


Figure 2.10.

Figure 2.10 Overview of USB block

The FX3 has an on chip SuperSpeed PHY and a Pipe 3.0


interface. Since all USB 3.0 devices must also operate at high
speed the FX3 also includes a USB 2.0 PHY. Additionally the FX3
can operate as a high-speed OTG host and this is described in the
reference section. The USB block implements all 32 possible
endpoints and each is paired with a socket such that 32 different
data transfers could be going on at the same time. The USB block
does not contain any endpoint buffering since this is done using
main memory and the DMA controller which can keep up with
SuperSpeed data transfers.

Alongside the USB block is the EZ-Dtect block that, when


enabled by the processor (there is an example in Chapter 6), allows
the USB-PHY to detect the presence of a connection to a USB
charger. In the OTG 2.0 specification, the OTG-ID line is a simple
on/off signal indicating whether the device is connected as a Host
(A-device, ID = 0) or as a Peripheral (B-device, ID = 1). The on-state
is generated through enabling a pullup resistor and detecting
whether the ID line is floating or terminated with a pulldown resistor.
In the Battery Charging Specification revision 1.2, the
functionality of this pin has been expanded in that the strength of the
pulldown resistor on this signal can indicated to the Device that an
Accessory Charger Adapter (ACA) is present. The ACA is a device
that enables a single USB port to be attached to a charger and also
to another device simultaneously. The strength of the pulldown
resistor also indicates to the Device its role (Host or Peripheral) and,
if it is a Peripheral, whether it is allowed to connect to the USB Bus
(by enabling a pullup resistor on D+) and initiating communication
with the USB Host

The Battery Charging Specification revision 2.0 has three


resistor values:

Resistor           Value             Device Type

RID_ A _CHG 102…114 kΩ USB OTG Host (A


Device)
RID_ B _CHG 171…189 kΩ USB OTG Peripheral (B
Device), may not connect

RID_ C _CHG 256…284 kΩ USB OTG Peripheral (B


Device), may connect

For the second case, where the Device is a Peripheral but


may not connect its data lines, the OTG Host conserves power by
not enabling VBUS, so the Peripheral needs to first attempt to
activate a session by initiating the Session Request Protocol (SRP),
as is given in the OTG 2.0 Specification.

Other charging specifications (eg Apple) use different resistor


values; the FX3 can detect the following values/value ranges:

Less that 10Ω, less that 1KΩ, 35KΩ to 39KΩ, 65KΩ to 72KΩ,
102KΩ +/- 2%, 119KΩ to 132KΩ, > 220KΩ and 440KΩ +/- 2%.

The next stop on the tour is power modes and the various
“power planes” of the FX3 are shown in Figure 2.11.

Figure 2.11 FX3 Power domains


The FX3 component gives you the flexibility of using different
supply voltages for different IO blocks depending upon the hardware
that the FX3 is connected to. The CPU core voltage, Vdd, must be
1.2V and the USB block operates at Vbus but all other VIO voltages
can be set to voltages between 1.8V and 5.0V. They could all be
connected to a single 3.3V supply in a low cost system. If a block is
not being used then it is not powered. Additionally the state of USB
3.0 is monitored such that the USB block can be switched to a lower
power link state or even suspended – this is described in more detail
in the next Chapter when I present the software API. The CPU too
can be slowed, halted or suspended to save power.

The last stop on our tour is the System RAM and the
Distributed DMA Controller. The data paths to and from RAM have
been designed with maximum throughput as the goal. Multiple
Advanced High-performance Buses (AHB, as defined by the ARM
System Architecture) are used to interconnect the system elements.
I have drawn a scale diagram in Figure 2.12 where the width of the
connection is used to show the data throughput available from the
various blocks of the FX3.

Figure 2.12 Data paths to and from RAM

Data access to RAM is zero wait state at 200 MHz and is


made 16 bytes at the time. The data path to and from the RAM is
3.2 GBps. The CPU also uses this data rate for cache line fills. The
high-speed interconnect bus has separate read and write paths each
supporting 3.2 GBps. USB also has separate read and write buses
and can support 100 MHz by 8 bytes wide (800 MBps)
simultaneously. Keeping up with SuperSpeed data transfers is not a
problem! The data path to GPIF II is more modest, “only” 4 bytes
wide at 100 MHz so data can be read or written to the outside world
faster than it can be written or read from USB. There is also a 400
MBps bus for the low speed peripheral block. This has bursty traffic
which handles individual FIFOs for each low speed device. The
MMIO (Memory Mapped IO) bus is used to read and write the
individual registers of each device and this bus does not support
DMA.

The CPU is granted 50% of the high-speed interconnect bus if


it needs it. The CPU loads and runs its code out of system RAM but
in a well-designed system the CPU will be asleep a lot of the time
since the DMA hardware will be moving all of the data on its behalf.
Remaining high-speed interconnect bus bandwidth is shared round-
robin fashion between the other AHB bridges. These multiple, wide
buses consume a great deal of silicon real estate on the FX3 die but
the result is tremendous system throughput that allows your high-
speed IO device to use the maximum bandwidth that SuperSpeed
USB provides.
Figure 2.13 shows an overview of DMA operation. If DMA
transfers are set up between USB and GPIF II as shown, they will
run at the maximum SuperSpeed rate. If the DMA transfer involves
the CPU then, in general, the rate will be slower due to processing
overhead.

Figure 2.13 Overview of DMA operation


DMA transfers to and from the low-speed devices will run at
an average rate defined by the low-speed device. Note that all the
transfer shown in Figure 2.13 can be operating simultaneously and
this will have little effect on the maximum transfer rate over USB.
And the Figure only shows 6 sockets on both USB and GPIF II, there
are, in fact, 32 sockets on each – this is a lot of capability!

The Cypress documentation has many pictures showing


different buffering schemes for moving data and discusses DMA
descriptors and different signaling methods. In general you can skip
this discussion since you will never have to setup or maintain these
data structures. The DMA device driver does this for you allowing
you to focus on what you want to do rather than how . We won't
start using the DMA controller in earnest until Chapter 6 and then I
will describe it using a series of examples.

SuperSpeed Explorer Board


Figure 2.14 shows a photograph of the FX3 based
development board specially designed to give you easy access to
SuperSpeed USB technology. This eight layer board brings all of the
high speed GPIO signals and the low speed peripheral signals out to
0.1” pitch headers.

Figure 2.14 SuperSpeed Explorer board

Figure 2.15 is a close up of one corner of the board to show


that the 0.1” connectors have a 0.2inch pin extension on the top side
so that you can connect a logic analyzer, jumper wires or a ‘scope.
The primary connection method used by the Cypress extension
boards is the two 40-pin sockets which therefore attach to the
‘bottom’ of the Superspeed Explorer board.

Figure 2.15 All high speed and low speed pins are accessible
Figure 2.16 shows a block diagram of this board. There is an
integrated debugger that includes a serial connection and a JTAG
port for debugging. There is also a user LED, a user button and an
I2C EEPROM so that we can do experiments with the basic board.

Cypress also supplies three adapter boards and one CPLD


board that plug onto the Explorer board to give you access to more
IO. These will be described as we use them throughout the book.

Figure 2.16 Block diagram of SuperSpeed Explorer board


This Chapter described the Cypress FX3 SuperSpeed device
component and attributes of all of its functional blocks. It was
specifically architected to support SuperSpeed data transfers at full
bus bandwith and achieves this with a distributed DMA controller that
has multiple, wide, parallel data paths, 0 wait state memory at
200MHz and a 32 bit parallel programmable protocol bus. There is
also a selection of low-speed IO blocks to interface to external
components and an ARM CPU to coordinate all of this hardware.

The next Chapter will look at the software that needs to be


written to convert this hardware into a usable system.
Chapter 3 A Robust Software Base

Chapter 2 described a unique set of hardware designed to


enable maximum SuperSpeed throughput at ‘portable’ power levels.
The low-level timing details of the heavily-coupled units, especially
the distributed DMA channels, require a lot of detailed analysis and
timing tuning. Rather than burden the user with these intricate, low-
level details, Cypress provides an RTOS (Real Time Operating
System) and device drivers for all of this specialized hardware. The
RTOS is Express Logic’s ThreadX (Version 5.1) and all of its
features are imported into the FX3 environment. Figure 3.1 shows a
programmers view of the FX3 family platform.

Figure 3.1 Programmers view of the FX3 family platform

This chapter will cover the non-IO block specific base


software: RTOS itself, the API used to access the FX3 hardware,
DMA programming and power-aware programming. This is a pre-
requisite for later chapters.

I’m sure that many readers will applaud discovering that they
will be writing their application on top of a robust RTOS – you may
skip the next section! For those of you who shuddered when you
read the word RTOS, let me describe why this is a good thing . . . . . .
Multi-threading RTOS 101
You will have to learn some new words and concepts to be
successful with a multi-threading RTOS. This will take some effort so
let me first explain the benefit of becoming familiar with these new
terms.

You may have heard the terms task or multi-tasking, what then
is multi-threading? The term task is used in operating system
literature in a variety of ways; it sometimes means a separately
loadable program, it sometimes may refer to an internal program
segment. To avoid this confusion there are two terms that have,
more or less, replaced the use of task: process and thread . A
process is a completely independent program that has its own
address space (the Windows operating system uses this model),
while a thread is a semi-independent program segment that
executes within a shared address space. Most embedded
applications cannot afford the overhead (both memory and
performance) associated with a full-blown process-oriented
operating system. For these reasons, ThreadX implements a thread
model, which is both extremely efficient and practical for most real-
time embedded applications.

You probably write your code using flow charts or state


machines. Flow charts are good for describing sequential processes
while state machines are good if there are small numbers of possible
states with well-defined transition rules. However, both are poor at
describing more complex systems with several interdependent parts.
Multi-threading, on the other hand, is a good fit for such systems -
you define a thread to handle each part of the system then define
how the parts interact.

A significant weakness of the sequential and state machine


approaches is that they are inflexible. A good programmer can
initially create a workable solution using these approaches but as
requirements change and marketing demands (or oversells)
enhancements the workable design invariably turns into spaghetti
code that is difficult to debug and even worse to maintain. The multi-
threading RTOS approach forces code that is structured so that it
can grow and change easily. Changes are implemented by adding,
deleting or changing some threads while leaving other threads
unchanged. Since your code is compartmentalized into threads,
propagation of changes through the code is minimized. This also
reduces testing efforts. So, you have some hard work now to save a
lot of time and effort later - this is a good deal.

The first paradigm shift you will need to make is to partition


your program into a set of smaller pieces - each will do one job and
will do it very well . Once your application is divided into several
threads you will define how these threads interact. The primary inter-
thread communications mechanism is an event , and several
operations are defined for an event such as Create, Set and Get . A
thread that creates data will signal with an event when it has data,
while a thread that consumes data will wait until an event is
signaled. Figure 3.2 shows a simple embedded program split into
multiple threads, three in this case; SignalA would be a Set EventA
while WaitA would equate to a GetA .

Figure 3.2 A program converted into multiple threads

We will work through a real example in a moment using real


ThreadX code rather than the theoretical pseudo-code shown in
Figure 3.2 so don’t focus upon the details yet. All will become clear
with a few examples.

Each thread is written as if it has sole ownership of the CPU


and you must now consider that GetData() runs continuously –
mmm, what did happen to input data while you were processing and
outputting data before? You could now allocate the coding of each
thread to different programmers with different areas of expertise.
Also if a better data processing algorithm is discovered or an
improved output device becomes available then only one thread has
to be changed; you need not be concerned about the impacts to the
other threads since they now operate independently of the other
threads. Are you beginning to see some of the benefits of this
“divide-and-conquer” approach?

When you divide your program into multiple threads you will
decide that some are more important than others and you can assign
these a higher priority. Figure 3.3 shows a multi-threading RTOS
task state diagram (copied from the ThreadX User Guide). As
threads are Created they are placed on the Suspended list or on
the Ready list where the RTOS determines the highest priority
thread and makes this the Executing thread; execution of this
thread continues until it is blocked for some reason (waiting for a
resource, such as an event or a timer) when it is placed on the
Suspended list; the RTOS then places the highest priority thread on
the Ready list as the Executing Thread; and so the process
continues. There is a system-defined thread, the IdleThread , which
has the lowest priority and is always ready to run; this typically
switches the CPU to a low power state, enables interrupts then halts.

Figure 3.3 A thread has five states


Maybe we are getting a little too deep here. My dilemma is
that we have a chicken-and-egg situation here – I want to describe
the software environment for running an FX3 program but we haven’t
learnt how to develop software yet! I decided to describe what we
are going to do in this Chapter and how we are going to do it in the
next Chapter. This will allow you to focus upon the new key
concepts without getting distracted by the nuances of the
development environment. So, for this Chapter I will describe an
example program and then use pre-compiled object code which I will
load and run using the USB Control Center application.

Let’s start at the beginning which is a RESET.

Operation from RESET


Following a RESET the software environment for FX3 must
be set up; the steps taken during this initialization are shown in
Figure 3.4 where the colored blocks are handled by Cypress code
and we are responsible for the white blocks.

Figure 3.4 Setting up the FX3 software environment following a


RESET
Cypress code initializes the ARM CPU environment (MMU,
VIC, core clocks etc), loads our program into RAM (including
initializing all interrupt vectors), initializes the C runtime environment
and finally calls our Main() routine. At this time the RTOS is not
running. We have the opportunity to adjust the CPU speed to match
our applications requirements, then optionally enable the ICache and
DCache. We then choose which IO devices will be initially
operational and then we start the RTOS. Figure 3.5 shows the
Main() routine for our first example; all FX3 programs start the same
way (but with different parameters).

Figure 3.5 The Main() routine for our first example


// Main sets up the CPU environment the starts the RTOS
int main (void )
{
CyU3PIoMatrixConfig_t ioConfig;
CyU3PReturnStatus_t Status;

// Start with the default clock at 384 MHz


Status = CyU3PDeviceInit (0);
if (Status == CY_U3P_SUCCESS )
{
Status = CyU3PDeviceCacheControl (CyTrue, CyTrue, CyTrue);
if (Status == CY_U3P_SUCCESS )
{
CyU3PMemSet ((uint8_t *)&ioConfig, 0, sizeof (ioConfig));
ioConfig.useUart = CyTrue; // We'll use this in the next example
ioConfig.lppMode = CY_U3P_IO_MATRIX_LPP_UART_ONLY ;
ioConfig.gpioSimpleEn [1] = 1<<(45-32); // Button is on GPIO_45
Status = CyU3PDeviceConfigureIOMatrix (&ioConfig);
if (Status == CY_U3P_SUCCESS ) CyU3PKernelEntry(); // This does not return
}
}

while (1); // Get here on a failure, can't recover, just hang here
// Later we shall do something more elegant here
return 0; // Won't get here but compiler wants this!
}

The RTOS does a great deal of initialization which includes


creating several threads to manage the IO blocks. It will eventually
call a known routine, CyFxApplicationDefine, where we add our
application functionality. My first example is the simplest that I could
think of: it uses the user button on the SuperSpeed Explorer Board
to change the blink rate of the user LED. Figure 3.6 shows the code
needed to implement this. As simple as it is, this example includes
many key concepts so it is worth an exhaustive look.

Figure 3.6 The Application Thread for our first example


void GPIO_InterruptCallback(uint8_t gpioId)
{
if (gpioId == Button) Delay = (Delay == 1000) ? 100 : 1000;
}

void ApplicationThread(uint32_t Value)


{
CyU3PGpioClock_t GpioClock;
CyU3PGpioSimpleConfig_t GpioConfig;
uint32_t Counter = 0;
// Since this application uses GPIO then I must start the GPIO clocks
GpioClock.fastClkDiv = 2;
GpioClock.slowClkDiv = 0;
GpioClock.simpleDiv = CY_U3P_GPIO_SIMPLE_DIV_BY_2 ;
GpioClock.clkSrc = CY_U3P_SYS_CLK ;
GpioClock.halfDiv = 0;
// Initialize the GPIO driver and register a Callback for interrupts
CyU3PGpioInit (&GpioClock, GPIO_InterruptCallback);
// Configure LED and Button GPIOs
// LED is on UART_CTS (currently been assigned to the UART driver) claim it back
CyU3PDeviceGpioOverride (LED, CyTrue);
CyU3PMemSet ((uint8_t *)&GpioConfig, 0, sizeof (GpioConfig));
GpioConfig.outValue = 1;
GpioConfig.driveLowEn = CyTrue;
GpioConfig.driveHighEn = CyTrue;
CyU3PGpioSetSimpleConfig (LED, &GpioConfig);
CyU3PMemSet ((uint8_t *)&GpioConfig, 0, sizeof (GpioConfig));
GpioConfig.inputEn = CyTrue;
GpioConfig.intrMode = CY_U3P_GPIO_INTR_NEG_EDGE ;
CyU3PGpioSetSimpleConfig (Button, &GpioConfig);
Delay = 1000;
while (1)
{
CyU3PThreadSleep(Delay);
CyU3PGpioSetValue (LED, (1 & Counter++));
}
}
// ApplicationDefine function called by RTOS to startup the application
void CyFxApplicationDefine(void )
{
void *StackPtr = NULL;
uint32_t Status;
StackPtr = CyU3PMemAlloc (APPLICATION_THREAD_STACKSIZE);
Status = CyU3PThreadCreate (&ApplicationThreadHandle, // Handle to my Application
Thread
"15:Chapter3_Example1" , // Thread ID and name
ApplicationThread, // Thread entry function
42, // Parameter passed to Thread
StackPtr, // Pointer to the allocated thread stack
APPLICATION_THREAD_STACKSIZE, // Allocated thread stack size
APPLICATION_THREAD_PRIORITY, // Thread priority
APPLICATION_THREAD_PRIORITY, // = Thread priority so no preemption
CYU3P_NO_TIME_SLICE, // Time slice not
CYU3P_AUTO_START // Start the thread immediately
);
if (Status != CY_U3P_SUCCESS ) while (1); // Get here on a failure, can't recover
// Once the programs get more complex we shall do something more elegant here
}
The text coloring was done by the Eclipse editor (to be used in
the next chapter) – the interactive syntax highlighting helps reduce
errors. Language key words and known functions are highlighted in
purple and if you hover over a function name then its definition and
short description will appear in a window. Known structure members
and constants are highlighted in blue and comments are highlighted
in green .

As seen, CyFxApplicationDefine creates a user thread.


Typically we create all of the RTOS resources that we will need for
our application in this routine. This first example has just one user
thread and no other resources.

The ApplicationThread uses the GPIO block to access the


button and LED so it must start this up which involves choosing a
variety of clock options as seen in CyU3GpioInit() . I also register a
callback routine and this is declared at the top of Figure 3.6. This
routine is entered when a negative edge is detected on the button
input (the button is pressed) and this changes the variable called
Delay . Callbacks are an important construct within the RTOS
environment and this is described in more detail below.

Note in the Main() routine in Figure 3.5 that I told RTOS that I
will use the UART and an IO pin, GPIO_45, for the button. This is
the only example that I don’t use the UART and it was simpler to
claim this from the outset. The user LED on the Explorer board is
connected to the UART signal CTS which the UART driver owns at
start up; I am only planning to use a 2 wire connection to a serial port
so I can give up the CTS control signal for LED use. I then configure
the LED as an output and configure the button as an input.

We then run our main loop which waits for Delay msec then
toggles the LED. The changing Delay value will change the blink
rate of the LED.

Enough theory, it’s about time that we had a demo! Connect


the SuperSpeed Explorer board to your PC using a USB 3.0 cable
and run the USB Control Center application program to load and run
Chapter3Example1.img and then press the user button a few times.
We are using less that 0.01% of the capability of the FX3 and RTOS
to blink the LED but we have to start somewhere! See Load and
Run in the Reference Section for full details on this test sequence.

API Overview
We interact with the FX3 using the FX3 API described by the
EZ-USB® FX3 +FX3S SDK Firmware API Guide. This document is
about 600 pages long with no obvious structure or prioritization of
information. I don’t expect you to read it, instead, we will work
through a series of examples that highlight specific functions and
attributes of the API. Note that we DO NOT directly access registers
within the FX3 ARM CPU or IO blocks – peeking and poking around
these can lead to disaster. We also don’t need to program in
assembler, those days are gone.

The general format of an API call is:

Status = Function(Parameter, Parameter,


&ReturnedParameter);

We should always check Status for CY_U3P_SUCCESS (=0)


to know that the function call was successful. The first routine that I
wrote was CheckStatus(Text, Status) so that it was easy, quick and
obvious to understand what was going on. In our early examples I
return if Status = 0 and display Text and Status if not equal to zero
then stop. Your initial errors will be just like mine and will be due to
passing bad parameters.

We deal with a lot of data structures when using the API so


many parameters are pointers to data structures such as
&MyThread. The compiler can catch most of the times you forget
the & since the parameter type will not match what the function is
expecting. Sometimes a parameter could be a function pointer. We
have already seen one example CyU3PGpioInit (&GpioClock,
GPIO_InterruptCallback) . Here we passed the address of a
callback function that the RTOS executes when appropriate. This
callback capability is used often by RTOS when an operation is
known to take a long, or unknown time to execute and we, the caller,
do not wish to wait. In this case we specify a callback function that
the RTOS should run whenever the specified event or condition
occurs.
A callback routine is similar to an interrupt service routine –
when called we should implement our function as quickly as possible
and NOT call any blocking API functions. A blocking API function
will have a WAIT parameter and, unfortunately, these are not always
obvious to spot. You, like I, will find out the hard way which functions
should not be used in a callback routine, or an interrupt service
function, since your code will suddenly hang. I will create tools in the
next chapter that will minimize our ‘hang time’.

If the function needs to return a parameter then again we


need to specify a pointer to the parameter that is to be returned.
Often this parameter is a pointer to a structure. Fortunately C lets us
manipulate pointers with ease but unfortunately C lets us manipulate
pointers with ease often doing exactly what we asked for but not
what we meant! So we need to be extra careful when dealing with
pointer parameters!

In general, the RTOS has control of the CPU and calls our
user code when appropriate. We should do whatever work is
needed then pass control back to the RTOS as soon as possible.
Remember that the RTOS is running many other threads
“underneath” our code. For example, if our application needs to wait
for, say, 100msec then we SHOULD NOT implement a decrementing
loop counter! Instead we should call the RTOS function
CyU3PThreadSleep(100) which will give control back to RTOS so
that it can get on with other work. The RTOS will return in about
100msec and give us control back. I say ‘about’ since the RTOS
may decide that something more important than the user thread
should be run at this time. We will see later that user threads have
the lowest priority so that the RTOS can guarantee servicing of the
IO block threads.

Key ThreadX features


An FX3 program is a collection of threads. The basic
structure of a thread comprises of some initialization code followed
by a do-forever loop as shown in Figure 3.7.

Figure 3.7 All threads have the same structure

ThreadX uses this code and some allocated RAM for a stack
when starting a thread. All of the local variables are allocated on the
stack so the code is inherently re-entrant; this will enable you to use
the same code to be started by multiple threads if needed by the
application. Note that thread now has a specific meaning, it
consists of a collection of code bytes that is the program, a collection
of variables that are data bytes on the stack and a data structure,
also on the stack, called the thread context . In this section we will
discuss how ThreadX deals with a resource that must be shared by
several threads and how these threads can communicate with each
other. A shared resource, such as an I2C communications port,
must be protected from being accessed from multiple threads at the
same time. The mechanism that all RTOS’s use is called a mutex
(a concatenation of MUTually EXclusive, meaning one owner) and
this is illustrated as a key in Figure 3.8.
Figure 3.8 A Mutex is used to protect a shared resource

Before a thread may access a shared resource it must first get


the mutex protecting it and once it has finished using the resource it
must put the mutex back. Note that there is no physical connection
between mutex and the shared resource. A mutex is a programming
convention that must be followed by your code. There is nothing
that prevents you from accessing the shared resource without using
the mutex, however this will more or less guarantee that your
program will fail! It may work for a while, but the odds of failure are
proportional to the number of executives present when you
demonstrate it.

Figure 3.9 shows a code fragment from the


Chapter3Example3 project. This is a demonstration program that
does no real work. It has been written to illustrate the operation of a
mutex. The full source code for all of the examples in this Chapter is
available in the Examples folder if you would prefer to follow along
on your PC screen.

Figure 3.9 Code example using a Mutex


void GetMutex(char * Name)
{
CyU3PReturnStatus_t Status;
Status = CyU3PMutexGet(&SharedMutex, CYU3P_WAIT_FOREVER);
CheckStatus(8, "Get" , Status);
CyU3PDebugPrint (4, "\n%s has Mutex" , Name);
}

void PutMutex(char * Name)


{
CyU3PReturnStatus_t Status;
// CyU3PDebugPrint(4, "\n%s returning Mutex", Name);
Status = CyU3PMutexPut(&SharedMutex);
CheckStatus(8, "Put" , Status);
}

// Declare main application code


// Note that both threads use the SAME CODE; Value passed in determines the thread
identity
void ApplicationThread(uint32_t Value)
{
char * ThreadName;
uint32_t StartTime;

CyU3PThreadInfoGet(&ThreadHandle[Value], &ThreadName, 0, 0, 0);


ThreadName += 3; // Skip numeric ID
CyU3PDebugPrint (4, "\n%s started" , ThreadName);
// Now run forever
while (1)
{
StartTime = CyU3PGetTime();
DoWork(ActivityTime[Value][0], ThreadName);
GetMutex(ThreadName);
DoWork(ActivityTime[Value][1], ThreadName); // Work done with Mutex owned
PutMutex(ThreadName);
DoWork(ActivityTime[Value][2], ThreadName);
GetMutex(ThreadName);
DoWork(ActivityTime[Value][3], ThreadName); // Work done with Mutex owned
PutMutex(ThreadName);
DoWork(ActivityTime[Value][4], ThreadName);
LoopCounter[Value]++; // Keep loop statistics
TotalTime[Value] += CyU3PGetTime() - StartTime; // Keep loop statistics
}
}

Thread A starts and soon needs to work with the shared


resource so it gets the mutex. Thread B is working and also needs
the mutex but it's get will not succeed since thread A already owns
it. So thread B must wait until thread A returns the mutex before it
can proceed with the shared resource. This operation continues
endlessly.

Connect another USB cable to the debug connector of the


SuperSpeed Explorer board and plug this into your PC too (see
Figure 4.2 for help). This connection will enumerate as a virtual
COM port and you can attach a terminal program such as
ClearTerminal, CoolTerm, TeraTerm or similar to this connection to
view progress messages sent from the FX3.

Using the USB Control Center, locate load and run


Chapter3Example3.img and observe the operation via the messages
in the console window. Note that this process of running a program
image is fully described in the next Chapter and some readers may
want to read Chapter 4 now and before returning to this point.

Figure 3.10 shows a timeline of the operation. Note that each


thread must wait while the other thread has ownership of the shared
resource via its possession of the mutex. Press the reset button on
the development board to stop the running program and return the
board to be a boot loader device.
Figure 3.10 Example showing ownership of a Mutex
Thread communication
Communications between threads is a fundamental
requirement so ThreadX includes three mechanisms to satisfy
different applications needs. Figure 3.11 outlines a basic data
collection and reporting system and this will be used to illustrate the
three thread communication methods. The simplest method is a
semaphore so this is described first.

Figure 3.11 Using three threads to collect, process and save


data
Thread communications using Semaphores
A semaphore illustrated in Figure 3.12 is used by threads to
signal each other; one thread can wait for a semaphore to be set by
another thread before proceeding and this will synchronize the
operation of the threads.

Figure 3.12 A Semaphore is used to signal between threads

Figure 3.13 shows a code fragment from the


Chapter3Example4 project. I decided to use a system timer to
generate a 3.5 second clock to create/collect/find data. Once the
data is prepared the input code puts a DataAvailable semaphore up
to indicate that valid data is available. I chose to implement the data
buffer as a global array so the data need can be shared between
threads without copying. The processing thread waits on the
DataAvailable semaphore and then crunches the data to produce an
output data buffer, also global, then it puts the
ProcessedDataAvailable semaphore up. The output thread is
waiting on the ProcessedDataAvailable semaphore and deals with
the data once it is signaled as available.

Figure 3.13 Code example of using a Semaphore


// Declare a helper routine so that I can simply add/remove progress messages
void DoWork (uint32_t Time, char * Name)
{
CyU3PDebugPrint (4, "\n%s is busy working" , Name);
CyU3PThreadSleep(Time);
}
// Input data is created on a periodic basis using a System Timer
void CreateInputData (uint32_t InitialValue)
{
// NOTE: a System Timer routine runs in ISR context so it CANNOT use any blocking
calls
// CyU3PDebugPrint() is a blocking call :-(
uint32_t i, CurrentValue;
for (i = 0; i<Elements(InputDataBuffer); i++) InputDataBuffer[i] = TempCounter++;
TotalData++;
// Check that the previous data has been processed
tx_semaphore_info_get(&DataToProcess, 0, &CurrentValue, 0, 0, 0);
if (CurrentValue == 1) DataOverrun++;
// Set an Semaphore to indicate at input data has been created/collected/found
else CyU3PSemaphorePut(&DataToProcess);
}
void ProcessDataThread (uint32_t Value)
{
char * ThreadName;
uint32_t i, j;
CyU3PThreadInfoGet(&ThreadHandle[Value], &ThreadName, 0, 0, 0);
ThreadName += 3; // Skip numeric ID
CyU3PDebugPrint (4, "\n%s started" , ThreadName);
while (1) // Now run forever
{
// Wait for some input data to process
CyU3PSemaphoreGet(&DataToProcess, CYU3P_WAIT_FOREVER);
for (i = 0; i<Elements(ProcessedDataBuffer); i++)
{
ProcessedDataBuffer[i] = 0;
for (j = 0; j<10; j++) ProcessedDataBuffer[i] += InputDataBuffer[(10*i)+j];
}
DoWork(2000, ThreadName); // Pad the actual work for demonstration
// Hand off the processed data to the Output thread
CyU3PSemaphorePut(&DataToOutput);
DoWork(100, ThreadName); // Do any tidy-up required
// Go back and find more work
}
}
void OutputDataThread (uint32_t Value)
{
char * ThreadName;
CyU3PThreadInfoGet(&ThreadHandle[Value], &ThreadName, 0, 0, 0);
ThreadName += 3; // Skip numeric ID
CyU3PDebugPrint (4, "\n%s started" , ThreadName);
while (1) // Now run forever
{
// Wait for some processed data to output
CyU3PSemaphoreGet(&DataToOutput, CYU3P_WAIT_FOREVER);
DoWork(1000, ThreadName); // Pad the actual work for demonstration
// Go back and find more work
}
}

Using the USB Control Center locate, load and run the
Chapter3Example4.img file and observe the operation via the
messages in the console window. To stop the currently running
example to enable loading new firmware, press the reset button on
the board (it’s the button next to the USB 3.0 connector)

Semaphores are a good solution for this simple application


since the data rate is controlled by the slowest operation which, in
this case, is the input function. If you can guarantee that your
processing and output threads will always be faster than your input
data rate then use this simplest semaphore solution.
Some of you may be thinking, wait a moment, a mutex and
semaphore appear to be the same so why do we have two methods
of doing the same thing? In this two thread example they do serve a
similar purpose but when more than two threads are involved the
operation of the semaphore is a little different.
A mutex is used to allow only one thread to access a
protected resource, this does not change.

Multiple threads can put to a single semaphore. ThreadX


semaphores are counting semaphores and can have any value from
0 to 0xFFFFFFFF. A get from a semaphore will be successful if its
value is currently 1 or greater. There will be several examples of
counting semaphores in later Chapters.

Thread communications using Event Flags


An alternate to a semaphore, which can only signal one
event, is a group of Event Flags which can signal up to 32 different
events. This is illustrated in Figure 3.14. Additionally a thread can
set or check on a collection of events in a single operation.

Figure 3.14 A set of Event Flags can signal multiple events

Figure 3.15 shows a code fragment from the


Chapter3Example5 project. This time I decided to use a thread for
the input operation since I wanted to put DebugPrint statements in
the function and this was not allowed when using a timer callback
routine. I replaced the two semaphores with two events so each
thread is now setting or getting events rather than putting or getting
semaphores. The operation for this Event Flags example is almost
the same as the Semaphore example but for more complicated
programs the use of events can simplify your code.

Figure 3.15 Code example of using Event Flags


// Declare some helper routines so that I can simply add/remove progress messages
void DoWork (uint32_t Time, char * Name)
{
CyU3PDebugPrint (DID, "\n%s is busy working" , Name);
CyU3PThreadSleep(Time);
}

// Declare main application code


void InputDataThread (uint32_t Value)
{
char * ThreadName;
uint32_t ActualEvents, Status, i;

CyU3PThreadInfoGet(&ThreadHandle[Value], &ThreadName, 0, 0, 0);


ThreadName += 3; // Skip numeric ID
CyU3PDebugPrint (4, "\n%s started" , ThreadName);
// Now run forever
while (1)
{
// Gather some input data
for (i = 0; i<Elements(InputDataBuffer); i++) InputDataBuffer[i] = TempCounter++;
DoWork(SampleTime, ThreadName); // Pad the actual work for demonstration
TotalData++;
// Check that the previous data has been processed
Status = CyU3PEventGet(&SharedEvent, INPUT_DATA_AVAILABLE,
CYU3P_EVENT_OR,
&ActualEvents, CYU3P_NO_WAIT);
if (Status == 0) DataOverrun++;
else CyU3PEventSet(&SharedEvent, INPUT_DATA_AVAILABLE,
CYU3P_EVENT_OR);
// Go back and find more input
}
}

void ProcessDataThread (uint32_t Value)


{
char * ThreadName;
uint32_t ActualEvents, i, j;

CyU3PThreadInfoGet(&ThreadHandle[Value], &ThreadName, 0, 0, 0);


ThreadName += 3; // Skip numeric ID
CyU3PDebugPrint (4, "\n%s started" , ThreadName);
while (1) // Now run forever
{
// Wait for some input data to process
CyU3PEventGet(&SharedEvent, INPUT_DATA_AVAILABLE,
CYU3P_EVENT_OR_CLEAR,
&ActualEvents, CYU3P_WAIT_FOREVER);
for (i = 0; i<Elements(ProcessedDataBuffer); i++)
{
ProcessedDataBuffer[i] = 0;
for (j = 0; j<10; j++) ProcessedDataBuffer[i] += InputDataBuffer[(10*i)+j];
}
DoWork(2000, ThreadName); // Pad the actual work for demonstration
// Hand off the processed data to the Output thread
CyU3PEventSet(&SharedEvent, PROCESSED_DATA_AVAILABLE,
CYU3P_EVENT_OR);
// Do any tidy-up required
DoWork(100, ThreadName);
// Go back and find more work
}
}

void OutputDataThread (uint32_t Value)


{
char * ThreadName;
uint32_t i, ActualEvents;

CyU3PThreadInfoGet(&ThreadHandle[Value], &ThreadName, 0, 0, 0);


ThreadName += 3; // Skip numeric ID
CyU3PDebugPrint (4, "\n%s started" , ThreadName);
// Now run forever
while (1)
{
// Wait for some processed data to output
CyU3PEventGet(&SharedEvent, PROCESSED_DATA_AVAILABLE,
CYU3P_EVENT_OR_CLEAR,
&ActualEvents, CYU3P_WAIT_FOREVER);
DoWork(1000, ThreadName); // Pad the actual work for demonstration
CyU3PDebugPrint (4, "\nOutput: " );
for (i = 0; i<Elements(ProcessedDataBuffer); i++) CyU3PDebugPrint (4, "%d " ,
ProcessedDataBuffer[i]);
// Go back and find more work
}
}

Both the semaphore solution and event solution are


successful since the slowest operation is the inputting of data.
Locate, load and run Chapter3Example4.img or
Chapter3Example5.img where I have changed the input rate to 1.5
seconds. Observe that some input data is lost. This is not tolerable
in most systems and can be solved with the addition of more data
buffers. It would be possible to allocate more semaphores or events
to handle these additional buffers but a simpler solution for the
thread communications is the use of a queue described next. I chose
to use fixed buffers for the semaphore and event flags examples
since the data flow and buffer usage is continuous and simple.

Thread communications using a Queue


A queue is illustrated in Figure 3.16 and it is a place where
one thread can send a message that another thread will wait for.
You can send messages to the front of a queue but typically
messages are added at the end of the queue. The message itself is
small, ranging from one 32 bit word up to sixteen 32 bit words. If you
need more, as we do in this example, you pass pointers to larger
buffers.

Figure 3.16 A Queue can hold many messages


In this queue example I use ThreadX to provide data buffers
since this is a better use of the memory resource for all but the
simplest of applications. Statically allocated buffers are a waste of
RAM if the thread that needs the buffer is not running continuously.
For these threads it is better to request a buffer from ThreadX and
return it when no longer needed.

Figure 3.17 shows the four queues defined for this example.
Memory buffers are created at step 1 and the memory address of
each buffer is the message that I pass around. Note that I am not
passing the whole data buffer. I assume that ownership of the buffer
address message is sufficient for a thread to own the buffer. This is
an important concept since it means that I can avoid copying large
amounts of data. One thread will fill the buffer then pass ownership
of the buffer via a send message to another thread that would use
the data. There is nothing, of course, preventing you from writing in
the buffer even if you do not own it – this is an easy way to create
‘hard-to-find’ bugs. Remember the convention, if you own the buffer
then you can use it, once you send it to a queue then you should not
read or write the buffer further.

Figure 3.17 Message flow using four queues

The message queues are primed during thread initialization


and “empty” messages are placed on the DataDone queue and the
ProcessedDataDone queue.

The code for this example is found in the Chapter3Example6


project. Look now at the input thread. It starts by getting a buffer
from the DataDone queue (more accurately it gets the address of
the buffer from this queue) and it starts filling this buffer with input
data. Once full it passes this buffer to the DataAvailable queue.
And this cycle continues endlessly. If there no message is waiting
at the DataDone queue when the input thread requires it then this is
considered a fault situation since input data will be lost. We will
need to write code to handle this fault condition.
In the meantime the processing thread is waiting at the
DataAvailable queue for work to do. Once a message arrives it gets
an “empty” message from the ProcessedDataDone queue in which it
can write the results, then it processes the input data. Once it has
finished with the input message it sends it to the DataDone queue
for recycling. It then sends the processed data message to the
ProcessedData queue.

The output thread is waiting for messages at the


ProcessedData queue and once it is finished with the data it sends
the used message to the ProcessedDataDone queue for recycling.

In this example the input thread is generating data faster than


the processing thread can do its work. I therefore assigned two
processing threads to handle the workload. I use the same code
and I start up two threads to service the DataAvailable queue. They
will take messages alternately from the DataAvailable queue and
send used messages to the DataDone queue and will also send
processed data messages to the ProcessedData queue. Note that
neither the input thread nor the output thread know that there are
now two processing threads, they operate as before. This
separation and isolation of the inner workings of each thread will
allow your thread code to be reused in other projects that have
similar interfacing requirements.

As seen in Figure 3.17 our data buffers are effectively cycling


around the loops 2, 3, 4, 5 and 6, 7, 8, 9. We know that the buffers
themselves are not actually moving just the references to them.
Figure 3.18 shows a time plot of the “movement” of buffers through
the system and the ownership of the buffers by the each thread.

Figure 3.18 Timing to show buffer usage in Queue example


DMA Programming Model
As explained in Chapter 2, the distributed DMA controller
moves data between sockets. Each IO block has a least one
producer socket and one consumer socket while the high bandwidth
IO devices, USB and GPIF, each have 32 sockets to support multiple
concurrent data transfers. The connection between a producer
socket and a consumer socket is called a DMA channel and the FX3
has a great deal of dedicated hardware to support data transfers
exceeding 800MBps.
The programming model for DMA sets up a DMA channel and
then lets the hardware take over. Figure 3.19 shows the basic
categories of DMA channels.

Figure 3.19 DMA channels are AUTO or MANUAL


To get the maximum throughput the DMA channel should be
set up for AUTO operation where data is moved from a producer
socket, or sockets, to a consumer socket, or sockets. The transfers
can be 1 to 1, 1 to many or many to 1 and there will be examples of
each throughout the book. If some CPU involvement is required
then a MANUAL channel is set up and the throughput will be less.
There is a variant of auto, called AUTO_WITH_SIGNAL, where the
DMA channel carries on at maximum speed and interrupts the CPU
with various events such as ‘buffer_consumed’ or ‘buffer_empty’; this
enables the CPU to keep track of data transfers without slowing
them down but note that if the CPU cannot service the event
callbacks fast enough then these event notifications will be lost.
Figure 3.20 shows the DMA channel operating states.

Figure 3.20 DMA Operating States


When a DMA channel is created, using
CyU3PDMAChannelCreate , it moves from Not_Configured to the
Configured state. A call to CyU3PDmaChannelSetXfer moves the
channel to the Active state where transfers will begin if the producer
socket is ready. If no data is available to read, or all data in a buffer
is written, the channel suspends. The application can force the
channel to suspend, using CyU3PdmaChannelSetSuspend ; the
DMA controller completes the current buffer before suspending the
channel. If needed the channel can be resumed or aborted using
CyU3PDmaChannelResume or CyU3PDmaChannelAbort .

To set up an auto channel we specify the size and number of


buffers that the channel can use. Buffers are filled by a producer
socket then passed to a consumer socket once full. The consumer
socket empties the buffer then passes it back to the producer socket
for refilling. Typically multiple buffers are used so that the producer
socket can be filling one buffer while the consumer socket is
emptying another. This is all handled by the hardware “underneath”
the API so there is no need for the application program to be
concerned about buffer allocation, use or recycling.

If either of the producer socket or consumer socket is the


CPU then the DMA channel is a MANUAL transfer which means that
the CPU can create new data, modify current data such as adding a
header and or footer, or consume the current data. There is a lot of
flexibility here and Figure 3.21 shows the additional operations
available for a manual channel. Again, there will be examples
throughout the book which demonstrate each of these transfer types.
Figure 3.21 Operations for a manual channel

Power Aware Programming


Chapter 1 explained the changes in the USB Specification
which enable lower power operation. The USB 3.0 specification
introduced multiple power states where the bus activity is reduced to
save power, while retaining the capability to quickly resume data
transfers, they are recapped here:

The relevant power states are:


U0: Full power and operation
U1: Standby with fast recovery to U0
U2: Standby with slow recovery to U0
U3: Suspend with very long recovery time

The USB block and PHY on FX3 cannot be put to sleep while
the link is in the U0, U1 or U2 states. This means that there is no real
opportunity to save power in the system while the link goes to U1 or
U2. The only requirement from the firmware is to ensure that the U0
<-> U1 and U0 <->U2 transitions are handled properly.

As per the USB spec, both the host and the device are
enabled to initiate transitions from U0 to U1/U2 or back. Most USB
3.0 hosts (Intel host for example) will move the USB link to the U1
state when it is expecting the next data transfer to be device
initiated.
Note that while the host can do a direct U0 -> U2 transition,
this is not commonly used. U2 entry typically happens from the U1
state. Once the link has been in U1 for longer than a host defined
inactivity timeout, the link will transition to U2 on both device and
host sides. This happens without any actual signaling on the USB
bus.

The FX3 hardware does not initiate U1/U2 entry or exit


automatically. It needs firmware involvement for both steps. Also,
there are no registers that can be used by the firmware to identify
whether the FX3 has any data ready to send to the USB host or not.
These restrictions mean that FX3 cannot automatically handle the
U1/U2 transitions based on actual transfer state. The following APIs
are included to enable the firmware to manage the transitions:

CyU3PUsbRegisterLPMRequestCallback : This function


registers a callback that will be called whenever the USB 3.0 link has
moved into the U1 or U2 states. The return value from this callback
function will indicate whether the FX3 should stay in the U1/U2 state
(if return value is CyTrue) or attempt to transition back to U0 (if return
value is CyFalse).   In most of the firmware examples, the callback
implementation always returns CyTrue. This is to because there is
no application-specific state which needs the system to stay in the
U0 state. Some however return CyTrue if the application has a state
machine that is currently idle, and returns CyFalse when the state
machine is active. Some hosts aggressively push the link to U1 and
this will have an impact on high throughput applications since
entering U1 and coming back to U0 will require anywhere between
10 us – 1 ms (depending on USB link quality) and no data can be
moved while this transition is happening.

CyU3PUsbLPMDisable CyU3PUsbLPMEnable : You can


disable the possible transitions to U1/U2 if this has a large impact on
your application. The result will, of course, be higher power. Once
this API is called, FX3 will reject any attempt by the USB host to
initiate U1/U2 entry; until CyU3PUsbLPMEnable() is called or a USB
reset happens.

CyU3PUsbSetLinkPowerState : This API allows the USB 3.0


link to be moved from U0 to U1/U2; or from U1/U2/U3 back into U0.

The following algorithm is suggested for USB power state


management in FX3 firmware applications:
1. Keep LPM operation enabled by default, so that the link can
move into U1/U2 and pass compliance tests when the system is
idle.
2. Disable LPM operation at any time when the application state
machine is active. The procedure to disable LPM operations is:
a. Call CyU3PUsbLPMDisable() to disable U1/U2 transitions.
b. Call CyU3PUsbSetLinkPowerState ( CyU3PUsbLPM_U0) to
ensure that the link is brought into U0 state if it was already in
U1/U2.
3. Re-enable LPM operation once the application state machine is
back in the idle state. This is done by calling
CyU3PUsbLPMEnable() .

FX3 Power Mode Handling


System level power saving operations can be performed by
the FX3 firmware when the USB connection is suspended (USB 2.0
suspend or USB 3.0 U3 state), or when the USB connection is
broken (VBus is off, or user has disabled USB connection).

USB Suspend Handling


In the first case (USB connection is present, but suspended),
the USB block needs to kept in the powered on and suspended
state. This happens automatically when the FX3 detects the USB
suspend so no firmware action is needed. The user application is
notified about a USB suspend through the
CY_U3P_USB_EVENT_SUSPEND event.
On receiving this event, the user can take system-level power
saving actions like suspending or turning off external peripherals
(image sensors, other controllers). You can also place the FX3
device into the low power suspend state (L1 or L2) at this stage.

In the L1/L2 states, the ARM core in the FX3 is placed into the
clock-gated Wait For Interrupt state. The clocks to the other
peripheral blocks on FX3 are also stopped. None of the blocks are
powered off at this stage. If any of the blocks (UART, SPI, I2C etc)
are to be powered off, you have to do this explicitly using the
corresponding de-init API calls.

The L1/L2 entry is achieved through the


CyU3PSysEnterSuspendMode() API call. If the USB connection was
in 3.0 mode, this API will move FX3 into the L1 mode. If the USB
connection was in 2.0 mode, the API will move FX3 into the L2
mode. This API returns only after the FX3 has woken up (returned
to L0) from the suspend mode. If the suspend mode entry fails for
any reason, the API will immediately return with an appropriate
return code.

USB Disconnect Handling


In a self-powered system based on FX3, it may be desirable
to turn most FX3 blocks powered off when the USB link is
disconnected (no VBus detected). This can be done by powering
FX3 off completely (L4), but this will require the FX3 to go through a
full boot process when it is powered ON again.

The Standby mode (L3) can be used to save more power as


compared to L1/L2 while avoiding the start-up latency associated
with a power down. In this case, most FX3 blocks (USB, GPIF II,
Serial peripherals) are powered off. Power is retained only to the
System RAM and a select portion of the device control logic (this is
required to manage the wake-up conditions).
The FX3 can be configured to wake from the L3 state when it
detects a valid VBus signal, or when it sees a transition on the GPIF
II CE# signal (or the UART CTS signal). When coming out of the L3
state, the device acts as if it is recovering from a CPU reset. The
firmware starts running again from its entry location and goes
through the complete initialization sequence.

The CyU3PSysEnterStandbyMode() API is used to move FX3


into the L3 standby mode. The API ensures that all FX3 GPIOs
retain their original state as FX3 is going in and out of the Standby
mode. As the I-TCM block in FX3 loses power while in L3, the API
also manages an automatic save and restore of the I-TCM content.

Chapter Summary
This chapter included a brief look at some of the key features
of the ThreadX RTOS. The interested reader should review Ed
Lamie's book Real-time Embedded Multithreading using ThreadX
, now in its second edition. You will notice that my mutex example is
a variation of the book’s “Speedy and Slow” mutex example.

The Cypress code makes extensive use of ThreadX features


to produce a robust software base on which you can add your
application threads. The IO drivers, for example, each include a
mutex to enforce single thread access to all IO devices. Any of your
threads can access these devices and the Cypress drivers manage
the exclusive access for you. If you want to learn more about the
specific ThreadX implementation on the FX3 then read through the
device driver code; Cypress provides this as a download from their
FX3 development page at www.cypress.com/fx3. Cypress provides
full source code for these drivers for you to study - I would NOT
recommend changing them!

One goal of this Chapter was to show you that all the RTOS
capability provided with ThreadX is available for your use when
writing your application. You can use as little or as much as you
like. Small applications would use a few threads while large
applications should use many threads. Cypress also supports
advanced users who wish to add their own device drivers.

For those of you who are still not convinced that programming
with an RTOS is a great idea you can, in fact, program the FX3
without using the RTOS. I don’t recommend it but a non-RTOS
example called FX3-Lite (Boot) Firmware Library is described in the
Reference Section.

This chapter also covered the DMA programming model and


power aware programming. These techniques are used project-
wide, independent of which IO blocks are being used. In the next
chapter we shall develop some code that uses some of the
capabilities of this FX3 software.
Chapter 4 FX3 Firmware Development

In this Chapter we are going to write some code. It's going to


be easy code since the main goal of this Chapter is to introduce you
to the FX3 development environment including various debugging
strategies. The FX3 SDK includes comprehensive Windows-based
toolsets to implement each phase of your design and these are
installed separately. There are three pieces of software that must be
written for a complete FX3 solution and these are shown in Figure
4.1

Figure 4.1 FX3 development uses three tool sets

Developing a host application will be covered in Chapter 7


while creating a GPIF II state machine is well documented in
Cypress documents Getting Started with GPIF II Designer and GPIF
II Designer Users Guide ; I also create several examples starting in
Chapter 8. This Chapter is about “Firmware runs here”. If you don't
have a PC with USB 3.0 connections yet, then I would recommend
getting a new laptop as your first choice. The USB 3.0 performance
of all of the laptops that I have used was always better than the USB
3.0 add-in cards that would be used to update a tower. In Chapter 7
we will develop a benchmark program that lets you compare the
USB 3.0 performance of all Windows-based machines.

I assume that you have installed the SDK. If not, do this now
– instructions are in the SuperSpeed Explorer Kit Users Guide which
is included at the end of the Reference Section.
Cypress chose the open source Eclipse environment for the
graphical user interface (GUI) to the firmware development process.
This windowed application manages projects and includes an editor
with syntax highlighting and sophisticated search and reference
capabilities that enable you, for example, to lookup where functions
are declared and referenced. Behind this human interface is a set of
GCC tools (compiler, assembler, linker, locator, etc) that create FX3
object code for execution.

The GCC tools are enormously flexible and can create object
code for a large range of microprocessors and microcontrollers. The
downside of this flexibility is that the tools must be configured to
generate the correct object code! You can either configure the tools
yourself (not recommended but this is explained in FX3
Programmers Guide) or you can start from a pre-configured, working
project and edit it. I suggest that you create a workspace directory
now and import all of the book example code into it. This will give
you a working copy and should things go terribly wrong you will be
able to re-import the examples to restart any project.

The steps to do this are detailed in the SuperSpeed Explorer


Users Guide. Especially note the section describing auto-saving
prior to compile, setting this will save you a lot of frustration!
The Eclipse human interface also includes the capability of
attaching to a JTAG debugger and Cypress includes a Zylin and
openOCD plug-in for this. The openOCD plug-in talks directly to the
integrated debugger on the FX3 SuperSpeed Explorer board so
there is no need to purchase additional hardware. The FX3 includes
two hardware breakpoints and the JTAG debugger allows you to add
any number of software breakpoints. You can also view CPU
registers, IO block registers and memory. However, to be honest, I
found little use for this capability since its annoyance factor was
larger than its usefulness factor. My issue with breakpoints is that
the CPU stops when a breakpoint condition is met and this is the
LAST thing you want when debugging a real time OS-based
application! There are cases when the JTAG debugger must be
used (the software engineers at Cypress use it to debug some of the
FX3 device drivers) so, if you can't solve your debug problem any
other way, refer to Debugging with JTAG in the Explorer Kit User
Manual at the end of the Reference Section.

Debugging is very important to me as you shall discover


reading through the examples. The best technique I found with FX3
projects is a Debug Console with hardware assist from a logic
analyzer. Let’s first discuss the Console.

Cypress provides a debug thread and we used this in the


previous chapter. The function CyU3PDebugPrint() is used to
generate progress messages which are sent to the FX3 UART; this,
in turn, is connected to the serial port of the integrated debugger and
then via USB to a PC where a terminal program such as
ClearTerminal is used. This is good but I will expand this capability
in this chapter. Also included within the Cypress debug thread is a
logging function that writes coded messages to a memory buffer; this
is useful if you need to generate messages that would be too fast for
the UART or within an interrupt service routine or callback routine
where CyU3PDebugPrint cannot be used since it is a blocking
function.

From this point forward I shall use DebugPrint(). Not just


because I got fed up of typing “CyU3P” but because this gives me
the ability to redirect console output without changing the example
code. In later examples I will redirect console output to the I2C
channel and to a USB interface and this can be done with minimal
effort and disruption. I will be adding a console_in capability so that
we can interact with a running program and this too will be
redirectable. I will also be adding RTOS-aware debugging
capabilities throughout this chapter. If you are developing your own
hardware it is ESSENTIAL that you include a debug console; the
UART is the simplest debug channel but I2C and USB also have
merit. I will present the features and benefits of all three alternatives
throughout the next few chapters.

Another key element of the FX3 development environment is


the host application called USB Control Center . We used this in
Chapter 3 and, for the curious reader, Cypress includes the source
code for this application in the SDK download (look in
<installdirectory>/application/c_sharp/controlcenter). The USB
Control Center is a GUI interface on top of the CyUsb3.sys driver; it
can locate and display features of any CyUsb3.sys-matched device
and since this includes the SuperSpeed Explorer board then we will
shall be using this a great deal. At the base level, the USB Control
Center can be used to download a program generated by the Eclipse
toolset into FX3 RAM where it runs. Since this download is across
USB then the download-run-debug-fix-recompile loop is quite fast
and you don’t lose concentration as you move forward.
Let's now look inside one of the programs that we ran in
Chapter 3. Connect the SuperSpeed Explorer board to your PC with
two USB cables as shown in Figure 4.2.

Figure 4.2 Explorer board connected and ready to go

Open the Chapter4Example1 project using the Eclipse HI and


note the source files. This is the semaphore example that was
introduced in the previous chapter except that I moved Main() to its
own module since it changed little in all of the chapter 3 examples.
Your display should look just like Figure 4.3 and the partitioning of
the code is shown in Figure 4.4.
I will not be including every listing of every module in the book
since this will make the book way too thick. I recommend that you
follow along on your screen if possible, however I will list key
program elements as Figures.

Figure 4.3 Chapter4Example1 project source files

Figure 4.4 Start of a project template


Project Template
We will add files to the project template as the examples get
more complex. I don't like large modules so I partition my code once
the listing gets too long. My initial partitioning shown in Figure 4.4 is
based on how often I will look at the source file. Files whose names
start with CyFx were written by Cypress and in general these should
not be changed, however, if you look at the exception handlers in
CyFxTx.c you will find unhelpful routines that silently hang the
processor! These will be replaced before the end of this Chapter.

StartUp.c sets up the FX3 environment for the program. We


need to choose the processors starting frequency, whether to enable
caches are not, set up the IO matrix then start ThreadX. All the
examples in this Chapter and the first few in the next Chapter use
just the CPU and the UART so this file will not be edited for a while.

Application.c contains all of our example code; it is small now


but will grow later. By convention ThreadX, which was started in
StartUp.c , will call CyFxApplicationDefine to get the application
running. It does this after it has started all of its threads to control
the underlying hardware. We shall have a peek at these before the
end of the Chapter.

The first thing CyFxApplicationDefine does is initialize a debug


console and this is described in DebugConsole.c . I use the UART
module because this is simple. In InitDebugConsole is a Cypress
routine that starts another thread that will allow multiple user threads
to send text to the UART - the source of this is downloadable from
the Cypress’s FX3 website. We use DebugPrint to display text on
the UART console.

CyFxApplicationDefine creates an event for all threads to use


then starts all three user threads: Input, Processing and Output. I
fake most of the real work that these threads are doing using a
CyU3PThreadSleep function. This sleep routine gives control back
to the ThreadX and allows it to schedule any other thread that has
real work to do. This gives the impression that all of the user threads
are running concurrently. Actually, they are, and there are many
other threads actively running “beneath” them as well. ThreadX is
also servicing interrupts and keeping statistics on everything that is
going on. Finally CyFxApplicationDefine sets up a reporting
mechanism which tells the developer what is going on within the
application. The program flow was described in Chapter 3 so it
should be straightforward to follow the code that is doing the work.
Using the USB Control Center load and run the program and
observe in your terminal window what is going on.

In Chapter 3 I had separately compiled program images that


changed the input thread sample time from 3.5 seconds to 1.5
seconds. It will be useful to do this programmatically and to do so
we need console input function. This is straightforward to add an
gives me the opportunity to describe a simple DMA operation.

Adding Console_In
Input characters from the UART arrive at an unpredictable
(but slow!) rate so we will set up a DMA channel to catch them and
deliver them to the CPU. This may seem like over-kill but we have
way more DMA capability than we will ever use and since it is free
(well, included with the FX3) we may as well use it.

DMA channels are set up between sockets. We decide how


big a buffer is needed for the data and how many buffers should be
stacked to keep up with the data rate. One 16 byte buffer (the
smallest that can be allocated) will suffice for this application of
catching characters typed by a user. We then choose a mode; there
are several and these will be described as needs arise but for now
we choose the simplest, which is byte mode. We tell the CPU what
to do when a buffer arrives using a callback routine. For now, I'm
just going to echo the line typed in. Figure 4.5 shows the new
contents of DebugConsole.c. This is available as the
Chapter4Example2 project.

Figure 4.5 Adding a Console_In capability


void UartCallback (CyU3PUartEvt_t Event, CyU3PUartError_t Error)
// Handle characters typed in by the developer
{
CyU3PDmaBuffer_t ConsoleInDmaBuffer;
char InputChar;
if (Event == CY_U3P_UART_EVENT_RX_DONE )
{
CyU3PDmaChannelSetWrapUp (&UARTtoCPU_Handle);
CyU3PDmaChannelGetBuffer (&UARTtoCPU_Handle, &ConsoleInDmaBuffer,
CYU3P_NO_WAIT);
InputChar = (char )*ConsoleInDmaBuffer.buffer ;
DebugPrint(4, "%c" , InputChar); // Echo the character
if (InputChar == 0x0d) DebugPrint(4, "\nInput: '%s'" , ConsoleInBuffer);
else
{
ConsoleInBuffer[ConsoleInIndex] = InputChar | 0x20; // Force lower case
if (ConsoleInIndex++< sizeof
(ConsoleInBuffer))ConsoleInBuffer[ConsoleInIndex]=0;
else ConsoleInIndex--;
}
CyU3PDmaChannelDiscardBuffer (&UARTtoCPU_Handle);
CyU3PUartRxSetBlockXfer (1);
}
}
// Spin up the DEBUG Console, Out and In
CyU3PReturnStatus_t InitializeDebugConsole (void )
{
CyU3PUartConfig_t uartConfig;
CyU3PDmaChannelConfig_t dmaConfig;
CyU3PReturnStatus_t Status;

Status = CyU3PUartInit (); // Start the UART driver


CheckStatus("CyU3PUartInit" , Status);
CyU3PMemSet ((uint8_t *)&uartConfig, 0, sizeof (uartConfig));
uartConfig.baudRate = CY_U3P_UART_BAUDRATE_115200 ;
uartConfig.stopBit = CY_U3P_UART_ONE_STOP_BIT ;
uartConfig.txEnable = CyTrue;
uartConfig.rxEnable = CyTrue;
uartConfig.isDma = CyTrue;
Status = CyU3PUartSetConfig (&uartConfig, UartCallback); // Configure the UART
hardware
CheckStatus("CyU3PUartSetConfig" , Status);

Status = CyU3PUartTxSetBlockXfer (0xFFFFFFFF); // Send as much data as I


need to
CheckStatus("CyU3PUartTxSetBlockXfer" , Status);
// Attach the Debug driver above the UART driver
Status = CyU3PDebugInit (CY_U3P_LPP_SOCKET_UART_CONS , 9);
if (Status == CY_U3P_SUCCESS ) DebugTxEnabled = CyTrue;
CheckStatus("ConsoleOutEnabled" , Status);
CyU3PDebugPreamble (CyFalse); // Skip preamble, debug info is targeted for a
person

// Now setup a DMA channel to receive characters from the Uart Rx


Status = CyU3PUartRxSetBlockXfer (1);
CheckStatus("CyU3PUartRxSetBlockXfer" , Status);
CyU3PMemSet ((uint8_t *)&dmaConfig, 0, sizeof (dmaConfig));
dmaConfig.size = 16; // Minimum size allowed, I only need 1 byte
dmaConfig.count = 1; // I can't type faster than the Uart Callback routine!
dmaConfig.prodSckId = CY_U3P_LPP_SOCKET_UART_PROD ;
dmaConfig.consSckId = CY_U3P_CPU_SOCKET_CONS ;
dmaConfig.dmaMode = CY_U3P_DMA_MODE_BYTE ;
dmaConfig.notification = CY_U3P_DMA_CB_PROD_EVENT ;
Status = CyU3PDmaChannelCreate (&UARTtoCPU_Handle,
CY_U3P_DMA_TYPE_MANUAL_IN ,
&dmaConfig);
CheckStatus("CreateDebugRxDmaChannel" , Status);
if (Status != CY_U3P_SUCCESS ) CyU3PDmaChannelDestroy
(&UARTtoCPU_Handle);
else
{
Status = CyU3PDmaChannelSetXfer (&UARTtoCPU_Handle,
INFINITE_TRANSFER_SIZE);
CheckStatus("ConsoleInEnabled" , Status);
}
return Status;
}

Reset the Explorer board then load and run


Chapter4Example2.img and note that characters typed in at your
terminal application are now echoed. When you enter CR the input
line will be returned.

Adding Paramter Input


The next step is to add some code to the console in code to
do something useful with this input stream. Figure 4.6 shows a code
needed to dynamically change the input sample rate from 3500
msec to 1500 msec. I also added the capability of resetting the
Explorer board from the console. This is available as the
Chapter4Example3 project.

Figure 4.6 Adding parameter input to your program


CyBool_t ASCII_Digit (char Char)
{
return ((Char >= '0' ) && (Char <= '9' ));
}
uint32_t GetValue (char * CharPtr)
{
uint32_t Value = 0;
while (ASCII_Digit(*CharPtr)) Value = (10*Value) + (*CharPtr++ - '0' );
return Value;
}
void ParseCommand (void )
{
CyU3PDebugPrint (4, "\n" );
if (strncmp ("set" , ConsoleInBuffer, 3) == 0)
{
SampleTime = GetValue(&ConsoleInBuffer[3]);
DebugPrint(4, "\nSet SampleTime = %d" , SampleTime);
}
else if (strcmp ("reset" , ConsoleInBuffer) == 0)
{
DebugPrint(4, "\nRESETTING CPU\n" );
CyU3PThreadSleep(100);
CyU3PDeviceReset (CyFalse);
}
else DebugPrint(4, "\nUnknown Command: '%s'\n" , ConsoleInBuffer);
ConsoleInIndex = 0;
}
void UartCallback (CyU3PUartEvt_t Event, CyU3PUartError_t Error)
// Handle characters typed in by the developer, look for CR
{
CyU3PDmaBuffer_t ConsoleInDmaBuffer;
char InputChar;
if (Event == CY_U3P_UART_EVENT_RX_DONE )
{
CyU3PDmaChannelSetWrapUp (&UARTtoCPU_Handle);
CyU3PDmaChannelGetBuffer (&UARTtoCPU_Handle, &ConsoleInDmaBuffer,
CYU3P_NO_WAIT);
InputChar = (char )*ConsoleInDmaBuffer.buffer ;
CyU3PDebugPrint (4, "%c" , InputChar); // Echo the character
if (InputChar == 0x0d) ParseCommand();
else
{
ConsoleInBuffer[ConsoleInIndex] = InputChar | 0x20; // Force lower case
if (ConsoleInIndex++< sizeof
(ConsoleInBuffer))ConsoleInBuffer[ConsoleInIndex]=0;
else ConsoleInIndex--;
}
CyU3PDmaChannelDiscardBuffer (&UARTtoCPU_Handle);
CyU3PUartRxSetBlockXfer (1);
}
}

Display Program Threads


For those of you interested in what is going on inside the
RTOS you should now load and run Chapter4Example4.img. I
added a command that discovers all of the threads currently running
on the FX3 and displays them. Your display should look similar to
Figure 4.7 and the code is shown in Figure 4.8. This number will
increase as we start using more of the FX3's devices such as USB
and GPIF.

Figure 4.7 Screenshot of the currently running threads

Figure 4.8 Display the currently running threads


void DisplayThreads(void )
{
CyU3PThread *ThisThread, *NextThread;
char * ThreadName;
// First find out who I am
ThisThread = CyU3PThreadIdentify();
tx_thread_info_get(ThisThread, &ThreadName, 0, 0, 0, 0, 0, &NextThread, 0);
// Now, using the Thread linked list, look for other threads until I find myself again
while (NextThread != ThisThread)
{
tx_thread_info_get(NextThread, &ThreadName, 0, 0, 0, 0, 0, &NextThread, 0);
DebugPrint(4, "\nFound: '%s'" , ThreadName);
}
}

Display Stack Usage


One concern that all people starting with an RTOS have is
“how much stack space should I allocate to my thread?" Too much
means that you are wasting the RAM resource and too little will
cause the thread to crash. I added a stack used command and this
code is shown in Figure 4.9.

Figure 4.9 An algorithm to check stack usage.


void DisplayStacks(void )
{
int i, j;
char * ThreadName;
for (i = 0; i<APP_THREADS; i++)
{
// Note that StackSize is in bytes but RTOS fill pattern is a uint32
uint32_t* StackStartPtr = StackPtr[i];
uint32_t* DataPtr = StackStartPtr;
for (j = 0; j<APPLICATION_THREAD_STACK>>2; j++) if (*DataPtr++ !=
0xEFEFEFEF) break ;
CyU3PThreadInfoGet(&ThreadHandle[i], &ThreadName, 0, 0, 0);
ThreadName += 3; // Skip numeric ID
DebugPrint(4, "\nStack free in %s is %d/%d" , ThreadName,
(DataPtr - StackStartPtr)<<2, APPLICATION_THREAD_STACK);
}
DebugPrint(4, "\n" );
}

When ThreadX starts up a thread it initializes its stack using a


0xEFEFEFEF data pattern. The thread’s stack pointer is set to the
end of the allocated stack buffer and grows towards the start of this
stack buffer. This routine checks for this pattern to determine how
much of the stack has been used. Following this check I reduced
the stack size for each of the threads and saved 4KB of RAM! I
needed to edit a most of the modules to implement this and the
project is now called Chapter4Example5.

Adding an Error Indicator


Things are going well now but that is because we are using
pre-debugged code. The next thing we should think of is “what
happens when things start to go wrong”. The first thing we need to
do is improve the default exception handlers; sending an error
message to a debug console would help but if the CPU hangs we
will not see that message. What we need is an Error LED that we
can blink independently of the CPU if something goes terribly
wrong. The SuperSpeed Explorer board has one LED that we can
blink at different rates depending upon the discovered problem. Not
ideal but if we also send a message of the debug console we may
see that too. Figure 4.10 shows the code needed to blink the LED
using a PWM timer.

Figure 4.10 Using a blinking LED as an Error Indicator


void IndicateError (uint16_t ErrorCode)
{
// Setup a PWM to blink the SuperSpeed Explorer's only user LED at an "error rate"
CyU3PGpioComplexConfig_t gpioConfig;
// LED is on UART_CTS which has been assigned to the UART driver, claim it back
CyU3PDeviceGpioOverride (UART_CTS, CyFalse);
// ConFigure UART_CTS as PWM output
CyU3PMemSet ((uint8_t *)&gpioConfig, 0, sizeof (gpioConfig));
gpioConfig.driveLowEn = CyTrue;
gpioConfig.driveHighEn = CyTrue;
gpioConfig.pinMode =(ErrorCode == 0) ? CY_U3P_GPIO_MODE_STATIC :
CY_U3P_GPIO_MODE_PWM ;
gpioConfig.timerMode = CY_U3P_GPIO_TIMER_HIGH_FREQ ;
gpioConfig.period = PWM_PERIOD << ErrorCode;
gpioConfig.threshold = PWM_THRESHOLD << ErrorCode;
CyU3PGpioSetComplexConfig (UART_CTS, &gpioConfig);
DebugPrint(1, "FATAL ERROR = %d" , ErrorCode); // This probably won't display but try
}

// Main sets up the CPU environment the starts the RTOS


int main (void )
{
CyU3PGpioClock_t GpioClock;
CyU3PIoMatrixConfig_t io_Config;
CyU3PReturnStatus_t Status;

// The default clock runs at 384MHz


Status = CyU3PDeviceInit ();
if (Status == CY_U3P_SUCCESS )
{
// Startup the GPIO module clocks, needed for ErrorIndicator
GpioClock.fastClkDiv = 2;
GpioClock.slowClkDiv = 0;
GpioClock.simpleDiv = CY_U3P_GPIO_SIMPLE_DIV_BY_2 ;
GpioClock.clkSrc = CY_U3P_SYS_CLK ;
GpioClock.halfDiv = 0;
Status = CyU3PGpioInit (&GpioClock, 0);
if (Status == CY_U3P_SUCCESS )
{
Status = CyU3PDeviceCacheControl (CyTrue, CyTrue, CyTrue);
if (Status == CY_U3P_SUCCESS )
{
CyU3PMemSet ((uint8_t *)&io_Config, 0, sizeof (io_Config));
io_Config. isDQ32Bit = CyTrue;
io_Config. useUart = CyTrue;
io_Config. lppMode = CY_U3P_IO_MATRIX_LPP_DEFAULT ;
Status = CyU3PDeviceConFigureIOMatrix (&io_Config);
IndicateError(1); // Turn on so we know if RTOS Start fails
if (Status == CY_U3P_SUCCESS ) CyU3PKernelEntry(); // This does not
return
}
}
}
// Get here on a failure, can't recover, just hang here
while (1);
return 0; // Won't get here but compiler wants this!
}

I decided to put this code in StartUp.c and start the LED


blinking immediately, and then turn it off in CyFxApplicationDefine if I
get there and all is well. I also edited cyfxtx.c (the linkage to the
ThreadX libraries) to put IndicateError (1) in the unhandled exception
routines (lines #117,#128 and #139). Now if something goes wrong
and the CPU hangs we will have a clue where to look! While
debugging this addition I was tripped up several times so saw errors
such as 64, 66 and 68. Rather than look up these every time I
decided to let the FX3 look them up for me and extended
CheckStatus() with an ErrorLookup routine and moved this to a new
a Support.c module. I saved all of these additions as the
Chapter4Example6 project.

Adding RTOS Visibility


A running RTOS does not give you much indication that all is
going well; I would like more feedback. Express Logic have two
builds of theirThreadX kernel, the standard one and a peformance
monitoring version that adds a lot of debug and monitoring features
to the standard kernel. This does mean that it runs a little slower but
the wealth of information collected is worth the small performance
hit. To date, Cypress have only shipped the standard kernel with the
FX3 DVK; I have persuaded them to add the performance monitoring
version in the next release which will be available by the time you
are reading this book. At my request, Cypress have added an RTOS
kernel hook that informs us when it starts up a new thread and when
this thread suspends. They added similar indication of when
Mutexes, Semaphores and Events change state. I specified a
hardware indication of what the RTOS software is doing and the
Cypress kernel writers provided me with four routines where I bind
an IO pin to a kernel event (the kernel does nothing extra if these
routines are not called) as follows:
UINT tx_thread_set_profile_gpio (TX_THREAD *thread_ptr, ULONG gpio_id);
UINT tx_mutex_set_profile_gpio (TX_MUTEX *mutex_ptr, ULONG gpio_id);
UINT tx_semaphore_set_profile_gpio (TX_SEMAPHORE *semaphore_ptr, ULONG
gpio_id);
UINT tx_event_flags_set_profile_gpio (TX_EVENT_FLAGS_GROUP *group_ptr,
ULONG gpio_id);

The FX3 has a lot of capability that we haven't used yet (well,
we are only in Chapter 4!) so I decided that allocating one IO pin per
Thread, User Mutex, User Semaphore or User EventGroup was not
being over-indulgent so this is what we shall move foreward with for
this example.

We can now attach a logic analyzer onto these IO pins and


get a visual representation of which thread is running at any
particular moment in time and we can see changes in Mutexes,
Semaphores and EventGroups. Debugging RTOS code with a logic
analyzer, it works for me! For this example I chose the GPIF pins,
DQ[16. .31], since we will not be using these until Chapter 9.

Open the Chapter4Example7 project into your Eclipse


workspace and I shall highlight some of the new additions over
Example6. There is a one line change in main to tell the RTOS that
we will use the DQ[16-31] pins as GPIOs as shown in Figure 4.

Figure 4 11 Adding visual display to thread operation


// Main sets up the CPU environment then starts the RTOS
int main (void )
{
CyU3PIoMatrixConfig_t ioConfig;
CyU3PReturnStatus_t Status;

// Start with the default clock at 384 MHz


Status = CyU3PDeviceInit (0);
if (Status == CY_U3P_SUCCESS )
{
Status = CyU3PDeviceCacheControl (CyTrue, CyTrue, CyTrue);
if (Status == CY_U3P_SUCCESS )
{
CyU3PMemSet ((uint8_t *)&ioConfig, 0, sizeof (ioConfig));
ioConfig.useUart = true;
ioConfig.lppMode = CY_U3P_IO_MATRIX_LPP_UART_ONLY ;
ioConfig.gpioSimpleEn[1] = 0x0003DFFE; // Set GPIF[16:31] as GPIOs
Status = CyU3PDeviceConfigureIOMatrix (&ioConfig);
if (Status == CY_U3P_SUCCESS )
{
// Need GPIO clocks working for Error Indicator and RTOS visibility
Status = InitGpioClocks();
IndicateError(1); // Turn on Error Indicator
// One of the first things ApplicationDefine should do is turn off ErrorIndicator
CyU3PKernelEntry(); // Start RTOS, this does not return
}
}
}
while (1); // Get here on a failure, can't recover, just hang here
// Once the programs get more complex we shall do something more elegant here
return 0; // Won't get here but compiler wants this!
}

The key data structure is declared in DebugConsole.c along


with the SetupTrace() function that manipulates it and the
initialization code from CyFxApplicationDefine() that calls it. I have
collected these together and have repeated then in Figure 4.12. Let
us look at each element in turn.
Figure 4.12 New elements for visual thread tracing
struct { uint32_t Type ; uint32_t ID ; uint32_t GPIO ; } RTOS_Trace[16] = {
{ TRACE_THREAD , 10, 33, }, { TRACE_THREAD , 11, 34, }, { TRACE_THREAD , 12,
35, },
{ TRACE_SEMAPHORE , 0, 36, },{ TRACE_SEMAPHORE , 1, 37, }, { 0, 0, 38, }, { 0, 0,
39, },
{ 0, 0, 40, },{ 0, 0, 41, }, { 0, 0, 42, }, { 0, 0, 43, }, { 0, 0, 44, },
{ TRACE_THREAD , 423, 46, }, { 0, 0, 47, }, { 0, 0, 48, }, { 0, 0, 49, } };
CyU3PReturnStatus_t SetupTrace (uint32_t Index)
{
CyU3PReturnStatus_t Status = CY_U3P_SUCCESS ;
CyU3PThread* Thread;
switch (RTOS_Trace[Index].Type )
{
case 0: break ; // Do nothing
case TRACE_THREAD :
Thread = FindThread(RTOS_Trace[Index].ID );
if (Thread)
{
DebugPrint (8, "\nThread=%X using %d" , Thread, RTOS_Trace[Index].
GPIO );
Status = CyU3PThreadSetActivityGpio(Thread, RTOS_Trace[Index]. GPIO );
CheckStatus( "Register thread monitoring GPIO" , Status);
}
break ;
case TRACE_SEMAPHORE :
if (Index == 0) Status = CyU3PSemaphoreSetActivityGpio(&DataToProcess,
RTOS_Trace[Index]. GPIO );
if (Index == 1) Status = CyU3PSemaphoreSetActivityGpio(&DataToOutput,
RTOS_Trace[Index]. GPIO );
CheckStatus( "Register semaphore monitoring GPIO" , Status);
break ;
default : DebugPrint (4, "\nInvalid Type in SetupTrace = %d" ,
RTOS_Trace[Index]. Type );
break ;
}
return Status;
}
// Now setup the GPIO and the RTOS Trace
// Setup GPIF[16:31], already allocated as simple IOs, to outputs, initial low
CyU3PMemSet ((uint8_t *)&GpioConfig, 0, sizeof (GpioConfig));
GpioConfig.driveLowEn = CyTrue;
GpioConfig.driveHighEn = CyTrue;
Status = 0;
for (i=0; i<Elements(RTOS_Trace); i++)
{
if (RTOS_Trace[i].Type )
{
Status |= CyU3PGpioSetSimpleConfig (RTOS_Trace[i].GPIO , &GpioConfig);
Status |= SetupTrace(i);
DebugPrint (8, "\nIndex = %d Status = %d" , i, Status);
}
}
CheckStatus("Setup Trace GPIO pins" , Status);
}

The RTOS_Trace data structure is 16 entries each of three


elements: the Type of the trace, the ID of this particular type and the
GPIO pin associated with this entry. This structure is initially filled
with { 0, 0, GPIO_pin[16..31]} – note that these are not sequential,
there is a gap at 45!. You then decide which events you would like
to trace, in this example I chose the three user threads, the two user
semaphores and the SystemTimer thread.

We have used the thread s command to view the running


threads (see Figure 4.7). I didn’t want to use the value of the thread
pointer as an ID since the system thread values typically change with
every program run and I wanted to give you the ability to trace
system threads as well as user threads. So I decided to use the
numeric value at the start of the name [now I know why Cypress
named the threads this way!]. The System Timer should be called
“00:System Timer Thread” and it may be fixed by the time you read
this, but if note that, as is, its ID is calculated as 423.

We have no visibility into the system Mutexes, Semaphores or


EventGroups so we can only trace user-created resources, so I use
the values returned by the Create functions.

Review now the code in SetupTrace() to see how it populates


the RTOS_Trace[] structure.

The last piece is the code in CyFxApplicationDefine that calls


SetupTrace for each of the 16 chosen IO signals. Note that if
RTOS_Trace[i].Type is zero then this GPIO is not used.

Connect your logic analyzer as shown in Figure 4.13 and run


the program. You will get a display similar to Figure 4.14. . Pretty
cool, now this is what I wanted to see! Now, when your program
hangs, because an RTOS thread is waiting for some event that isn’t
going to occur and it is blocking all other threads, you can see the
culprit.

The trace wasn’t exactly what I was expecting – I was


expecting something more like Figure 3.18 but the captured thread
signals are pulses and not square waves. The culprit is my
DoWork() routine which I simulated using ThreadSleep; this, of
course, gives control back to RTOS so my thread isn’t really
“working”, it is “pulsing” just like the logic analyzer says it is.

Figure 4.13 Connecting a logic analyzer to the GPIO trace pins


Figure 4.14 Logic Analyzer trace showing RTOS activity

The SuperSpeed Explorer board also contains a JTAG TAP


Controller and this can also be used to debug a program. I
personally did not find it useful since it halted the CPU but, since it is
free, (well, included) and the Eclipse tools support it then it does
deserve some mention. The SuperSpeed Explorer Kit User Guide
included at the end of this book describes how to set up Eclipse to
communicate with the integrated debugger on the board. The
integrated debugger allows JTAG and the serial port to operate
concurrently so you will not lose your console while using JTAG.

Chapter Summary
This Chapter introduced the Firmware Development portion of
the FX3 Toolset; this uses an Eclipse GUI for program entry and
project management and “back-end” GCC tools to create program
image files that can be executed on an FX3.

The SuperSpeed Explorer Board is a convenient vehicle to


debug FX3 programs and this Chapter also discussed some useful
firmware routines that give insight into some RTOS constructs. One
routine showed how to give visibility to the internal operation of the
RTOS by indicating key events on GPIO pins that can be monitored
with a logic analyzer.

In the next Chapter we will look beyond the CPU at some of


the low-speed IO devices integrated into the FX3 component.
Chapter 5 Exploring the FX3 Low Speed
Peripherals

We looked at most of the low speed GPIO capabilities in


Chapter 4. In this Chapter we will look at the I2C and SPI
peripherals. Honestly I couldn’t find anything to do the the I2S
module: it is transmit only and expects separate left and right data
streams (this is not how Windows does it). So there will be no I2S
example in this edition of the book (contributions welcomed for the
second edition).

In this Chapter we will be using the CPLD add-on board


shown in Figure 5.1 to gain access to more IOs. We will use the low-
speed connections to the board in this Chapter and the high-speed
connections in Chapter 9.

Figure 5.1 CPLD add-on board used in this Chapter


A block diagram of the board is shown in Figure 5.2 and you
may want to study the schematic that is within the download data
(the details were too small to read when I included as a Figure in the
book).

Figure 5.2 Block diagram of the CPLD add-on board

Special (non-obvious) connections are listed below:

1) Note that I use the FX3’s I2S pins as GPIO lines to drive the
JTAG interface of the CPLD, and this allows the FX3 to reprogram
the CPLD. This is described in detail in the Reference Section in
the Programming the CPLD Chapter.

2) I2C and SPI are both fed into the CPLD and the SPI EEPROM
Chip select is derived from the CPLD.

3) I wanted to give you 8 un-committed CPLD pins so that you could


input or output your own data. I bring these to a 10 pin header
that also includes 3.3V and Gnd for your circuitry.

! IMPORTANT NOTE !
When using the CPLD board, jumper J5 on the SuperSpeed
Explorer board should be removed. This disconnects the
onboard SRAM from GPIF lines which is also connected to
the FX3. J2 should be inserted to set VIO to 3.3V.
It may be interesting to note that the board in Figure 5.1 was
not used in the original draft of this book. I used discrete I2C
expander components from Philips as shown in Figure 5.3. The
CPLD board was designed for Chapter 8 but, by adding switches
and LEDs and then reprogramming the CPLD, I was able to produce
a solution for this Chapter that was both cheaper and easier to use.

Figure 5.3 Original I2C example circuit


Connecting the CPLD board
Disconnect the USB 3.0 cable, check that JP3 is Open (we
need 3.3V as VIO due to SPI Flash memory and LEDs), turn your
SuperSpeed Explorer board over and plug in the CPLD board; it is
keyed so will only attach one way. You should now reattach the USB
3.0 cable.

Depending upon the history of your CPLD board it may not


have the I2C slave image on it, but “there’s an app for that!” Locate
and load the CPLD_Programmer.img file onto your Explorer board
using the USB Control Center. The inner workings of the CPLD
programmer are described in the Reference Section but, for now, we
will just treat it as a helpful utility. Now use the PC Utility SendFile to
send I2C_Slave.xsvf to the FX3. The file will be downloaded and
programmed into the CPLD. CPLDs contain flash memory to
remember the image programmed into them after power has been
removed so you only have to do this programming once (or until we
need to change the program!).

Figure 5.4 Programming the CPLD


Many engineers are reluctant to use I2C since it is relatively
slow and needs CPU attention during operation. The FX3 can't do
much about the speed since this is an industry standard but it's I2C
module adds buffers and hardware assistance to the I2C port which
enables the CPU to “set it and forget it”. The CPU sets up and starts
a transfer, and then receives and interrupt only when the I2C
operation is completed. Figure 5.5 shows the FX3 hardware added
to support its I2C port with the lsb of each register shown with a red
0.

Figure 5.5 I2C subsystem within FX3


A PreAmble buffer is used to set up commands and it’s
Length can be up to 8 bytes. A 16-bit StopStartControl register
implements repeated START signaling required by some data
transfers. The API include a pointer to a user buffer of data or two
DMA sockets are available if large amounts of data are to be moved.
For all operations the I2C slave device address is written into
Preamble[0].bits[7..1] and bit 0 will determine if this transfer is a read
or write. If the I2C device has an address field, and many do, this is
written into Preamble[1] and maybe Preamble[2]. If the command is
a read then the I2C bus must issue another start and the I2C device
address again. Figure 5.6 shows the register setup used to initiate a
read or write with a typical I2C EEPROM.

Figure 5.6 Setting up for a read (on the left) and write (on the
right)

If we only transferring a few bytes then we use TransferBytes


or ReceiveBytes ; these are blocking calls that wait for the transfer to
complete; this call blocks execution of your application program, but
not the underlying threads executing in the background.. At 400
Kbps it takes about 25 µsec to transfer a byte on the I2C bus and the
CPU can get through about thousand instructions in this time! When
transferring an array of data we will use a DMA socket as the source
or sink of the data and we initiate the transfer with SendCommand ;
we can choose to wait in line for the transfer to complete using
WaitForBlockXfer or we could set up a callback which executes on
completion.

Another the useful API command is WaitForAck ; after writing


to an EEPROM it will not respond to read while it is a busy updating
its internal flash memory, so a CPU typically polls with a read
command and waits for an Ack. The FX3 CPU can set up this read
command in the preamble buffer, initiate a WaitForAck and then get
on with other work. The I2C subsystem will continue to poll on the
CPU's behalf and the WaitForAck will eventually return.

Figure 5.7 shows the equivalent I2C circuit that is inside the
CPLD. It is an I2C slave with its I2C address set in the Verilog code
(I chose 56), an 8-bit input port where switches are attached and an
8-bit output port where LEDs are attached. This is a simple I2C
slave with no sub-addressing and is provided by Xilinx; their Verilog
code is described in the Reference Section in the Writing Your Own
CPLD Code Chapter.

Figure 5.7 I2C slave circuitry within the CPLD


Now open the I2C_Example project into the Eclipse
workspace. This contains all of the debug additions that we made in
the previous Chapter, including the ErrorLookup feature; hopefully
we are not going to have many of these! Note in Startup.c we tell
the GPIO manager that we are planning to use the I2C module. For
my example I am going to read the switches and transfer the pattern
to the LEDs. This operation is simple enough that both functions can
be included in the preamble buffer. You may want to do something a
little more exotic so feel free to attach your favorite I2C hardware to
the SuperSpeed Explorer board and edit the I2C_Example program.
I would suggest that you select and copy the I2C_Example project
then paste and rename it to MyI2C_Example (or similar) so that you
don't lose a working example.

Figure 5.8 Copying switches to LEDs using I2C


uint8_t ReadButtons(void )
{
CyU3PReturnStatus_t Status;
CyU3PI2cPreamble_t preamble;
uint8_t Value;
preamble.length = 1;
preamble.buffer [0] = (DeviceAddress<<1) | 1;
preamble.ctrlMask = 0x0000;

Status = CyU3PI2cReceiveBytes (&preamble, &Value, 1, 0);


CheckStatus("I2C_Read" , Status);
CyU3PDebugPrint (4, "\nButtons = %x" , Value);
return Value;
}
void WriteLEDs(uint8_t Value)
{
// CyU3PDebugPrint(4, "\nLEDs = %d, ", Value);
CyU3PReturnStatus_t Status;
CyU3PI2cPreamble_t preamble;
preamble.length = 1;
preamble.buffer [0] = DeviceAddress<<1;
preamble.ctrlMask = 0x0000;

Status = CyU3PI2cTransmitBytes (&preamble, &Value, 1, 0);


CheckStatus("I2C_Write" , Status);

/* Wait for the write to complete. */


Status = CyU3PI2cWaitForAck (&preamble, 10);
CheckStatus("I2C_WaitForAck" , Status);
}
void ApplicationThread(uint32_t Value)
{
int32_t Seconds = 0;
uint8_t Buttons = 0xAA;
CyU3PReturnStatus_t Status;

Status = SetupGPIO(); // Needed for CPLD_Reset = GPIF_CTRL[10]


CheckStatus("GPIO Initialized" , Status);

Status = InitializeDebugConsole(9);
CheckStatus("Debug Console Initialized" , Status);
// Remove Reset from the CPLD
CyU3PThreadSleep(10);
CyU3PGpioSetValue (CPLD_RESET, 0);

if (Status == CY_U3P_SUCCESS )
{
Status = I2C_Init();
CheckStatus("I2C_Init" , Status);

while (1) // Now run forever


{
CyU3PThreadSleep(100);
WriteLEDs(Buttons);
CyU3PThreadSleep(100);
Buttons = ReadButtons();
Seconds++;
}
}
DebugPrint (4, "\nApplication failed to initialize. Error code: %d.\n" , Status);
while (1); // Hang here
}
Before we leave I2C I have two other examples. We shall
read and write the I2C EEPROM on the SuperSpeed Explorer board
using DMA and we will read and write the I2C port of the
CY7C65215 component that implements the integrated debugger.
This will prove very useful, as you should see, when we move on to
SPI.

The on-board I2C EEPROM is at device address 010100xx.


The M24M02 is a 256KB device and therefore needs 18 address
lines; 16 of these are supplied as a sub address and the upper 2 bits
are part of the device address. The I2C bus therefore sees this
component as 4 separate devices. Now open the
I2C_EEPROM_Example project into the Eclipse workspace; this sets
up DMA channels to write into the EEPROM from ConsoleIn and
ConsoleOut and read from the EEPROM to ConsoleOut (with
logging turned off for this so we don't have an infinite loop). I am not
advocating that you implement this in your design but it is an
interesting demonstration of DMA buffering; the EEPROM page write
size is 256 bytes so I will use this as my DMA size too to simplify
buffering. Study the code and play with the example since it may
help you with a logging project.

The USB Control Center allows you to copy a firmware image


directly into the EEPROM of the SuperSpeed Explorer board.
Vendor commands are used and the FX3 firmware must be written to
accept and act on these commands. The Cypress UsbI2cDmaMode
example reference does this for you. The FX3 can boot from the I2C
EEPROM; using the USB Control Center load the I2CExample.img
onto the board but this time select I2C EEPROM rather than RAM as
the download target. After downloading, press the RESET button
and note that the FX3 boots directly into the CopySwitchesToLEDs
application. Any .img file can, of course, be chosen and we will use
this feature in the next Chapter such that Cypress’s USB Vendor ID
(VID) and Product ID (PID) are not exposed to your end customer.

Dual Console Project


My next planned project example was to be the SPI block but
we have a problem. When in 32-bit GPIF mode, the SPI lines are on
the same pins as the UART lines. In order to enable the SPI
interface I must disable the UART interface, and this means that I
loose my debug console.

Since I don’t wish to operate without a Debug Console I


looked for alternate solutions. A Debug Console over USB looks
good but we haven’t learnt about USB yet, I shall implement a USB
Debug Console in Chapter 6. We have just learnt I2C and using the
I2C channel as a Debug Console is straightforward so I decided to
go in this direction. For safety, since I don’t like making too big steps
during development, I will first implement an example with both
UART and I2C consoles and then remove the UART console.

In addition to the EEPROM connection, the FX3 I2C controller


is also connected to channel 0 of the CY7C65215 USB serial
component that implements the integrated debugger and Cypress
provides a driver that will connect the I2C slave port of the
CY7C65215 to a Virtual COM port Driver. We can configure the
CY7C65215 channel 0 to connect to the FX3 UART or the FX3 I2C
bus so this project will need the temporary addition of a UART-to-
USB cable as shown in Figure 5.9 so that we can have access to
both interfaces.

Figure 5.9 Connecting the FX3 UART to a PC using a UART-to-


USB cable

The UART-to-USB cable is connected to the TX and RX pins


of J7 and to ground. I used a CYUSB232 USB-UART LP Reference
Design Kit since I had one on hand but many others are available
that do the same function. Operation is the same as with all of the
previous examples, we are just using a separate cable instead of the
CY7C65215 connection.

The CY7C65215 connection will be running our I2C Debug


Console. Using the Cypress USB-Serial Configuration Utility,
reconfigure the SCB 0 connection to support I2C Slave as shown in
Figure 5.10.

Figure 5.10 Change Serial Channel 0 to be an I2C Slave

I chose an address of 0x42 since that is my lucky number.


After clicking the program button you need to select Device and
Reset it for the CY7C65215 to come up in its new configuration. The
I2C channel will automatically connect as a Virtual COM Port since
this is a recognized configuration within CyUsb3.sys. We now have
two consoles attached to the FX3 as shown in Figure 5.11.

Figure 5.11 Tempoarily attaching another console for this


example

I show the development PC running two instantiations of


Clear Terminal but you could also use two separate PCs running
their own terminal emulation program if you preferred.

Now load and run the DualConsoleExample.img and note that


Console Output is appearing on both screens and you can provide
Console Input on either keyboard. I needed to use many RTOS
features to develop this example and I describe the code in detail in
the Reference Section in B uilding an I2C Debug Console . The
reason that I initially chose the UART for a Debug Console is that it
was easy and this example shows that using an I2C Debug Console
is just as easy thanks to the USB-to-Serial converter, CY7C65215.

SPI Example
My next example uses SPI and here we HAVE to use I2C for
the console since the FX3 uses the same pins for SPI and the
UART. Well this isn't quite accurate; if we were using just 16-bits of
GPIF then both the UART and SPI ports are available but the UART
is moved to a different pin position which is not supported on the
SuperSpeed Explorer board. But remember that this book is about
building a high-performance, low-power, stand-alone, SuperSpeed
device and this would use a 32-bit GPIF connection where you
must decide between UART and SPI. So for the remainder of this
Chapter the example we shall use I2C for the debug console. You
will not notice the difference but when you design your own board
you will need to know this pin allocation limitation of the FX3 and
design around it as I have done.

The SPI module is a 4-wire, full duplex, master


communications channel that operates at up to 33 MHz. It is fully
configurable supporting data links from 4 to 32-bits, big or little
endian, all standard clock modes, and its slave chip select can be
auto-driven by the module. In the upcoming examples I have 3 SPI
devices connected and must therefore preselect which device I am
about to communicate with so that I can steer the Slave Select,
Negative true (SSN) signal. I do this inside the CPLD as shown
diagrammatically in Figure 5.12. This is simpler, faster and saves bit
banging of individual chip selects using GPIO pins. I use 2 GPIO
lines as addresses for my 3 SPI devices. If both address lines are
high then I2C owns the switches and LEDs. If you have more
devices in your design then you would use additional GPIO lines as
addresses.

Figure 5.12 SPI circuitry within CPLD

For those of you who are interested in what happens “behind-


the-scenes” Figure 5.13 shows the Verilog code used to implement
the SPI module (edited to fit on one page, see the actual project for
real code). The SPI_I2C CPLD project has a top module with both
I2C (from Figure 5.7) module and the SPI module included. There is
no need to reprogram the CPLD at this time.

Figure 5.13 Verilog code for SPI module implementation


module spi_module (
input SPICLK, SPIMOSI, SPICS_N, [1:0] SPIADDR, [7:0] BUTTON, RESET,
output SPIMISO, FLASHCSN, [7:0] LED,
inout [7:0] USER_PIN
);
// Set parameter to determine the direction of the individual USER_PIN
parameter user_output = 8'b00000000; // 1 = output
// Declarations
wire SWLED_CS; // Switch/Led register selected
wire USER_CS; // User register selected
wire shift_mosi; // Shift the mosi shift register
reg [8:0] mosi_r; // Mosi shift register
wire load_miso; // Load the Miso shift register
wire [7:0] read_data; // Read of BUTTON or USER_PIN inputs
reg [7:0] miso_r; // Miso shift register
wire write_led; // Write the LED register
wire write_user; // Write the USER register
wire [7:0] write_data; // Data to write to registers
reg [7:0] led_r; // LED register
reg [7:0] user_r; // USER register
// Decode address to determine target of Spi operation
assign FLASHCSN = ~( (~SPICS_N) & (SPIADDR == 2'b00)); // Flash selected
assign SWLED_CS = ( (~SPICS_N) & (SPIADDR == 2'b01)); // Switch/Led selected
assign USER_CS = ( (~SPICS_N) & (SPIADDR == 2'b10)); // User pins selected
// MOSI shift register
assign shift_mosi = (SWLED_CS | USER_CS) & ~mosi_r[8];
always @ (negedge SPICLK or posedge SPICS_N) begin
if (SPICS_N) begin
mosi_r <= 9'h001;
end
else begin
mosi_r <= shift_mosi ? {mosi_r[7:0], SPIMOSI} : mosi_r;
end
end
// MISO shift register
assign load_miso = (mosi_r[8:1] == 8'h01);
assign read_data = SWLED_CS ? BUTTON : USER_CS ? USER_PIN : 8'h0;
always @ (negedge SPICLK or posedge SPICS_N) begin
if (SPICS_N) begin
miso_r <= 8'h0;
end
else begin
miso_r <= load_miso ? read_data : {miso_r[6:0], 1'b0};
end
end
assign SPIMISO = miso_r[7];
assign MISO_ENA = SWLED_CS | USER_CS;
// Drive LED and USER pins
assign write_led = SWLED_CS & (mosi_r[8:7] == 2'b01);
assign write_user = USER_CS & (mosi_r[8:7] == 2'b01);
assign write_data = {mosi_r[6:0], SPIMOSI};
always @ (negedge SPICLK or posedge RESET) begin
if (RESET) begin
led_r <= 8'h0;
user_r <= 8'h0;
end
else begin
led_r <= write_led ? write_data : led_r;
user_r <= write_user ? write_data : user_r;
end
end
assign LED = led_r;
assign USER_PIN[0] = user_output[0] ? user_r[0] : 1'bz;
assign USER_PIN[1] = user_output[1] ? user_r[1] : 1'bz;
assign USER_PIN[2] = user_output[2] ? user_r[2] : 1'bz;
assign USER_PIN[3] = user_output[3] ? user_r[3] : 1'bz;
assign USER_PIN[4] = user_output[4] ? user_r[4] : 1'bz;
assign USER_PIN[5] = user_output[5] ? user_r[5] : 1'bz;
assign USER_PIN[6] = user_output[6] ? user_r[6] : 1'bz;
assign USER_PIN[7] = user_output[7] ? user_r[7] : 1'bz;
endmodule

Now open the SPI_Example project into your Eclipse


workspace. Note in Startup.c that we are using SPI and I2C but not
the UART. To make this example more realistic I have 2 threads that
both wish to access SPI devices. The underlying SPI driver will
correctly permit only one thread to access the SPI module at a time
but I have added another mutex to control access to the SPI address
lines since these become a shared resource that must be protected
from being changed during SPI accesses. One thread periodically
reads the switches and copies them to the LEDs while the other
threads displays blocks of data from the SPI_Flash memory.

Load and run SPI_Example.IMG and enter Dnnn (nnn =


0x000 to 0xFFF) on the Debug Console to display blocks of data
from the SPI_Flash. The switches may be toggled while the display
is running and note that the LEDs are updated. Feel free to extend
the Verilog code and the application program to access the 8-bit
User port on the CPLD board.

Before leaving this Chapter you may want to switch your


debug console back to use the UART. Use the USB-Serial
Configuration Utility, to reconfigure the SCB 0 connection back to a
UART with CDC driver and select 2 wire as the connection method.

This Chapter has covered the FX3’s “supporting cast” of low


speed peripheral devices. The FX3 hardware treats them all as
block orientated devices and this provides a consistent programming
interface where underlying DMA hardware does the heavy lifting of
data movement so that more CPU time is available to run your
application code. We also discovered that we can use the UART
channel or the I2C channel to run our Debug Console.

The next Chapter will look at USB and you will discover that
the DMA hardware also makes this easy for you and the CPU too.
Chapter 6 SuperSpeed USB communications

The USB block presents the same socket interface to the FX3
API so transferring data across the SuperSpeed USB interface is
fundamentally the same as transferring data across the UART or SPI
interfaces so there is not a lot more to learn here. There are 32
sockets that match up with the 32 endpoints so we can have several
conversations going on at the same time. To get maximum USB
transfer speed we need to generate or consume data at 400 MBps
and the UART, SPI or even CPU cannot maintain this rate. GPIF II
is designed to sustain this rate and we shall do this in the next
Chapter but I wanted to start easy and grow into the higher speed.

Connecting to USB has consequences for the overall design


of the project so I'd like you to step back and consider a few system
implications of this step. Many USB devices require a USB
connection to implement their task and without it they may do other
tasks or may do nothing at all. Also a USB connection may be lost
or suspended at any time and the application must deal with this. To
accommodate these system level issues I have split Application.c
into StartStopApplication.c and RunApplication.c. We also have two
new modules in our template to handle USB; these are
USB_Handler.c that manages USB traffic and USB_Descriptors.c
that defines the descriptors for this application and this template is
shown in Figure 6.1.

Figure 6.1 Project template moving forward


Keyboard Example
My first example is a keyboard. I am not expecting you to
build a SuperSpeed USB keyboard (unless you can type really, really
fast). I chose this example because it uses a pre-existing driver and
therefore we do not need to write any software on the host. We can
concentrate on the USB aspects of the FX3. A keyboard device is a
HID (Human Interface Device) and therefore the communications
protocol is predefined. We need to implement an Interrupt_IN
endpoint and handle class requests on the control endpoint, EP0.
All communication will go via the CPU so we will learn how to
implement various USB to and from CPU DMA transfers. I will use
the debug keyboard to generate keystrokes and will display
keyboard indicators sent from the host on the debug console as
shown in Figure 6.2. You can use two laptops as I have shown in
the figure or, if you have a large monitor, use two instantiations of
Clear Terminal (or TeraTerm or . . . ), one for each virtual port.

Figure 6.2 SuperSpeed USB keyboard example


Open the KeyboardExample project into your Eclipse
workspace since it will be easier to follow along looking at code on
your screen rather than the smaller Figures in this book. Before the
keystrokes can be sent the USB connection to the host must be
made. ApplicationDefine calls InitializeUSB() in USB_Handler.c and
this is shown in Figure 6.3.

Figure 6.3 USB initialization code


// Declare the callbacks needed to support the USB device driver
CyBool_t USBSetup_Callback (uint32_t setupdat0, uint32_t setupdat1)
{
CyBool_t isHandled = CyFalse;
CyU3PReturnStatus_t Status ;
uint16_t Count;
union { uint32_t SetupData [2];
struct { uint8_t Target :5; uint8_t Type :2; uint8_t Direction :1;
uint8_t Request ; uint16_t Value ; uint16_t Index ; uint16_t Length ; };
} Setup;
// Copy the incoming Setup Packet into Setup union which will "unpack" the variables
Setup. SetupData [0] = setupdat0;
Setup. SetupData [1] = setupdat1;
// USB Driver will send me Class and Vendor requests to handle
// I only have to handle three class requests for a Keyboard
if (Setup.Target == CLASS_REQUEST)
{
if (Setup.Direction == 0) // Host-to-Device
{
if (Setup.Request == HID_SET_REPORT)
{
CyU3PUsbGetEP0Data ( sizeof (glEP0Buffer), glEP0Buffer, &Count);
isHandled = CyTrue;
CyU3PDebugPrint (4, "\nSet LEDs = 0x%x\n" , glEP0Buffer[0]);
}
if (Setup.Request == HID_SET_IDLE)
{
CyU3PUsbAckSetup ();
isHandled = CyTrue;
CyU3PDebugPrint (4, "\nSent Ack to Set Idle" );
}
}
else // Device-to-Host
{
if ((Setup.Request == CY_U3P_USB_SC_GET_DESCRIPTOR ) && ((Setup.Value
>> 8) ==
CY_U3P_USB_REPORT_DESCR ))
{
Status = CyU3PUsbSendEP0Data (59, (uint8_t*)ReportDescriptor);
CheckStatus( "Send Report Descriptor" , Status);
isHandled = CyTrue;
}
}
}
return isHandled;
}
// For Debug and education display the name of the Event
const char * EventName[] = {
"CONNECT" , "DISCONNECT" , "SUSPEND" , "RESUME" , "RESET" ,
"SET_CONFIGURATION" , "SPEED" ,
"SET_INTERFACE" ,"SET_EXIT_LATENCY" , "SOF_ITP" ,
"USER_EP0_XFER_COMPLETE" , "VBUS_VALID" ,
"VBUS_REMOVED" , "HOSTMODE_CONNECT" , "HOSTMODE_DISCONNECT" ,
"OTG_CHANGE" ,
"OTG_VBUS_CHG" ,"OTG_SRP" , "EP_UNDERRUN" , "LINK_RECOVERY" ,
"USB3_LINKFAIL" ,
"SS_COMP_ENTRY" , "SS_COMP_EXIT" };
void USBEvent_Callback (CyU3PUsbEventType_t Event, uint16_t EventData )
{
DebugPrint (4, "\nEvent received = %s" , EventName[Event]);
switch (Event)
{
case CY_U3P_USB_EVENT_SETCONF :
/* Stop the application before re-starting. */
if (glIsApplicationActive) StopApplication();
StartApplication();
break ;
case CY_U3P_USB_EVENT_RESET :
case CY_U3P_USB_EVENT_CONNECT :
case CY_U3P_USB_EVENT_DISCONNECT :
if (glIsApplicationActive)
{
CyU3PUsbLPMEnable ();
StopApplication();
}
break ;

default :
break ;
}
}
CyBool_t LPMRequest_Callback (CyU3PUsbLinkPowerMode link_mode)
{
return CyTrue;
}
// Spin up USB, let the USB driver handle enumeration
CyU3PReturnStatus_t InitializeUSB (void )
{
CyU3PReturnStatus_t Status;
Status = CyU3PUsbStart ();
CheckStatus("Start USB Driver" , Status);
// Setup callbacks to handle the setup requests, USB Events and LPM Requests (for
USB 3.0)
CyU3PUsbRegisterSetupCallback(USBSetup_Callback, CyTrue);
CyU3PUsbRegisterEventCallback (USBEvent_Callback);
CyU3PUsbRegisterLPMRequestCallback (LPMRequest_Callback);

// Driver needs all of the descriptors so it can supply them to the host when requested
Status = SetUSBdescriptors();
CheckStatus("Set USB Descriptors" , Status);
/* Connect the USB Pins with super speed operation enabled. */
Status = CyU3PConnectState (CyTrue, CyTrue);
CheckStatus("Connect USB" , Status);
return Status;
}
I start the USB driver using CyU3PUsbStar t () and this will
set up all the USB hardware, several DMA buffers and a thread to
handle almost all of the work. We need to register three callback
routines with the driver where we can customize the operation of the
driver for this particular application. The
CyU3PUsbRegisterSetupCallback() tells the driver how to handle the
set up requests that will arrive on EP0; in general you set the second
parameter to CyTrue which tells the driver to handle all set up
requests, such as the many received during enumeration, itself. It
will then only call the USBSetup_Callback routine for Class and
Vendor requests that it cannot handle.

If you look now inside the USBSetup_Callback() you will see


that it is handling the class requests to set the keyboard LEDs,
respond to the HID idle requests and supply the report descriptor
when requested. This code implements the functions that a HID
class device is expected to handle.

The USBEvent_Callback() is used to notify us of important


state transitions on USB. I included a decoded display of these
events since I find it useful to know what the driver is doing and I am
sure that you will too. I handle events which let me know when to
start and stop the application.

The final callback is LPMRequest_Callback() . This callback


is made when the host wants to transition from SuperSpeed active
state (U1) to a lower power state, U2. We return CyTrue if it is OK to
move to a low power state or CyFalse if we would prefer to stay fully
active at the highest power. In this application it is acceptable to go
to the lower power state since the delay in restarting from this state
is insignificant when compared with our data rate, so we return
CyTrue. In examples in later Chapters will return CyFalse if our
application would prefer to keep USB active.

After registering the callbacks we call SetUSBdescriptors() to


provide the driver with all of our descriptor information. You should
take a quick look at these descriptors in USB_Descriptors.c ; they
are not repeated here since the file is large and not very interesting,
this is standard USB stuff! They include everything needed to
declare a USB 3.0 device and a USB 2.0 device; all USB 3.0 devices
must also operate as USB 2.0 devices if they are attached to a USB
2.0 host. I put the routine that sets this these descriptors in this
module too.

The last thing the InitializeUSB() does is connect the internal


USB module to the USB pins using CyU3PConnectState(). This will
result in the host seeing our device and start the enumeration
sequence. Various events will be seen on our debug console and
once a Set Configuration event is received we can start the
application; the code to do this is within StartStopApplication.c and
repeated here in Figure 6.4 for discussion. The matching
StopApplication is also shown.

Figure 6.4 Starting (and Stopping) the keyboard application


void StartApplication(void )
// USB has been enumerated, time to start the application running
{
CyU3PEpConfig_t epConfig;
CyU3PDmaChannelConfig_t dmaConfig;
CyU3PReturnStatus_t Status;
// Display the enumerated device bus speed
DebugPrint(4, "\nRunning at %sSpeed" , BusSpeed[CyU3PUsbGetSpeed ()]);
// Configure and enable the Interrupt Endpoint
CyU3PMemSet ((uint8_t *)&epConfig, 0, sizeof (epConfig));
epConfig.enable = CyTrue;
epConfig.epType = CY_U3P_USB_EP_INTR ;
epConfig.burstLen = 1;
epConfig.pcktSize = REPORT_SIZE;
Status = CyU3PSetEpConfig (CY_FX_EP_CONSUMER, &epConfig);
CheckStatus("Setup Interrupt In Endpoint" , Status);
// Create a manual DMA channel between CPU producer socket and USB
CyU3PMemSet ((uint8_t *)&dmaConfig, 0, sizeof (dmaConfig));
dmaConfig.size = 16; // Minimum size, I only need REPORT_SIZE
dmaConfig.count = 2;
dmaConfig.prodSckId = CY_FX_CPU_PRODUCER_SOCKET;
dmaConfig.consSckId = CY_FX_EP_CONSUMER_SOCKET;
dmaConfig.dmaMode = CY_U3P_DMA_MODE_BYTE ;
Status = CyU3PDmaChannelCreate (&glCPUtoUSB_Handle,
CY_U3P_DMA_TYPE_MANUAL_OUT ,
&dmaConfig);
CheckStatus("CreateCPUtoUSBdmaChannel" , Status);
// Set DMA Channel transfer size = infinite
Status = CyU3PDmaChannelSetXfer (&glCPUtoUSB_Handle, 0);
CheckStatus("CPUtoUSBdmaChannelSetXfer" , Status);
glIsApplicationActive = CyTrue; // Now ready to run!
}

void StopApplication(void )
// USB connection has been lost, time to stop the application running
{
CyU3PEpConfig_t epConfig;
CyU3PReturnStatus_t Status;
glIsApplicationActive = CyFalse;
// Close down and disable the endpoint then close the DMA channel
CyU3PUsbFlushEp (CY_FX_EP_CONSUMER);
CyU3PMemSet ((uint8_t *)&epConfig, 0, sizeof (epConfig));
Status = CyU3PSetEpConfig (CY_FX_EP_CONSUMER, &epConfig);
CheckStatus("Disable Producer Endpoint" , Status);
Status = CyU3PDmaChannelDestroy (&glCPUtoUSB_Handle);
CheckStatus("Close USBtoCPU DMA Channel" , Status);
}

Our keyboard HID has one data endpoint and we set up a


manual DMA channel to get the report data from the CPU to this
endpoint and turn the channel on. OK, we can now send keyboard
reports! I generate keyboard reports in DebugConsole.c as shown in
the highlighted line in Figure 6.5 – I intercept characters that are
being typed on the debug console and convert these into keystrokes
which will be sent to the PC.

Figure 6.5  Collect ‘keystrokes’ from the user


void UartCallback (CyU3PUartEvt_t Event, CyU3PUartError_t Error)
// Handle characters typed in by the developer
{
CyU3PDmaBuffer_t ConsoleInDmaBuffer;
char InputChar;
if (Event == CY_U3P_UART_EVENT_RX_DONE )
{
CyU3PDmaChannelSetWrapUp (&glUARTtoCPU_Handle);
CyU3PDmaChannelGetBuffer (&glUARTtoCPU_Handle,
&ConsoleInDmaBuffer, CYU3P_NO_WAIT);
InputChar = (char )*ConsoleInDmaBuffer.buffer ;
CyU3PDebugPrint (4, "%c" , InputChar); // Echo the character
// The characters typed on the debug console are sent as keystrokes on the keyboard
SendKeystroke(InputChar);
CyU3PDmaChannelDiscardBuffer (&glUARTtoCPU_Handle);
CyU3PUartRxSetBlockXfer (1);
}
}
The translation of ASCII characters into keystrokes is
implemented using a lookup table and this is shown in Figure 6.6;
We need to send 2 reports, one for key down and the other for key
up. I only handle printable ASCII characters. If I had been building a
real SuperSpeed keyboard then I would have code that handled
function keys, special keys such as volume up and rewind and a
plethora of other keyboard goodies. The goal of this example was to
introduce USB data transfers and not build the world's best and
fastest keyboard. This is left to the reader who would like to expand
this example.

Figure 6.6 Ready to send keystrokes


const uint8_t Ascii2Usage[] = {
// Create a lookup table that uses the ASCII character as an index and produces a
Modifier/Usage Code pair
0,0x2C,2,0x1E,2,0x34,2,0x20,2,0x21,2,0x22,2,0x24,0,0x34, // 20..27 !"#$%&'
2,0x26,2,0x27,2,0x23,2,0x2E,0,0x36,0,0x2D,0,0x37,0,0x38, // 28..2F ()*+,-./
0,0x27,0,0x1E,0,0x1F,0,0x20,0,0x21,0,0x22,0,0x23,0,0x24, // 28..2F 01234567
0,0x25,0,0x26,2,0x33,0,0x33,2,0x36,0,0x2E,2,0x37,2,0x38, // 28..2F 89:;<=>?
2,0x1F,2,0x04,2,0x05,2,0x06,2,0x07,2,0x08,2,0x09,0,0x0A, // 00..07 ^ @ABCDEFG
2,0x0B,2,0x0C,2,0x0D,2,0x0E,2,0x0F,2,0x10,2,0x11,2,0x12, // 08..1F ^ HIJKLMNO
2,0x13,2,0x14,2,0x15,2,0x16,2,0x17,2,0x18,2,0x19,2,0x1A, // 10..17 ^ PQRSTUVW
2,0x1B,2,0x1C,2,0x1D,0,0x2F,0,0x31,0,0x30,2,0x23,2,0x2D, // 18..1F ^ XYZ[\]^_
0,0x2C,0,0x04,0,0x05,0,0x06,0,0x07,0,0x08,0,0x09,0,0x0A, // 00..07 ^ @abcdefg
0,0x0B,0,0x0C,0,0x0D,0,0x0E,0,0x0F,0,0x10,0,0x11,0,0x12, // 08..1F ^ hijklmno
0,0x13,0,0x14,0,0x15,0,0x16,0,0x17,0,0x18,0,0x19,0,0x1A, // 10..17 ^ pqrstuvw
0,0x1B,0,0x1C,0,0x1D,2,0x2F,2,0x31,2,0x30,2,0x35,0,0x28 // 18..1F ^ xyz{|}~
};

void SendKeystroke (char InputChar)


// In this example characters typed on the debug console are sent as key strokes
// The format of a keystroke is defined in the report descriptor;
// it is 8 bytes long = Modifier, Reserved, UsageCode[6]
// A 'standard' keyboard can encode up to 6 key usages, this example only does 1
// A keyboard will send two reports, one for key press and one for key release
{
uint16_t Index;
CyU3PReturnStatus_t Status = CY_U3P_SUCCESS ;
CyU3PDmaBuffer_t ReportBuffer;
// The only Control character I handle is CR
if (InputChar == 0x0D) InputChar = 0x7F;
if (InputChar > 0x1F)
{
Index = (((uint8_t)InputChar & 0x7F)-0x20)<<1; // Each entry is two uint8_t
// First need a buffer to build the report
Status = CyU3PDmaChannelGetBuffer (&glCPUtoUSB_Handle, &ReportBuffer,
CYU3P_WAIT_FOREVER);
// CheckStatus("GetReportBuffer4KeyPress", Status);
// Most of this report will be 0's
ReportBuffer.count = REPORT_SIZE;
CyU3PMemSet (ReportBuffer.buffer , 0, REPORT_SIZE);
// Convert InputChar to a Modifier and a Usage
ReportBuffer.buffer [0] = Ascii2Usage[Index++];
ReportBuffer.buffer [2] = Ascii2Usage[Index];
// Send the Key Press to the host
Status = CyU3PDmaChannelCommitBuffer (&glCPUtoUSB_Handle,
REPORT_SIZE, 0);
// CheckStatus("Send KeyPress ", Status);
// Wait 10msec then send a Key Release
CyU3PThreadSleep(10);
Status = CyU3PDmaChannelGetBuffer (&glCPUtoUSB_Handle, &ReportBuffer,
CYU3P_WAIT_FOREVER);
// CheckStatus("GetReportBuffer4KeyRelease", Status);
ReportBuffer.count = REPORT_SIZE;
CyU3PMemSet (ReportBuffer.buffer , 0, REPORT_SIZE);
Status = CyU3PDmaChannelCommitBuffer (&glCPUtoUSB_Handle,
REPORT_SIZE, 0);
// CheckStatus("Send KeyRelease", Status);
}
}
In real-time the last few pages took about 50 msec so the
code in RunApplication.c, has been waiting for us. . . . . .
Build the project then use the USB Control Center to load and
run it. Characters now typed on your debug console are being sent
to your host. Open a word processor, or similar, and watch the
characters arrive. On your host keyboard try CAPS_LOCK and
NUM_LOCK a few times; these generate notifications that can be
viewed on the debug console as LEDs settings.

The USB driver does all of the heavy lifting for us and the
DMA driver moves data to where we need it. These two building
blocks make it straight forward to communicate with a host PC using
SuperSpeed USB allowing you to focus on the requirements of your
application and not on the low level details.

CDC Example
My next example builds a tool that we can use in later
examples. You may also find it useful while developing your
applications. This example also uses an OS-supplied class driver
and, from an implementation point of view, it is similar to the
keyboard example. The Communications Driver Class (CDC) is
used by modems and terminal emulation programs such as Clear
Terminal and TeraTerm to move serial data between the host and a
serial device. By itself it is not too interesting but when combined
with other USB interfaces, as we shall do with the next example, it
will be a valuable addition to our tool chest. Figure 6.7 shows the
operation of this example.

Figure 6.7 Data communication in this CDC Example


A CDC interface uses two bulk endpoints, one OUT and one
IN, to move user data. It also uses a control endpoint to send
commands and an interrupt endpoint for returning status and
notifications from the device. Open the CDC_Example project into
your Eclipse workspace and review Descriptors.c; again nothing too
exciting but necessary for the application. These descriptors define
an abstract control device which is basically a bidirectional, serial
byte mover.

We have two choices on how to connect up the sockets and


these are shown in Figure 6.8, the example supports both with a
#define DirectConnect. If DirectConnect = 1 then I make the obvious
connections of joining the Producer Endpoint socket with the UART
Consumer socket and UART Producer socket with the Consumer
Endpoint socket. This allows me to setup auto DMA channels
between the USB sockets and UART sockets. If all we wanted to do
was to create a USB-to-Serial connection then we are done,
however, a disadvantage of this scheme is that we lose our Debug
Console.

The right side of Figure 6.8 shows the connections with


DirectConnect= 0; here I am passing all the data through the CPU.
The sockets on the CPU appear to be shared – this is OK, the CPU
supports as many producer and consumer sockets as you need, it
uses the callback routines to execute the correct code.
Figure 6.8 The example supports DirectConnect or CPU
supervised

Figure 6.9 shows the StartApplication function that sets up the


USB endpoints and DMA channels and the StopApplication code
that tears them down. It is a little long since I included both
connection examples in the same code so that you could compare
and contrast the differences between the implementations. Setting
up the endpoints is the same in each case. The DirectConnect has
simpler DMA settings since the channel is an auto channel with no
CPU involvement. The CPU-supervised connection specifies two
manual channels each with a callback routine to manage routing
characters around the various DMA channels. The manual channel
between CPU and USB uses a local (non-DMA allocated) buffer to
send characters – I wanted to show an alternative approach to the
regular GetBuffer method. In all cases the DMA buffers are recycled
by the routines and data transfer continues until the FX3 is reset.

The comments and the text in the CheckStatus() calls give


you a good guide to what the code is doing.
Figure 6.9 Starting and Stopping the two examples
#if (DirectConnect)
CyU3PDmaChannel glUSBtoUART_Handle; // Handle needed for Bulk Out Endpoint
CyU3PDmaChannel glUARTtoUSB_Handle; // Handle needed for Bulk In Endpoint
#else
CyU3PDmaChannel glUSBtoCPU_Handle; // Handle needed for Bulk Out Endpoint
CyU3PDmaChannel glCPUtoUSB_Handle; // Handle needed for Bulk In Endpoint
CyU3PDmaBuffer_t UserBuffer; // Used for sending to EP Consumer

void GotCharactersFromHost (CyU3PDmaChannel *Handle, CyU3PDmaCbType_t


Type,
CyU3PDmaCBInput_t *Input)
{
if (Type == CY_U3P_DMA_CB_PROD_EVENT )
{
uint8_t* BytePtr = Input->buffer_p .buffer ;
uint8_t* EndPtr = BytePtr + Input->buffer_p .count ;
*EndPtr = 0;
// Shouldn't call DebugPrint in a callback but this is a special case
DebugPrint(4, "\nReceived: %s" , BytePtr);
CyU3PDmaChannelDiscardBuffer (Handle);
}
}
void SentCharacterToHost (CyU3PDmaChannel *Handle, CyU3PDmaCbType_t Type
CyU3PDmaCBInput_t *Input)
{
if (Type == CY_U3P_DMA_CB_CONS_EVENT )
CyU3PDmaChannelDiscardBuffer (Handle);
}
void SendCharacter (char InputChar)
{
*UserBuffer.buffer = InputChar;
UserBuffer.count = 1;
UserBuffer.status = 0;
CyU3PDmaChannelSetupSendBuffer (&glCPUtoUSB_Handle, &UserBuffer);
}
#endif
void StartApplication (void )
// USB has been enumerated, time to start the application running
{
CyU3PEpConfig_t epConfig;
CyU3PDmaChannelConfig_t dmaConfig;
CyU3PReturnStatus_t Status = CY_U3P_SUCCESS ;
uint16_t Size = EpSize[CyU3PUsbGetSpeed ()];
// Display the enumerated device bus speed
DebugPrint(4, "\nRunning at %sSpeed" , BusSpeed[CyU3PUsbGetSpeed ()]);

// Configure and enable the Consumer Endpoint


CyU3PMemSet ((uint8_t *)&epConfig, 0, sizeof (epConfig));
epConfig.enable = CyTrue;
epConfig.epType = CY_U3P_USB_EP_BULK ;
epConfig.burstLen = 1;
epConfig.pcktSize = Size;
Status = CyU3PSetEpConfig (CY_FX_EP_CONSUMER, &epConfig);
CheckStatus("Setup Consumer Endpoint" , Status);
// Configure and enable the Producer Endpoint
Status = CyU3PSetEpConfig (CY_FX_EP_PRODUCER, &epConfig);
CheckStatus("Setup Producer Endpoint" , Status);
// Configure and enable the Interrupt Endpoint
epConfig.epType = CY_U3P_USB_EP_INTR ;
epConfig.pcktSize = 64;
epConfig.isoPkts = 1;
Status = CyU3PSetEpConfig (CY_FX_EP_INTERRUPT, &epConfig);
CheckStatus("Setup Interrupt Endpoint" , Status);
#if (DirectConnect)
// Create an auto DMA channel between USB producer socket and UART Consumer
CyU3PMemSet((uint8_t *)&dmaConfig, 0, sizeof (dmaConfig));
dmaConfig.size = Size;
dmaConfig.count = 2;
dmaConfig.prodSckId = CY_FX_EP_PRODUCER_SOCKET;
dmaConfig.consSckId = CY_U3P_LPP_SOCKET_UART_CONS;
dmaConfig.dmaMode = CY_U3P_DMA_MODE_BYTE;
Status = CyU3PDmaChannelCreate(&glUSBtoUART_Handle,
CY_U3P_DMA_TYPE_MANUAL_IN,
&dmaConfig);
CheckStatus("CreateUSBtoCPUdmaChannel" , Status);
// Set DMA Channel transfer size = infinite
Status = CyU3PDmaChannelSetXfer(&glUSBtoUART_Handle, 0);
CheckStatus("USBtoCPUdmaChannelSetXfer" , Status);
// Create a auto DMA channel between UART producer socket and USB Consumer
CyU3PMemSet((uint8_t *)&dmaConfig, 0, sizeof (dmaConfig));
dmaConfig.size = Size;
dmaConfig.count = 2;
dmaConfig.prodSckId = CY_U3P_LPP_SOCKET_UART_PROD;
dmaConfig.consSckId = CY_FX_EP_CONSUMER_SOCKET;
dmaConfig.dmaMode = CY_U3P_DMA_MODE_BYTE;
Status = CyU3PDmaChannelCreate(&glUARTtoUSB_Handle,
CY_U3P_DMA_TYPE_MANUAL_OUT,
&dmaConfig);
CheckStatus("CreateCPUtoUSBdmaChannel" , Status);
// Set DMA Channel transfer size = infinite
Status = CyU3PDmaChannelSetXfer(&glUSBtoUART_Handle, 0);
CheckStatus("USBtoCPUdmaChannelSetXfer" , Status);
#else
// Create a manual DMA channel between USB producer socket and CPU Consumer
CyU3PMemSet ((uint8_t *)&dmaConfig, 0, sizeof (dmaConfig));
dmaConfig.size = 32; // I assume a person is typing
dmaConfig.count = 2;
dmaConfig.prodSckId = CY_FX_EP_PRODUCER_SOCKET;
dmaConfig.consSckId = CY_U3P_CPU_SOCKET_CONS ;
dmaConfig.dmaMode = CY_U3P_DMA_MODE_BYTE ;
dmaConfig.notification = CY_U3P_DMA_CB_PROD_EVENT ;
dmaConfig.cb = GotCharactersFromHost;
Status = CyU3PDmaChannelCreate (&glUSBtoCPU_Handle,
CY_U3P_DMA_TYPE_MANUAL_IN , &dmaConfig);
CheckStatus("CreateUSBtoCPUdmaChannel" , Status);
// Set DMA Channel transfer size = infinite
Status = CyU3PDmaChannelSetXfer (&glUSBtoCPU_Handle, 0);
CheckStatus("USBtoCPUdmaChannelSetXfer" , Status);
// Create a manual DMA channel between CPU producer socket and USB Consumer
CyU3PMemSet ((uint8_t *)&dmaConfig, 0, sizeof (dmaConfig));
dmaConfig.size = 32; // I assume a person is typing
dmaConfig.count = 0; // Don't assign any buffers here, will do manually
dmaConfig.prodSckId = CY_U3P_CPU_SOCKET_PROD ;
dmaConfig.consSckId = CY_FX_EP_CONSUMER_SOCKET;
dmaConfig.dmaMode = CY_U3P_DMA_MODE_BYTE ;
dmaConfig.notification = CY_U3P_DMA_CB_CONS_EVENT ;
dmaConfig.cb = SentCharacterToHost;
Status = CyU3PDmaChannelCreate (&glCPUtoUSB_Handle,
CY_U3P_DMA_TYPE_MANUAL_OUT ,
&dmaConfig);
CheckStatus("CreateCPUtoUSBdmaChannel" , Status);

UserBuffer.buffer = CyU3PMemAlloc (32);


if (UserBuffer.buffer == NULL) Status = CY_U3P_ERROR_MEMORY_ERROR ;
CheckStatus("Get UserBuffer" , Status);
UserBuffer.size = 32;
#endif
glIsApplicationActive = CyTrue; // Now ready to run!
}

void StopApplication (void )


// USB connection has been lost, time to stop the application running
{
CyU3PEpConfig_t epConfig;
CyU3PReturnStatus_t Status = CY_U3P_SUCCESS ;

glIsApplicationActive = CyFalse;

// Close down and disable the endpoints then close the DMA channels
CyU3PUsbFlushEp (CY_FX_EP_CONSUMER);
CyU3PUsbFlushEp (CY_FX_EP_PRODUCER);
CyU3PUsbFlushEp (CY_FX_EP_INTERRUPT);
CyU3PMemSet ((uint8_t *)&epConfig, 0, sizeof (epConfig));
Status = CyU3PSetEpConfig (CY_FX_EP_CONSUMER, &epConfig);
CheckStatus("Disable Consumer Endpoint" , Status);
Status = CyU3PSetEpConfig (CY_FX_EP_PRODUCER, &epConfig);
CheckStatus("Disable Producer Endpoint" , Status);
Status = CyU3PSetEpConfig (CY_FX_EP_INTERRUPT, &epConfig);
CheckStatus("Disable Interrupt Endpoint" , Status);
#if (DirectConnect)
Status = CyU3PDmaChannelDestroy(&glUSBtoUART_Handle);
CheckStatus("Close USBtoUART DMA Channel" , Status);
Status = CyU3PDmaChannelDestroy(&glUARTtoUSB_Handle);
CheckStatus("Close UARTtoUSB DMA Channel" , Status);
#else
Status = CyU3PDmaChannelDestroy (&glUSBtoCPU_Handle);
CheckStatus("Close USBtoCPU DMA Channel" , Status);
Status = CyU3PDmaChannelDestroy (&glCPUtoUSB_Handle);
CheckStatus("Close CPUtoUSB DMA Channel" , Status);
CyU3PMemFree (UserBuffer.buffer );
#endif
}

Figure 6.10 shows the handling of CDC class commands


which are used to set up and report upon the communications
channel. These functions are required by the host CDC driver but
note that I don’t reconfigure the UART unless I am in DirectConnect
mode (the PC kept changing my baud rate settings!). Also note that
all of the DebugPrint() statements are commented out – the program
hangs with them in because I forgot my own advice “don’t put
blocking calls in a callback routine” , and DebugPrint is blocking! I’ll
fix this in the next example.

Figure 6.10 Handling CDC class commands


// Declare the callbacks needed to support the USB device driver
CyBool_t USBSetup_Callback (uint32_t setupdat0, uint32_t setupdat1)
{
CyU3PReturnStatus_t Status;
union { uint32_t SetupData [2];
struct { uint8_t Target :5; uint8_t Type :2; uint8_t Direction :1;
uint8_t Request ; uint16_t Value ; uint16_t Index ; uint16_t Length ; };
} Setup;
struct {uint32_t DTE_Rate ;uint8_t StopBits ;uint8_t Parity ;uint8_t Length ;}
LineCoding;
uint16_t ReadCount;
// Copy the incoming Setup Packet into my Setup union which will "unpack" the variables
Setup.SetupData [0] = setupdat0;
Setup.SetupData [1] = setupdat1;

// USB Driver will send me Class and Vendor requests to handle


// I only have to handle three class requests for a CDC Device
if (Setup.Target == CLASS_REQUEST)
{
if (Setup.Request == SET_LINE_CODING)
{
Status = CyU3PUsbGetEP0Data (sizeof (LineCoding), (uint8_t*)&LineCoding,
&ReadCount);
// CheckStatus("Set Line Coding", Status);
if (Status == CY_U3P_SUCCESS ) return CyTrue;
#if (DirectConnect)
{
glUartConfig. baudRate = LineCoding. DTE_Rate ;
// Update other parameters only if I can support them
if (LineCoding. StopBits == 0) glUartConfig. stopBit =
CY_U3P_UART_ONE_STOP_BIT ;
if (LineCoding. StopBits == 2) glUartConfig. stopBit =
CY_U3P_UART_TWO_STOP_BIT ;
if (LineCoding. Parity == 0) glUartConfig. parity =
CY_U3P_UART_NO_PARITY ;
if (LineCoding. Parity == 1) glUartConfig. parity =
CY_U3P_UART_ODD_PARITY ;
if (LineCoding. Parity == 2) glUartConfig. parity =
CY_U3P_UART_EVEN_PARITY ;
Status = CyU3PUartSetConfig(&glUartConfig, NULL);
// CheckStatus("Change UART Configuration", Status);
return CyTrue;
}
#endif
}
if (Setup.Request == GET_LINE_CODING)
{
// DebugPrint(4, "\nGet Line Coding");
#if (DirectConnect)
LineCoding.DTE_Rate = glUartConfig.baudRate ;
if (glUartConfig.stopBit == CY_U3P_UART_ONE_STOP_BIT )
LineCoding.StopBits = 0;
if (glUartConfig.stopBit == CY_U3P_UART_TWO_STOP_BIT )
LineCoding.StopBits = 2;
if (glUartConfig.parity == CY_U3P_UART_NO_PARITY ) LineCoding.Parity = 0;
if (glUartConfig.parity == CY_U3P_UART_ODD_PARITY ) LineCoding.Parity = 1;
if (glUartConfig.parity == CY_U3P_UART_EVEN_PARITY ) LineCoding.Parity = 2;
#endif
Status = CyU3PUsbSendEP0Data (sizeof (LineCoding),
(uint8_t*)&LineCoding);
// CheckStatus("Report UART Configuration", Status);
return CyTrue;
}
if (Setup.Request == SET_CONTROL_LINE_STATE)
{
// DebugPrint(4, "\nSet Control Line State");
if (glIsApplicationActive) CyU3PUsbAckSetup ();
else CyU3PUsbStall (0, CyTrue, CyFalse);
return CyTrue;
}
}
return CyFalse;
}

Load and run CDC_Example.img file using the USB Control


Center and this will create a new virtual com port. You now have two
virtual com ports, one connected to the FX3 and one connected to
the debug console. Attach a terminal program such as Clear
Terminal to each of them and note that characters typed in one
window appear in the other. If you set DirectConnect to 0 then you
will get a lot more messages in the Debug Window. Not all that
exciting but this has big implications for the next example.

Debug Console Over USB


I called the previous example a CDC Interface because we
can now “copy and paste” this code into another project and easily
make a composite device. This example combines a Cypress
BulkLoop firmware example with a CDC interface as shown in Figure
6.11. We will also add code to redirect the Debug Console to the
CDC interface.

Figure 6.11 This example has two independent interfaces

Open the CDC_BulkLoop project into your Eclipse workspace


and we shall have a quick look at DebugConsole.c to discover how
this was accomplished. I added a new command, switch , to the
ParseCommand routine which calls SwitchConsoles(). The code is
significantly simpler that the console redirection to I2C since the
CDC driver looks like an infinte byte source and sink just like a
UART is. This means that we can use the DebugPrint routine
provided by Cypress to implement Console_Out and we just need to
connect to the socket created for CDC Out to implement our
Console_In routine. The important code is repeated in Figure 6.12
(edited slightly to fit on the page, the real code is in the
CDC_BulkLoop project folder).

Figure 6.12 Debug console redirection to CDC driver


static CyBool_t UsingUARTConsole;

CyU3PReturnStatus_t InitializeDebugConsoleIn (CyBool_t UsingUARTConsole)


{
CyU3PReturnStatus_t Status;
CyU3PDmaChannelConfig_t dmaConfig;
CyU3PMemSet ((uint8_t *)&dmaConfig, 0, sizeof (dmaConfig));
dmaConfig.size = 64; // This should be way more than I need
dmaConfig.count = 2; // I probably won't need 2
dmaConfig.prodSckId =UsingUARTConsole ? CY_U3P_LPP_SOCKET_UART_PROD
:USB_CDC_OUT_SOCKET;
dmaConfig.consSckId = CY_U3P_CPU_SOCKET_CONS ;
dmaConfig.dmaMode = CY_U3P_DMA_MODE_BYTE ;
dmaConfig.notification = CY_U3P_DMA_CB_PROD_EVENT ;
Status=CyU3PDmaChannelCreate (&DebugConsoleIn_Handle,
TYPE_MANUAL_IN ,&dmaConfig);
CheckStatus("CreateDebugRxDmaChannel" , Status);
if (Status != CY_U3P_SUCCESS ) CyU3PDmaChannelDestroy
(&DebugConsoleIn_Handle);
else
{
Status = CyU3PDmaChannelSetXfer (&DebugConsoleIn_Handle,
INFINITE_TRANSFER_SIZE);
CheckStatus("ConsoleInEnabled" , Status);
}
return Status;
}

void SwitchConsoles (void )


{
CyU3PReturnStatus_t Status;
DebugPrint(4, "Switching console to %s" , UsingUARTConsole ? "USB" : "UART" );
CyU3PThreadSleep(100); // Delay to allow message to get to the user
// Disconnect the current console
CyU3PDebugDeInit ();
CyU3PThreadSleep(100); // Delay to allow thread to complete and all buffers returned
// Connect up the new Console out
// The CDC interface is already running since USB has been enumerated
Status=CyU3PDebugInit (UsingUARTConsole?USB_CDC_IN_SOCKET:
LPP_SOCKET_UART_CONS ,6);
UsingUARTConsole = ~UsingUARTConsole;
// Say hello on the new console
DebugPrint(4, "Console is now %s" , UsingUARTConsole ? "UART" : "USB" );
// Now connect up Console In
Status = InitializeDebugConsoleIn(UsingUARTConsole);
CheckStatus("InitializeDebugConsoleIn" , Status);
}
Load and run CDC_BulkLoop.img using the USB Control
Center. Connect a terminal emulation program to the virtual com
port that the program creates (this is the same com port as you used
in the previous example). Note that this terminal emulation program
is now your debug console! I redirected the debug console to the
CDC interface. From an operational point of view everything will
look the same as before but this console will run up MUCH faster
when using SuperSpeed USB. I do hope that you can read quickly
or that your terminal program will save the console output into a file!

The USB Control Center should display the BulkLoop


interface of this example. You can use the center panel to copy
some data to and from the SuperSpeed Explorer board.

One small downside of using a debug console across USB is


that the USB connection must be operational to see any output. As
a safety net I still keep a serial console attached to the integrated
debugger port (UART in this example but it could have been the I2C
channel) at all times and switch back to it when the USB connection
is lost. The USB debug console also uses some of the SuperSpeed
bus bandwidth but the fundamental non-polling of bulk endpoints
implemented by USB 3.0 keeps this load to a minimum.

Cypress USB examples


When you installed the SuperSpeed SDK it came with many
Cypress examples. The USB examples are in five categories:

Class drivers: UVC, MSC, CDC, Audio

Bulk loop: USBBulkLoopXXXX


Streamer: USBXXXSourceSink

Low speed IO: SPI, I2C, I2S, FlashPrg, UART

Other: USBHost, OTG, SlaveFIFO, Boot, GPIF

I started this chapter with a USB class driver. I chose Human


Interface Device (HID) because it was easy. We then worked a
Communications Device Class (CDC) interface then made this into a
composite device. You could also review the Cypress Mass Storage
Class (MSC), Audio and USB Video Class (UVC) examples. They
follow the same structure as my worked HID example: the
descriptors are pre-defined by the USB class specification and the
list of class commands that need to be implemented are also
documented.

All of the other examples use a Vendor Defined driver called


CyUsb3.sys. This is a Microsoft signed driver that supports all
versions of Windows.

I want to focus upon two of the categories here since they


could form the basis of your custom application: BulkLoop and
Streamer.

BulkLoop Firmware

BulkLoop forms the basis of moving bulk data across USB.


The host will send one chunk of data on an OUT endpoint and will
then wait for a similar chunk of data on an IN endpoint. If you don't
return some data then the host would not send any more. This is
why it is called a loop .

The firmware will decide how big a chunk is. To receive data
from an OUT endpoint we must set up a DMA channel to receive
data: we choose the USB OUT endpoint as a Producer Socket then
select a BufferSize and BufferCount. Chunk is measured in bytes
and is (BufferSize * BufferCount). Once the DMA controller has
accepted chunk bytes from the Producer Socket then further
attempts by the host to send data will be NAKed. What does the
DMA controller do with this data?

The DMA controller processes data in BufferSize pieces so


once the first of the buffers is filled, it will handoff the buffer to the
Consumer Socket so that it can start filling the second buffer. The
choice of Consumer Socket will determine what happens next.
If you choose the IN endpoint as the Consumer Socket then
the DMA controller will post the first buffer there and, as soon as IN
tokens arrive, the buffer will be emptied to USB. Once fully emptied,
the buffer will be given back to the Producer Socket where it joins
the other buffers waiting to be filled. This filling, emptying and
recycling of buffers will continue forever or until the PC stop sending
data. This is called an AUTO channel and the FX3 CPU is unaware
that all this data is moving in and out of the FX3 since it is all done
using DMA hardware. A small variant of this is AUTO_SIGNAL
where the CPU gets an interrupt when buffers fill or empty but it has
to be pretty idle to catch all of them. A 16KB buffer can be received
in resent within 40 µs.

Although very fast this AUTO loopback is not useful except


the demonstration or testing. The PC is getting back the same data
that it sent.

If we choose GPIF as the Consumer Socket then an AUTO


channel could move the USB data out of the FX3 as fast it was
coming from the PC. Now this is useful, and we will do this in
Chapter 9, but this is not a BulkLoop application since we are not
sending data back on the USB Consumer Socket.

If we choose the CPU as the Consumer Socket then we could


process the data and send modified data back to the PC host. This
is called a MANUAL channel since the CPU is involved and this will
not be as fast as an AUTO channel. We will also have to setup
another DMA channel between a CPU Producer Socket and the IN
endpoint Consumer Socket to return the data.

There are several Cypress examples of data processing


including adding or removing headers and footers to/from the
incoming data. They are all variations on the theme of receive data
in, process data, send data out. Using the PC as the source and
sink for the data makes it easier to develop PC applications code
while waiting for the device hardware to be ready.

Streamer firmware
There are three examples of Streamer firmware, they are the
same code but use bulk, isochronous or interrupt endpoints. A
Streamer is an infinite source and sink of data. It pre-fills the DMA
buffers with known data and keeps them staged at an IN endpoint
Consumer Socket. It fills buffers with the data that arrives at an OUT
endpoint Producer Socket then discards the data so that it can
recycle the buffer as fast as possible.

The DMA channels are set up as MANUAL but since the CPU
is doing no real work on the data, and is just recycling buffers, then it
can easily keep up with USB 3.0 data rates. It is expected that the
PC is using asynchronous, overlapped transfers to get the maximum
data throughput rates. This firmware is useful in initial testing of PC
application software before the device hardware is ready. Once the
GPIF interface is up and running that it can stream data faster than
the PC can keep up with and we will study this in the next chapter.

Low speed IO examples


Cypress includes a selection of USB to low speed IO
examples. These will be useful for learning the FX3 and for
development.

Other examples
I put USB-to-GPIF in this category and we will cover this in
later chapters. There are also examples that I don't cover in this
SuperSpeed Device book. The FX3 also runs at full speed (480
Mbs) and at this speed also supports USB host mode including
OTG. Cypress has worked examples covering these modes and
applications notes that describe the implementation details (see
AN77960 Introduction to EZ-USB® FX3™ High-Speed USB Host
Controller ). Note that the FX3 does not include a root hub which
means that it can only talk to a single device (this does not include a
hub). For point-to-point communications, such as being an OTG
host to an Android phone for example, then this is fine. You can find
several FX3 to Android projects in Unboxing Android USB by
Rajaram Regupathy.

Chapter Summary
In this Chapter we looked at connecting the FX3 to a
SuperSpeed USB bus and we wrote several firmware programs that
used Class Drivers on the PC host; this enabled us to focus on the
FX3’s device role and the firmware we needed to write for successful
communications. The USB and DMA device drivers do most of the
‘grunt’ work and we used a high level API to access these drivers –
this enabled us to focus on the application function and not on low
level USB issues. Cypress supplies a collection of example
programs which will help you with your projects.

We look at the PC Host side of a USB solution in the next


chapter and write a series of programs which exercise the USB and
DMA blocks of the FX3.
Chapter 7 PC Host Software Development

Recall from Chapter 4 that there are three pieces of software


that must be written for a complete FX3 solution. These are shown
in Figure 7.1 which is a duplicate of Figure 4.1. In this Chapter we
are going to write some Windows-based applications programs,
using Visual Studio and Cypress’s USB Suite, that interact with the
FX3 firmware that was discussed in the previous chapter.

Figure 7.1 FX3 development uses three tool sets

Figure 7.2 shows the Visual Studio projects available with the
SDK that we will be exploring in the next two chapters. Cypress-
supplied software and drivers are shown in red and my examples are
shown in blue.

Figure 7.2 Visual Studio projects available to study


Cypress wrote a CyUsb3.sys driver to give full access to the
capabilities of the SuperSpeed FX3 component. This driver has
been WHQL certified and is accessed from a C++ application using
CyAPI.lib or from a managed Microsoft.net application using
CyUSB.dll. I will cover C++ examples in this chapter and a C#
application in the next chapter.
CyAPI.lib is a C++ class library that provides a high-level
interface to the CyUsb3.sys kernel mode driver. Working with
Windows applications can get very deep, very fast due to the human
interface wrappers that the Visual Studio applications wizard builds
around your code. To enable you to focus upon the capabilities that
CyAPI.lib exposes I decided that my first example would be a
Windows console application. This minimal human interface has
almost no Windows GUI distractions.

A Windows console application opens a DOS box (I envy you


if you had to google DOS, when I was young . . .) and simple printf
statements can be used to send messages to the user. This DOS
box is a bona fide window that can receive OS messages such as
notifications of device attachments or cursor activity. I decided to
forego this capability since it is a distraction to my goal of showing
the capabilities of CyAPI.lib. Instead I shall use a timer-based poll to
look for an FX3 device. This makes the example much smaller and
simpler.

The first example will write a file to an FX3 attached device.


Two methods of writing are supported, fast and very fast. A fast
transfer will use standard, blocking Windows functions while very
fast will use non-blocking, overlapped, data transfers for maximum
throughput. Let's start with fast data transfer and then move on to
very fast.

Figure 7.3 shows the code used to look for an FX3 device
(actually any device that the CyUsb3.sys driver recognizes by its
GUID). The important line is number 16; here we will create a new
instance of CCyUSBDevice. The CyAPI library searches all
attached USB devices for those that match the CYUSBDRV_GUID
and populates USBDevice objects with information it extracts from
each device. The device object is extensive and its structure can be
viewed in CyAPI.h; it contains properties and methods that we can
access. The method USBDevice->DeviceCount() retrieves a count
of the matching devices.

Figure 7.3 Polling for an FX3 device


7 int _tmain(int argc, _TCHAR* argv[])
8 {
9 CCyUSBDevice *USBDevice;
10 int Seconds;
11
12 printf("\nPoll for FX3 device V0.3\n" );
13
14 for (Seconds=30; Seconds>0; Seconds--)
15 {
16 USBDevice = new CCyUSBDevice(NULL, CYUSBDRV_GUID, true );
17 if (USBDevice->DeviceCount())
18 {
19 printf("\nFound %d device(s)" , USBDevice->DeviceCount());
20 break ;
21 }
22 else
23 {
24 delete USBDevice;
25 printf("%d \r" , Seconds);
26 }
27 Sleep(1000);
28 }
29 if (Seconds == 0) printf("Sorry, no FX3 devices found" );
30 printf("\nUse CR to EXIT\n" );
31 // The DOS box typically exits so fast that the developer doesn't see any messages
32 // Hold the box open until the user enters a character, any character
33 while (!_kbhit()) { }
34 return 0;
35 }
Copy the contents of the Visual Studio Examples directory
into your Visual Studio working area. Note that I have included all
the Cypress .h and .lib files into this structure so that the projects can
easily reference and find them. Build and run this Poll4FX3 project.
If you have your SuperSpeed Explorer board attached then this
program will find it. The program will report three devices if you also
have the serial debugger attached.

Figure 7.4 shows the few lines added to check that the
discovered device is running bulkloop firmware and then uses the
USBDevice->BulkOutEndPt->XferData method to download some
test data to the board. We have discovered, opened, written to and
closed the FX3 device in less than 10 lines of code. I said that this
wouldn’t be difficult!

Figure 7.4 Identify and write to a BulkLoop device


unsigned char TestData[] = "Hello World" ;

int _tmain(int argc, _TCHAR* argv[])


{
CCyUSBDevice *USBDevice;
int Seconds, i;
LONG BytesWritten = 0;

printf("\nPoll for FX3 device and Write V0.3\n" );

for (Seconds=30; Seconds>0; Seconds--)


{
USBDevice = new CCyUSBDevice(NULL, CYUSBDRV_GUID, true );
for (i = 0; i<USBDevice->DeviceCount(); i++)
{
USBDevice->Open(i);
// A BulkLoop device will have a VID = 0x04B4 and a PID = 0x00F0
if ((USBDevice->VendorID == 0x04B4) && (USBDevice->ProductID == 0x00F0))
{
BytesWritten = sizeof (TestData);
USBDevice->BulkOutEndPt->XferData(TestData, BytesWritten);
printf( "\nSent %d bytes to FX3" , BytesWritten);
}
USBDevice->Close();
}
if (BytesWritten) break ;
delete USBDevice;
printf("%d \r" , Seconds);
Sleep(1000);
}
if (Seconds == 0) printf("Sorry, no FX3 devices found" );
printf("\nUse CR to EXIT\n" );
// The DOS box typically exits so fast that the developer doesn't see any messages
// Hold the box open until the user enters a character, any character
while (!_kbhit()) { }
return 0;
}
Figure 7.5 shows the complete listing for the file download
program - it fits on less than two pages! And to make it even simpler
to use, I start off by assuming that the SuperSpeed Explorer board
has just been reset so is in BootLoader mode. I then look for a
CyFX3Device, which is an extension of the CyUSBDevice, and
download the BulkLoop firmware onto it and then download the data
file.

Figure 7.5 Program to download a file to the Explorer board


// SendFile.cpp : This program looks for a "BootLoader" FX3 then loads
// USBBulkLoopAuto.img onto it, then downloads the requested data file
//

#include "stdafx.h"

unsigned char FileBuffer[128 * 1024];

int _tmain(int argc, _TCHAR* argv[])


{
CCyUSBDevice *USBDevice;
CCyFX3Device *FX3Device;
HANDLE FileHandle;
int i, Success, Seconds;
int BulkLoopDevice = -1;
DWORD FileSize;
LONG BytesWritten;
bool Continue = false ;

printf("\nSendFile V0.3\n" );

// Get the name of the file that needs to be downloaded


if (argc != 2) printf("\nUsage: SendFile <filename >" );
else {
FileHandle = CreateFile(argv[1], GENERIC_READ, FILE_SHARE_READ, NULL,
OPEN_EXISTING, 0, NULL);
if (FileHandle == INVALID_HANDLE_VALUE) printf("\nCould not open %s" , argv[1]);
else {
Success = ReadFile(FileHandle, FileBuffer, sizeof (FileBuffer), &FileSize, 0);
if (!Success) printf("\nCould not read from %s" , argv[1]);
else {
if (FileSize == sizeof (FileBuffer)) printf( "\nInternal Buffer too small" );
else {
CloseHandle(FileHandle);
printf( "\n%d bytes read from %s\n" , FileSize, argv[1]);
Continue = true ;
}
}
}
}
if (Continue) {
// Look for a BootLoader device
FX3Device = new CCyFX3Device();
FX3_FWDWNLOAD_ERROR_CODE Status = FAILED;
for (Seconds = 30; Seconds > 0; Seconds--)
{
printf("Waiting for a BootLoader %d \r" , Seconds);
if (FX3Device->Open(0))
{
if (FX3Device->IsBootLoaderRunning())
{
printf( "Waiting for a BootLoader found, downloading" );
Status = FX3Device->DownloadFw( "USBBulkLoopAuto.img" , RAM);
break ;
}
}
Sleep(1000);
}

if (Status != SUCCESS) printf("\nFirmware download failed (%d)" , Status);


else
{
// Wait for the FX3 to come back as a Bulk Loop device (0x04B4, 0x00F0)
USBDevice = new CCyUSBDevice(NULL, CYUSBDRV_GUID, true );
BytesWritten = 0;
for (Seconds = 0; Seconds<10; Seconds++)
{
for (i = 0; i<USBDevice->DeviceCount(); i++)
{
USBDevice->Open(i);
if ((USBDevice->VendorID == 0x04B4) && (USBDevice->ProductID ==
0x00F0))
{
BytesWritten = (LONG)FileSize;
USBDevice->BulkOutEndPt->XferData(FileBuffer, BytesWritten);
printf( "\nSent %d bytes to FX3" , BytesWritten);
}
USBDevice->Close();
}
if (BytesWritten) break ;
Sleep(1000);
}
}
}

printf("\nUse CR to EXIT\n" );
// The DOS box typically exits so fast that the developer doesn't see any messages
// Hold the box open until the user enters a character, any character
while (!_kbhit()) { }

return 0;
}

Most of the heavy lifting for this SendFile application is within


the CyAPI library. The CyAPI library has many, many features and
these are documented in the 200+ page Cypress API Programmers
Reference Manual. This console program lets you drag and drop a
file onto the SendFile icon and it sends the file to the Explorer
board. If the file download does not start immediately then reset the
Explorer board to put it into BootLoader mode.

The ProgramCPLD program that we have been using in


previous chapters has the same format as this example. The
firmware looks like a bulkloop device and it uses the downloaded file
to program the CPLD via its JTAG port. This is described in depth in
the Reference Section.

CollectData
Our next example is a GUI-based CollectData program,
shown in Figure 7.6, and this will use overlapped USB transfers to
get maximum throughput from the SuperSpeed Explorer board.

Figure 7.6 CollectData reads data from the FX3 as fast as


possible.

Let me first explain what the program does and then we will
study how it does it. CollectData uses the same technique as
SendFile to identify FX3 based devices, however this time we are
looking for a Streamer interface rather than a bulkloop interface.
The program discovers any device that matches the CyUsb3.sys
GUID but it is designed to operate with a streamer interface.
We can choose to receive the data and discard it and this will
give us maximum throughput numbers; I included this as a debug
aid. I intend to save the data from the FX3 device into a file that we
can later examine. Writing the data to a disk file will not be able to
keep up with SuperSpeed data transfer rates and some data will be
dropped; we will study which data is dropped and why in a later
chapter. There is a time limit data for file transfers since this
program can quickly fill up your hard drive if left running for a few
minutes – I suggest setting this to 30 seconds. When the Start
button is clicked the program gets ready to receive data then signals
the FX3 to send data. Data is then received and saved, as best it
can, to disk. The program calculates and displays the rate of data
saved (not data received) and this value will be a performance metric
for your hard disk system.

CollectData is going to use multiple buffers as shown in


Figure 7.7.

Figure 7.7 CollectData reads from FX3 and writes to disk.


The source code for this example is in the Visual Studio
directory and rather than repeat all 11 pages in this book I decided to
describe the code at a high level such that you can more easily
follow along in the source code.

When the Start button is pressed the PerformDataCollection


thread is started which collects data from the FX3. Conceptually the
PC will collect data from USB and start filling buffer #1. Once buffer
#1 is full the PC will start a DiskWrite thread that will write buffer #1
to the specified disk file. In the meantime, the PC has been filling
buffer #2 then buffer #3 etc. Once the PC has filled buffer #N it
moves back to filling buffer #1 and this goes on until the stop button
is pressed or the time limit expires.

The DiskWrite thread writes buffer #1 then #2 etc. and once it


gets to buffer #N it starts again at #1 goes on until there is no more
data to write. The buffers are declared within the
PerformDataCollection thread so a linked list of buffers is used so
that the DiskWrite thread can get access to the data.

We know that the DiskWrite thread will not be able to keep


up. This application has no protection against the
PerformDataCollection thread overwriting a buffer that the DiskWrite
thread has not saved to disk yet. This means, of course, that the
data file will become corrupted. For this application and this is OK,
we will remedy this in the next example in Chapter 8. The purpose
of this example is to explain how to do overlapped reads from USB.
The data that the FX3 sends will be an incrementing counter
generated by the CPLD and we will study this in Chapter 9. We can
gain useful information from this example even though we know that
it generates corrupted data files.

I declare all of the metrics for the buffering at the beginning of


the Fx3ReceiveDataDlg.cpp source file. A SuperSpeed bulk
endpoint is 1024 bytes and with a maximum burst length of 16 this
gives me a packet size of 16 KB. So with 256 packets per transfer
each buffer in Figure 7.7 is 256x16 KB equals 4MB. I have
MAX_QUEUE_SZ of 64 so there is a total of 256 MB of buffering.
This sounds a lot but the FX3 will fill this in less than a second! If
you would like to change these constants and look at the effects of
different transaction sizes and queue depths feel free to edit the file
and rebuild the example. In fact, so many people thought that this
was a good idea that these values are user selectable using the GUI
in the next example.

Figure 7.8 shows the flowchart of the PerformDataCollection


thread which is divided into three sections; Ready, Steady and GO.

Figure 7.8 PerformDataCollection Thread Flow Chart


Once started the PerformDataCollection thread does some
checks on the data file it is about to generate then looks for an FX3
configured as a Streamer device. It expects this device to have a
bulk in endpoint that will be the source of its data. As seen in the
flowchart, any errors will take us to CleanUp code.

BeginDataXfer() is repeatedly called to set up overlapped


reads for the incoming data. Windows stages these reads in the
USB host controller driver.

We are now ready.

The thread sends a vendor command to the SuperSpeed


Explorer board which enables the CPLD to generate an incrementing
32-bit counter at 100MHz using the firmware GPIF_Counter.IMG.
The DiskWrite thread is also started and it waits for the first buffer to
be filled.

We have just passed steady and we are into GO.

The thread waits at WaitForXfer() for Windows to inform it that


the first buffer has been filled and it then calls FinishXfer () to collect
the data. A pointer to this buffer is passed via a link list to the
DiskWrite thread so that it can deal with saving the buffer to disk.

Assuming that the stop button has not been pressed nor the
time limit expired, the thread immediately resubmits the buffer with
another BeginDataXfer(). This keeps the overlapping transactions
queue for an all layers of the USB driver stack busy with work. We
continue around and around this loop until the stop condition is true.
Once stop is requested we send another vendor command to the
SuperSpeed Explorer board so that it can stop generating data and
then we wait for the DiskWrite thread to finish the backlog of buffers
then give control back to the user.
Cypress PC Utilities
Within the original Cypress SDK installation three PC utilities
BulkLoop, Streamer and USB Control Center were installed. The
source code for these utilities is also included for your review, in fact,
both C++ and C# implementations of BulkLoop and Streamer are
available. USB Control Center makes heavy use of forms so is only
supplied in C#. BulkLoop uses synchronous transfers, Streamer
uses asynchronous, overlapped transfers and the USB Control
Center can talk to all devices supported by CyUsb3.sys.

BulkLoop Utility
A BulkLoop device is identified by a VID_PID combination of
0x04B4_ 0x00F0 and its base structure and human interface are
shown in Figure 7.9.

Figure 7.9 Structure and GUI of BulkLoop

There are several examples of the FX3 firmware that used


different DMA options and these were discussed in the previous
chapter. From the PC’s perspective, all of these equate to a
Bulk_OUT endpoint into which the BulkLoop application writes data
and a Bulk_IN endpoint from which the application reads data. More
or less data can be received than sent depending upon the loaded
FX3 firmware. Both source codes are well documented but too long
to include as a Figure here. Use Visual Studio 2008 or 2010 to open
the projects and investigate the code.

Streamer utility
A Streamer device is identified by a VID_PID combination of
0x04B4_ 0x00F1 and its base structure and human interface are
shown in Figure 7.10.

Figure 7.10 Structure and GUI of Streamer

There are three variants of Streamer firmware; one uses bulk


endpoints, one uses isochronous endpoints and the third uses
interrupt endpoints. Any data that the Streamer application writes to
the OUT endpoint is consumed and fake data is constantly created
at the IN endpoint for the application to read. Data is transferred
using overlapping transfers and the Streamer application may be
used to measure the performance of your USB connection. The
benchmark example in the next chapter measures the system
performance of your PC including the disk subsystem.
You can vary the packet size in the source code if desired; it
is set to the maximum supported according to enumerated bus
speed, for Superpeed this will be (PacketMax * BurstSize) = (1KB *
16) = 16 KB. The GUI enables you to combine multiple packets into
a buffer for transfer and then create a circular queue of buffers to
keep the Windows USB 3.0 stack busy. Data for writes is created in
the memory buffers and data read from the FX3 endpoint is
discarded, no attempt is made to save to a disk file as in the
CollectData example. Both source codes are well documented but
too long to include as a Figure here. Use Visual Studio 2008 or
2010 to open the projects and investigate the code.

You should also review Cypress Note AN86947 Optimizing


USB 3.0 throughput with EZ-USB FX3 . It runs through a series of
experiments using different transfer sizes and buffer counts for all
three USB packet types. The big takeaway from this app note is that
all USB 3.0 host controllers are not created equal , their
performance varies all over the map.

USB Control Center


The USB Control Center is an extensive forms application that
includes multiple windows. It can talk to all devices supported by
CyUsb3.sys which includes BulkLoop, Streamer and another
VID_PID combination of 0x04B4_0x00F3 which identifies a
BootLoader device. The default power-on firmware in an FX3 is a
BootLoader and the USB Control Center enables you to download
new firmware into RAM for the FX3 to execute. It also displays the
full enumeration context of device for study; an example is shown in
Figure 7.11.

Figure 7.11 USB Control Center can talk to all CyUsb3.sys


devices
The USB Control Center is an alternate method to copy data
to and from BulkLoop and Streamer devices. The source code is
well commented and describes both the synchronous and
asynchronous data transfer methods.

Commercial USB Port Tester


When visiting a customer who is reporting performance
problems with an FX3 system the first thing I do is test their PC with
a great tool that I got from PassMark. A photograph of their USB
3.0 Loopback plug is shown in Figure 7.12 (it is in focus, the plastic
box is frosted) and a block diagram provided by PassMark is shown
in Figure 7.13. The hardware is basically an FX3 configured to run
the LoopBack test firmware with analog circuitry added to monitor
USB voltage, some LEDs and a LCD display, packaged in a robust
plastic case so that you can take it where ever you need to go.

Figure 7.12 USB 3.0 LoopBack Plug from PassMark

When attached to a USB socket on a PC the LoopBack Plug


identifies the speed of the connection: note that some laptops use
blue inserts on their USB sockets even for USB 2.0 speed ports –
VERY naughty; when you find one you should sent an email to
[email protected] where we are keeping a list and also advising the
manufacturer of their error.

The Loopback plug also displays the current voltage on the


USB socket. As we know USB 3.0 is quite sensitive to low USB
voltage and several PCs that I have tested had values quite close to
the USB minimum (4.423V) and this will cause problems typically
intermittantly and these are the worst to track down. Remember
Step 1: check the volts and amps before looking at protocol!

Figure 7.13 Block Diagram of the USB 3.0 LoopBack Plug

The USB 3.0 LoopBack plug comes with test software similar
to that described in this chapter but with added features such as
logging as shown in Figure 7.14. Multiple units can be attached to a
PC so that all the USB ports can be tested at the same time.
Passmark have stress test software that supports development and
burn in testing.

Figure 7.14 LoopBack Plug includes testing software


Chapter Summary

This chapter has demonstrated fast data transfer with a


SendFile example and a BulkLoop example. It demonstrated very
fast data transfer with a CollectData example and a Streamer
example. The comprehensive USB Control Center application was
also presented. Source code for all of these PC utilities are available
for your review.

A commercial tool that the author has found invaluable for


providing a base level of confidence in system operation was also
presented.

Fast data transfer used BulkLoop firmware and blocking


transfers while very fast data transfer use Streamer firmware and
overlapped transfers. The corrupted data files saved by CollectData
will be analyzed in Chapter 9 within the context of FX3 operation.
Reliable, error-free, fast data transfers are the subject of the next
Chapter which describes an FX3 Benchmark application program.
Chapter 8 FX3 Throughput Benchmark

As we were writing the CollectData example program it


became clear that an enhanced version which did not discard data
when there was insufficient time to write it to disk would be very
useful. When testing CollectData on various versions of Windows
and on different PCs with different host controllers, it became evident
that the actual performance of an FX3 peripheral device was more
dependent on the PC rather than FX3 firmware. This chapter
describes a Benchmark program, a screenshot is shown in Figure
8.1, that will characterize your Windows platform; I first describe how
to use this program then go into detail on how it works.

Figure 8.1 The Benchmark program can characterize your PC


platform

Benchmark is a forms-based Windows program which is


used to determine maximum throughput performance for any
combination of reading and writing from USB 3.0, Memory, or Disk.
Using the SuperSpeed Explorer board and appropriate compiled
FX3 application images, this program will drive your system to
determine your peak and sustained data processing rates.

This throughput analyzer builds on the functionality of


CollectData adding bi-directional data transfers and reliable buffering
between the PC and the FX3 with selectable data sources and
destinations. It uses a multi-threaded producer/consumer
programming model for driving the hardware. A large thread safe
circular buffer is allocated for one producer and one consumer, each
of which then operates independently in its own thread. The
producer puts data into the buffer as fast as possible while the
consumer simultaneously pulls data out. If the buffer becomes full,
the producer stalls waiting for buffer space. If the buffer empties, the
consumer blocks awaiting more data from the producer. This large
buffer handles the fluctuations in data production/consumption rates
or mismatches in producer/consumer relative speeds.
From the perspective of the program, producers put data into
the circular buffer and consumers remove data. This is similar to the
naming convention adopted for USB, where IN and OUT endpoints
always refer to their direction relative to the host computer. In the
FX3 benchmark, a USB producer is a thread attached to an IN
endpoint, taking data from the Explorer board. This requires an
active IN endpoint on the Explorer board pushing data down the
USB hose. A USB consumer takes data from the buffer and places
it into an OUT endpoint, requiring an active OUT endpoint listener to
act as the sink. Likewise, file producers open existing disk files and
write data into a buffer, while file consumers take data from the buffer
and stream it to disk. Memory producers are infinite sources of data,
perpetually writing to the circular buffer, while memory consumers
are infinite sinks, reading data from the queue and immediately
discarding the contents.

USB 3.0 specifies several types of endpoints, including


Control, Interrupt, Isochronous, and Bulk. This program only
supports Bulk endpoints, as the goal is to measure the maximum
possible throughput for your system, and the Bulk endpoints are
overall the fastest given an otherwise quiescent bus.

Let’s familiarize ourselves with the interface, Figure 8.2


highlights the FX3 device detected by the program.
Figure 8.2 Benchmark will identify a connected FX3 device

In the upper left of the GUI is the USB information section.


The SuperSpeed Explorer board needs to be pre-loaded with the
Cypress Streamer firmware (USBBulkSourceSink.img). The
program expects the Vendor ID 0x4B4 and Product ID 0x00F1 which
identifies a Streamer device. The program looks for the default IN
endpoint 0x81 and OUT endpoint 0x01 as defined by the Cypress
implementation. If the USB device cannot be located or either
endpoint is missing, the information area will display the missing
functionality, which makes it unavailable for testing. No IN endpoint,
and you cannot create a USB producer. No OUT endpoint, no USB
consumer. The program listens for device attach/detach events from
the driver, so you can start the program and
disconnect/program/reconnect without restarting.
You choose the data source and sink as shown in Figure 8.3

Figure 8.3 Selecting the data source and sink

On the left you choose the producer, on the right the


consumer. A single producer/consumer pair is created and tested
based on your choices, with one exception – if you choose USB 3.0
as both the source and sink, two producer/consumer pairs are
created. The first is a memory producer/USB 3.0 consumer pair for
sending data to the FX3, the other is a USB 3.0 producer/memory
consumer pair for reading data back from the FX3.

Figure 8.4 highlights the parameter fields. You can change


some of the parameters of the benchmark which may impact system
performance. The default values shown have been proven to be
close to optimal for most systems, but it is instructive to understand
what they are and how they impact program operation.

Figure 8.3 Selecting the data source and sink


USB packets per transfer determines how many USB
packets will be bundled into a single transaction. USB supports a
burst mode where several packets are transmitted without
intervention by the controller. This parameter controls how many
such packets are placed in the USB driver output buffer (for a USB
consumer) or pulled from the driver (for a producer) for every USB
transaction. Valid values are 1-64 in powers of 2, but the actual
number used by the program will be limited to remain under the
maximum USB transaction size.

USB Interleaved transfers The program uses


asynchronous I/O, which can in some circumstances provide a
substantial throughput improvement. If this value is set to 1, then the
benchmark program fills a buffer, sends the data, and waits for
confirmation from the driver that the data has been received or
transmitted before starting the next transaction. This is how
SendFile operates. When this value is > 1, the system queues each
read or write to the USB driver to finish asynchronously. This allows
the program to execute up to USB interleaved transfers parallel I/O
operations. The reads and writes do not actually happen in parallel,
as this would intermix the data between reads and writes and create
garbelled data. What it does allow is maximum use of the I/O
channel, which is never left idle since as soon as one operation
completes, another is immediately ready to execute. Historic data
shows that values around 8 for parallel I/O operations works well.
Too few interleaved transfers allow the USB channel to be
underutilized, whereas too many require too much control overhead
to leverage no remaining channel capacity. To see how much of an
improvement asynchronous I/O can bring to your throughput
numbers, try running the benchmark with a value of ‘1’ and compare
to higher values.

Both the USB packets per transfer and interleaved transfer


settings are USB specific – they are not used for memory or disk
producers/consumer.

Buffer read/write block size This value determines


how many bytes of data are committed to the circular buffer by
producers and pulled from the buffer by consumers in a single
transfer. This number is important for two reasons. First, each
producer and consumer operates in its own thread of execution. If
the read and write block sizes are too small, the operating system
will have to switch contexts often to service ready producers and
consumers, and the time used in actually pushing data gets dwarfed
by the overhead of switching threads and setting up the transfer.
Second, hard drives (particularly spinning versions) are much more
efficient at file reading and writing when the size of the operation is
well matched to the physical media and/or operating system
buffering scheme. A sequence of file writes when the size of the
data is not well matched to the media buffering will run at a fraction
of the maximum capacity of the drive. This is true of SSD drives as
well. The default for this program is a good minimum for most
Windows system. Values from 64kB to 8MB in powers of two are
available.

Choose source file/Choose destination file are used


to point to files on the system for receiving or supplying data to
consumers or producers, respectively. Click on the “Choose file…”
button for each to be presented with the standards Windows Open
File/Save File dialogs. Choosing a destination file requires you to
have file open and write authorization to the file’s location. Read
access is all that is required for the source file.

If you attempt to run the program with a file producer or


consumer and you have not previously made selections for the file
type required, you will receive an error and will not be able to
continue until a choice is made. Note that files generated from USB
or memory can grow very large very fast, and the file is not
automatically deleted when the program exits. Likewise, input files
will be read quite quickly. If you want an actual measure of your
system’s throughput capabilities, choose a large file for a File
producer, several gigabytes in length, similar to a DVD image. In
order to get sufficient run time, the program will read the entire
contents of the input file and, when the end is reached, rewind and
start from the front of the file all over again, indefinitely. Small files
will suffer a throughput degradation commensurate with the extra
work required to continuously rewind and re-read the file.

The Go button starts testing with the current input/output pair


selected. Parameters are checked; for instance, if one end of the
transfer is a file, a file must have been specified with the source
file/destination file choosers. After a press, the benchmark runs
continuously until the Stop button is pressed. There is no
fundamental limit on how long the test will run with the exception that
writing to a disk file can fill up your hard drive very quickly. I did not
include a time limit as I did with CollectData, this will be in the next
revision!

Once every second, the instantaneous throughput is reported


to the GUI for display and this updates the guage and the strip chart
as shown in Figure 8.4. All values are in MB.
Figure 8.4 Clicking GO starts the program and reporting

The gauge on the left shows the last instantaneous throughput


value. This will vary from second to second, sometimes significantly
depending on the source or sink (notice in the above demonstration
how file system buffering for the file consumer over-inflates the
throughput early in the testing process). The chart on the right is a
strip chart monitoring the last 30 seconds of throughput reporting.
The Total MB transferred text box lists the total number of bytes
successfully transferred from producer/buffer/consumer for the
duration of the test. The throughput displays remain on the screen
after a test until a new test is executed
The Characterize System button runs an automated test with
the following pairs of producer/consumer: USB 3.0 IN -> Memory,
Memory->USB 3.0 Out, USB 3.0 IN -> File, File -> USB 3.0 Out, and
USB 3.0 -> USB 3.0 bidirectional. Obviously, you must supply an
input and output file before executing this test. Each
producer/consumer pair is created and run for 30 seconds in order to
settle out any transients (disk files are notorious for this). The
display updates during the test, allowing you to monitor the
operation. After 30 seconds, the last five seconds of transfer
throughput for each pair is averaged. During a system
characterization, the user inputs are locked out, so plan on about 2.5
minutes to run to completion.

Once finished, a report is presented looking something like


Figure 8.5, (values are representative only). This gives you a
concise one-stop location for determining approximate throughput
capabilities of your system.
Figure 8.5 Characterize System reports your PCs capabilities
How Benchmark works
Benchmark is a mixed mode C#/C++/CLI application intended
to provide a guideline for some of the key performance issues one
would want to test in determining the fitness of a given compute
platform for use with FX3 high data throughput needs. It is not a full-
featured production quality application, and as demonstration code,
suffers from the usual lack of flexibility, and reduced error handling,
with the intent of making the function clearer. The source code is
provided for you to study and/or modify.

The application is divided into two main components: the user


interface, which is a standard Windows Forms C# application, and
the Engine, which is a C++/CLI driver for all of the performance
critical tests of system speed. This overall architecture was chosen
because while C# is the natural choice for doing windows GUI
programming, it does not in general produce code that can match
C++ in performance. Cypress explicitly warns us that throughput
using their .NET library will be reduced compared to using the native
C++ version. This is one case where you really do want to eek every
last microsecond of performance out of the application, and C++ is
still the speed champ, so as we have the luxury of living in a world
where we can mix both to satisfy our program requirements, we can
continue in the fine tradition of using the right tool for a given job.

The Engine, which sets up and manages program execution


but is not directly involved in the speed tests, is exposed as a C++
managed class. This allows the C# GUI to consume its services
directly and cleanly, while simultaneously providing a host and
execution context for the native C++ code. It’s the best of both
worlds. The highest level is shown in Figure 8.6
Figure 8.6 High level architrecture of Benchmark program

Each level can directly “talk” to the level immediately adjacent,


and all three levels interface with the Cypress supplied libraries. The
GUI connects to the USB connect and disconnect notifications from
the driver to determine which device is attached to the system, to
display the attached device name, and to verify the existence of the
appropriately numbered endpoints. The engine instantiates
endpoints for bulk in/out transfers that are passed on to the Native
C++ components for execution. The Native C++ layer drives the
entire test at speed, offering query points to the Engine to
periodically determine throughput.

The Producer/Consumer model


At the heart of the application is an implementation of a
standard multi-threaded producer/consumer model for sourcing and
sinking data. In a producer/consumer system, one thread of
execution is responsible for generating data, which it writes into a
shared container. In parallel, on another thread, a consumer has the
responsibility of reading data from that shared container and doing
something useful with it. That relationship can be modeled using
standard UML notation as shown in Figure 8.7. This portion of the
code is written in unmanaged native C++ for best performance.

Sidenote: To refresh on UML class notation, each box has


three sections. The top section is the class name with any
decorations. The middle section are the attributes – the data
members like POD (plain ol’ data) or other internal classes that make
up the class state. The third box contains the operations – the
methods and properties that make up the functional interface to the
class. In all cases, attributes and operations are prefixed with either
a “+”, “-“, or “#”, indicating public, private, and protected access to
those attributes/ operations respectively. Italics when used on an
operation or class name implies that this entity is abstract; that is, it
cannot be instantiated directly in the code but must be subclassed to
provide an implementation that matches the indicated signature.
There are many kinds of relationships between classes, but the most
common you will see in this document is the named diamond-ended
line between two classes. This indicates aggregation, which is just
another way of stating that these classes are attributes as well but
the author wished to make the relationship more visual. The
diamond end of the line indicates the owner of the relationship.

Figure 8.7 Benchmark uses a Producer/Consumer model

Central to this relationship is the CircularBuffer<T>. This is a


parameterized (template) class that can hold a defined number of
elements of a specific type T. In this application, T is always bytes
since we are dealing with byte oriented interfaces like USB, memory,
and disk files. CircularBuffer is a fully reentrant class that supports
multi-threaded insertion and extraction of its contents by a single
producer and a single consumer. All of the internal data is protected
by appropriate mutual exclusion locks and condition variables, so a
producer will never accidentally write over data a consumer is using,
and a consumer can never pull data out of the buffer before the
producer has fully finished placing it there. The circular buffer also
acts as a shock absorber for variations in data rates and thread
context switching. The benchmark program uses circular buffers of
hundreds of megabytes to keep buffer thrashing down to a minimum
and to move the data at USB 3.0 rates, which can be substantial.

Blocking Read and Write operations are exposed for the


producers and consumers. When a write is requested, if sufficient
space exists, the data is placed in the circular buffer and the current
write location is updated for the next attempted write. The success
or failure of the write is returned to the caller. If there is not sufficient
space in the buffer, the write block until the requested timeout,
waiting for space to become available. The same symmetric read
operation works when reading data from the buffer, blocking until the
number of elements requested becomes available, copying those
elements out of the buffer to the provided storage, and updating the
read location for subsequent calls.

In the concrete implementations of producers and consumers


used in this application, the producers and consumers all run as fast
as possible. Since the system is almost always asymmetric in that
one end of the producer/consumer pair is faster than the other, one
of the ends of the pipeline will be more frequently blocked awaiting
either space or data in the buffer while the other is furiously
processing its end.
The AbstractProducer and AbstractConsumer classes dictate
the interface for all concrete producers and consumers. Each has in
internal counter of how many units it has produced or consumed.
These counts are used by the engine to determine throughput. Each
contains an abort flag, which is a pointer to a Boolean variable
placed in each class at construction so that it knows when to stop.
Each producer/consumer pair contains an internal variable telling it
how many elements to read and write operation. This is important
because some concrete producers and consumers are very
throughput sensitive as to how much data is gathered into a single
operation.

Finally, each concrete producer has to define a Produce()


operation that generates data, and each consumer must define a
Consume() operation. When running, the engine deals only with
these abstract interfaces, so it is blind to which actual type of
producer and consumer is behind the pair.

The Low Level unmanaged C++ level


The low level code has two enumeration types as shown in
Figure 8.8.

Figure 8.8 Producer/Consumer Enumerations

Producers
Whereas AbstractProducer defines the interface for
producers, it doesn’t define any actual capability. That is left to the
three concrete producers, shown in Figure 8.9. Throughout the code
and this document, a Producer or source always puts data into a
buffer (writes) and a Consumer or sink always removes data
(reads) from a buffer.
Figure 8.9 Concrete Producers
MemoryProducer is the simplest of the three and the
fastest. Upon construction, it simply fills a buffer of the appropriate
size (determined by the BytesPerWrite variable) and constantly stuffs
that memory block into its circular buffer, as fast as possible. “As
fast as possible” depends on many things, including the specifics of
your system, the number of bytes written per write, and the size of
the buffer. The latter is a consequence of the overhead associated
with putting data into the synchronized circular buffer – locking the
buffer, writing one byte, updating the state, and unlocking the buffer
will generate much worse performance than writing a much larger
chunk. Write sizes in the megabytes-per-operation have significantly
better performance.

FileProducer is a producer that reads from a disk file and


places that data into the buffer. Disk operations are the most
sensitive to the BytesPerWrite parameter. This is because reading
from a disk is most efficient when those reads are aligned on the
natural geometry of the disk itself. The FileProducer object uses a
couple of optimizations when issuing the Windows CreateFile call to
tell the OS to not allow file sharing and to optimize for sequential
reads. Also, since you might not have disk files large enough to test
your system for the 30 seconds or so required to get good statistics,
the file producer automatically wraps back around to the beginning of
the file when it reaches the end and start over. Obviously, this will
impact throughput, so the larger a file you have for testing the file
producer, the better your performance will be.

USBProducer is the most complicated of the concrete


producers. It holds information relevant to the USB connection, all of
which are directly or indirectly exposed through the GUI to allow
performance tweaking. It also has by far the most complex
Produce() function, having to manipulate multiple asynchronous I/O
operations simultaneously in order to achieve the absolute best
overall performance. Notice from the diagram the overlaps
association class that manages a list of OverlappedIO classes. The
OverlappedIO class is an important piece of the USB producer and
consumer. It will be discussed in a section by itself, below.

Consumers
Consumers, as shown in Figure 8.10, are completely
symmetric with producers in function. Consumers pull data out of
circular buffers and do something with the data. MemoryConsumers
simply drop the data on the floor. They typically have the capacity to
stay well ahead of most producers. FileConsumers take incoming
data and write it to disk. Due to system buffering on files,
FileConsumers can often sustain very high throughputs for short
durations of a few seconds, but ultimately slow down as buffers fill
faster than disks can write – unless you have a very fast SSD
device, a FileConsumer typically cannot keep up with a USB 3.0
producer.

Figure 8.10 Concrete Consumers

USBConsumer is the consumer that tests the bulk out


capabilities of your machine. Remember – producers always put
data into buffers, while consumers always remove data, which
means that a USB producer uses a bulk IN endpoint to pull data from
USB and place it in a buffer, while a USB consumer removes data
from a buffer and places it into a USB bulk OUT endpoint.

OverlappedIO
The OverlappedIO class, shown in Figure 8.11, is central to
throughput maximization with the Cypress library. When top speed
is of no concern, I recommend that you use the blocking XferData()
function to read or write to an endpoint as I did in SendFile since this
is MUCH easier to use and is still fast. For many applications, the
throughput and latency involved in using synchronous I/O is
sufficient. However, if your processing needs are such that
XferData() is not good enough, and it certainly isn’t for system
benchmarking, then asynchronous I/O is necessary.

Figure 8.11 OverlappedIO Class

The OverlappedIO class is an encapsulation of a single such


asynchronous transfer. It uses the Cypress
BeginXfer()/WaitForXfer()/FinishXfer() functions. When you call
BeginXfer(), the Cypress library queues up the operation for
asynchronous completion, then returns immediately while the I/O sits
in a queue waiting for its turn to run. It becomes the responsibility of
the programmer to then later call WaitForXfer() to determine if the
transfer succeeded and finally FinishXfer() to extract the data at a
later time, presumably while other I/O operations are executing.
OverlappedIO takes some of the difficulty out of using this
model. The examples provided by Cypress show some of the
complexities involved in starting, stopping, managing, and cleaning
up these asynchronous operations. This benchmark takes a more
object oriented approach to the problem to make sure that the
system properly manages asynchronous I/O. As an example, from
the Cypress help, it is clearly stated that every BeginXfer() call
MUST be matched by a subsequent FinishXfer() call to properly
manage the destruction of system resources, and the author can
attest to all of the memory corruption issues you will encounter if you
fail to strictly follow this advice. This class manages all of those
complexities transparently – you cannot call BeginTransfer() on a
buffer currently in use. AbortTransfer() handles properly terminating
a queued transfer. FinishTransfer() only runs when a previous call to
BeginTransfer() succeeded, and is guaranteed to be called in the
destructor of the OverlappedIO object if any resources are still
unclosed.

OverlappedIO functions:
BeginTransfer() and BeginTransfer(begin, end) – These
functions initiate a USB transaction. An OverlappedIO object can be
attached to either a bulk IN endpoint or a bulk OUT endpoint – the
library keeps track of which is active through the internal
CCyUSBEndPoint data member with which the OverlappedIO class
was constructed. Calling the version with no arguments initiates an
asynchronous read from USB, while calling the iterator version first
copies the sequential chunk of data from (begin, end) (using the
same open set notation of the C++ standard library, where begin
points to the first byte of data, end points to one past the last byte)
and starts a write transfer. Internal variables are set such that the
OverlappedIO instance knows that a transfer has started, hanging on
to references to the Windows overlapped I/O structure and the
Cypress completion token necessary to abort or finish the transfer.

FillBuffer is a convenience function used by


BeginTransfer(begin, end), allowing the user to explicitly place data
into the buffer for transfer without starting the actual transaction.

AbortTransfer() attempts to abort this asynchronous I/O


transfer.

WaitForTransfer() waits for the asynchronous I/O to complete


or timeout, whichever happens first. You cannot extract data from a
read operation until WaitForTransfer() succeeds, as the I/O is not
guaranteed to be complete, although this is not enforced by the
OverlappedIO class.

FinishTransfer() completes the transfer, deallocates library


resources associated with the operation, and returns the data to the
buffer. It takes a reference to a long to indicate the actual number of
bytes transferred as the Cypress library indicates that this might be
less than was requested.

Figure 8.11 shows a diagram of the continuous overlapped


I/O within the Cypress library with error conditions omitted for clarity.
Figure 8.11 Continuous Overlapped I/O Activity
Mid-level Managed C++ layer

USB Engine
The USBEngine class is the “traffic-cop” for the application. It
manages the creation of producers and consumers, links them up to
their respective buffers, starts the underlying threads on which each
runs, monitors the throughput, and periodically reports to any
interested party on the progress of the test. The USBEngine has the
most complex class diagram because of these multiple
responsibilities, as shown in Figure 8.12.

Figure 8.12 Class diagram of the USB Engine


USBEngine is a managed C++/CLI class. As such, it exposes
methods and properties that can be used directly by the GUI through
the common .NET framework. From the perspective of the GUI, the
USBEngine is a relatively simple class. It exposes two methods –
Run() and Stop(), one Event FactoryUpdate(), and a handful of
hierarchical structures that the GUI uses to populate the user
configurable combination of producers and consumers required to
run a test.

To accomplish those ends, several auxiliary classes are


aggregated by USBEngine. The first is the CCyUSBDevice itself.
This is generated anew every time Run() is called in case the device
has been unplugged or changed. Once the device is created, its
endpoints are passed on to any USBProducers or USBConsumers
that need it.

Two .NET classes are used to facilitate threading –


ProducerThread and ConsumerThread. These are populated with
an AbstractProducer and AbstractConsumer pointer, respectively.
They are then used by System::Threading to place each created
producer and consumer in its own thread of execution. In the case
where a one-way test is being executed such as USB-Memory, there
will be two such threads, one for each direction. In the case of a
USB-USB test, there will be four, two for each direction (this requires
a USB loopback image or a separate source/sink image).

The USBParameters class exposes the USB information


needed by the Engine to create the appropriate library endpoints. It
includes the Vendor and Product ID, the number of USB packets to
send per transaction, and the number of Overlapped I/O operations
to execute in parallel. It is intended that this class be created and
filled by the GUI prior to calling Run() against the engine as the
engine has no means of defaulting these values. A code snippet
from the GUI shows how this structure is filled:
theEngine.LinkParameters = new USBParameters ();
theEngine.LinkParameters.VendorID = VENDOR_ID;
theEngine.LinkParameters.ProductID = PRODUCT_ID;
theEngine.LinkParameters.PacketsPerTransfer =
Convert .ToInt32(PPXComboBox.SelectedItem);
theEngine.LinkParameters.ParallelTransfers =
Convert .ToInt32(InterleavedComboBox.SelectedItem);

Two SourceSinkPair classes are exposed to the GUI as well.


These are the description of the producer/consumer pair(s) to be
used for the test. The GUI creates, fills, and sets these properties
from user input to the main form, as shown below:
theEngine.DownloadPipe = new SourceSinkPair ();
theEngine.DownloadPipe.bufferSize = 250 * 1024 * 1024;

theEngine.DownloadPipe.producer = new Source ();


theEngine.DownloadPipe.producer.bytesPerWrite =
Convert .ToInt32(FileBlockSizeComboBox.SelectedItem) * 1024;
theEngine.DownloadPipe.producer.producer =
( ProducerTypes )SourceListBox.SelectedIndex;
theEngine.DownloadPipe.producer.fileName = openFileDialog1.FileName;

theEngine.DownloadPipe.consumer = new Sink ();


theEngine.DownloadPipe.consumer.consumer =
( ConsumerTypes )SinkListBox.SelectedIndex;
theEngine.DownloadPipe.consumer.bytesPerRead =
Convert .ToInt32(FileBlockSizeComboBox.SelectedItem) * 1024;
theEngine.DownloadPipe.consumer.fileName = saveFileDialog1.FileName;

These code snippets show the dependency on having a driver


application and the simple integration of the GUI with the managed
code engine.

The Engine exposes one .NET Event (delegate) to which the


GUI can subscribe. It is called FactoryUpdate, and returns a single
64 bit long to the subscriber that informs it of how many bytes have
been processed by the consumer since the program started. Even
though there are up to four producers/consumers running at any one
time in the program, the slowest of these will always ultimately set
the overall throughput for all of the rest, and so once the system hits
steady state operation, which depends on the buffer size and the
speed of the producers and consumers, one number is a reasonable
measure of overall system performance.

Every time Run() is called on the Engine, the sequence of


events shown in Figure 8.13 is initiated.
Figure 8.13 USB Engine Run Sequence
At startup, the main Windows form subscribes to two Cypress
events exposed through the .NET interface to the driver. The first is
the USBDeviceAttached event, the other the USBDeviceRemoved
event. These execute every time their respective activity occurs,
when the main form checks the device vendor and product IDs and
the exposed endpoints for validity.

Much of the WinForms code is involved in servicing the


throughput callback generated by the USBEngine when the program
is executing. In that code, the gauge, throughput strip chart, and
total transferred megabytes are updated (see the
ThroughputCallback method for details).

The Cypress device is closed and recreated for a fresh run.


Then, using the data supplied by the GUI in the SourceSinkPair
objects, a new Producer and a new Consumer is created. Disk files
are opened, USB endpoints attached, and buffers are allocated
(depending on the choices made) and passed through constructors
to the appropriate classes. A new ConsumerThread and a new
ProducerThread class is allocated and passed a pointer to the
producer and consumer just created, whereupon their respective
threads begin executing. One more thread is produced, the
MonitorThread, whose job is to wake up once a second, poll the
consumer from the SourceSinkPair, and fire the .NET Event to any
subscribers who are listening. This produce/consume/monitor
operation then runs until the user calls Stop() on the Engine class.
Once Stop() is executed, the AbortFlag of each producer and
each consumer is set true. The Stop() method then blocks on a
thread join(), waiting for the ConsumerThread and the
ProducerThread to finish execution, tear down the respective buffer
and connections, and return. No state is retained for the next
execution, which starts fresh from a clean slate.

Acknowledgement:

Conspicuous on the user interface is the gauge control that


monitors throughput on the GUI. That control was downloaded in
binary form from the good folks over at CodeProject.com from the
(AGauge) WinForms Gauge Control page through their generous
code sharing license. It was truly a joy to find something so simple
and yet so useful for this application.
Chapter 9 Getting Started With High-Speed IO.

The last major FX3 block to learn is the GPIF II block, which I
shall abbreviate to just ‘GPIF’. I have left this until last since it takes
a while to get your head around its basic functionality let alone it is
amazing capabilities. By now you should be comfortable with the
FX3’s DMA engine and the API used to control it. You will learn a
little more about the DMA engine in this Chapter as I expose some
more features that it has but we haven't used until now.

The GPIF's primary function is to interface to the outside


world and efficiently move data into and out of the FX3. What this
means is that we need some external hardware for the GPIF block to
interface to. The Cypress documentation explains that this could be
an ASIC, FPGA or even a processor. Because I like to start simply
and then move forward once we understand the basic theory, I shall
be using a CPLD as my external hardware. A CPLD is a much
simplified version of an FPGA with the added benefit that is it is
implemented with flash memory so once programmed it will operate
the same through power cycles. It does not have to be
reprogrammed as an FPGA does following a reset.

Figure 9.1 shows the CPLD board that was designed for this
Chapter and was re- purposed in Chapter 5 to demonstrate low
speed IO capability. The board contains a Xilinx XC2C128 CPLD
connected to all of the FX3 high-speed DQ and DQ control lines. I
have included a CPLD programmer project in the Reference Section
that enables the FX3 to reprogram the CPLD using its JTAG
connection. No other hardware is required.

Development tools for the Xilinx CPLD are a free download


them from www.xilinx.com ; refer to Developing Your Own CPLD
Code in the Reference Section for installation instructions. This is
the same toolset that is used for Xilinx FPGAs so if your design
exceeds the capacity of the 128 macrocell CPLD you can move your
code into an FPGA with little effort. Cypress also provides adapter
boards that enable you to connect the SuperSpeed Explorer board to
the Xilinx SP601 FPGA development board; they also have an
adapter board that connects to the Altera Cyclone III development
board.

Figure 9.1 CPLD board connects to all high-speed IOs

So designing with the GPIF II interface requires an additional


skill; you need to program the CPLD or FPGA and we do this in a
hardware description language such as VHDL or Verilog. I have
included a tutorial on Verilog in the Reference Section since this is
what I will be using in all of the examples in this Chapter. There are
also many tutorials available on the web for Verilog.

First GPIF Project


Figure 9.2 shows our first GPIF project. It consists of a 32-bit
counter in the CPLD that provides data to the GPIF block at 100
MHz. The GPIF block fills DMA buffers with this data which will be
transported across USB where the data is stored in a file. We
already know from the benchmark program in the previous Chapter
that some data will be lost since it is being generated at 400 MBps
and SuperSpeed USB cannot run this fast. We shall study where
data is dropped and gradually change the design until no data is lost.

Figure 9.2 Our first high-speed IO project

In the first example the CPLD operates as a master by


controlling the data transactions. The CPLD needs a clock source
and, instead of having to include an oscillator on the CPLD board,
the FX3 can output a clock for this use. The FX3 can generate this
clock AND be a slave and this added flexibility saved me the cost of
an oscillator. Figure 9.3 shows the Verilog code for the counter that
we will load into the CPLD for this example.

Figure 9.3 32-bit Verilog counter loaded into CPLD


`timescale 1ns / 1ps

module Counter1(
input PCLK,
input RESET,
output reg WR_n,
output [31:0] DQ,
output [7:0] LED
);
reg [31:0] Counter;
assign LED = ~Counter[31:24];
assign DQ = Counter;

always @ ( posedge PCLK or posedge RESET) begin


if (RESET) begin
WR_n <= 1; // Disable writes
Counter <= 0;
end
else begin
WR_n <= 0;
Counter <= Counter + 1;
end
end
endmodule
The counter stays at 0 while the RESET input is active then
drives WR and an incrementing counter on DQ[0..31] when RESET
is released. I chose to keep PCLK clock running at all times since
this makes debug simpler but you may wish to suppress clocks if
there is no data collection.

Before we can start data collection we must prepare the FX3


and this involves setting up the GPIF block and some DMA
channels.

Setting up GPIF II
GPIF II is a soft-loaded state machine that powers up in the off
state. To get GPIF to do useful work we must program it and this is
done using an external tool called GPIF II Designer. Creating a state
machine for the GPIF is at the opposite end of the scale as writing
equations with Verilog. With GPIF designer you create state
machine pictures using a graphical editor where states are drawn in
boxes and transitions are drawn as lines between these boxes.
Actions are assigned to each state. GPIF has a 32-bit address
counter, a 32-bit data counter and 16-bit control counter and
matching comparators that you can use. A state action could include
incrementing one of these counters, setting an IO pin, reading or
writing from the GPIF pins, reading or writing from a DMA socket or
interrupting the CPU. A state transition could be a comparison from
one of these counters, the value of an IO pin, the value of a DMA
flag (we will add this in the next iteration of the example) or a signal
from the CPU. You can define up to 256 states which should be way
more than anyone will use. You can also implement several
independent state machines within the structure providing that there
are no more than 256 total states. One implementation restriction
that we will hit later in this Chapter is that only one or two conditions
can be evaluated for a state transition. You can design around this
using extra or mirror states and the tool will help you construct
these. There is a lot more that I could say but rather than repeat a
lot of text here I refer you to Getting Started with GPIF II Designer .

The state machine we need to input data from the CPLD


counter and save it to the GPIF socket is shown in Figure 9.4.
Figure 9.4 Saving GPIF data to a DMA buffer
GPIF state machines always start in the START state and
immediately transition to the next state once the GPIF block is
enabled (how is covered in the next Figure). We idle at WAIT until
the CPLD drives WR low when we move to SAVE. Data is strobed
in at SAVE and we stay in SAVE strobing data in on every PCLK until
WR returns high. Note that if there is not a DMA buffer waiting then
this data falls on the floor and is lost. We will look at flags to detect
or prevent data loss in the next example. The GPIF II Designer
compiles this state machine into a header file which is included in
your project and StartApplication loads and starts the GPIF engine at
the highlighted lines shown in Figure 9.5.

Figure 9.5 DMA channel and GPIF initialization


void StartApplication(void)
// USB has been enumerated, time to start the application running
{
uint16_t size;
CyU3PEpConfig_t epConfig;
CyU3PReturnStatus_t Status;

CyU3PUSBSpeed_t usbSpeed = CyU3PUsbGetSpeed();


CyU3PDebugPrint(4, "\n@StartApplication, running at %sSpeed",
BusSpeed[usbSpeed]);
CyU3PMemSet((uint8_t *)&epConfig, 0, sizeof(epConfig));
epConfig.enable = CyTrue;
epConfig.epType = CY_U3P_USB_EP_BULK;
epConfig.burstLen = (usbSpeed == CY_U3P_SUPER_SPEED) ?
(ENDPOINT_BURST_LENGTH) : 1;
epConfig.pcktSize = EpSize[usbSpeed];
// Setup and flush the Consumer endpoint
Status = CyU3PSetEpConfig(CONSUMER_ENDPOINT, &epConfig);
CheckStatus("CyU3PSetEpConfig_Enable", Status);
CyU3PUsbFlushEp(CONSUMER_ENDPOINT);

// Create a DMA AUTO channel for the GPIF to USB transfer


CyU3PMemSet((uint8_t *)&dmaConfig, 0, sizeof(dmaConfig));
dmaConfig.size = EpSize[usbSpeed] * ENDPOINT_BURST_LENGTH);
dmaConfig.count = 2; // Increase this if I have available memory later
dmaConfig.prodSckId = (CyU3PDmaSocketId_t)GPIF_PRODUCER_SOCKET;
dmaConfig.consSckId = (CyU3PDmaSocketId_t)CONSUMER_ENDPOINT_SOCKET;
dmaConfig.dmaMode = CY_U3P_DMA_MODE_BYTE;
dmaConfig.notification = CY_U3P_DMA_CB_CONS_SUSP;
dmaConfig.cb = GpifToUsbDmaCallback;
Status = CyU3PDmaChannelCreate(&glGPIF2USB_Handle,
CY_U3P_DMA_TYPE_AUTO, &dmaConfig);
CheckStatus("DmaChannelCreate", Status);

// Start the DMA Channel with transfer size to Infinite


Status = CyU3PDmaChannelSetXfer(&glGPIF2USB_Handle, 0);
CheckStatus("DmaChannelStart", Status);

// Load, configure and start the GPIF state machine


Status = CyU3PGpifLoad(&CyFxGpifConfig);
CheckStatus("GpifLoad", Status);
CyU3PGpifRegisterCallback(Gpif2CpuIntrCallback);
Status = CyU3PGpifSMStart(0, 0); //START, ALPHA_START);
CheckStatus("GpifStart", Status);
glIsApplicationActive = CyTrue;
}

Setting up a DMA Channel


The DMA channel is set up with four 16 KB buffers and no
CPU involvement. Once the GPIF socket fills the first buffer it starts
filling the second buffer. At this instant the endpoint socket starts
emptying the first buffer to USB. Once the endpoint socket has
emptied the first buffer then this buffer is given back to the GPIF
socket to use. If the data cannot be sent to USB (the application
reading this data may not be active) then all four DMA buffers will
stack up at the endpoint socket and none will be returned to the
GPIF socket which will then start dropping data. A 16KB buffer fills
in 40 usec and the FX3 will fill all four filled in 160 usec so the
application should be ready to receive data before it enables the
data to start coming!

This filling, emptying and recycling of buffers is all handled by


the DMA hardware since the channel was set up for AUTO
operation. A small variation of AUTO mode is AUTO_SIGNAL which
signals the CPU on chosen events so that it can keep track of buffer
transfers. This does not slow the data transfers. We will add this
feature later.

Reprogram the CPLD using Counter1.xsvf file then load and


run the GPIF_Example1.img on your SuperSpeed Explorer board.
This implements the code that we have been discussing over the last
several pages such that the FX3 is now ready to start data collection.
The last link in this data collection chain is a host application
program that kickstarts the whole process and save the data to a file
on your host computer. Locate and run the CollectData application
on your host computer; this is a variant of the Benchmark program
that collects data continuously as fast as it can until it reaches the
set time limit. This data is saved in a file called CollectData.bin.
WARNING don't run this file for too long since it is writing 1 GB of
data to your hard drive about every three seconds! The source code
for CollectData, and all other programs used this book, is included
within the examples code.

Design Stage 1
Set the limit to 10 seconds and click START as shown in
Figure 9.6.

Figure 9.6 Running the CollectData host application


On completion the program has saved the data and it reports
the average data rate. Open CollectedData.bin with a hex editor, or
run ConvertData which reads DataCollected.bin as an array of 32-bit
integers and creates an ASCII file, CollectedData.txt, which can be
opened with any editor or even Excel. Review the list of numbers
and note that it is not monotonic; there are small gaps at some 4K
sample boundaries (samples are 32-bits) and some larger gaps at
other 4K boundaries.

Most PCs can keep up with the initial data but then buffers in
the PC get full and start to be over-written (CollectData was
designed this way, it collects data as fast as it can from USB at the
expense of over-writing buffers before they have been written to
disk). You need to run for 20-30 seconds to see the typical
throughput but this generates enormous files that Excel can’t open.
So locate and run a utility called CheckData – this looks through the
CollectData.bin file and reports discontinuities in the data. Click,
drag and drop the data file onto CheckData.exe.

The large gaps are caused by the USB transfer not being able
to keep up with the CPLD’s 400 MBps data rate. The only solution is
to reduce the data rate so that it is less than the average data rate as
reported by the CollectData application. We could put some data
compression at our data source; this is viable in an FPGA design but
there isn't the capacity in our small CPLD so I will take the simpler
approach of reducing PCLK.

At the debug console enter the keyword PCLK and the FX3
will display its current value. You can now enter PCLK+ or PCLK- to
increase and decrease the clock driving the CPLD state machines
which will have the effect of changing the data rate of the
incrementing counter. Figure 9.7 shows the code behind this PCLK
command. You should run this example several times until the data
rate is low enough that your host computer can keep up. Use
different save filenames to collect the data if you would like to
compare results.

Figure 9.7 The CPLD data clock can be changed


if (strncmp ("pclk" , glConsoleInBuffer, 4) == 0)
{
CyU3PPibClock_t pibClock;
if (glConsoleInBuffer[4] == '-' ) Clock++;
if ((glConsoleInBuffer[4] == '+' ) && (Clock > 0)) Clock--;
Status = CyU3PPibDeInit (); // Turn off GPIF so that I can start it again
pibClock.clkDiv = Clock + 4;
pibClock.clkSrc = CY_U3P_SYS_CLK ;
pibClock.isHalfDiv = (Clock & 1);
pibClock.isDllEnable = CyFalse;
Status = CyU3PPibInit (CyTrue, &pibClock);
CheckStatus("Change GPIF Clock" , Status);
DebugPrint (4, "\nGPIF Clock = %d MHz = %d Mbps " ,
400/(Clock+4), 1600/(Clock+4));
}

Design stage 2
The small gaps of non-monotonic counter data in
CollectedData.txt are due to the latency when switching DMA buffers
at the GPIF block. We drop between 50 and 150 counts as the
buffers are switched. The FX3 solves this with more hardware which
Cypress unfortunately calls a thread. To distinguish this feature from
the RTOS threads that we were described in Chapter 3, I shall refer
to these new threads as hardware threads throughout this Chapter.

A GPIF state machine has access to up to 4 hardware


threads per socket as shown in Figure 9.8. Only one hardware
thread can be active at a time but switching between hardware
threads is done with zero latency.
Figure 9.8 Hardware threads allowed zero latency multiplexing

The hardware thread address is provided by the GPIF state


machine or the external hardware. We will start with the GPIF state
machine then move this to the CPLD in the next example. Figure
9.9 shows the GFIF state machine extended to provide hardware
thread addressing.

Figure 9.9 Hardware thread addressing controlled by the state


machine
We must now extend our application to listen to two GPIF
sockets which will be accepting data alternately. The DMA API
includes a MANY_TO_ONE construct to allow collection from
multiple sockets and a ONE_TO_MANY construct to allow
distribution of data to multiple sockets. This is a small change in
StartApplication.c and this is highlighted in Figure 9.10

Figure 9.10 Collecting data from two GPIF sockets


void StartApplication(void)
// USB has been enumerated, time to start the application running
{
uint16_t size;
CyU3PEpConfig_t epConfig;
CyU3PReturnStatus_t Status;
CyU3PDmaMultiChannelConfig_t dmaMultiConfig;

CyU3PUSBSpeed_t usbSpeed = CyU3PUsbGetSpeed();


// Display the enumerated device bus speed
CyU3PDebugPrint(4, "\n@StartApplication, running at %sSpeed",
BusSpeed[usbSpeed]);
// Based on the Bus Speed configure the endpoint packet size
size = EpSize[usbSpeed];

CyU3PMemSet((uint8_t *)&epConfig, 0, sizeof(epConfig));


epConfig.enable = CyTrue;
epConfig.epType = CY_U3P_USB_EP_BULK;
epConfig.burstLen = (usbSpeed == CY_U3P_SUPER_SPEED) ?
(ENDPOINT_BURST_LENGTH) : 1;
epConfig.pcktSize = size;

// Setup and flush the Consumer endpoint


Status = CyU3PSetEpConfig(CONSUMER_ENDPOINT, &epConfig);
CheckStatus("CyU3PSetEpConfig_Enable", Status);
CyU3PUsbFlushEp(CONSUMER_ENDPOINT);

// Create a multi-DMA AUTO channel for the GPIF to USB transfer


CyU3PMemSet((uint8_t *)&dmaMultiConfig, 0, sizeof(dmaMultiConfig));
dmaMultiConfig.size = (size * ENDPOINT_BURST_LENGTH);
dmaMultiConfig.count = DMA_BUFFER_COUNT;
dmaMultiConfig.validSckCount = 2; // Number of producer sockets
dmaMultiConfig.prodSckId[0] = (CyU3PDmaSocketId_t)PING_PRODUCER_SOCKET;
dmaMultiConfig.prodSckId[1] = (CyU3PDmaSocketId_t)PONG_PRODUCER_SOCKET;
dmaMultiConfig.consSckId[0] =
(CyU3PDmaSocketId_t)CONSUMER_ENDPOINT_SOCKET;
dmaMultiConfig.dmaMode = CY_U3P_DMA_MODE_BYTE;
dmaMultiConfig.notification =
CY_U3P_DMA_CB_CONS_EVENT+CY_U3P_DMA_CB_PROD_EVENT;

dmaMultiConfig.cb = DualGpifToUsbDmaCallback;
Status = CyU3PDmaMultiChannelCreate(&glDualGPIF2USB_Handle,
CY_U3P_DMA_TYPE_AUTO_MANY_TO_ONE, &dmaMultiConfig);
CheckStatus("DmaMultiChannelCreate", Status);

// Start the DMA Channel with transfer size to Infinite and with PING (Offset = 0)
Status = CyU3PDmaMultiChannelSetXfer(&glDualGPIF2USB_Handle, 0, 0);
CheckStatus("DmaMultiChannelStart", Status);

// Load, configure and start the GPIF state machine


Status = CyU3PGpifLoad(&CyFxGpifConfig);
CheckStatus("GpifLoad", Status);
CyU3PGpifRegisterCallback(Gpif2CpuIntrCallback);
Status = CyU3PGpifSMStart(0, 0); //START, ALPHA_START);
CheckStatus("GpifStart", Status);

// OK, Application can now run


glIsApplicationActive = CyTrue;
}

Locate and load the GPIF_Example2.img into your


SuperSpeed Explorer board and adjust PCLK to the safe value
determined in the previous example. Now run CollectData for 20
seconds or so run CheckData to check that the collected data is
monotonic.

Design stage 3
You may want more proof that no data is being lost so in this
stage we give the CPLD access to the DMA flags. This simplifies
the GPIF state machine as shown in the top portion of Figure 9.11
but shifts the complexity to the CPLD as shown in the bottom portion
of Figure 9.11. Design of a GPIF interface to external hardware is an
iterative process since these two units cooperate in solving the
problem.

Figure 9.11 The CPLD and GPIF state machine cooperatively


implement a solution
`timescale 1ns / 1ps

module CPLDinControl(
input ClockIn, // From FX3
input RESET, // 0 = resets counter, 1 = Supply data
input DMA0_Ready, // 0 = can accept data, 1 = busy and samples will be missed
input DMA1_Ready, // 0 = can accept data, 1 = busy and samples will be missed
output [31:0] GPIF, // CPLD drives a counter onto GPIF
output WR_N, // 0 = no data sent, 1 = sample data being sent
output SelectDMA, // CPLD chooses FX3 DMA Buffer (actually, Thread)
output [7:0] LED // Some user feedback
);

reg [31:0] Counter; // Counts samples sent to FX3


reg WR_N; // Determines if sample will be stored in FX3

// Define some 'continuously calculated' signals


assign GPIF = RESET ? Counter : 32'h ZZZZZZZZ;
assign LED = Counter[31:23]; // For display
assign SelectDMA = Counter[12]; // Swap DMA Buffers every 4096 samples

always @ (posedge ClockIn) begin


if (RESET) begin
WR_N <= 1;
Counter <= -1;
end
else begin
// First manage the control of WR_N
if ((Counter[11:0] == 4095) && (SelectDMA ? !DMA0_Ready : !DMA1_Ready))
WR_N <= 1;
else if (WR_N && (SelectDMA ? DMA1_Ready : DMA0_Ready)) begin
WR_N <= 0;
Counter <= Counter + 1;
end
end

end
endmodule

Note that it is now the CPLD that is choosing the hardware


thread and therefore the DMA socket that is receiving the counter
data; the Thread Number selection in Figure 11 is greyed out
meaning that an external address is selecting the hardware thread.
It must also keep track of the sample count so that it knows when to
switch buffers. As one DMA buffer is about to be filled the CPLD
checks that the next DMA buffer is available using a DMA_Ready
flag. If the buffer is available then the CPLD starts to fill it but it if it's
not available then it raises WR to indicate that data is being lost.
This is shown in Figure 9.12. This is far superior to dropping the
data on the floor as the previous examples have done.

Figure 9.12 Operation for DMA_Ready and Not Ready


The GPIF state machine sees the loss indication and moves
to an ALERT state which interrupts the CPU. The CPU can then
stop the data collection and inform the user the data has been lost
and probably suggest a slower sample rate. Figure 9.13 shows the
final interface signals shared by the GPIF state machine and the
CPLD.

Reprogram the CPLD with Counter3.xsvf then load and run


GPIF_Example3.img into your SuperSpeed Explorer board and
adjust PCLK to the safe value determined in the previous example.
If the PC cannot keep up with the data rate then this will be indicated
in the Debug Console window.
Figure 9.13 Final interface design for this real-time data
collection example

Completed Design – a Logic Analyzer


What we have just built is a simple application that reliably
and accurately saves the value of 32 data pins at a reasonably high
clock rate. This is the basis of a logic analyzer! The good folks at
Saleae used this basic concept then added a great deal of software
and hardware around the FX3, including ADCs, and wrote a friendly
human interface to produce their Logic Pro 8 and Logic Pro 16 which
is shown in Figure 9.14.

Figure 9.14 The Saleae Logic Pro 16 is based on FX3


A block diagram provided by Saleae is shown in Figure 9.15.
Note that Saleae detects and records only signal edges and can
therefore handle more data at a higher rate since the data is
compressed before being passed onto the FX3. Their FPGA does a
lot of the front-end work then uses the FX3 to quickly move this data
to the PC host for display and storage.
This is a great tool that I have used since its first prototype.

Figure 9.15 Block diagram of the Saleae Logic Pro 16

Chapter Summary

This Chapter has introduced the GPIF interface and we


investigated it by transferring real-time data from the outside work
into the PC. There were issues to resolve to assure reliable data
collection and these were solved. We successfully built an example
that reliably collects real time data at the highest data rate supported
by the host computer.
The next Chapter will also look at streaming data but this is
from a video/audio source at a slower rate; the challenge here will be
fitting the FX3 implementation between two pre-defined standards.
Chapter 10 Moving Real Data, Part 1.
My original plan, as I was designing the learning flow of the
book, was to cover a video application here and explain how to
interface between two known interfaces, the USB Video Class
Specification and Video Hardware. This would include how to add
headers to real time data as it passes through the FX3. However,
Cypress Applications Note AN75779 does such a good job of this
that I couldn’t think of anything to add J.

If your application has anything to do with capturing data from a


video sensor then you should read AN75779 since it covers all
aspects of a design. The good folks at Lattice Semiconductor have
implemented a comprehensive reference design based upon the
USB Video Class Specification and the FX3 and this is shown in
Figure 10.1.

Figure 10.1 A Video Reference Design from Lattice

The Lattice USB3 Audio/Video Bridge Development Kit is a


productionready High Definition video capture and conversion
system based on the LatticeECP3TM FPGA family, designed by
Mikroprojekt. Supplied with Lattice’s Video to USB3 Bridge reference
design and USB 3.0 UVC video class firmware by Mikroprojekt, the
kit works out of the box and can be easily demonstrated on USB 3.0
hosts running Windows, MacOS, or Linux. The board is recognized
as a standard video capture device and operates with any
commercial or open source software.

The solution, based on the LatticeECP3 supports high speed


reception and packing of video and audio data into USB 3.0 UVC
and UAC data frames without the use of external memory buffers. A
block diagram is shown in Figure 10.2. The Cypress EZ-USB FX3
USB 3.0 interface provides 5 GB streaming links to the USB host.
The Analog Devices ADV7611 provides HDMI 1.4a capture
capabilities (with optional HDCP decoding), and the Lattice Tri-Rate
SDI PHY enables reception of professional audio and video signals
over the LatticeECP3 SERDES interface. An expansion connector
allows the connection of a Camera or sensor over either the MIPI
CSI-2 interface, or SubLVDS differential lines, quickly transforming
the board into a USB 3.0 High Definition camera suitable for
industrial vision applications.

Figure 10.2 Block Diagram of Lattice FPGA Solution


Chapter 11 Moving Real Data, Part 2.

In this Chapter we will move a lot of data as fast as USB 3.0


host computer can support but with the assumption that the data
source or data sink can request that the data flow can be stopped
and restarted with no consequences. This is typical operation for
storage devices, printers, scanners and many others that must
process some of the data before continuing.

The traditional structure used for adjusting data flow between


two devices is a FIFO as shown in Figure 11.1. The data source
stops writing when it sees a Full signal and the data sink stops
reading when it sees an Empty signal. There will be timing
constraints on these signals which will make our job more
“interesting”.

Figure 11.1 A FIFO can be used to adjust data flow

We have the flexibility to make the GPIF interface a master or


a slave. Typically the external device that the FX3 is connecting to is
already defined to be a master or a slave so the GPIF implements
the matching interface. A synchronous RAM interface, for example,
is a slave so the GPIF would need to be a master. In this application
the CPLD can be a master or a slave so I decided to implement both
sides so that you could compare and contrast the two choices.

I will implement a synchronous interface since this is MUCH


simpler than an asynchronous interface. I have an asynchronous
interface planned for inclusion in the second edition of this book. To
meet the setup and hold times of the FX3 and the Xilinx XC2C128
CPLD, both devices are clocked on the positive edge of PCLK. This
means that signals driven on this edge cannot be used until the next
edge.

Slave FIFO Design


In the first example of this Chapter I shall have the real world
(the CPLD in our case) be the source of data. In the second
example it will be the data sink, then in the third example I shall
make it bi-directional. I first use the FX3 as a slave device then I
repeat these examples with the FX3 as a master device. We can get
a long way into the theory of moving SuperSpeed data using the
CPLD but to test this at maximum speed we need a real-world,
SuperSpeed data source and sink so the last example in this
Chapter will cross-connect two SuperSpeed Explorer boards and we
will transfer files between two host computers as fast as the PCs
allow.

Figure 11.2 shows our first example which implements a slave


FIFO read. The data source is our 32-bit counter again so that we
can test the reliability of the transfer. I call it a Read example since
that is what a host computer will be doing; this is consistant with
USB naming conventions. The CPLD writes into the FX3 and treats
it like a FIFO. The CPLD controls the data transfers and the FX3 is
a slave device. I generate FIFO_Full from DMA_Ready signals from
the FX3 as I did in Chapter 9 but this time we use the information to
throttle the data flow rather than just complaining to the FX3.

Figure 11.2 This Slave FIFO Read example is similar to Figure


8.13
Closer inpection of the DMA0_Ready timing shows that it is
not a good FIFO_Full signal. It has a 3 clock latency on reporting
that the DMA buffer is full which is too late to stop CPLD which is
writing 32-bit data on every clock edge. A reasonable approach is to
let the CPLD count the WR signals that it is generating and stop
when it gets to DMA_BUFFER_SIZE, this ensures that there is no
overflow. However, I am NOT going to recommend this strategy
since I have now debugged two customer systems who did it this
way and their systems stopped working!

The story at both customer was the same – it was close to the
end of the project and the FX3 firmware writer was running short of
RAM so he changed the endpoint burst size from 16 to 8 and this
reduced DMA_BUFFER_SIZE from 16KB to 8KB. But he didn’t tell
the FPGA designer. The system now started to have data errors,
they looked everywhere. Nobody suspected the FPGA since “that
has been working correctly for months”. Having a software
dependency in the hardware is never a good thing.
If you have good documentation and a good team process
then using a counter in the CPLD/FPGA is a solid, simple approach.
There is a better way.

The DMA controller has a DMA_Watermark flag that can be


set when the DMA buffer is almost full. For a WR on a 32 bit bus
there is a 4 clock delay [for other bus sizes this increases by
(32/BusSize – 1) clocks] and since the CPLD needs a 1 clock
warning then I set the Watermark to 5 using
CyU3PgpifSocketConfigure (described in the next section). We
therefore need to use two DMA flags to generate a robust FIFO_Full
signal (WR = ~FIFO_Full) as shown in Figure 11.3.

Figure 11.3 Generating FIFO_Full from two DMA flags


The first timing diagram shows the start of a transfer with the
DMA_Ready indicating that the next DMA buffer is available and the
second timing diagram (on this page) shows DMA0_Watermark
giving early warning that the DMA buffer is soon to be filled. Writes
into the DMA buffer are shown as green circles.

The state machines for this example are shown in Figure


11.4. I will always keep the GPIF state machine on the left and the
CPLD state machine on the right to match the block diagram Figures
(such as Figure 11.2). The GPIF state machine is “copied and
pasted” from GPIF designer and I drew the CPLD state machine with
square, colored boxes and straight transitions so that it is easy for
you to differentiate between the two state machine diagrams; I also
display any CPLD asserted signal inside the state machine box. I
have lined up matching states as best I could. In this example the
FX3 GPIF block is a slave and the CPLD is the master.

The FX3 starts the process by releasing reset from the CPLD
(how is explained in a few paragraphs). The GPIF slave, which is
using socket 0, has already started and it is in the WAIT4WR state
waiting to be told that there is a valid data on the DQ data lines that it
should capture. So let's look at how the CPLD master does this.

Once reset is removed the CPLD moves to the READY state


and waits for RUN = 1. Note in Figure 11.2 that the user button on
the Explorer board is connected to the CPLD and pressing this
toggles the RUN signal. The CPLD moves to the WAIT4DMA state
where it checks if the FX3 is ready to accept data. The FX3 output
signal DMA0_Ready will be high indicating that the DMA buffer is not
full. The CPLD now moves to the WRITE state where it asserts the
WR signal. The GPIF state machine seeing WR high moves to its
READ state. Both state machines stay in these states for a while
transferring 32 bits of counter data at 100MHz. When
DMA0_Watermark goes high the CPLD moves to the WAIT4DMA
state, where WR is no longer asserted, and waits for another DMA
buffer to become available.

Figure 11.4 GPIF and CPLD State Machines for a Slave FIFO
Read
We measured the time it takes for a DMA buffer to become
available in Chapter 9 – this was about 70 100 MHz clocks. We also
learnt in Chapter 9 that using two DMA sockets and hardware
threads could reduce this delay to 0 and this includes the 3 clock
delay on DMA0_Ready. I decided to keep this first example simple
so I am only using one socket. Note that a 70 clock delay with 4096
burst is less than 0.05% degradation.

The state machines continue around their main loops until the
user button is pressed again which toggles RUN. The CPLD state
machine moves to the STOP state where it also asserts LastData.
The GPIF state machine sees LastData asserted so then moves to
its SIGNAL state where an interrupt into the FX3 CPU is generated
which results in tidying up the final transfer.

The Verilog code for the CPLD state machine is shown in


Figure 11.5. Note that I edited out white space of the listing so that it
would fit better on the page; see the actual Xilinx project for the real
code.

Figure 11.5 Verilog code for the CPLD counter and state
machine

Code in top.v
module top(
inout [12:0] CTRL, [31:0] DQ,
input PCLK,
inout [7:0] User, I2C_SCL, I2C_SDA,
input [7:0] Button, GPIO45_n, SPI_SCK, SPI_SSN, RX_MOSI,
output [7:0] LED, TX_MISO, FlashCS_n, INT, TP_2
);

// Need to assign inputs else they get optimized away


assign TP_2 = RX_MOSI & SPI_SCK & SPI_SSN & I2C_SCL & I2C_SDA;

// Assign fixed outputs not used in this example


assign FlashCS_n = 1'b1;
assign INT = 1'b0;
assign TX_MISO = 1'bZ;

// Include the Counter


FifoMasterCounter Counter (
.PCLK(PCLK),
.RESET(CTRL[10]),
.DMA0_Ready(CTRL[4]),
.DMA0_Watermark(CTRL[5]),
.PushButton(GPIO45_n),
.WR(CTRL[0]),
.LastData(CTRL[2]),
.DQ(DQ),
.LED(LED),
.User(User)
);

Endmodule
Code in FifoMasterCounter.v
module FifoMasterCounter(
input PCLK, RESET, DMA0_Ready, DMA0_Watermark, PushButton,
output WR, LastData, [31:0] DQ, [7:0] LED, [7:0] User
);

// Define our counter which will provide data


reg [31:0] Counter;
assign DQ = (WR | RD) ? Counter : 8'hzzzzzzzz;
// Display data movement during transfers
assign LED = ~DQ[31:24];

// Generate a RUN signal from the PushButton; PushButton presses toggle RUN
// Note that this creates a different clock domain but this is OK
reg RUN;
always @ (negedge PushButton or posedge RESET) begin
if (RESET) RUN<=0; else RUN <= ~RUN;
end

// Define a State Machine for CPLD as FIFO Master, use one hot encoding
reg [4:0] CurrentState, NextState;
parameter IDLE = 5'b00001;
parameter WAIT4DMA = 5'b00010;
parameter WRITE = 5'b00100;
parameter PAUSE = 5'b01000;
parameter STOP = 5'b10000;
// Display internal variables on User port for debug
assign User = { CurrentState, RUN };
// Output signals are dependent upon the state machine
assign WR = (CurrentState == WRITE);
assign LastData = (CurrentState == STOP);

always @ (posedge PCLK or posedge RESET) begin


if (RESET) begin
CurrentState <= IDLE;
Counter <= 0;
end
else begin
CurrentState <= NextState;
if (WR) Counter <= Counter + 1;
end
end

// Calculate next state using combinational logic


always @ (*) begin
// Default is to stay in the current state
NextState = CurrentState;
case (CurrentState)
IDLE: if (RUN) NextState = WAIT4DMA; //else NextState = IDLE;
WAIT4DMA: if (DMA0_Ready) NextState = WRITE; else
if (~RUN) NextState = IDLE; // else NextState = WAIT4DMA;
WRITE: if (~RUN) NextState = STOP; else
if (DMA0_Watermark) NextState = PAUSE_W; // else NextState = WRITE;
PAUSE: if (~DMA0_Ready) NextState = WAIT4DMA; // else NextState = PAUSE;
STOP: if (!RUN) NextState = IDLE;
default: NextState = IDLE; // Should never get here
endcase
end

endmodule

Figure 11.6 shows the DMA initialization which is set up in


MANUAL mode and includes a callback routine that monitors all data
transfers. Each transfer must be monitored since the FX3 does not
know when the CPLD will stop sending data so it must manually
commit every packet to the USB socket so that it can wrap up the
last partially filled DMA buffer. The overhead in checking this is
small compared with the 40 µs it takes to fill a 16 KB buffer. The
extra step of setting up the DMA_Watermark is highlighted in red .

Figure 11.6 DMA initialization for Slave FIFO Read example


#include "SlaveFifoRead.h" // File generated by GPIF Designer
char * CyFxGpifConfigName = { "SlaveFifoRead" };

void GpifToUsbDmaCallback (CyU3PDmaChannel *chHandle, CyU3PDmaCbType_t


type,
CyU3PDmaCBInput_t *input)
{
CyU3PReturnStatus_t Status = CY_U3P_SUCCESS ;
glCounter[0]++;
if (type == CY_U3P_DMA_CB_CONS_SUSP ) glChannelSuspended = CyTrue;
if (type == CY_U3P_DMA_CB_CONS_EVENT ) glCounter[1]++;
if (type == CY_U3P_DMA_CB_PROD_EVENT )
{
glCounter[2]++;
Status = CyU3PDmaChannelCommitBuffer (chHandle, input->buffer_p .count ,
0);
if (Status != CY_U3P_SUCCESS ) DebugPrint (4,
"CyU3PDmaChannelCommitBuffer failed,
Error code = %d\n" , Status);
}
}

CyU3PReturnStatus_t StartGPIF (void )


{
CyU3PReturnStatus_t Status;
Status = CyU3PGpifLoad (&CyFxGpifConfig);
CyU3PDebugPrint (7, "\nUsing GPIF:%s" , CyFxGpifConfigName);
CheckStatus("GpifLoad" , Status);
Status = CyU3PGpifSocketConfigure (0, CY_U3P_PIB_SOCKET_0 , 5, CyFalse,
1);
CheckStatus("Set DMA_Watermark", Status);
Status = CyU3PGpifSMStart (0, 0); //START, ALPHA_START);
return Status;
}

void StartApplication (void )


// USB has been enumerated, time to start the application running
{
CyU3PEpConfig_t epConfig;
CyU3PDmaChannelConfig_t dmaConfig;
CyU3PReturnStatus_t Status;
CyU3PPibClock_t pibClock;

// Start GPIF clocks, they need to be running before we attach a DMA channel to GPIF
pibClock.clkDiv = 4;
pibClock.clkSrc = CY_U3P_SYS_CLK ; // 400/4 = 100MHz
pibClock.isHalfDiv = 0;
pibClock.isDllEnable = CyFalse; Status = CyU3PPibInit (CyTrue, &pibClock);
CheckStatus("Start GPIF Clock" , Status);

CyU3PUSBSpeed_t usbSpeed = CyU3PUsbGetSpeed ();


// Display the enumerated device bus speed
CyU3PDebugPrint (4, "\n@StartApplication, running at %sSpeed" ,
BusSpeed[usbSpeed]);
// Based on the Bus Speed configure the endpoint packet size
CyU3PMemSet ((uint8_t *)&epConfig, 0, sizeof (epConfig));
epConfig.enable = CyTrue;
epConfig.epType = CY_U3P_USB_EP_BULK ;
epConfig.burstLen = (usbSpeed == CY_U3P_SUPER_SPEED ) ?
(ENDPOINT_BURST_LENGTH) : 1;
epConfig.pcktSize = EpSize[usbSpeed];

// Setup and flush the Consumer endpoint


Status = CyU3PSetEpConfig (CONSUMER_ENDPOINT, &epConfig);
CheckStatus("CyU3PSetEpConfig_Enable" , Status);

// Create a MANUAL channel since I need to look for the last packet
CyU3PMemSet ((uint8_t *)&dmaConfig, 0, sizeof (dmaConfig));
dmaConfig.size = (EpSize[usbSpeed] * ENDPOINT_BURST_LENGTH);
dmaConfig.count = 4;
dmaConfig.prodSckId = GPIF_PRODUCER_SOCKET;
dmaConfig.consSckId = CONSUMER_ENDPOINT_SOCKET;
dmaConfig.dmaMode = CY_U3P_DMA_MODE_BYTE ;
dmaConfig.notification = CY_U3P_DMA_CB_CONS_SUSP |
CY_U3P_DMA_CB_CONS_EVENT |
CY_U3P_DMA_CB_PROD_EVENT ;
dmaConfig.cb = GpifToUsbDmaCallback;
Status = CyU3PDmaChannelCreate (&glGPIF2USB_Handle,
CY_U3P_DMA_TYPE_MANUAL , &dmaConfig);
CheckStatus("DmaChannelCreate" , Status);

Status = CyU3PUsbFlushEp (CONSUMER_ENDPOINT);


CheckStatus("CyU3PUsbFlushEp" , Status);
// Start the DMA Channel with transfer size to Infinite
Status = CyU3PDmaChannelSetXfer (&glGPIF2USB_Handle, 0);
CheckStatus("DmaChannelStart" , Status);
// Load, configure and start the GPIF state machine
Status = StartGPIF();
CheckStatus("GpifStart" , Status);

// OK, Application can now run


glIsApplicationActive = CyTrue;
}
First load the SlaveFIFO.xsvf code into the CPLD (refer to the
Programming the CPLD Chapter). Then set Switches 1 through 7 to
OFF and set Switch 8 to ON.

Load and run GPIF_Example4.img using the USB Control


Center. We are now ready to test the example code so run the
CollectData host application with the time limit above 30 seconds
and click START. Since the CPLD is the master so I wanted to let it
start that data transfer – you do this be pressing the User Button on
the SuperSpeed Explorer board. The LEDs on the CPLD board are
connected to the top 8 bits of the counter and these now show an up
counter. Wait a few seconds (remember we are transferring about
1GB of data every 3 seconds to the PC) then press the user button
again to terminate the transfer. If you now use CheckData on
CollectedData.bin you will discover that there are no steps indicating
missing data as was found in the Chapter 9 examples; this data
transfer is a little slower but now it is reliable!

The second example is a Slave FIFO Write and Figure 11.8


shows the interface signals which uses USB as the data source and
the CPLD as the data sink. Note that I chose to use Socket 1 for this
example since this makes the next example simpler. As in the
previous example, the FIFO_Empty signal is created using
DMA1_Ready and DMA1_Watermark – one difference however is
that for a RD cycle the signal delaya are 1 clock less so I set my
Watermark to 4. Since the Watermark is checking on the space
remaining in the DMA buffer then it also asserts on the last, short
packet sent from USB.

I don't have anywhere to save the data in the CPLD so I just


drop it on the floor (BB in the figure means Byte Bucket, it’s where all
unwanted data is placed). This means that the CPLD is an infinite
data sink and will not be the cause of any reduction in the data rate.
You can, of course, change this by using the PCLK command to
reduce the data sink rate of the CPLD.
Figure 11.8 Interface signals for Slave FIFO Write

The GPIF state machine and CPLD state machine are


shown in Figure 11.9, as expected it is similar to Figure 11.4
so I don’t need to go into detail.
Figure 11.9 GPIF and CPLD State Machines for a Slave
FIFO Write
The one difference is the FX3 asserts LastData (which is a
GPIO pin) when the data transfer from USB has completed and this
moves the CPLD state machine to the STOP state In this example
the CPLD doesn’t actually do anything with the LastPacket signal but
it later examples it will. The Verilog code that implements the CPLD
state machine is shown in Figure 11.10. There is no counter since
this Slave FIFO Write case just discards the input data. A later
example will save it.

Figure 11.10 Verilog code for the CPLD

Code in top.v
module top(
inout [12:0] CTRL, [31:0] DQ,
input PCLK,
inout [7:0] User, I2C_SCL, I2C_SDA,
input [7:0] Button, GPIO45_n, SPI_SCK, SPI_SSN, RX_MOSI,
output [7:0] LED, TX_MISO, FlashCS_n, INT, TP_2
);
// Need to assign inputs else they get optimized away
assign TP_2 = RX_MOSI & SPI_SCK & SPI_SSN & I2C_SCL & I2C_SDA;;
// Assign fixed outputs not used in this example
assign FlashCS_n = 1'b1;
assign INT = 1'b0;
assign TX_MISO = 1'bZ;
// Include the Counter
FifoMasterCounter Counter (
.PCLK(PCLK),
.RESET(CTRL[10]),
.DMA1_Ready(CTRL[6]),
.DMA1_Watermark(CTRL[6]),
.PushButton(GPIO45_n),
.RD(CTRL[1]),
.LastData(CTRL[3]),
.DQ(DQ),
.LED(LED),
.User(User)
);

endmodule

Code in FifoMasterCounter.v
module FifoMasterCounter(
input PCLK, RESET, DMA1_Ready, DMA1_Watermark, PushButton, LastData, [31:0]
DQ,
output RD, [7:0] LED, [7:0] User
);
// Define our counter which will provide data
reg [31:0] Counter;
// Display data movement during transfers
assign LED = ~DQ[31:24];
// Generate a RUN signal from the PushButton; PushButton presses toggle RUN
reg RUN;
always @ (negedge PushButton or posedge RESET) begin
if (RESET) RUN<=0; else RUN <= ~RUN;
end
// Define a State Machine for CPLD as FIFO Master, use one hot encoding
reg [4:0] CurrentState, NextState;
parameter IDLE = 5'b00001;
parameter WAIT4DMA = 5'b00010;
parameter READ = 5'b00100;
parameter PAUSE = 5'b01000;
parameter STOP = 5'b10000;
// Display internal variables on User port for debug
assign User = { CurrentState, RUN };
// Output signals are dependent upon the state machine
assign RD = (CurrentState == READ);
always @ (posedge PCLK or posedge RESET) begin
if (RESET) begin
CurrentState <= IDLE;
Counter <= 0;
end
else begin
CurrentState <= NextState;
end
end
// Calculate next state using combinational logic
always @ (*) begin
// Default is to stay in the current state
NextState = CurrentState;
case (CurrentState)
IDLE: if (RUN) NextState = WAIT4DMA; //else NextState = IDLE;
WAIT4DMA: if (DMA1_Ready) NextState = WRITE; else
if (~RUN) NextState = IDLE; // else NextState = WAIT4DMA;
READ: if (~RUN) NextState = STOP; else
if (DMA1_Watermark) NextState = PAUSE; // else NextState = WRITE;
PAUSE: if (~DMA1_Ready) NextState = WAIT4DMA; // else NextState = PAUSE;
STOP: if (!RUN) NextState = IDLE;
default: NextState = IDLE; // Should never get here
endcase
end
ndmodule

The DMA initialization for this Slave FIFO write example is


almost the same as the read example except, of course, the DMA
channel is from USB to GPIF Socket1 so it is not shown here. You
can review the code in the GPIF_Example5 project folder.

There is no need to reprogram the CPLD at this stage since I


included the write code into the previously loaded image. I did this to
save you time and disruption in this Chapter. Now load and run
GPIF_Example5.img using the USB Control Center. The example
enumerates as a USB Streamer device so you can open it using the
USB Control Center and then send a file using the center panel as
described in the SuperSpeed Explorer Users Guide.

The last step in this Slave FIFO example set is to make the
data transfer bi-directional which means combining the two
examples; I had to extend the names of a few of the signals. Figure
11.11 shows the combined interface signals and Figure 11.12 shows
the combined state machines. The matching Verilog code is not
shown since it is a concatenation of the two previous examples but
note that the state machine had to be extended to 7 bits to allow for
the additional states. The code is available in the CPLD Code
examples folder for review. There are two sockets needed for the bi-
directional transfer and Figure 11.11 shows that I am using sockets 0
and 1 and the matching hardware threads 0 and 1. The combined
DMA initialization did not include anything special so I decided to
save space and not present a Figure. It too is available in the
examples directory as GPIF_Example6 project for review.

Figure 11.11 Interface signals for Slave FIFO Read and Write
Figure 11.12 State machines for Slave FIFO Read and Write
The CPLD already includes the bi-directional code so there is
no need to reprogram it at this time. Load and run
GPIF_Example6.img using the USB Control Center. I tried to
demonstrate bi-directional data transfer by interleaving CPLD reads
and writes on alternate 32-bit data samples but the performance was
so poor due to the many additional states needed to turn the bus
around that I was embarrassed to include it as an example. The
typical use of this bi-directional interface is to sometimes move a lot
of data in one direction and then move a lot of data in the other
direction. This is how a hard disk drive or multifunction device, such
as a printer/scanner operates and this performance is exceptional;
we shall see this in a moment.

You can use CollectData to read from the CPLD or the USB
Control Center to write to the CPLD as in the previous two
examples. The FX3 project can run both Slave FIFO Read and
Write cycles, but you need to select which the CPLD is going to do
and I did this with Switch 6.

Third Party Products


For those of you would like to see really fast bi-directional data
movement now then I have a solution. Figure 10.13 shows the
SuperSpeed Explorer board connected via a Cypress adapter board
to a Xilinx Spartan SP 601 board and another Explorer board
connected via a different Cypress adapter board to an Altera
Cyclone III board. These boards are FPGA development boards
used for serious product development. I used the Xilinx board to
develop many of the GPIF examples in this book but did not want
you, the reader, to have to buy one of these boards to learn about
the FX3. So the CPLD board was born. I added LEDs, switches
and other IO for the same price as a Xilinx or Altera adapter board.

Figure 11.13 Adapter boards allow the SuperSpeed Explorer


board to connect to commercial FPGA development boards
If you have a Xilinx or Altera FPGA board then you can run
this Slave FIFO example now – the pin assignment is different due to
the different hardware construction but full instructions are included
in Cypress Note AN65974.

FIFO Master Design.


The first three examples in this Chapter had the CPLD in
control of the data transfers while the FX3 was a slave device. We
now redo these examples with the FX3 as the master and the CPLD
as a slave. What we shall see is that the complexity of the CPLD
state machine moves into the GPIF state machine and the CPLD,
now being a slave device, has simpler state machines.

Figure 11.14 shows the interface signals for an FX3 Master


FIFO Read of a CPLD FIFO slave. Note that I am using socket (and
hardware thread) 2 for this example, this makes a later example
easier. The CPLD drives FIFO_Empty high when there is no data to
read and drives LastData high on the last data value read. Our
CPLD will not be toggling FIFO_Empty since it will always be ready
once started but the GPIF state machine honors this signal since
later example will toggle this signal.
Figure 11.14 FX3 is a master reading the CPLD

Figure 11.15 shows the GPIF state machine and CPLD state
machine. I decided for this example, to let the GPIF state machine
count cycles to determine when the DMA buffer is full so that you
could see an alternative solution. Since DMA_BUFFER_SIZE is a
global FX3 project constant then we should not get tripped up as
with the Slave FIFO case.
Figure 11.15 GPIF and CPLD state machines for Master FIFO
Read

Following a RESET the CPLD drives FIFO_Empty high since


it is not ready. It waits for the user pushbutton to be pressed which
sets RUN then it moves to the WAIT4RD and deasserts FIFO_Empty
(it is now empty) so the FX3 can proceed to read the data. The
GPIF state machine needs a DMA buffer to write the data into and
since the app hasn't started yet it waits. Once a DMA buffer is
available we move to the READ state where we stay until the buffer
is full. I count data writes to the DMA buffer to determine when it is
full. Once the DMA buffer is filled we move to WAIT4BUFFER for
another buffer to be available. GPIF Designer truncates the
transition names (I have asked Cypress not to!) so you cannot see
the actual transitions without openning GPIF Designer; if you open
the GPIF_Example8 project then you will see that it is our old friend
DMA2_Ready that causes a transition back to the READ state.

When the user pushbutton is pressed again the RUN signal


deasserts and the CPLD moves to the STOP state where it drives
LastData. The GPIF state machine sees this and goes to the
SIGNAL state where it interrupts the FX3 CPU so that it can
COMMIT the last partial buffer.

The FX3 code is the same as the Slave FIFO Read example
with the only difference being a different GPIF state machine was
included. So, the FX3 firmware nor the PC know that the GPIF
interface is now operating as a master.

The CPLD code is shown in Figure 11.16; the CPLD is now a


FIFO slave so the code is a little simpler although this is not evident
from the figure due to the infra-structure involved in the approach.
The subtle difference is the change in direction for many signals.
For larger projects the reduction in complexity can be dramatic. I
editted the code to fit in the figure.

Figure 11.16 Verilog code for the CPLD

Code in FifoMasterCounter.v
module top(
inout [12:0] CTRL, [31:0] DQ,
input PCLK,
inout [7:0] User, I2C_SCL, I2C_SDA,
input [7:0] Button, GPIO45_n, SPI_SCK, SPI_SSN, RX_MOSI,
output [7:0] LED, TX_MISO, FlashCS_n, INT, TP_2
);
// Need to assign inputs else they get optimized away
assign TP_2 = RX_MOSI & SPI_SCK & SPI_SSN & I2C_SCL & I2C_SDA;;
// Assign fixed outputs not used in this example
assign FlashCS_n = 1'b1;
assign INT = 1'b0;
assign TX_MISO = 1'bZ;
// Include the Counter
FifoSlaveCounter Counter (
.PCLK(PCLK),
.RESET(CTRL[10]),
.PushButton(GPIO45_n),
.RD(CTRL[1]),
.LastData(CTRL[3]),
.DQ(DQ),
.LED(LED),
.FIFO_Empty(CTRL[6]),
.User(User)
);
Endmodule

Code in FifoSlaveCounter.v
module FifoSlaveCounter(
input PCLK, RESET, PushButton, RD,
output LastData, FIFO_Empty, [31:0] DQ, [7:0] LED, [7:0] User
);
// Define our counter which will provide data
reg [31:0] Counter;
assign DQ = Counter;
// Display data movement during transfers
assign LED = ~DQ[31:24];
// Generate a RUN signal from the PushButton; PushButton presses toggle RUN
reg RUN;
always @ (negedge PushButton or posedge RESET) begin
if (RESET) RUN<=0; else RUN <= ~RUN;
end
// Define a State Machine for CPLD as FIFO Master, use one hot encoding
reg [4:0] CurrentState, NextState;
parameter IDLE = 5'b00001;
parameter WAIT4RD = 5'b00010;
parameter READ = 5'b01000;
parameter STOP = 5'b10000;
// Display internal variables on User port for debug
assign User = { CurrentState, RUN };
// Output signals are dependent upon the state machine
assign FIFO_Empty = (CurrentState == IDLE);
assign LastRDData = (CurrentState == STOP);

always @ (posedge PCLK or posedge RESET)


begin
if (RESET) begin
CurrentState <= IDLE;
Counter <= 0;
end
else begin
CurrentState <= NextState;
Counter <= Counter + 1;
end
end
// Calculate next state using combinational logic
always @ (*) begin
// Default is to stay in the current state
NextState = CurrentState;
case (CurrentState)
IDLE: if (RUN) NextState = WAIT4RD; //else NextState = IDLE;
WAIT4RD: if (RD) NextState = READ; else
if (~RUN) NextState = IDLE; // else NextState = WAIT4DMA;
READ: if (!RD) NextState = WAIT4RD; else // else NextState = WRITE;
if (!RUN) NextState = STOP;
STOP: if (!RUN) NextState = IDLE;
default: NextState = IDLE; // Should never get here
endcase
end

endmodule

We do need, however, to reprogram the CPLD so locate


FIFO_SLAVE.xsvf and copy it into the CPLD. Load and run
GPIF_Example8.img using the USB Control Center. We are now
ready to test the example code so run the CollectData host
application with the time limit above 30 seconds and click START.
Now press the user button on the SuperSpeed Explorer board to
start the transfer. I again copy the high counter bits to the LEDs so
that can readily see that data is being collected. Press the user
button again after a few seconds to terminate the transfer. The logic
traces looked just like the slave example and, since you cannot
determine from the trace which side of the interface is driving RD, I
decided not to include them as a figure.

Master FIFO write


Figure 10.17 shows the interface signals for an FX3 being a
FIFO master implementing a write. Data is sent via USB to socket
(and hardware thread) 3 and the GPIF state machine waits for
FIFO_Full to be low so that it can start writing data. Even though our
CPLD will not drive FIFO_Full high (it drops all of the data on the
floor so is an infinite data sink) I include logic which stops FX3 writes
if the line is driven high since we will need this for the next section.

Figure 11.17 Interface signals for Master FIFO write

Figure 11.18 shows the GPIF state machine and CPLD


state machine for this Master FIFO write example.
Figure 11.18 GPIF and CPLD state machines for Master
FIFO write
We have the same issue with GPIF Designer in that the whole
equation for the state transition is not shown; we again count cycles
to determine when the DMA buffer is empty then wait on
DMA3_Ready for more data to become available. When USB stops
sending data, as deterected by a short packet, the FX3 COMMITs
the final partial buffer then drive LastData, a GPIO line, to signal to
the CPLD that it can stop. The FX3 code is the same as the Slave
FIFO write example and the Verilog code for this Master FIFO write
example is almost identical to Figure 11.16 so I decide not to make it
into another figure; it is available in the CPLD projects folder.

Load and run GPIF_Example9.img using the USB Control


Center and open the Streamer device that is presented following
enumeration or use the SendFile application to send a file to the
CPLD. The GPIF state machine waits until the CPLD is ready and
you indicate this by pressing the user pushbutton. The file data is
copied to the LEDs so that the data transfer is obvious. When the
file transfer completes you can push the user pushbutton again
prepare the CPLD for the next data transfer.

Combined master read and write


Figure 11.19 shows the combined master read and write
interface signals. Again this was a copy-and-paste with some signal
name extensions. The diagram looks almost the same as Figure
11.12, in fact, the operation from the host computer's point of view it
is the same.

Figure 11.19 Combined Master FIFO Read and Write interface


The inner implementation details of the GPIF interface are not
visible to the host computer. The GPIF state machine and matching
Verilog code are not shown but are, of course, available within the
examples folder. You can load and run GPIF_Example10.img to test
the combined read and write functions if desired. The results will be
the same as the individual examples previously implemented.

Master FX3 FIFO connected to a Slave FX3 FIFO


We now move on to the more interesting example of the FX3
master FIFO that we have just created being connected to the FX3
slave FIFO that we started this Chapter with. Figure 11.20 shows the
interface signals needed to connect a Master FIFO to a Slave FIFO;
I have expanded out the FIFO_Full and FIFO_Empty signals that we
derived from the two DMA flags while creating the Slave FIFO
examples.

Figure 11.20 Connecting Master FIFO FX3 to a Slave FIFO FX3

Figure 11.21 shows two SuperSpeed Explorer boards plugged


together such of their GPIF interfaces are connected. The
interposing connectors (SAMTEC Part Number SSW-120-03-G-D)
serve two functions; physically keeping the two board spaced apart
such that the USB 3.0 connector does not interfere, and more
importantly, these connectors do not connect voltage supplies from
the two boards or the low speed peripherals.

Figure 11.21 Interconnecting two SuperSpeed Explorer boards


Figure 11.22 shown the pins that should be clipped on each of
the J6 and J7 connectors [ on the interposing connectors, NOT
the Explorer board ]. The grounds of both boards are connected
together and, of course, the CPLD board should be disconnected.

Figure 11.22 Clip the white pins on the interposing connectors


You can now connect each Explorer board, using two USB 3.0
cables, to different PCs. You could use the same host computer but
the demonstration is not as dramatic. Locate and load
MasterFIFO_Example.img onto one of the Explorer boards and
locate and load SlaveFIFO_Example.img on the other. I also had
two terminal emulation programs monitoring each FX3 in my setup.
On the master host computer run CollectData with a time of say 60
seconds. On the other host computer run SendData with a file size
of 100 GB. The transfer should take about X seconds depending
upon the USB 3.0 performance of each computer. If you put a logic
analyzer on the GPIF interface you will see it running at 100MHz
most of the time and it is not the bottleneck in the system.

To demonstrate this to your colleagues or boss it would be a


good idea to copy the two program images into the on-board I2C
EEPROM of each Explorer board so that there is less setup and less
to go wrong in the demo. The Programming the I2C EEPROM
Chapter explains step-by-step how to do this.
This is the end of the ‘book’
We have come a LONG way since Chapter 1, especially if you
have worked through some of the examples. You probably started
the book with a little fear of what a SuperSpeed Device Design
would entail, but, by now, you should be feeling much more
confident. There is a lot you have to know to be successful and I
hope that I have partitioned the design problem into managable
sections such that you could understand the issues and know where
to go for help. You may have to read some chapters again. You
may want to redo some of the examples and improve upon them. If
Verilog was new to you and you tried it then you probably enjoyed it
– writing lines of code that get “compiled” into a hardware schematic
is GREAT; you can now make better GPIF interfacing decisions
since you now know that some things would be better handled by a
CPLD/FPGA rather than inside the GPIF block.

If I have done my job correctly then you will even be finding


bugs in my code and I’m sure that you will email me to let me know!
[email protected] is the best email to use since I may need help
answering the hard questions J

There were some topics that I did not cover, such as


Designing Your Own Hardware . Cypress have an
Applications Note on this, AN70707, to get you started. They also
have applications engineers world-wide who will review your
schematics and layout BEFORE you build 1000 boards. They have
found it easier to fix problems at this stage rather than when you are
two days from product ship.

A BIG topic that I did not cover was the use of an FX3 as an
attached processor. You can connect a “main” CPU directly onto the
GPIF interface and use the FX3 as an intelligent sub-system. This is
a LARGE topic and the schedule did not allow for this to be included
in this First Edition; it will be in the Second Edition.

What else would you like to see? Let me know at


[email protected]
Happy developing, John
Load and Run

The USB Control Center is used to “Load and Run” a


program.
The Eclipse toolset builds FX3 object code and creates an
.IMG file. An .IMG file is basically a downloadable memory
image and its format is known by the USB Control Center.
Following a RESET the SuperSeed Explorer board will
enumerate as a BootLoader device (see top of Figure 1) and
if you select this then choose FX3 from the Program menu
you have the choice of downloading this .IMG file into FX3
RAM or to an attached I2C EEPROM of SPI FLASH
component. Selecting RAM will transfer the memory image
to the FX3 and give it control thus the Load ed program will
Run .
Figure 1 Load an IMG file into FX3 RAM to Run it
Programming the CPLD

There are two methods of programming the CPLD.


The preferred method is to use the PC utility ProgCPLD
and drag and drop the appropriate .xsvf file onto the
ProgCPLD.exe icon. This will find the SuperSpeed Explorer
Board, load CPLDProgrammer.img into it, run this, then
download the xsvf file to it which will be copied to the CPLD.
Progress willl be displayed in an attached Debug Console. If
the Explorer board is not immediately found then RESET it
(ProgCPLD searches for a BootLoader device).
You can also program the CPLD manually:
Load and Run CPLDProgrammer.img on the Explorer
board, it will enumerate as a BulkLoop device; select the
BulkLoop device then open interfaces until you get to
BulkOut Endpoint (0x01), shown in Figure 1 below; click
“Transfer FileOUT” and select the required xsvf file in the
dialog box.
Figure 1 CPLD Images can also be programmed
manually
Programming the CPLD takes about 3 seconds.
How the CPLD Programmer Works
Xilinx provide a portable C application that can be used
to program Xilinx programmable devices (CPLD, FPGA)
using a microcontroller. They describe the how to download
this program and have instructions to help port the code to a
target microcontroller in their App Note XAPP058 Xilinx In-
System Programming Using an Embedded Microcontroller .
The application program is basically a translator – it reads
encoded JTAG TAP instructions from an xsvf file and toggles
the JTAG lines to implement a programming algorithm. The
only issue I had with porting were the printf statements which
used %dl which my DebugPrint could not handle. Glue
routines had to be written to link in my hardware with their
generic code. Since I was not using the I2S interface I used
the GPIO lines to drive the JTAG interface on the CPLD.
The resulting program was self-contained and ran correctly
with little effort.
I used the BulkLoop framework to get data from the PC
into a buffer where the translator program could access it.
Note that I don’t follow the BulkLoop rules and send data
back so I will only be sent one “chunk” of data – this is
128KB in my case and since the xsvf files were all about
67KB, I was fine. After programming the CPLD I reset the
FX3 so that it is ready to download the next program image.
The source code of this CPLD programmer is in the
Examples directory. If the FX3 system that you are building
has an FPGA then this application program will save the cost
and hassle of a Xilinx download cable and save the board
space for the connectors.
Developing your own CPLD Code
What I found when designing GPIF interfaces is that I
was always talking to some external logic that was also
running a state machine. The GPIF and external logic are so
intertwined that they need to be designed together . The
state machines that you draw using GPIF designer make it
pretty straightforward to design reasonably complex logic,
then the GPIF designer “compiles” your diagrams to make a
list of constants that are loaded into the RAM of the GPIF
block. The low-level detail of what these constants do is not
exposed since you don't need to know this.
The external logic is designed using a hardware
description language (HDL) that looks similar to C. There
are two popular HDLs, VHDL and Verilog. I tried both and
preferred Verilog so all my examples are in Verilog. If you
already know VHDL then you will be able to easily read my
Verilog code but if you don't then the next chapter will
describe the basic concepts of Verilog to enable you to read
my code. If a helpful VHDL expert would like to translate my
examples then I will happily post them on the book website
for other readers to benefit from.
I chose a Xilinx CPLD and the tools are a free
(enormous) download from the Xilinx site at http://www.xilinx.com
. The links move around so it is safer to give you navigation
details: choose Support then Downloads then ISE Design
Tools . The latest Design Suite is Version 14.7 and it a
HUGE 6GB download since it includes all the bells and
whistles needed to create code for all of Xilinx’s range of
products. We only need a small subset of this but you can’t
just download the CPLD tools. I used Version 13.4 to create
all my examples and also tried their oldest download version
(10.1 which is ‘only’ 2.2GB) and all of the examples worked
since I use so few features of these tools. You should also
download XAPP058 and XAPP503 from the CoolRunner II
Application Notes page since these explain how to install and
run the tools. Xilinx also have an ISE Tutorial (UG695)
and a Programmable Logic Design Overview
(UG500) that includes a chapter on Implementing CPLD
designs . I found their documentation excellent.
The compiled code is typically programmed into a
CPLD or FPGA using a Xilinx download cable but we won’t
be using a cable. Xilinx provides a portable utility that
describes how to build a loader using a microcontroller. I
ported this loader onto the FX3 and wrote a ProgCPLD PC
utility to control it; these are described in the Programming
the CPLD chapter.
The tool flow is shown in Figure 1, we need to create
two, or more, text files that the Xilinx toolchain will use to
create a binary file that can be loaded into some
programmable logic, the CPLD is our case.
Figure 1 Overview of Xilinx Toolchain

I used the standard module method to build my


examples. I created a ‘top’ module that defined all of the
signals on the CPLD board, then I created separate modules
for each example that are instanteated inside the top module
as shown in Figure 2; a new ‘top’ is created for each project.
I also created a UCF file that defined the pin locations and
their characteristics and this is shown in Figure 3; this file is
constant and does not change.
Figure 2 Top module with enclosed modules
Figure 3 User Constraints File defines our CPLD pins
# This file defines the pin assignments for the CPLD board
# First define the 32-bit GPIF bus
NET "DQ[0]" LOC = P50;
NET "DQ[1]" LOC = P49;
NET "DQ[2]" LOC = P46;
NET "DQ[3]" LOC = P43;
NET "DQ[4]" LOC = P41;
NET "DQ[5]" LOC = P39;
NET "DQ[6]" LOC = P36;
NET "DQ[7]" LOC = P34;
NET "DQ[8]" LOC = P32;
NET "DQ[9]" LOC = P28;
NET "DQ[10]" LOC = P23;
NET "DQ[11]" LOC = P11;
NET "DQ[12]" LOC = P9;
NET "DQ[13]" LOC = P7;
NET "DQ[14]" LOC = P4;
NET "DQ[15]" LOC = P2;
NET "DQ[16]" LOC = P52;
NET "DQ[17]" LOC = P53;
NET "DQ[18]" LOC = P54;
NET "DQ[19]" LOC = P55;
NET "DQ[20]" LOC = P56;
NET "DQ[21]" LOC = P58;
NET "DQ[22]" LOC = P59;
NET "DQ[23]" LOC = P60;
NET "DQ[24]" LOC = P61;
NET "DQ[25]" LOC = P63;
NET "DQ[26]" LOC = P92;
NET "DQ[27]" LOC = P93;
NET "DQ[28]" LOC = P94;
NET "DQ[29]" LOC = P95;
NET "DQ[30]" LOC = P96;
NET "DQ[31]" LOC = P97;
#Now the GPIF Control Sigansl
NET "CTRL[0]" LOC = P44 | PULLUP;
NET "CTRL[1]" LOC = P42 | PULLUP;
NET "CTRL[2]" LOC = P40 | PULLUP;
NET "CTRL[3]" LOC = P37 | PULLUP;
NET "CTRL[4]" LOC = P35 | PULLUP;
NET "CTRL[5]" LOC = P33 | PULLUP;
NET "CTRL[6]" LOC = P27 | PULLUP;
NET "CTRL[7]" LOC = P24 | PULLUP;
NET "CTRL[8]" LOC = P10 | PULLUP;
NET "CTRL[9]" LOC = P8 | PULLUP;
NET "CTRL[10]" LOC = P6 | PULLUP;
NET "CTRL[11]" LOC = P3 | PULLUP;
NET "CTRL[12]" LOC = P1 | PULLUP;
NET "INT" LOC = P99 | PULLUP;
NET "PCLK" LOC = P22; # This is up to 100 MHz
# Now the Low Speed IO
NET "I2C_SDA" LOC = P29;
NET "I2C_SCL" LOC = P30;
NET "RX_MOSI" LOC = P78; # FX3 can congigure as UART or
SPI
NET "TX_MISO" LOC = P77; # FX3 can congigure as UART or
SPI
NET "SPI_SCK" LOC = P76;
NET "SPI_SSN" LOC = P74; # Explorer board LED too
NET "FlashCS_n" LOC = P73;
NET "GPIO45_n" LOC = P79 | SCHMITT_TRIGGER; # User
Pushbutton
NET "TP_2" LOC = P80; # Spare inout Signal
# User LEDs
NET "LED[0]" LOC = P81;
NET "LED[1]" LOC = P82;
NET "LED[2]" LOC = P85;
NET "LED[3]" LOC = P86;
NET "LED[4]" LOC = P87;
NET "LED[5]" LOC = P89;
NET "LED[6]" LOC = P90;
NET "LED[7]" LOC = P91;
# User Buttons (switches)
NET "Button[0]" LOC = P72 | PULLUP;
NET "Button[1]" LOC = P71 | PULLUP;
NET "Button[2]" LOC = P70 | PULLUP;
NET "Button[3]" LOC = P68 | PULLUP;
NET "Button[4]" LOC = P67 | PULLUP;
NET "Button[5]" LOC = P66 | PULLUP;
NET "Button[6]" LOC = P65 | PULLUP;
NET "Button[7]" LOC = P64 | PULLUP;
# Uncommited User Port
NET "User[0]" LOC = P12;
NET "User[1]" LOC = P13;
NET "User[2]" LOC = P14;
NET "User[3]" LOC = P15;
NET "User[4]" LOC = P16;
NET "User[5]" LOC = P17;
NET "User[6]" LOC = P18;
NET "User[7]" LOC = P19;
After creating, or obtaining, a User Module, a
declaration for it is pasted into the ‘top’ module then you
“hook-up” the wires by defining the input, output and inout
connections. You then save this new ‘top’ in the CPLD
Projects folder under the example name.
Let’s look at an example where I include an I2C Slave
module and a Counter module to create a
I2C_Slave_Counter module. I copied the I2C_Slave code
from the Xilinx web site and since it is 4 pages long I decided
not to include the listing here, it is in the I2C_Slave project
folder. Xilinx have a lot of examples but I did have to extend
this one since their reference code only did 2 bits of input
and I needed 8. The Counter_Module is shown in Figure 9.3
and not repeated here to save space. The new piece is top
and this is shown in Figure 4 (editted a little to fit on the
page) so that you see how the connections are made.
It is just like wiring up hardware!
Figure 4 User Counter module and ‘top’ instantiating it
`timescale 1ns / 1ps
module top(
inout [12:0] CTRL, inout [31:0] DQ, input PCLK,
inout I2C_SCL, inout I2C_SDA,
input [7:0] Button, output [7:0] LED, inout [7:0] User,
input SPI_SCK, input SPI_SSN, input RX_MOSI, output TX_MISO, output FlashCS_n,
output INT, output GPIO45_n, input TP_2
);

// Need to assign all the inputs else they get optimized away
wire UnusedUser1 = User[0] & User[1] & User[2] & User[3];
wire UnusedUser2 = User[4] & User[5] & User[6] & User[7];
wire UnusedCtrl1 = CTRL[0] & CTRL[1] & CTRL[2] & CTRL[3] & CTRL[4] & CTRL[5];
wire UnusedCtrl2 = CTRL[6] & CTRL[7] & CTRL[8] & CTRL[9] & CTRL[11] & CTRL[12];
wire UnusedOther = RX_MOSI & SPI_SCK & SPI_SSN & GPIO45_n;
assign TP_2 = UnusedCtrl1 & UnusedCtrl2 & UnusedOther & UnusedUser1 &
UnusedUser2;
// Assign fixed outputs not used in this example
assign FlashCS_n = 1'b1;
assign INT = 1'b0;
assign TX_MISO = 1'bZ;
// Using CTRL[10] to RESET the CPLD
assign RESET = CTRL[10];
// Both modules output to the LEDs, use Button[7] to select which module has control
wire [7:0] I2C_LEDs;
wire [7:0] Counter_LEDs;
assign LED = Button[7] ? I2C_LEDs : Counter_LEDs;

// Include the Counter1 counter


Counter1_Module Counter (
.PCLK(PCLK),
.RESET(RESET),
.WR_n(CTRL[0]),
.DQ(DQ),
.LED(Counter_LEDs)
);

// Include the I2C Port Expander module


wire sda_out;
wire out_en;
wire ack_out;

// tri-state output vs. drive 1 for logic 1 (i2c specs pullup)


wire sda_in = I2C_SDA;
assign I2C_SDA = (ack_out || (out_en & ~sda_out)) ? 1'b0 : 1'bz;

i2c_module i2c_slave (
.scl(I2C_SCL),
.i2c_rst(RESET),
.sda_in(sda_in),
.gpio_input_pins(Button),
.ack_out(ack_out),
.out_en(out_en),
.sda_out(sda_out),
.gpio_output_pins(I2C_LEDs));

Endmodule

The non-obvious point to note is that you need to refer


to all signal names else the tools will optimize your signals
away and therefore not assign the proper termination to
these input and output signals. You set the default
termination in the Process Properties menu: expand
Implement Design , right click on Fit and choose Process
Properties . My selections are shown in Figure 5, this is
where you also select LVCMOS33 as the IO voltage.
Figure 5 Selecting Process Properties for the CPLD
projects

Feel free to edit my current modules or create your


own then “include” them (the proper term is instantiate) them
in the top module and then click Implement Design. The tool
creates a bit file (top.jed) which will be loaded into the CPLD
but my process requires an additional step.
Under the Tools menu choose iMPACT and this will
open in a new window. If the right pane does not open and
show top.jed then right-click in the pane and add it. Then
select the CPLD icon and you will be offered a choice of
operations. But first, under the Output menu, choose XSVF
File and choose a name such as projectname.xsvf.
Programming instructions will now be copied into this file
rather than using a (costly) Xilinx download cable. Choose
Erase, Program, Verify then use the Output menu again to
Close the file. Copy the .xsvf file into the
CPLD_Programming folder and now you have an image that
can be loaded into the CPLD as described in the
Programming the CPLD Reference Section.
Pretty straighforward. Have fun!
Introduction To Verilog

Yes, there was supposed to be a Verilog section here


but I was responsible for a faux pas with the book. The page
count of a book has to be set surprizingly early in the process
since it defines the width of the book spine and other
production parameters. The proof book was 300 pages and I
had to stay with this limit. I created more pages than this so
something had to go! Even after removing all the facing
pages from the Reference Section there was still more that
had to be removed. This section was the safest to remove
since this material is not unique to this book and is widely
available on the web; a google search for “Verilog Tutorial”
will get many hits.
My Verilog code is straight forward and so you should
be able to read my examples with understanding. I would
strongly recommend that you learn Verilog (or VHDL) since
this is a very valuable skill to have!
FX3 Lite (Boot) Firmware Library

Some FX3 applications may require a very low memory


footprint because of one or more of the following reasons:

The application is targeted at the FX3 parts with lower RAM


availability
The application wants to use most of the RAM for data
buffering.
The application needs to co-exist in RAM alongside another
application.

Such use cases are facilitated by the FX3 Lite (Boot)


firmware library. This is a scaled down version of the FX3
firmware library which does not make use of the RTOS. While this
library is mainly targeted at custom boot-loader applications, it can
also be used to implement complete firmware applications.
This firmware library provides a set of APIs and drivers
which support the following features also shown in the Figure
below:

1. Full USB device (peripheral) mode support: USB 2.0 and 3.0
2. GPIO support (simple GPIOs only).
3. I2C, SPI and UART support.
4. GPIF-II and PMMC support for connection to external devices.
5. DMA support: Low level DMA access without any direct DMA
channel support.
A separate naming convention (CyFx3Boot) is used for APIs
in this library to distinguish them from the full RTOS based
firmware solution. As the emphasis is on low memory footprint,
the APIs provided at low level calls that will require the user to do
most of the application implementation. Two application examples
are provided in the Cypress examples installation; they are
prefixed with Fx3BootApp .
As this library does not make use of any RTOS or threads, it
expects that the user will call the relevant APIs from the main
processing loop. The drivers for each module register interrupt
handlers for the corresponding interrupts, and provide callbacks to
notify the application about events of interest. These callbacks are
invoked from the ISR itself, and the user application will need to
defer their processing to the main loop as and when required.
USB API
Only the USB device (peripheral) mode of operation is
supported in this firmware library. The USB APIs provide full-
featured USB 2.0 and 3.0 device support.
This library also supports a seamless transition to a full FX3
library based application, without a USB re-connect. This feature
(no re-enumeration) facilitates use cases where the system
requires firmware loading through the USB host without the
overhead of multiple USB connections and driver binding.

GPIO APIs
The library supports selection of any FX3 IO as a simple
GPIO, configuring the IO pin and IO state get/set functionality.
Complex GPIOs and GPIO interrupts are not supported.

UART APIs
The UART APIs in the FX3 lite library support UART
transmit and receive operations in register and DMA modes. A
DebugPrint equivalent function is provided for logging as well.
UART interrupts are not enabled, and the user will need to poll for
events of interest.
I2C APIs
The I2C APIs in the FX3 lite library support I2C functionality
similar to that provided in the full FX3 firmware library, with the
exception of I2C interrupt support.
SPI APIs
The SPI APIs in the FX3 lite library support SPI functionality
similar to that provided in the full FX3 firmware library, with the
exception of SPI interrupt support.

GPIF-II APIs
The GPIF-II APIs in the FX3 lite library support GPIF-II
configuration and access functions similar to that provided in the
full FX3 firmware library. These APIs use the same structure
definitions that are used in the full library, so that the GPIF-II
designer generated configurations can be used directly.

PMMC APIs
The library also supports the PMMC (Pseudo MMC or MMC
Slave) mode of operation of the FX3’s P-Port block. The selection
between GPIF-II and PMMC mode has to be made prior to
initializing the PIB block.

DMA APIs
The FX3 Lite library provides a set of low level DMA
configuration and access functions that can be used to implement
complex DMA use cases. The APIs provided deal with the
configuration and access of DMA building blocks like descriptors
and sockets. No channel level APIs are provided in order to keep
the memory footprint low.

The firmware examples provided show how these low level


APIs can be used to implement AUTO and MANUAL DMA
channels.
Building an I2C Debug Console
When the FX3 UART lines are not available for a
Debug Console a useful alternate solution on the
SuperSpeed Explorer board is to use the I2C channel. Both
the FX3 UART lines and the FX3 I2C lines are connected to
the CY7C65215 which is used to implement the integrated
debugger. Within this implementation the CY7C65215
supports two channels; one connects the UART or I2C to a
virtual com port driver on the host computer and the other
connects the FX3 JTAG to an OpenOCD driver (on Windows
only at this time). The CY7C65215 has more capability than
is used on the SuperSpeed Explorer board so if you are
designing your own hardware then this would be a good
choice for your debug console and you could add other
features.
The I2C Debug Console is based on the Cypress
module cyu3debug .c which is downloadable from the FX3
software development page. This code supports two
functions; the sending of ASCII data, via Cy3UPDebugPrint ,
to a byte sink (typically the UART) and the logging of binary
data using Cy3UPDebugLog into memory . Most of the
complexity of the module is due to the binary logging function
and I will not be implementing this in the I2C Debug
Console. In Chapter 4 I added a ConsoleIn function and this
was independent of the Cy3UPDebugPrint function since it
worked with the UART producer socket. Figure 1 shows the
basic functionality of the ASCII part of Cypress-supplied
UART Debug Console.
The array of buffers is setup when DebugInit() is
called. The Cy3UPDebugPrint function can be called from
any thread; note that it cannot be called from an ISR or timer
routine since some of the internal functions are blocking (an
internal check returns an error in this case). A DMA buffer is
taken from the array of buffers and the function basically
does an snprintf operation using the formatted data string
and the supplied parameters to create a text string in the
buffer. The buffer is then committed to the UART consumer
socket that sends out the characters using DMA. It takes
some time to send out a long string of characters and
another Cy3UPDebugPrint call may be made before that
data stream has been completed. This is not a problem
since another buffer will be used and these just stack up at
the UART consumer socket waiting to be processed. If all of
the buffers are in use then the function waits at the GetBuffer
step until a buffer becomes available. This wait at GetBuffer
is the blocking call that could hang your code (HINT: don’t
use DebugPrint from an ISR/Callback).
Console In characters are collected in a ISR_Buffer by
a Callback routine that was registered with the UART_RX
ISR.
Figure 1 The UART Debug Console functionality.
An I2C Debug Console is a little more complicated that
the UART Debug Console since I2C has a protocol where
data reads and writes use the same I2C wires. In contrast
the UART channel has no protocol and data reads and writes
use independent wires. Figure 2 shows the I2C Debug
Console functionality.
Handing the print buffer to the I2C module, as we did
with the UART module in insufficient since we also have to
tell the I2C module what to do with this buffer. I therefore
added a Queue between the i2C_DebugPrint function and
the I2C block and then needed a thread to manage this
Queue. This I2C_Thread waits at the Queue for filled
buffers. There is not a trace on the SuperSpeed Explorer
board between the CY7C65215 and the FX3 that can be
used as an interrupt to alert the FX3 that characters have
been typed by the user (and the schedule wouldn’t allow one
to be added, but you could on your board!) so I had to poll
the CY7C65215 for input. I set a timeout of 1 second on the
“Wait At Q” and check for user input at that time.
If one or more characters is retrieved from the
CY7C65215 then the routine echos them which will mean a
call to I2C_DebugPrint() which will generate another filled
buffer which is processed the next time around the
I2C_Thread loop (or soon after if there are other filled
messages in the Queue). If a filled buffer was received from
the Q then it is sent to the I2C block along with an I2C write
command that also specifies the I2C slave address. I wait
for the transfer to complete so that I know that the I2C
channel is not left busy.

Figure 2 The I2C Debug Console functionality.


Figure 3 shows the intersting code for this project. I
edited it down to 3 pages for the Figure by removing some
error checking and comments but all of the code is in the
DualConsole project folder. I included it here since it is a
good RTOS programming example and I would suggest
following it through.
Figure 3 Highlights of the I2C Debug Console code
CyU3PReturnStatus_t I2C_DebugPrint (uint8_t Priority, char * Message, ...)
{
// This takes the same parameters as CyU3PDebugPrint which it is modeled upon
CyU3PReturnStatus_t Status = CY_U3P_SUCCESS ;
va_list argp;
CyU3PDmaBuffer_t DMABuffer;

// First do some error checking


if (!I2C_DebugEnabled) return CY_U3P_ERROR_NOT_STARTED ;
if (Priority > glDebugTraceLevel) return CY_U3P_SUCCESS ;
if (CyU3PThreadIdentify() == NULL) return CY_U3P_ERROR_INVALID_CALLER ;
// OK to proceed
CyU3PMutexGet(&I2C_DebugLock, CYU3P_WAIT_FOREVER);
// Allocate the buffer for formatting the string.
DMABuffer.buffer = CyU3PDmaBufferAlloc
(CY_U3P_DEBUG_DMA_BUFFER_SIZE);
if (Status == CY_U3P_SUCCESS )
{
DMABuffer.count = DMABuffer.size = CY_U3P_DEBUG_DMA_BUFFER_SIZE;
DMABuffer.status = 0;
va_start(argp, Message);
Status = MyDebugSNPrint(DMABuffer.buffer , &DMABuffer.count , Message, argp);
va_end(argp);
// Increment the count to include the NULL character also.
DMABuffer.count ++;
}
if (Status == CY_U3P_SUCCESS )
{
// Now queue this message to be displayed on the I2C console
Status = CyU3PQueueSend(&I2C_DebugQueue, &DMABuffer,
CYU3P_WAIT_FOREVER);
CheckStatus("QueueSend" , Status);
}
if ((Status != CY_U3P_SUCCESS ) && (DMABuffer.buffer != NULL))
{
CyU3PDmaBufferFree (DMABuffer.buffer );
}
CyU3PMutexPut(&I2C_DebugLock);
return Status;
}

void I2C_ConsoleThread (uint32_t Value)


{
// Value is a Semaphore that thread should signal once it is ready process buffers
CyU3PReturnStatus_t Status, Q_Status;
CyU3PDmaBuffer_t FilledBuffer, ConsoleIn;
CyU3PI2cPreamble_t Preamble;
int32_t retryCount = I2C_RETRY_COUNT;

// Get an aligned buffer to collect I2C Console Input


ConsoleIn.buffer = CyU3PDmaBufferAlloc (I2C_READ_SIZE);
ConsoleIn.size = I2C_READ_SIZE;
// Preset fixed data
Preamble.buffer [0] = CY7C65215_DeviceAddress<<1;
Preamble.length = 1;
Preamble.ctrlMask = 0;
// Tell InitDebug that the thread is ready for work
Status = CyU3PSemaphorePut((CyU3PSemaphore*)Value);

// Now wait for filled buffers to be send to the Queue and forward them to the I2C Block
while (1)
{
Q_Status = CyU3PQueueReceive(&I2C_DebugQueue, &FilledBuffer, PollDelay);

// It is recommended to read from the I2C device before transmitting anything.


if ((Q_Status == CY_U3P_ERROR_QUEUE_EMPTY ) || (Q_Status ==
CY_U3P_SUCCESS ))
{
// Poll I2C for console in
Preamble.buffer[0] |= 1; // Setup a Read
Status = CyU3PI2cSendCommand (&Preamble, I2C_READ_SIZE, CyTrue);
//CheckStatus("CyU3PI2cSendCommand", Status);
if (Status == CY_U3P_SUCCESS )
{
CyU3PMemSet (ConsoleIn. buffer , 0xFF, I2C_READ_SIZE);
ConsoleIn. count = ConsoleIn. status = 0;
Status = CyU3PDmaChannelSetupRecvBuffer (&I2C_DebugRXHandle,
&ConsoleIn);
}
if (Status == CY_U3P_SUCCESS )
{
Status = CyU3PDmaChannelWaitForCompletion
(&I2C_DebugRXHandle, 100);
}
if (Status == CY_U3P_SUCCESS )
{
//CyU3PDebugPrint(4, ",CI=%x", *ConsoleIn.buffer);
uint32_t i;
for (i = 0; i < I2C_READ_SIZE; i++)
{
if (ConsoleIn. buffer [i] != 0xFF) GotConsoleIn(ConsoleIn. buffer [i]);
else break ;
}
}
else Restart_I2C (); // Read failed
}
if (Q_Status == CY_U3P_SUCCESS )
{
// There was a buffer waiting, send it to the I2C Block
Status = CyU3PDmaChannelSetupSendBuffer (&I2C_DebugTXHandle,
&FilledBuffer);
if (Status != CY_U3P_SUCCESS ) CheckStatus( "lSetupSendBuffer" , Status);
// Now tell the I2C Block what to do with this buffer of data
Preamble.buffer[0] &= 0xFE; // Setup a Write
if (Status == CY_U3P_SUCCESS )
{
Status = CyU3PI2cSendCommand (&Preamble, FilledBuffer. count ,
CyFalse);
}
// Wait for the I2C Block to be done
if (Status == CY_U3P_SUCCESS )
{
Status = CyU3PI2cWaitForBlockXfer (CyFalse);
}
if (Status != CY_U3P_SUCCESS )
{
Restart_I2C();
CyU3PThreadSleep (50);
if (retryCount > 0)
{
retryCount--;
Status=CyU3PQueuePrioritySend(&I2C_DebugQueue,&FilledBuffer,NO_W
AIT);
if (Status != CY_U3P_SUCCESS )
{
CyU3PDebugPrint (4, "Unable to re-queue data" );
}
}
}
else retryCount = I2C_RETRY_COUNT;
}
}
}

CyU3PReturnStatus_t I2C_DebugInit (uint8_t TraceLevel)


{
CyU3PDmaChannelConfig_t dmaConfig;
CyU3PReturnStatus_t Status;
CyU3PSemaphore ThreadSignal;
void * StackPtr;
if (I2C_DebugEnabled) return CY_U3P_ERROR_ALREADY_STARTED ;
Status = Restart_I2C();
// Create MANUAL DMA channels to send and receive data from the I2C IO block
CyU3PMemSet ((uint8_t *)&dmaConfig, 0, sizeof (dmaConfig));
// Get a set of buffers to output debug messages
dmaConfig.size = CY_U3P_DEBUG_DMA_BUFFER_SIZE;
dmaConfig.count = 0;
dmaConfig.prodSckId = CY_U3P_CPU_SOCKET_PROD ;
dmaConfig.consSckId = CY_U3P_LPP_SOCKET_I2C_CONS ;
dmaConfig.dmaMode = CY_U3P_DMA_MODE_BYTE ;
Status = CyU3PDmaChannelCreate (&I2C_DebugTXHandle,
DMA_TYPE_MANUAL_OUT , &dmaConfig);
// Console In Buffer will be assigned manually
dmaConfig.size = I2C_CONSOLEIN_BUFFER_SIZE; // 0 should work here
dmaConfig.count = 0;
dmaConfig.prodSckId = CY_U3P_LPP_SOCKET_I2C_PROD ;
dmaConfig.consSckId = CY_U3P_CPU_SOCKET_CONS ;
Status = CyU3PDmaChannelCreate (&I2C_DebugRXHandle,
DMA_TYPE_MANUAL_IN , &dmaConfig);
// Create a Mutex and a Queue for the I2C_Console to use
Status = CyU3PMutexCreate(&I2C_DebugLock, CYU3P_NO_INHERIT);
Status=QueueCreate(&I2C_DebugQueue,sizeof (CyU3PDmaBuffer_t),Queue,sizeof
(Queue));
// I need to create a thread that will manage the Queue
// I also need a signal to let me know that this thread is running
Status = CyU3PSemaphoreCreate(&ThreadSignal, 0);
CheckStatus("ThreadSignal SemaphoreCreate" , Status);
StackPtr = CyU3PMemAlloc (DEBUG_THREAD_STACK_SIZE);
Status = CyU3PThreadCreate(&I2C_DebugThread, // Handle to my Application
Thread
"30:I2C_Debug" , // Thread ID and name
I2C_ConsoleThread, // Thread entry function
(uint32_t)&ThreadSignal, // Parameter passed to Thread
StackPtr, // Pointer to the allocated thread
DEBUG_THREAD_STACK_SIZE, // Allocated thread stack size
DEBUG_THREAD_PRIORITY, // Thread priority
DEBUG_THREAD_PRIORITY, // = Thread priority so no preemption
CYU3P_NO_TIME_SLICE, // Time slice no supported
CYU3P_AUTO_START // Start the thread immediately
);
// Wait for the thread to be set up
Status = CyU3PSemaphoreGet(&ThreadSignal, CYU3P_WAIT_FOREVER);
I2C_DebugTraceLevel = TraceLevel;
I2C_DebugEnabled = CyTrue;
return Status;
}
FX3 Family Members
The FX3 component has been available for a some
time now and early adopters were storage and video
developers. The FX3’s architecture allows all of USB 3.0’s
bandwidth to be delivered to the GPIF interface where it is
used to implement high-performance interfaces. These early
adopters wanted even more performance so Cypress worked
with them to produce two components targeted at these
industries. Figure 1 shows the block diagram of the FX3S,
designed for storage applications and the CX3 designed for
video capture applications. The GPIF block already has
multiple counters and comparators for general 32-bit
applications and two different sets of extra circuitry were
added in the GPIF block, shown in red , to better serve these
applications. As you can see, most of the FX3 is unchanged
so 95% of this book is also relevant to FX3S and the CX3
applications.
Figure 1 FX3S and CX3 share the same block diagram
FX3S designed for storage application s
The FX3S has an integrated storage controller that can
support two independent mass storage devices on its
storage ports. The interfaces are built around industry
standards: JEDEC standard JESD84 – A44 for Embedded
Multimedia Cards (eMMC) Version 4.41, Micro Secure Digital
Specification (SD) Version 3.0 and SDIO Version 3 .0. You
can attach 2TB SDXC cards giving huge data capacity or, by
using two eMMC cards in parallel, developers can
theoretically support up to 208 MBps of data storage
bandwidth. To provide pins to connect to storage devices the
GPIF interface is only available up to 16 bits.
Drivers are provided within the ThreadX framework to
support the storage devices including built-in RAID support
for RAID0 and RAID1 using either SD or eMMC cards.
Cypress provides four FX3 application examples, including
full source code, that implement basic file storage and
RAID0. Cypress application note AN89661 USB RAID1 Disk
Design Using EZ-USB FX3 explains all of the details that you
need to be successful. The toolset described in this book
creates FX3S code but you will need FX3S hardware to
debug it; Figure 2 shows two FX3S development boards from
Pactron who are a silver partner of the Cypress Partner
Developer Program.
Figure 2 FX3S Development Board and Raid-On-A-Chip
Dongle from Pactron

CX3 designed for video capture applications


The CX3 is also built around an industry standard
interface: the Mobile Processor Interface Group (MIPI)
Alliance works on many interfaces that are important to the
entire mobile device – “ from the antenna and peripherals to
the modem and application processor”. The Camera Serial
Interface Specification (CSI-2) defines standard data
transmission and control interfaces between a transmitter
and a receiver. Data transmission is a unidirectional,
differential, serial interface with data and clock signals and
supports 1 to 4 Data Lanes as shown in Figure 3. The
control interface is a 400KHz implementation of I2C.
Figure 3 CSI-2 Specification defines Data Lanes

The CX3 supports up to 4 data lanes at speeds up to


1Gbps per lane (a 2 MIPI lane version is also available). A
single lane will support a wide variety of industry standard
formats including RAW8/10/14, YUV422/444 and
RGB888/666/565. This translates into the CX3 being able to
capture uncompresed streaming video at 4K UHD @ 15fps,
1080p @ 30fps, and 720p @ 60fps. A dedicated GPIF
configuration is required for the MIPI interface so no GPIF
pins are available for the application however there are 12
pins that may be used as GPIOs to support lighting, sync-in,
sync out or other camera features not supported by the I2C
control interface.
Drivers are provided within the ThreadX framework to
support Data Lanes and, of course, the I2C interface.
Cypress provides four CX3 application examples, including
full source code, that implement the USB Video Class (UVC)
interface for a range of color and frame resolutions. Cypress
application note AN75779 How to Implement an Image
Sensor Interface Using EZ-USB FX3 in a USB Video Class
(UVC) Framework explains all of the details that you need to
be successful. The toolset described in this book creates
CX3 code but you will need CX3 hardware to debug it; Figure
4 shows a Denebola Reference Design from e-con Systems
who are a silver partner of the Cypress Partner Developer
Program.
Figure 4 Denebola Reference Design from e-con
Systems

You might also like