procDF PDF
procDF PDF
Sponsors
Editors
Jorge M. Finochietto
Gustavo D. Sutter
Orlando Micolini
Pablo Recabarren
ii
Proceedings of the
2011 VII Designer Forum
Córdoba, Argentina
April 13 – 15, 2011
Organized by
Digital Communications Research Lab
School of Exact, Physical and Natural Sciences
National University of Córdoba
iii
iv
Proceedings of the
2011 VII Designer Forum
Editors
Jorge M. Finochietto
Gustavo Sutter
Orlando Micolini
Pablo Recabarren
ISBN: 978-84-614-7682-4
v
Preface
These Proceedings contain the technical papers presented at the VII 2011 Designer Forum
organized within the 2011 VII Southern Conference on Programmable Logic (SPL), held in Cór-
doba, Argentina, from April 13th to 15th, 2011. The SPL Conference is the South Hemisphere’s
largest and most comprehensive conference focused on reconfigurable technology (i.e., FPGA)
and its applications.
The history of SPL started in 2005. The Joint Latin American FPGA Laboratories Project
(SURLAB) was financed by Banco Santander Central Hispano of Spain. Its aim was to create
a network of Latin American laboratories to spread FPGA as a key technology for industry, up-
dating university curricula to include related subjects. The original partners were the Universidad
Autónoma de Madrid, the Instituto Tecnologico de Monterrey, the University of Lima in Peru, and
the Argentinean Universities of Mar del Plata, Salta, Tandil, and CAECE.
Starting in March 2005, the first SPL Conference was attended by more than 60 people from
Argentina, Brazil, Costa Rica, and Peru. This 5-day workshop in the unique atmosphere of the
one-hundred year old CAECE University building, introduced students, professors and engineers
to the FPGA state of the art.
In 2006, more than 80 engineers attended the 2nd SPL, and more than 50 papers from Ar-
gentina, Brazil, Costa Rica, and Peru, Spain, United Kingdom, Uruguay, and USA were selected.
In 2007, the 3rd SPL Conference was sponsored by IEEE for the first time, receiving more than
90 papers from 24 countries: Argentina, Australia, Bangladesh, Belgium, Brazil, Colombia, Costa
Rica, Czech Republic, France, Germany, Greece, Hong Kong, India, Italy, Mexico, Netherlands,
Paraguay, Peru, Portugal, Singapore, Spain, Taiwan, UK, and USA
In 2008, the 4th SPL Conference moved from Mar del Plata to San Carlos de Bariloche,
situated on the Andes foothills. A total of 29 full-papers, 23 short papers and 20 Designer Forum
papers were selected, from around one hundred submission, including authors from the following
countries: Argentina, Australia, Brazil, China, Canada, Colombia, France, Germany, Hong Kong,
Mexico, Peru, Portugal, Romania, Spain, United Kingdom, and USA.
In 2009, the 5th SPL Conference, sponsored again by IEEE, moved out of Argentina to Sao
Carlos, Brazil. 90 papers were submitted from many countries, 26 were accepted as full papers,
12 as short papers, and 8 as Designer Forum papers.
In 2010, the 6th SPL Conference, sponsored by IEEE, moved to the Northeastern Coast of
Brazil to the well known Porto de Galinhas Beach, near the city of Recife. This central location in
a relaxed atmosphere, combined with the fast-paced economic growth in this part of Brazil, was a
great site to discuss advanced technology. SPL2010 received submissions from Argentina, Brazil,
Canada, China, France, Iran, Italy, Mexico, Netherlands, Pakistan, Peru, Poland, Portugal, Spain,
United Kingdom, and United States. A total of 53 papers were selected: 22 full papers, 13 short
papers, and 18 Designer Forum papers.
In 2011, the 7th SPL Conference, sponsored as traditionally by IEEE, has moved to the Cór-
doba, the second-largest city in Argentina, and it will be hosted at the National University of Cór-
doba, one of the oldest universities in America. Paper submission from the following countries
were received: Argentina, Belgium, Brazil, Colombia, Finland, France, Germany, Greece, In-
dia, Mexico, Portugal, Spain, Sweden, United Kingdom, United States of America and Uruguay.
From 99 submissions, a total of 50 regular papers were selected: 24 for oral presentation and 21
for poster one.
A total of 25 papers were selected to be included in the Proceedings of Designer Forum, which
demonstrates the increasing relevance of this forum within the SPL conference. The goal of the
Designer Forum is to give exposure to ongoing researches, academic experiences, and industrial
designs in order to get feedback from experienced researchers and industrial partners. The De-
signer forum was born with the Southern Conference on Programmable Logic (SPL) in 2005 and
it became an important part of it. It promotes the participation of novel researchers and advanced
students of the conference region. Due to the regional scope of the Designer Forum, its papers can
be written also in Spanish and Portuguese languages.
This year 2 one-week intensive courses were held to encourage hardware digital design skills
on advance students and professionals; thus, maintaining the spirit to spread FPGA technology
knowledge in the southern hemisphere. Besides, 4 tutorials have been organized for conference
attendees which are lectured by both industry and academic experts.
This year over 150 participants are expected from more than 40 universities, technological
institutions and companies all around the world.
The topics in this year program include: Embedded Processors and IP Cores, System-on-Chip,
Computer Arithmetic, Image Processing and Vision, FPGA Architectures for Specific Applica-
tions, Fault Tolerance, Test & Verification. SPL has beautiful track record and it becoming an
important forum for discussion on FPGA technology and its applications.
We would like to express our gratitude to the many people who have contributed to the high
quality of the technical program. Special thanks to those who chaired or were members of the vari-
ous committees. Particularly the Program Committee who’s careful review has helped to maintain
the high quality of SPL.
Finally, we would like to thank our sponsors: Altera, ClariPhy Argentina, Fundación Tarpuy,
National Agency for the Scientific and Technologic Promotion (Agencia), National Scientific and
Technical Research Council (CONICET), and Synopsys.
A special thanks to the School of Exact, Physical and Natural Sciences (National University
of Córdoba) and Universidad Autónoma de Madrid for their support.
The Editors
Córdoba, Argentina, April 2011
vii
viii
Table of Contents
Executive Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Poster Session 1
Poster Session 2
ix
Digitally Configurable Platform for Power Quality Analysis
Bruno Falduto, Ricardo Cayssials, Edgardo Ferro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Solar Tracker for Compact Linear Fresnel reflector using PicoBlaze
Maiver Villena, Daniel Hoyos, Carlos Cadena, Victor Serrano, Telmo Moya, Marcelo Gea . . . . . . . . . 97
Toolbox NURBS and Visualization System Via FPGA
Luiz Marcelo Silva, Maria Paiva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Una Metodología para el Desarrollo de Sistemas en Chip de Alta Performance
Marcos Oviedo, Pablo Ferreyra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
High Throughput 4x4 and 8x8 SATD Similarity Criteria Architectures for Video Coding Applications
Luciano Agostini, Julio Saracol Domigues, Dieison Soares Silveira, Leomar Soares da Rosa, Vinicius
Possani . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Adquisición de Vídeo Bajo Estándar ITU-R BT.656-4 Mediante Lógica Programable
Juan Carlos Contreras, Guillermo Gutierrez, Emilio Kowalski, Rodolfo Cavallero . . . . . . . . . . . . . . . 119
x
Executive Committee
General Chairs
Jorge M. Finochietto
Universidad Nacional de Córdoba – CONICET, Argentina
Gustavo Sutter
Universidad Autónoma de Madrid, Spain
Forum Chairs
Orlando Micolini
Universidad Nacional de Córdoba, Argentina
Pablo Recabarren
Universidad Nacional de Córdoba – CONICET, Argentina
Tutorial Chair
Graciela Corral-Briones
Universidad Nacional de Córdoba, Argentina
Local Chair
Carmen Rodirguez
Universidad Nacional de Córdoba, Argentina
Financial Chair
Ramiro Calderón
Fundación Tarpuy, Argentina
Executive Secretary
Publicity Chair
Eduardo Boemo
Universidad Autónoma de Madrid, Spain
Edval Santos
Universidade Federal de Pernambuco, Brazil
Valentin Obac Roda
Universidade de Sao Paulo, Brazil
Elias Todorovich
Universidad Nacional del Centro, Argentina
Luciano Agostini
Universidade Federal de Pelotas, Brazil
xi
xii
Forum Committee
xiv
IP CORE MAC ETHERNET
ABSTRACT
La tecnología Ethernet provee comunicación entre PCs y dispo-
sitivos que funcionen en forma autónoma, en ámbitos locales o a
través de Internet. En este trabajo presentamos un core que imple-
menta la capa MAC Ethernet, de uso sencillo, con diversas con-
figuraciones, que ocupa pocos recursos de una FPGA. El diseño
fue simulado con herramientas de Software Libre y verificado en
hardware utilizando una FPGA Virtex 4.
1. INTRODUCCIÓN
1
Registros 2 y 3: dirección MAC.
Registro 4: control/estado de interfaz MDIO.
Registros 5 y 6: dirección de memoria de la tablas de des-
criptores de transmisión y recepción.
Los descriptores son datos de 32 bits transmitidos mediante
AHB. Tanto en transmisión como en recepción se tienen dos des-
criptores contiguos:
Descriptor 0: se conforma de bits de control y estado. Utiliza
11 bits para especificar la cantidad de bytes a transferir.
Descriptor 1: consiste en un puntero de 30 bits a la zona de
memoria donde se almacenan/extraen los datos.
Fig. 2. Esquema de instanciaciones de GReth (izq.) y MAC (der.).
2.4.1. Transmisión
A través del AHB se colocan los datos a partir de la direc- Para el manejo de AMBA se desarrolló una biblioteca denomi-
ción apuntada por el descriptor 1. Los datos deben poseer las di- nada AMBA Handler, con propósitos de simulación. En la misma
recciones MAC destino y origen, y el campo tipo/tamaño. El CRC se implementaron ocho procedimientos que representan las com-
(Cyclic redundancy check) de 4 bytes es añadido automáticamente. binaciones de escritura o lectura, a un maestro o esclavo, APB o
A continuación, se especifica la dirección del descriptor 0 en AHB.
el registro 5. GReth comienza la transmisión cuando se le indica
en el registro 0. 4. EL CORE DESARROLLADO: MAC Ethernet
Cuando la transmisión finaliza, GReth escribe información de
estado en el registro 1 y el descriptor 0. Finalmente apunta al si- 4.1. Introducción
guiente par de descriptores y queda listo para la próxima opera-
ción. En la Fig. 2, se pueden ver dos esquemas resumidos de la ins-
tanciación de componentes del core del cual se partió (izquierda) y
2.4.2. Recepción del core que se obtuvo (derecha).
El nivel superior del GReth, instancia las FIFO de transmisión
Se especifica la dirección del descriptor 0 en el registro 6. y recepción, el componente ethc0, e implementa el manejo del core
GReth lee los descriptores cuando se le indica en el registro 0 y mediante descriptores, la comunicación MDIO y parte de AMBA,
aguarda un paquete entrante. Dicho paquete será aceptado cuando la interfaz EDCL y la sincronización entre distintos dominios de
la dirección MAC destino sea la indicada en los registros 2 y 3 o reloj. El componente ethc0, instancia a los componentes que re-
la de broadcast, o cuando el core tenga habilitado el modo promis- suelven la transmisión y recepción a través de MII/RMII y a un
cuo. En cualquier otro caso será descartado. componente que resuelve la otra parte de la comunicación AMBA.
Cuando finaliza, se escribe información de estado en el registro El core MAC desarrollado, presenta un nivel superior neta-
1 y el descriptor 0, y los datos recibidos son accesibles a partir de mente estructural, que solamente instancia a los llamados canales
la dirección apuntada por el descriptor 1. de transmisión y recepción, y opcionalmente la interfaz MDIO.
Los canales nombrados, instancian en su interior memorias RAM
2.4.3. MDIO dual port, los componentes que resuelven la transmisión y recep-
ción a través de MII y componentes para la sincronización entre
Esta interfaz permite acceder de 1 a 32 PHY, que contengan distintos dominios de reloj.
de 1 a 32 registros de 16 bits. Su control y estado es accesible
mediante el registro 4.
4.2. Implementación
La escritura se inicia especificando el dato, número de PHY
y registro, y colocando a ’1’ el bit de escritura, mientras que la El core desarrollado fue escrito en lenguaje VHDL 93 están-
lectura precisa el número de PHY y registro, e inicia colocando a dar. Para su desarrollo se utilizaron las herramientas y lineamientos
’1’ el bit de lectura. recomendadas por el proyecto FPGALibre [7].
Con respecto a GReth se eliminaron ciertas características, se
3. TESTEO DEL GRETH remplazaron descripciones y se modificaron en parte o totalmente
otras.
Con el objetivo de poder detectar cualquier error introducido Se eliminaron las siguientes características:
al simplificar el core se diseñó un testbench para el GReth. Esto Utilización de buses AMBA.
nos permitió tener un mejor conocimiento de su funcionamiento,
en particular teniendo en cuenta la utilización del Método de los Manejo mediante descriptores.
dos procesos en GReth. Interfaz EDCL.
El testeo consistió en instanciar el GReth, junto a una descrip-
ción denominada FakePHY, que simula ser un PHY y desde las Soporte de RMII.
interfaces AMBA realizar escrituras y lecturas MDIO, transmisio- Las FIFO genéricas utilizadas en GReth fueron remplaza-
nes y recepciones mediante MII, y verificar que lo enviado y lo das por unas propias del laboratorio, implementadas con memoria
recibido coincidiera, o abortar en caso contrario. RAM dual port. Además, las mismas pasaron a ser instanciadas
2
los datos escritos a la FIFO son los obtenidos de MII y los leídos
de la FIFO quedan disponibles para ser usados. Para evitar la pér-
dida de paquetes, debido a que la aplicación no haya terminado de
retirar los datos recibidos, se implementó un esquema de múltiples
FIFOs. El número de FIFOs es configurable y su manejo depende
exclusivamente del core.
4.3. Arquitectura presenta características similares al GReth, pero una nueva in-
terfaz. Posee señales para especificar el número de PHY, de re-
La Fig. 3 muestra un diagrama en bloques core, donde se puede gistro y datos de entrada y salida por separado. Con señales indi-
apreciar los tres dominios de reloj con los cuáles trabaja el sistema. viduales se indica si la operación es una escritura o una lectura.
La transmisión consiste en una FSM que en función de señales Finalmente, cuenta con una señal de ocupado y otra de falla en la
de entrada, escribe datos a una FIFO implementada con una RAM comunicación.
dual port. Al terminar de transferir datos a la FIFO, se genera la
señal wr_end, que luego de ser sincronizada, es identificada por la
5. VALIDACIÓN DEL CORE DESARROLLADO
FSM que lee los datos de la FIFO y los transmite a través de MII.
Una vez leídos todos los datos, mediante la señal rd_end, la FSM
5.1. Simulación
de escritura vuelve a su estado inicial.
La recepción es similar a la transmisión, con la diferencia que Para la simulación se utilizó GHDL [8] 0.28.
3
común de uso, y además el caso de utilizar un sólo canal de recep-
Table 1. Resultados de la síntesis ción, lo cual puede ser suficiente en numerosas aplicaciones que no
core GReth
requieran un flujo de datos continuo.
Configuración LUTs FFs Slices BRAMs
Sin MDIO 1814 775 1099 2 7. CONCLUSIONES
Con MDIO 2011 834 1220 2
De la comparación de los resultados de la síntesis, puede apre-
core MAC
ciarse que se obtuvo una implementación más compacta de la que
Configuración LUTs FFs Slices BRAMs se partió. Para configuraciones de uso equivalentes, nuestro core
1 RX sin MDIO 823 333 491 2 utiliza menos del 50 % de área de la FPGA que el GReth. Debe
considerarse también que el core GReth precisa la disponibilidad
2 RX sin MDIO 872 341 516 3
de memoria accesible mediante AMBA, además de todo el soporte
2 RX con MDIO 1016 381 591 3 para el manejo de descriptores, mientras que nuestro core cuenta
con todo lo necesario para ser directamente utilizado.
En cuanto al modo de uso, el core desarrollado es más simple y
Se realizó un testbench, donde nuevamente se instancia al co- no depende de un cierto bus, aunque puede ser fácilmente adaptado
re FakePHY, esta vez junto a nuestro MAC, pero a diferencia del al que sea necesario, ya sea AMBA, WISHBONE [12] u otro. La
testeo del GReth, este es más riguroso, incluyendo características simplificación del modo de uso y el cambio de arquitectura, son
tales como: las principales razones de la menor ocupación de recursos de la
Implementa procesos separados para transmisión y recep- FPGA.
ción, en lugar de utilizar uno sólo de forma secuencial. La utilización de lenguaje VHDL 93 estándar, permite que el
core sea sintetizable en una FPGA de cualquier fabricante.
Verifica el funcionamiento de la indicación de errores. La utilización de las herramientas propuestas por el proyecto
Los tres relojes que utiliza, no son múltiplos exactos entre FPGALibre demostró ser adecuada para un proyecto de estas ca-
ellos, lo que permite una mejor simulación de la sincroniza- racterísticas.
ción entre señales. Tareas futuras sobre este trabajo, podrían implicar tanto capas
de menor nivel, como la implementación de algún PHY Ethernet,
Por otro lado, se desarrolló un core denominado Replies, el
como aplicaciones de un nivel superior, que provea manejo del pro-
cual contesta peticiones ARP (Address Resolution Protocol) e ICMP
tocolo IP (Internet Protocol).
(Internet Control Message Protocol). Cabe aclarar que los meca-
nismos que utiliza para tal fin no reflejan los especificado para es-
8. REFERENCES
tos dos protocolos, sino artilugios para realizar pruebas. Este core
se utilizó en un testbench junto a tramas Ethernet reales adquiridas [1] S. E. Tropea and R. A. Melo, “USB framework - IP core and related
con el software wireshark [9], para recrear la ejecución del coman- software,” in XV Workshop Iberchip, vol. 1, Buenos Aires, 2009, pp.
do ping y poder visualizar las formas de onda y los paquetes de 309–313.
datos intercambiados. [2] GRLIB IP Core User’s Manual, 1.0.19 ed. Gaisler Research, 2008,
pp. 324–336.
5.2. Validación en hardware [3] J. Gaisler, “An open-source VHDL IP library with plug&play confi-
guration,” in IFIP Congress Topical Sessions, R. Jacquart, Ed. Klu-
Se llevó a cabo utilizando una FPGA Virtex 4 de Xilinx y wer, 2004, pp. 711–718.
el software ISE WebPack 11.3 - L.57. El host utilizado fue una [4] ARM. (2010, Jun.) AMBA - Advanced Microcontroller Bus
computadora personal corriendo el sistema operativo Debian [10] Architecture. [Online]. Available: http://www.arm.com/products/-
GNU [11] /Linux. system-ip/amba/amba-open-specifications.php
Como aplicación se utilizó el core Replies, el cual es sinteti- [5] Free Software Foundation, Inc., “GNU General Public License,”
zable. Una vez que el core superó el testbench sin reportar ningún http://www.gnu.org/copyleft/gpl.html.
error, se hicieron múltiples pruebas utilizando el comando ping, [6] J. Gaisler, “A structured VHDL design method,” http://www.gaisler.-
que fueron desde horas hasta más de una semana de ejecución, pre- com/doc/vhdl2proc.pdf, Jun. 2010.
sentando en todos los casos cero paquetes perdidos. Nuevamente, [7] S. E. Tropea, D. J. Brengi, and J. P. D. Borgna, “FPGAlibre: Herra-
se utilizó el software wireshark, en este caso para verificar la co- mientas de software libre para diseño con FPGAs,” in FPGA Based
rrecta conformación de los paquetes recibidos. Systems. Mar del Plata: Surlabs Project, II SPL, 2006, pp. 173–180.
El PHY externo utilizado, fue el DP83847 de National Se- [8] T. Gingold. (2010, Jun.) A complete VHDL simulator. [Online].
miconductor. Las pruebas se realizaron usando una comunicación Available: http://ghdl.free.fr/
full-duplex de 100 Mb/s . [9] G. Combs and contributors. (2010, Jun.) Network protocol analyzer.
[Online]. Available: http://www.wireshark.org/
6. RESULTADOS [10] I. Murdock et al. (2010, Jun.) Debian GNU/Linux operating system.
[Online]. Available: http://www.debian.org/
En el Cuadro 1 pueden observarse los resultados de la síntesis [11] R. M. Stallman et al. (2010, Jun.) The GNU project. [Online].
de los cores GReth y MAC, para una Virtex 4. Available: http://www.gnu.org/
En el caso del GReth, se sintetizaron las configuraciones más [12] Silicore and OpenCores.Org. (2010, Jun.) WISHBONE System-
comunes con y sin el uso de la interfaz MDIO, en ambos casos on-Chip (SoC) interconnection architecture for portable IP cores.
[Online]. Available: http://prdownloads.sf.net/fpgalibre/wbspec_b3-
con la interfaz EDCL deshabilitada. Para el MAC se sintetizaron
2.pdf?download
las mismas opciones, siendo dos canales de recepción el caso más
4
IP CORE MAC ETHERNET
ABSTRACT
La tecnología Ethernet provee comunicación entre PCs y dispo-
sitivos que funcionen en forma autónoma, en ámbitos locales o a
través de Internet. En este trabajo presentamos un core que imple-
menta la capa MAC Ethernet, de uso sencillo, con diversas con-
figuraciones, que ocupa pocos recursos de una FPGA. El diseño
fue simulado con herramientas de Software Libre y verificado en
hardware utilizando una FPGA Virtex 4.
1. INTRODUCCIÓN
5
Registros 2 y 3: dirección MAC.
Registro 4: control/estado de interfaz MDIO.
Registros 5 y 6: dirección de memoria de la tablas de des-
criptores de transmisión y recepción.
Los descriptores son datos de 32 bits transmitidos mediante
AHB. Tanto en transmisión como en recepción se tienen dos des-
criptores contiguos:
Descriptor 0: se conforma de bits de control y estado. Utiliza
11 bits para especificar la cantidad de bytes a transferir.
Descriptor 1: consiste en un puntero de 30 bits a la zona de
memoria donde se almacenan/extraen los datos.
Fig. 2. Esquema de instanciaciones de GReth (izq.) y MAC (der.).
2.4.1. Transmisión
A través del AHB se colocan los datos a partir de la direc- Para el manejo de AMBA se desarrolló una biblioteca denomi-
ción apuntada por el descriptor 1. Los datos deben poseer las di- nada AMBA Handler, con propósitos de simulación. En la misma
recciones MAC destino y origen, y el campo tipo/tamaño. El CRC se implementaron ocho procedimientos que representan las com-
(Cyclic redundancy check) de 4 bytes es añadido automáticamente. binaciones de escritura o lectura, a un maestro o esclavo, APB o
A continuación, se especifica la dirección del descriptor 0 en AHB.
el registro 5. GReth comienza la transmisión cuando se le indica
en el registro 0. 4. EL CORE DESARROLLADO: MAC Ethernet
Cuando la transmisión finaliza, GReth escribe información de
estado en el registro 1 y el descriptor 0. Finalmente apunta al si- 4.1. Introducción
guiente par de descriptores y queda listo para la próxima opera-
ción. En la Fig. 2, se pueden ver dos esquemas resumidos de la ins-
tanciación de componentes del core del cual se partió (izquierda) y
2.4.2. Recepción del core que se obtuvo (derecha).
El nivel superior del GReth, instancia las FIFO de transmisión
Se especifica la dirección del descriptor 0 en el registro 6. y recepción, el componente ethc0, e implementa el manejo del core
GReth lee los descriptores cuando se le indica en el registro 0 y mediante descriptores, la comunicación MDIO y parte de AMBA,
aguarda un paquete entrante. Dicho paquete será aceptado cuando la interfaz EDCL y la sincronización entre distintos dominios de
la dirección MAC destino sea la indicada en los registros 2 y 3 o reloj. El componente ethc0, instancia a los componentes que re-
la de broadcast, o cuando el core tenga habilitado el modo promis- suelven la transmisión y recepción a través de MII/RMII y a un
cuo. En cualquier otro caso será descartado. componente que resuelve la otra parte de la comunicación AMBA.
Cuando finaliza, se escribe información de estado en el registro El core MAC desarrollado, presenta un nivel superior neta-
1 y el descriptor 0, y los datos recibidos son accesibles a partir de mente estructural, que solamente instancia a los llamados canales
la dirección apuntada por el descriptor 1. de transmisión y recepción, y opcionalmente la interfaz MDIO.
Los canales nombrados, instancian en su interior memorias RAM
2.4.3. MDIO dual port, los componentes que resuelven la transmisión y recep-
ción a través de MII y componentes para la sincronización entre
Esta interfaz permite acceder de 1 a 32 PHY, que contengan distintos dominios de reloj.
de 1 a 32 registros de 16 bits. Su control y estado es accesible
mediante el registro 4.
4.2. Implementación
La escritura se inicia especificando el dato, número de PHY
y registro, y colocando a ’1’ el bit de escritura, mientras que la El core desarrollado fue escrito en lenguaje VHDL 93 están-
lectura precisa el número de PHY y registro, e inicia colocando a dar. Para su desarrollo se utilizaron las herramientas y lineamientos
’1’ el bit de lectura. recomendadas por el proyecto FPGALibre [7].
Con respecto a GReth se eliminaron ciertas características, se
3. TESTEO DEL GRETH remplazaron descripciones y se modificaron en parte o totalmente
otras.
Con el objetivo de poder detectar cualquier error introducido Se eliminaron las siguientes características:
al simplificar el core se diseñó un testbench para el GReth. Esto Utilización de buses AMBA.
nos permitió tener un mejor conocimiento de su funcionamiento,
en particular teniendo en cuenta la utilización del Método de los Manejo mediante descriptores.
dos procesos en GReth. Interfaz EDCL.
El testeo consistió en instanciar el GReth, junto a una descrip-
ción denominada FakePHY, que simula ser un PHY y desde las Soporte de RMII.
interfaces AMBA realizar escrituras y lecturas MDIO, transmisio- Las FIFO genéricas utilizadas en GReth fueron remplaza-
nes y recepciones mediante MII, y verificar que lo enviado y lo das por unas propias del laboratorio, implementadas con memoria
recibido coincidiera, o abortar en caso contrario. RAM dual port. Además, las mismas pasaron a ser instanciadas
6
los datos escritos a la FIFO son los obtenidos de MII y los leídos
de la FIFO quedan disponibles para ser usados. Para evitar la pér-
dida de paquetes, debido a que la aplicación no haya terminado de
retirar los datos recibidos, se implementó un esquema de múltiples
FIFOs. El número de FIFOs es configurable y su manejo depende
exclusivamente del core.
4.3. Arquitectura presenta características similares al GReth, pero una nueva in-
terfaz. Posee señales para especificar el número de PHY, de re-
La Fig. 3 muestra un diagrama en bloques core, donde se puede gistro y datos de entrada y salida por separado. Con señales indi-
apreciar los tres dominios de reloj con los cuáles trabaja el sistema. viduales se indica si la operación es una escritura o una lectura.
La transmisión consiste en una FSM que en función de señales Finalmente, cuenta con una señal de ocupado y otra de falla en la
de entrada, escribe datos a una FIFO implementada con una RAM comunicación.
dual port. Al terminar de transferir datos a la FIFO, se genera la
señal wr_end, que luego de ser sincronizada, es identificada por la
5. VALIDACIÓN DEL CORE DESARROLLADO
FSM que lee los datos de la FIFO y los transmite a través de MII.
Una vez leídos todos los datos, mediante la señal rd_end, la FSM
5.1. Simulación
de escritura vuelve a su estado inicial.
La recepción es similar a la transmisión, con la diferencia que Para la simulación se utilizó GHDL [8] 0.28.
7
común de uso, y además el caso de utilizar un sólo canal de recep-
Table 1. Resultados de la síntesis ción, lo cual puede ser suficiente en numerosas aplicaciones que no
core GReth
requieran un flujo de datos continuo.
Configuración LUTs FFs Slices BRAMs
Sin MDIO 1814 775 1099 2 7. CONCLUSIONES
Con MDIO 2011 834 1220 2
De la comparación de los resultados de la síntesis, puede apre-
core MAC
ciarse que se obtuvo una implementación más compacta de la que
Configuración LUTs FFs Slices BRAMs se partió. Para configuraciones de uso equivalentes, nuestro core
1 RX sin MDIO 823 333 491 2 utiliza menos del 50 % de área de la FPGA que el GReth. Debe
considerarse también que el core GReth precisa la disponibilidad
2 RX sin MDIO 872 341 516 3
de memoria accesible mediante AMBA, además de todo el soporte
2 RX con MDIO 1016 381 591 3 para el manejo de descriptores, mientras que nuestro core cuenta
con todo lo necesario para ser directamente utilizado.
En cuanto al modo de uso, el core desarrollado es más simple y
Se realizó un testbench, donde nuevamente se instancia al co- no depende de un cierto bus, aunque puede ser fácilmente adaptado
re FakePHY, esta vez junto a nuestro MAC, pero a diferencia del al que sea necesario, ya sea AMBA, WISHBONE [12] u otro. La
testeo del GReth, este es más riguroso, incluyendo características simplificación del modo de uso y el cambio de arquitectura, son
tales como: las principales razones de la menor ocupación de recursos de la
Implementa procesos separados para transmisión y recep- FPGA.
ción, en lugar de utilizar uno sólo de forma secuencial. La utilización de lenguaje VHDL 93 estándar, permite que el
core sea sintetizable en una FPGA de cualquier fabricante.
Verifica el funcionamiento de la indicación de errores. La utilización de las herramientas propuestas por el proyecto
Los tres relojes que utiliza, no son múltiplos exactos entre FPGALibre demostró ser adecuada para un proyecto de estas ca-
ellos, lo que permite una mejor simulación de la sincroniza- racterísticas.
ción entre señales. Tareas futuras sobre este trabajo, podrían implicar tanto capas
de menor nivel, como la implementación de algún PHY Ethernet,
Por otro lado, se desarrolló un core denominado Replies, el
como aplicaciones de un nivel superior, que provea manejo del pro-
cual contesta peticiones ARP (Address Resolution Protocol) e ICMP
tocolo IP (Internet Protocol).
(Internet Control Message Protocol). Cabe aclarar que los meca-
nismos que utiliza para tal fin no reflejan los especificado para es-
8. REFERENCES
tos dos protocolos, sino artilugios para realizar pruebas. Este core
se utilizó en un testbench junto a tramas Ethernet reales adquiridas [1] S. E. Tropea and R. A. Melo, “USB framework - IP core and related
con el software wireshark [9], para recrear la ejecución del coman- software,” in XV Workshop Iberchip, vol. 1, Buenos Aires, 2009, pp.
do ping y poder visualizar las formas de onda y los paquetes de 309–313.
datos intercambiados. [2] GRLIB IP Core User’s Manual, 1.0.19 ed. Gaisler Research, 2008,
pp. 324–336.
5.2. Validación en hardware [3] J. Gaisler, “An open-source VHDL IP library with plug&play confi-
guration,” in IFIP Congress Topical Sessions, R. Jacquart, Ed. Klu-
Se llevó a cabo utilizando una FPGA Virtex 4 de Xilinx y wer, 2004, pp. 711–718.
el software ISE WebPack 11.3 - L.57. El host utilizado fue una [4] ARM. (2010, Jun.) AMBA - Advanced Microcontroller Bus
computadora personal corriendo el sistema operativo Debian [10] Architecture. [Online]. Available: http://www.arm.com/products/-
GNU [11] /Linux. system-ip/amba/amba-open-specifications.php
Como aplicación se utilizó el core Replies, el cual es sinteti- [5] Free Software Foundation, Inc., “GNU General Public License,”
zable. Una vez que el core superó el testbench sin reportar ningún http://www.gnu.org/copyleft/gpl.html.
error, se hicieron múltiples pruebas utilizando el comando ping, [6] J. Gaisler, “A structured VHDL design method,” http://www.gaisler.-
que fueron desde horas hasta más de una semana de ejecución, pre- com/doc/vhdl2proc.pdf, Jun. 2010.
sentando en todos los casos cero paquetes perdidos. Nuevamente, [7] S. E. Tropea, D. J. Brengi, and J. P. D. Borgna, “FPGAlibre: Herra-
se utilizó el software wireshark, en este caso para verificar la co- mientas de software libre para diseño con FPGAs,” in FPGA Based
rrecta conformación de los paquetes recibidos. Systems. Mar del Plata: Surlabs Project, II SPL, 2006, pp. 173–180.
El PHY externo utilizado, fue el DP83847 de National Se- [8] T. Gingold. (2010, Jun.) A complete VHDL simulator. [Online].
miconductor. Las pruebas se realizaron usando una comunicación Available: http://ghdl.free.fr/
full-duplex de 100 Mb/s . [9] G. Combs and contributors. (2010, Jun.) Network protocol analyzer.
[Online]. Available: http://www.wireshark.org/
6. RESULTADOS [10] I. Murdock et al. (2010, Jun.) Debian GNU/Linux operating system.
[Online]. Available: http://www.debian.org/
En el Cuadro 1 pueden observarse los resultados de la síntesis [11] R. M. Stallman et al. (2010, Jun.) The GNU project. [Online].
de los cores GReth y MAC, para una Virtex 4. Available: http://www.gnu.org/
En el caso del GReth, se sintetizaron las configuraciones más [12] Silicore and OpenCores.Org. (2010, Jun.) WISHBONE System-
comunes con y sin el uso de la interfaz MDIO, en ambos casos on-Chip (SoC) interconnection architecture for portable IP cores.
[Online]. Available: http://prdownloads.sf.net/fpgalibre/wbspec_b3-
con la interfaz EDCL deshabilitada. Para el MAC se sintetizaron
2.pdf?download
las mismas opciones, siendo dos canales de recepción el caso más
8
AUTONOMOUS WIRELESS INTELLIGENT NETWORK ACCESSIBLE VIA IP
ABSTRACT
F
An autonomous wireless intelligent network is presented. D
H
An autonomous wireless intelligent network (AWIN) is Network is identified for one IP address, so all the
presented. It is defined as a wireless Ethernet local area nodes share it and have the same structure and capabilities
network. All the communications, internal and external, but each of them is identified with a different MAC
are made via Internet Protocol (IP). Stations remote address.
access via wireless Ethernet is enabled for reset process or The network builds autonomously its communication
data gathering. The protocol for wireless Ethernet architecture. As each wireless network node can
networks is defined in IEEE 802.11 standard rules [1] [2]. communicate only with those nodes that are within the
The rules are technology and internal structure range of transmitter, the communication inside the net
independent. The minimum and necessary subset of this must be neighbor node to neighbor node or “mouth to
standard rules was selected to implement the node mouth”. Once the communication path is defined, as it is
communication module. The network has an IP address; shown in figure 2, the net is ready and the programmed
all nodes shared this IP address and have their own process starts.
physical address (MAC). Nodes deployment is not fixed and it may change over
Internal network intelligence is centered in time. Nodes are battery powered, so the transmitter range
architecture dynamic reconfiguration according to the will be affected by the state of battery charge. This or
physical location of the nodes. Border Gateway Protocol another cause of failure as environmental or electronic
(BGP) was adapted to allow dynamic reconfiguration. risk or involuntary destruction can put some nodes out of
BGP was developed to allow an effective all to all service. If one or some nodes stop working, the network
interconnection between autonomous systems via IP [3]. must be auto reconfigured to maintain the network
As BGP capabilities exceed autonomous network needs, communication alive as it is shown in figure 3.
the capabilities needed for specific application were Periodically, an architecture check is done, and when it is
selected. To make dynamic reconfiguration in a simple necessary a communication path reconfiguration is made.
way, adding o removing nodes and changing the When an external access is required, the requirement
communication path without affect network performance can be received by many nodes, the first node that
presented an interesting compromise to solve. The answers assumes the role of hub node. Hub node is
commitment was high performance, low cost and responsible for wireless communication with the external
minimum power consumption. Figure 1 shows a fourteen Ethernet network and all others must report to it using
nodes net before the communication architecture has been intermediate nodes as repeaters.
was built.
9
F
H TO/FROM ETHERNET
D NETWORK
L
J P TRANSMITTER
A /RECEIVER
PROTOCOL
M COMMUNICATION
C CODE/DECO SUBSYSTEM
Q
O K
COMUNICATION
MEMORY
N
E
SENSOR SENSOR
Figure 2. Fourteen nodes network communication path MEMORY
SUBSYSTEM SENSOR
SUBSYSTEM
F CONTROL
H SENSOR
D
L
J P
A Figure 4. Network node block diagram
M
C
O
Q Dedicated communication module block (PROTOCOL
K
CODE/DECO) was designed on the basis of earlier works
N [4] [5]. System internal working frequency was defined at
E
100MHz and part of Ethernet manager works at 50MHz.
It is a bidirectional block to manage data transmission and
Figure 3. Fourteen nodes network communication path
reception. As receptor, it recognizes, decodes and
with C node out of service
processes the incoming frame according to ETHERNET
rules. In data transmission, the reverse process is
2.NODE DESCRIPTION managed.
It selects between a transmission or reception process.
Typical net node block diagram is shown in figure 4. It is In transmission process, the output frame is shaped
possible to difference two subsystems, one for assembling sensor subsystem incoming data with
communication and the other to manage sensor activity destination/origin MAC and IP addresses and control bits.
and configuration. Before starting transmission channel occupancy is
Communication subsystem has three blocks. detected, when channel is free transmission is enabled.
First block is a wireless ETHERNET compatible In reception process, when a valid data frame is
transmitter/receiver. The second (PROTOCOL CODE/DECO) detected, reception is starting. Incoming frame is
is a dedicated communication module that is responsible processed according to protocol and destination IP
for interpreting the message according to the IP protocol, address network matching is verified, in other way the
for storing in memory the fields it needs to keep and for frame is discarded. If origin MAC address matches with
transmitting data to the sensor subsystem in a reception one of the network nodes MAC addresses an internal net
process, or for shaping the frame according to the message is identified, in other way an external
Ethernet protocol retrieving from memory the fields communication is detected.
needed to build the outgoing message. The last is a In both, decoding process is accomplished and
memory block (COMMUNICATION MEMORY). redundancies are checked through a feedback shift
Sensor subsystem is composed by three blocks: one register that was proposed in XILINX application notes
to manage all subsystem activities (SENSOR SUBSYSTEM [6]. Origin and destination MAC and IP addresses are
CONTROL), a memory block to store data and extracted and stored in COMMUNICATION MEMORY
configuration parameters (SENSOR MEMORY) and the to be used in message answer construction, and data is
sensor itself (SENSOR). submited to the sensor subsystem with an special bit code
The transmitter/receiver to be used in this application to identify the external or internal communication.
will be a wireless ETHERNET IEEE 802.11 compatible COMMUNICATION MEMORY was implemented in a
transmitter/receiver and its description runs out of the two read/write ports memory.
scope of this paper.
10
Sensor Subsystem has three blocks: the SENSOR Once received the KEEPALIVE message, the hub
SUBSYSTEM CONTROL (SSC), a memory block to store node emits an UPDATE message to notifying its
sensor data and address and configuration parameters neighbors MAC addresses. Neighbor nodes receives
(SENSOR MEMORY) and the sensor it self. SSC has the message and emits an UPDATE message to announce
responsibility of management all sensor subsystem their own neighbor addresses and the route to reach hub
activities. node. Every node that receives the message repeat the
operation announcing its MAC neighbor addresses and
3.NETWORK OPERATION. the route to reach the hub node, and information goes
spreading for the network.
Network operations are differenced in five categories. When all nodes have been reached and the path
communication information has been stored in all of
Three of tem are defined for external communication
them, the net architecture is completely configured and
(shown in figure 1) and they are identified as Network Set
sensors start DR process. KEEPALIVE messages will be
Up, Network Programming and Data Gathering.
The fourth category corresponds to an internal periodically exchanged to ensure that the relationship
communication process of the net and it is defined as continues established. If some node goes out of service, a
communication break is reported and routes including this
Network Configuration, and the last, which is identified as
node are reconfigured with UPDATE messages
Data Recollecting, is defined for storing data collected by
generation.
the sensor in the sensor memory.
Network Programming (NP) and Data Gathering (DG)
Network Set Up (NSU) is the starting process.
Assuming the network has a predefined quantity of nodes, process start with the corresponding external messages.
each of them identified with a different MAC addresses, When a NP or a DG external message is received, all the
node are enable to receive it, the one that first answers the
and each node has stored the addresses of all the others,
requirement, assumes the role of hub node to receive and
an external NSU message is required to start net
retransmit information.
operation. When NSU message is received, the node that
NP is the process to programme sensors parameters.
receives and first answers the requirement, assumes the
role of hub node, and Network Configuration process The information goes spreading for the network and all
(NCP) is started (figure 5). the sensors are reprogrammed when it is stored in the
sensor memory of each node. DG is the process that
A dedicated protocol based on BGP was developed for
allows the transfer of data stored in the sensors outside the
NCP. Devices that can communicate directly are defined
network. When hub node sends a data request message,
as neighbors, and the first step is to detect neighboring.
data travel node to node to reach hub node and they are
Hub node sends a START message to all the others, the
nodes that answered message are assumed as neighbors transmitted to the external network.
Data Recollecting (DR) is an internal node process
and their MAC address are stored as a neighbor address.
which periodicity is programmed during NP process.
After a prefixed time without receive answer messages,
hub node assumes its table of neighbour node is
completed, and sends an OPEN message to each one of its 4.CONCLUSIONS
neighboring nodes, and waits for a KEEPALIVE message
that only includes the BGP header. Each one of the nodes Nodes structure and operation of an autonomous
carries out the same procedure to identify its neighbors. wireless intelligent network reachable remotely via
REMOTE INTERNET were presented. Specific application is
STATION
F
sensing meteorological data in field.
HUB NODE
H The structure of nodes is the same for all of them. All
D
nodes have the same capabilities, share the same IP
L
1 IP address address and have different MAC addresses. The minimum
J P
14 MAC address and necessary rules subset of IEEE 802.11 standard rules
A
M
was selected to implement node communication module.
C Internal network intelligence is centered in dynamic
Q topology reconfiguration according to the physical
O K
location of the nodes. Border Gateway Protocol (BGP)
N
was adapted to allow dynamic reconfiguration.
E
11
[3]
Two prototypes nodes were implemented over Rekhter Y., Li T., Hares S. “Request for Comments 4271: A
SPARTAN III available in Digilent S3 SKB development Border Gateway Protocol 4 (BGP-4)”
XILINX field programmable logic devices boards [7]. http://www.ietf.org/rfc/rfc4271.txt
The design was validated with successfully [4]
Schiavon M. I., Crepaldo D., Martín R. L., Varela C.
communication tests made in Laboratory. For tests, “Dedicated system configurable via Internet embedded
connection between nodes was implemented as a wired communication manager module”, V Southern Conference on
connection using a 10BASE-T connection synchronized Programmable Logic, San Carlos, Brasil (2009) pp 193-197.
at 10Mb/seg. Now the work is RF transmitter analysis and [5]
Schiavon M. I., Crepaldo D., Martín R. L. “Wireless Internet
selection to implement wireless communication. configurable network module”, VI Southern Conference on
Programmable Logic, Puerto Galhinas, Brasil (2010) pp
5.REFERENCES [6]
Borrelli C. “IEEE 802.3 cycle redundancy check”, XILINX,
App. Note XAPP209. March, 2001.
[1]
IEEE, IEEE STD 802.11-2007, “Revision of IEEE STD [7]
Digilent S3 SKB development boards, SPARTAN 3 FPGA, and
802.11-1999”, June 2007.
ISE platform, http://www.xilinx.com
[2]
Waisbrot, J. “Request For Comments: 791”, http://www.rfc-
es.org/rfc/rfc0826-es.txt
12
Multi-Level Synthesis on the Example of a Particle
Filter
Jan Langer, Daniel Froß, Enrico Billich, Marko Rößler, Ulrich Heinkel
Chemnitz University of Technology
Chemnitz, Germany
{laja,daf,ebi,marr,heinkel}@hrz.tu-chemnitz.de
Abstract—In this paper we compare two high level synthesis A fundamentally different approach is to utilize the InTerval
approaches on the example of a particle filter design. First, a Language (ITL), that has been originally used as a formal
C synthesis is used to transform C code into RT level VHDL. verification technique. A system description is created as a set
The second method employs the tool vhisyn to compile a set of
operation properties written in ITL into RTL code. A particle of Operation Properties that split the system’s behavior into
filter component has been implemented using both methods and operations of fixed length, which are connected by a property
the resulting designs were synthesized and run on a FPGA board. graph. Using ITL has been proposed as an intermediate
The corresponding synthesis results have been compared to a HLS methodology that compensates specific drawbacks of the
hand coded design. previous approach.
This work focuses on the comparison of two high level design
methods starting from different levels of abstraction and hand
This paper is structured as follows. First, an overview of
coded VHDL. As a result, the resource utilization and timing previous work in the field of HLS is given. The second section
of the high level designs are not prohibitively high. Especially, describes the specification of the particle filter design. In
it is interesting to classify operation properties as an efficient section IV and V, we provide some details about the high
prototyping and design method in certain application areas. level design methodologies we have used. The paper concludes
In general, high-level design methods are applied when a more with a presentation of the design results and the respective
abstract, concise and maintainable system description is required
and only a short design time is allowed. Operation properties performance of the two implementations compared to a hand
represent a compromise between abstract C based methods and coded VHDL design.
classical RT design.
II. P REVIOUS W ORK
I. I NTRODUCTION High-level synthesis rises the design level with the objective
to improve verification and system design productivity. Related
High level synthesis (HLS) raises the level of designing a work dates back 30 years, starting from algorithmic level [1]
system from the traditional register transfer (RT) level up to and moving up to system level. ANSI C/C++ and derivatives
higher levels of abstraction. This step helps to improve both of them like SystemC, Single Assignment-C (SA-C) [2] and
design productivity and achieved verification quality. In this Handle-C [3] provide functionality similar to languages like
paper, two very different approaches to HLS and a hand coded Verilog and VHDL and aim at a unified hardware-software
design on RT level are evaluated by means of a case study in representation. Commercial and academic C to VHDL com-
performance and efficiency. A particle filter algorithm is used pilers like CatapultC, C-to-Silicon [4], Cyber [5] and others
as an application example. The particle filter is an estimation generate intermediate RT level code, which can be processed
technique for Bayesian models that is primarily well suited for by logic synthesis tools afterwards [6]. C2H [7], Streams-C [8]
localization purposes. Furthermore, the particle filter is a good and CoDeveloper [9] combine HLS and hardware software co-
example to illustrate certain aspects of the different design design. Tools for compiling other languages like Java [10] or
approaches of this work. Matlab to hardware appeared recently.
The first HLS approach is the generation of RT Hardware In general, it is a well understood process to generate ex-
based on a system description written in an augmented C ecutable and even synthesizable models from single temporal
language that will be translated into synthesizable VHDL. The properties or sets of properties. Those models can be either
resulting hardware implementation exploits coarse-grained used as monitors in system simulation and emulation or they
parallelism on process level and low level parallelism on form abstractions for early prototypes in system verification.
instruction level. Synthesizing temporal properties has mostly focused on Linear
Time Logic (LTL) as implemented in PSL or SVA [11]–
This research work was supported in part by the German Federal Ministry [14]. However, all those methods can only handle a subset
of Education and Research (BMBF) in the project HERKULES under the
contract number 01 M 3082 and the project InnoProfile under contract number of the operators of the property language or they can only
03 IP 505. process problems of very small complexity. Another problem
13
is ambiguity. In most cases, a property or a set of properties is See [20] for a comprehensive introduction to particle filters.
satisfied by more than one exact behaviour. Thus, the synthesis For reasons of approximation accuracy the number of particles
method can either create a general solution that contains all has to be large - depending on the problem to be estimated.
consistent behaviour or an arbitrarily chosen specific solution. As a consequence, a software implementation on an embedded
In contrast to PSL or SVA, the synthesis of models from microprocessor platform is infeasable due to low update rates.
complete sets of ITL properties can profit from additional This has made a hardware implementation necessary. In our
constraints, that are not present in pure LTL properties. For case, the state to be estimated is the unknown position (x, y, z)
one, the property graph connecting the operations imposes of the object. Thus, every particle represents one possible
structural information that is used during synthesis. Further- position hypothesis
more, the special syntax of ITL (in many aspects more
p[m] = (x[m] , y [m] , z [m] ), (1)
restricted than general LTL) and the assertions obtained during
the check for completeness simplify the synthesis process and where m is the running index in the particle set. A filter update
allow a much higher complexity to be handled. Thus, in [15] at time t consists of the following steps:
a tool vhisyn has been proposed to translate ITL descriptions [m]
1) Prediction. A hypothetical position pt for each par-
to VHDL. This work uses the tool to generate the operation ticle is predicted at the actual timestep t based on its
property based design to be compared to the other two design [m]
former position pt−1 . Therefore, every new particle has
approaches. to be sampled from a proposal distribution that is based
Similar to this paper, [16] also uses ITL properties to gen- on a given state transition or motion model. In our
erate executable models, called Cando objects. The algorithm case, the mobile node is assumed to move without any
does not employ the property graph structure, and on one hand favored direction. Hence, this distribution is modeled
is more general than our approach, but on the other hand less [m]
symmetrically around pt−1 as a three dimensional nor-
able to handle complex property descriptions. mal distribution with identical variances σp2 ∆t for x,
Case studies of HLS tools are available (e.g. in [6], y and z. Due to the fact that positional uncertainty
[17]–[19]), but limit the comparison exclusively to either increases with time, the variance values are scaled with
programming language based HLS approaches or to RT level the time difference ∆t between the actual time and the
designs. To the best of our knowledge, there is presently time of the last filter update.
no comprehensive case study available that comparatively 2) Weight Calculation. The next step consists of calculat-
qualifies the results of synthesizing a complete design of a [m] [m]
ing a weight wt for each particle pt by incorporating
complex algorithm at these levels of abstraction.
a distance measurement dt between the object and an
III. PARTICLE F ILTER anchor position pa . This weight is the probability of the
[m]
This section presents a particle filter for localization estima- distance measurement under the particle pt . In our
tion as a possible specification for a hardware implementation. case the weight is given by
The filter estimates an object’s three-dimensional position by [m] k
wt = , k>0 (2)
incorporating distance measurements to reference points of k + |∆d|
known position. The localization problem is similar to that [m]
∆d = dt − |pt − pa | (3)
of the global positioning system (GPS).
The particle filter has been chosen as an example for this where ∆d is the difference between expected distance
comparative work, because it can be described as a short, well- (euclidean distance between particle and anchor posi-
understood piece of C code, that will be used as a starting point tion) and measured distance dt . The scaling constant k
for C based synthesis. Furthermore, the particle filter’s behav- characterizes the quality of distance information. If pre-
ior can be split into meaningful operational properties making dicted and measured distance match exactly the weight
it a feasible target of property based synthesis. However, de- maximizes to one. With increasing difference the weight
spite these characteristics, a specific hardware implementation decreases asymptotically to zero according to the value
of this design on register-transfer-level requires a lot of work. of k.
Considering these facts, the particle filter appears as an ideal 3) Resampling. The final particle set is generated through
candidate for a study to compare the design approach using a resampling procedure of the hypothetical set from
operational properties with both a higher level method based step 1). The probability of drawing each particle from
on C and a lower level manual implementation. the set is given by its weight. The resulting particle
A particle filter is a nonparametric implementation of the set possesses duplicates of particles with large weights
Bayes filter algorithm, where the posterior distribution is while particles of lower weight have been replaced.
approximated by a set of random state samples (particles). Thus, the resulting particle set focuses on regions with
The likelihood of the true system’s state is proportional to the high posterior probability. In our implementation, a so-
density a region of the state space is populated by particles. called low variance sampler from [20] is deployed. In
14
wt timing specifications, memories, communication patterns and
other constraints [21]. However, these aspects are crucial to
w [1]
t w [2]
t ... synthesize the corresponding hardware structures. Handling of
these issues differs between the available C synthesis tools and
r r + wt r + 2w t
there appears to be no clear winning solution.
Nevertheless, all tools share a more or less semi-automated
Fig. 1. Low variance resampling procedure way to handle the various levels of parallelism to generate
hardware with reasonable performance. For the work in this
at least M paper, the tool CoDeveloper by Impulse Accelerated Tech-
Weight
measurement
Weight FIFO nologies has been used. It is the commercial successor of the
weights Calculation
Streams-C compiler. In general, the principles described in this
Po sition FIFO paper also apply to other synthesis tools based on C that do
particles
particles
not depend on explicit annotation of concurrency on a fine
Resampling Prediction grained level.
Power
PC On the lowest level, blocks of C code, bounded by control
Statistics
mean / covariance
statements (e.g. case, if, for, ... ), are automatically processed
to exploit parallelism. Data dependencies between instructions
Fig. 2. Block diagram of the particle filter design. are analyzed to extract implicit concurrency. Simple operations
(e.g. addition of fixed point values) are directly mapped to
the corresponding HDL statement, whereas more complex
a first step, a single random number r in the interval instructions are mapped to specific components from a library.
[0; wt ) is chosen where wt is the arithmetic mean of The following allocation step decides how many operators will
all particle weights. In the following steps the algorithm be instantiated and how memory access and data operations are
selects particles by repeatedly adding wt to r and by scheduled into fixed time slices according to their estimated
choosing the particle that corresponds to the resulting execution time.
value. Figure 1 illustrates this resampling method. The automatic transformation of loops and control structures
4) Density Extraction. Finally, based on the discrete par- generally results in state machines. Loops are either unrolled
ticle set maintained by the filter, a continuous density is and each step is executed concurrently to minimize compu-
estimated. We compute the mean and the covariances tational delay or the steps are pipelined for area efficiency.
over all particles assuming them to be normally dis- Unrolling and pipelining span a rather large design space
tributed. The probability density at any position can then bound by the required speed (frequency) and size (area) of the
be calculated by a normal distribution using the obtained chip. A constraint driven synthesis process explores solutions
mean vector and covariance matrix. to meet the restrictions defined by the designer.
To compare both high level design approaches to a hand The original resampling algorithm of the particle filter is
coded VHDL design, the particle filter has been implemented shown in the left part of Fig. 3. The resulting scheduling
using all three methods. All designs are structured similarly as is annotated in the right part. The initialization phase takes
shown in Fig. 2. The three blocks: prediction, weight calcu- two cycles due to a memory read. Loop conditions consume
lation and resampling correspond to the update rules 1) to 3) one cycle and the loop bodies two cycles each due to data
above. The resampling block will not start operating until the dependencies and memory accesses.
cumulative sum of all particle weights is available. Therefore,
the weights and positions of one complete set (M = 8192) C-Code Cycle Block
of particles need to be stored in a FIFO, that is located at U = rand() % step; 0 Block1
i = 0; 0
the input of the resampling block. As soon as the resampled j = 0; 0
particles drop out of the resampler, they are processed by the c = M*weight[i]; 0-1
prediction and weight calculation and again pushed into the for (j=0; j<=M; j++) 2 Loop1
FIFOs. The statistics block corresponds to update step 4) with while (U>=c) 3 Loop2
i++; 4
calculating mean and covariance parameters over all particles. c += M*weight[i]; 4-5
To synthesize hardware from derivatives of the sequential Fig. 3. C code of the low variance resampling algorithm.
software programming language C, several problems have to
be considered. The programming model of pure C does not Pure C language is not especially well-suited to specify
define certain aspects of the concurrency model, data types, hardware. Therefore, a designer is forced to guide the synthesis
15
C-Code Cycle Block idle
U = rand() % step; 0 Block1 start reset
i = 0; 0
j = 0; 0
c = M*weight[i]; 0-1 read write
while (j < M) 1 Loop1
#pragma CO PIPELINE
if (U>=c) 1
c += M*weight[i++]; 2-3 Block2 Fig. 5. Property graph of the resampler.
else {
U += step; 4
state2[++j] = state1[i]; 4-5 Block3 property read is
assume: Jan Langer read
Professur Schaltkreis-
} 3 und Systementwurf
at t : U >= c; i <M +1
Fig. 4. Optimized C code of the resampling algorithm property at t :is
read i < M;
prove : weight
assume:
at t+2 : i = prev(i,2)+1; c +
at t : weight >= limit;
at t+2 : c = prev(c + weight,2);
at t+1 at
: rd_cnt
t+2 : U<=M; prev(U,2); U >= =
process in order to achieve the best possible performance. during[t+1,t+2] : wr_en = ’0’; wr_en
Guidelines by tool vendors and the research community in-prove: during[t+1,t+2] : state2 = 0;
state2 0
clude combining loops, combine or split memories, mark loops at t+2 at
: rd_cnt
at t+2 at
: limit
t+2 : =
= prev(rd_cnt,2)+1;
t+1 : rd_en = ’1’;
prev(limit,2)
rd_en = ’0’; + rd_en
for pipeling or unrolling. In general, it is necessary to review ... prev(in_weight,2); t t+1 t+2
end :property;
the synthesis results in order to optimize critical code sections. at t+2 weight = prev(weight,2);
during[t+1,t+2] : wr_en = ‘0‘;
The resulting C code might be less efficient to be run in during[t+1,t+2] Fig. 6. ITL code and timing
: out_state = 0;diagram of the read property.
software but more suitable for hardware synthesis. Fig. 4 at t+1 : rd_en = ‘1‘;
shows an optimized version of the resampling algorithm of at t+2 : rd_en = ‘0‘;
end property;
the particle filter. Rewriting the algorithm and advising loop needed, as shown in Fig. 5. The reset property sets the
piplining reduced the latency in each path to two cycles. component’s state variables to defined values after a system
All C-synthesis tools require a manual definition of par- reset has occurred. Furthermore,
Jan Langer it definesProfessur
the Schaltkreis-
values of all
und Systementwurf
4
allelism on the coarse grained level. This is often achieved output signals in this phase. The idle property is activated in
by processes or threads. In particular, the fundamental unit of the time between subsequent update cycles of the filter and sets
concurrently executed computation in CoDeveloper is called the output values to zero. In case a new update cycle is started,
process. Streams, signals, registers and shared memories are the start property applies and prepares the internal variables
provided to synchronize processes and to extract the global for the following resampling process. The two properties read
data path. The implementation of the particle filter in Fig. 2 and write alternate according to the received particle weights.
uses processes for weight calculation, prediction and resam- As soon as all particles have been read and as many particles
pling. Global arrays are used to buffer the particles between have been written, the idle property is activated again.
the processes, whereas all remaining communication utilizes The resampling component’s read operation picks the state
streams. and weight of the next particle from the FIFO and does not
write a new particle to the output. This operation is shown in
V. O PERATION P ROPERTIES Fig. 6. The corresponding timing diagram tries to visualize the
The commercial tool 360MVTM by OneSpin Solutions [22] behavior. The expressions in the assume part of the operation
introduces a Gap Free Verification methodology based on form the antecedent of the property and indicate the activation
operation properties. It provides a special property syntax conditions. In this case, the read property is executed as long
known as InTerval Language (ITL). A set of additional rules as variable c is greater or equal to variable U and the particle
helps to write a complete set of properties, that explicitly read count i is smaller than the total number of particles M .
covers the design intent for every valid sequence of input The prove part forms the consequent of the operation and sets
values. The tool employs a powerful engine to prove the output and internal signals to their new values.
completeness of the property set as well as the correctness of In contrast to high level synthesis approaches based on
each individual property with respect to the design. A property algorithmic descriptions like the C language, the properties
set is complete, if the conjunction of the properties alone is contain no loops. The user has to encode loop-like behavior
able to map every valid sequence of input data to exactly one implicitly in the sequence of the operations allowed by the
corresponding sequence of output data [23]. The completeness property graph. Furthermore, the properties are designed such
of a property set can be proven without the need of an actual that they can partly overlap and therefore exploit a pipelining
design. behavior in the resulting design.
To illustrate the property-based design, we want to show The length of the read operation is two cycles. So, during
one property of the resampling component. The resampler’s read’s third cycle at t + 2, the following property can be
behavior is first split into distinct operations based on the activated and the two properties overlap for one cycle. In
specification. It turns out, that exactly five operations are general, the use of operations is more beneficial for properties
16
TABLE I
16000 D ESIGN DESCRIPTION AND SYNTHESIS RESULTS
12000
lines of code 2138 (vhdl) 1243 (vhi) 447 (C)
estimated design effort 1-2 weeks 3 days 2 days
10000
slices 3855 (28%) 6011 (43%) 4603 (33%)
8000
slice FF 5924 (21%) 5120 (18%) 5286 (19%)
4 input LUT 3552 (12%) 8930 (32%) 6387 (23%)
6000 BRAM 70 (51%) 69 (50%) 82 (60%)
real position
estimated position MULT18x18 18 (13%) 23 (16%) 29 (21%)
4000
anchors
max. freq. (in MHz) 182 25 113
covariance ellipsis
2000 avg. cycles per particle 2 3 66
17
Xilinx Coregen Tool. This applies for example to the [2] W. A. Najjar, W. Böhm, B. A. Draper, J. Hammes, R. Rinker, J. R.
various arithmetic operators with large bit widths such Beveridge, M. Chawathe, and C. Ross, “High-level language abstraction
for reconfigurable computing,” Computer, vol. 36, no. 8, pp. 63–69, Aug.
as division, square root and multipliers. 2003.
2) The design generated by CoDeveloper is of moderate [3] Celoxica Limited, “Handle-C Language Reference Manual,” 2005.
size and reasonably fast but it needs about 66 cycles to [Online]. Available: www.celoxica.com
[4] Cadence Design Systems Inc., “Cadence C-to-Silicon Compiler Delivers
process one particle. In particular, CoDeveloper fails to On The Promise Of High-level Synthesis,” 2008.
implement a pipelined division and employs a sequential [5] K. Wakabayashi, “C-based synthesis experiences with a behavior syn-
component that needs 64 cycles for one operation. thesizer, ”Cyber”,” in Design, Automation, and Test in Europe (DATE).
Munich: IEEE Comput. Soc, 1999, pp. 390–393.
The runtime of the synthesis tools itself has been negligible. [6] O. Hammami, Z. Wang, V. Fresse, and D. Houzet, “A Case Study:
The vhisyn tool runs for about 16 seconds to generate the Quantitative Evaluation of C-Based High-Level Synthesis Systems,”
particle filter design. The time scales linearly with the amount EURASIP Journal on Embedded Systems, vol. 2008, 2008.
of hardware generated. It has been intended as a prototyping [7] Altera Corporation, “Nios II C2H Compiler User Guide,” 2009.
[Online]. Available: www.altera.com
platform and offers a lot of room for speed improvements. In [8] M. B. Gokhale, J. M. Stone, J. Arnold, and M. Kalinowski, “Stream-
general, when developing vhisyn, it has been a major point Oriented FPGA Computing in the Streams-C High Level Language,” in
not to include algorithms that do not scale well with big, IEEE Symposium on Field-Programmable Custom Computing Machines
(FCCM). Washington, DC, USA: IEEE Computer Society Press, 2000,
industrial strength designs blocks. By far the largest runtime p. 49.
is consumed by the tools, that process the generated VHDL [9] M. Rößler, H. Wang, N. Engin, W. Drescher, and U. Heinkel, “Rapid
code and generate a bistream file for the FPGA. Prototyping of a DVB-SH Turbo Decoder Using High-Level-Synthesis,”
in Forum on Specification & Design Languages (FDL), Sophia Antipolis,
VII. C ONCLUSION France, Sep. 2009.
[10] S. S. Huang, A. Hormati, D. F. Bacon, and R. Rabbah, “Liquid
In this paper, we used two high level design approaches Metal: Object-Oriented Programming Across the Hardware/Software
to implement a particle filter design. We compared the two Boundary,” in Object-Oriented Programming (ECOOP). Springer,
generated designs to a hand coded VHDL design of the 2008, pp. 76–103.
[11] Y. Abarbanel, I. Beer, L. Gluhovsky, S. Keidar, and Y. Wolfsthal,
same functionality. As expected the hand coded design leads “FoCs - Automatic Generation of Simulation Checkers from Formal
in terms of resource utilization and frequency requirements. Specifications,” in Computer Aided Verification. Berlin / Heidelberg:
However, when considering the improved ease of use and Springer, 2000, pp. 538–542.
[12] M. Boule and Z. Zilic, “Efficient Automata-Based Assertion-Checker
much lower code maintenance costs of both the property and Synthesis of SEREs for Hardware Emulation,” in Asia South Pacific
the C code approach, the higher resource requirements and Design Automation Conference (ASP-DAC). IEEE, 2007, pp. 324–329.
lower maximum frequency seem to be acceptable. [13] R. Bloem, S. Galler, B. Jobstmann, N. Piterman, A. Pnueli, and M. Wei-
glhofer, “Specify, Compile, Run: Hardware from PSL,” Electronic Notes
Furthermore, it can be seen that the C based methodology in Theoretical Computer Science, vol. 190, no. 4, pp. 3–16, 2007.
is more abstract than the property based method, which results [14] K. Morin-Allory and D. Borrione, “Proven correct monitors from PSL
in a very low implementation effort but reduced control over specifications,” in Design, Automation, and Test in Europe (DATE),
the cycle accurate behaviour. 2006, pp. 1246–1251.
[15] J. Langer and U. Heinkel, “High Level Synthesis Using Operational
One of the most important aspects of the property based Properties,” in Forum on Specification & Design Languages (FDL), Sep.
design effort has been the constant use of formal verification, 2009, pp. 1–6.
that provides the designer with information about the design [16] M. Schickel, “Applications of Property-Based Synthesis in Formal
Verification,” Ph.D. thesis, Technische Universität Darmstadt, 2009.
quality. Such measures are the determination of all output [17] E. El-Araby, M. Taher, M. Abouellail, T. El-Ghazawi, and G. B. Newby,
signals at each time step, the absence of deadlocks in the “Comparative Analysis of High Level Programming for Reconfigurable
control flow automaton and the unambiguous design behavior Computers: Methodology and Empirical Study,” in Southern Conference
on Programmable Logic (SPL). Mar del Plata: IEEE, Feb. 2007, pp.
for every possible sequence of valid input data. 99–106.
The paper classifies operation properties as an intermedi- [18] S. Ahuja, S. T. Gurumani, C. Spackman, and S. K. Shukla, “Hardware
ate level of description for hardware blocks, that offers a Coprocessor Synthesis from an ANSI C Specification,” IEEE Design &
valuable design approach for certain applications. In a future Test of Computers, vol. 26, no. 4, pp. 58–67, Jul. 2009.
[19] L. Piga and S. Rigo, “Comparing RTL and high-level synthesis meth-
development environment or even single hardware description odologies in the design of a theora video decoder IP core,” in Southern
language, algorithmic descriptions, operations and traditional Conference on Programmable Logic (SPL). Sao Carlos: IEEE, Apr.
RT level design will coexist and the developer chooses the 2009, pp. 135–140.
[20] S. Thrun, W. Burgard, and D. Fox, “The Particle Filter,” in Probabilistic
most appropriate design method for each individual block. In Robotics. MIT Press, 2005, ch. 4.3, pp. 96–113.
certain cases, even a mixture of different methods might be [21] S. A. Edwards, “The Challenges of Synthesizing Hardware from C-Like
applied. Languages,” IEEE Design & Test of Computers, vol. 23, no. 5, pp. 375–
386, 2006.
R EFERENCES [22] (2010) OneSpin Solutions. [Online]. Available: http://www.
onespin-solutions.com
[1] M. C. McFarland, A. C. Parker, and R. Camposano, “Tutorial on
[23] J. Bormann, “Vollständige funktionale Verifikation,” Ph.D. thesis, Uni-
high-level synthesis,” in Design Automation Conference (DAC). Los
versität Kaiserslautern, 2009.
Alamitos, CA, USA: IEEE Computer Society Press, 1988, pp. 330–336.
18
LAYERED TESTBENCH FOR ASSERTION BASED VERIFICATION
Departamento de Computación
Facultad de Ciencias Exactas y Naturales
Universidad de Buenos Aires
email: [email protected], {spedre, patricia}@dc.uba.ar
19
tested functionality during the simulation. Section 5
describes assertions and coverage points with more details.
4.2.1. Driver
The driver translates in proper stimulus the different
commands received from the Agent, and notify back the
execution of each command based on “AC97_SYNC”
signal.
Fig. 1. Desing Under Verification: AC97 controller. The driver was divided internally into two sub-drivers,
the “ac97_driver” and the “fifo_driver”. The last sets/resets
the “FIFO full” signal and the former injects serially the
We have also introduced properties to verify that the 256 bit audio frame into the DUV.
design does everything it is supposed to do. The collection 4.2.2. Monitor
of these additional verification properties represents the The monitor takes the 8 bit “DATA_OUT” signal each time
functional coverage model of the DUV. The properties the “LOAD” signal is asserted, and reports back to the
covered during the simulation provide a metric of Checker the obtained 20 bit of the sample data.
verification progress.
20
Fig. 2. Internal modules of layered testbench
Example 2:
5. ASSERTIONS AND COVERAGE
// psl property fifo_full_load = always
PSL specification define four layers, Boolean layer which {full==1'b1}|=>{!rose(load)};
has HDL boolean expressions; Temporal layer which is the // psl assert fifo_full_load;
core of PSL, providing temporal relationships between
We have introduced coverage points to verify that the
boolean expressions; Verification layer which directs the
design does everything it is supposed to do. Based on the
use of properties to coverage or assertions; and Model layer
which has statements to model the environment. DUV’s specification and the list of directed-testcases, we
We have written PSL embedded in Verilog comments have created a set of properties that reports the functionality
as a method to introduce assertions in the simulation tested.
The AC97 controller module is based on the FSMD
environment.
We have introduced assertions at AC97 interface based methodology, i.e. consist of a data path controlled by a
on the AC97 protocol specification; hence we have added FSM. So, we have added coverage to each of the possible
states (Example 3 show some covered states) to ensure that
properties to verify the duration of a frame (Example 1), the
duration of the “AC97_SYNC” signal in case of “Valid the simulation cover all its possible states. Also, we have
Frame” and “Valid Time Slot”. added assertions to verify the control signal are valid at the
right moment.
Example 1:
Example 3:
// psl property frame_len = always
// psl sequence fsm_idle = {state_reg==Idle_state};
{rose(ac97_sync)} |->
// psl cover fsm_idle;
{1'b1[*256];rose(ac97_sync)};
// psl assert frame_len; // psl sequence fsm_sync =
{state_reg==Sync_state};
At the FIFO interface we have added properties to // psl cover fsm_sync;
verify that no new data is loaded if FIFO is full (Example // psl sequence fsm_valid_frame =
2), and to ensure the right duration of the restart signal. {state_reg==ValidFrame_state};
// psl cover fsm_valid_frame;
21
6. CONCLUSION 7. REFERENCES
This paper is on the direction of adopting innovative tools [1] Chris Spear, “SystemVerilog for Verification: A Guide to
and methodologies applied to testing and verification. Learning the Testbench Language Features Second Edition”,
We have found that the time spent to implement the Springer, 2008.
layered testbench environment is on the same order of each [2] Harry D.Foster, Adam C.Krolnik, David J.Lacey.
testcase on the monolithic approach. Hence, the automation “Assertion-Based Design”, 2nd edition, Springer, 2004.
of directed-testcases is reflected as a productivity increase (ISBN: 1402080271).
on the verification process. [3] Property Specification Language (PSL), Accellera,
Assertions and coverage properties propose a higher www.eda.org/vfv
level of abstraction because are closer to the specification
than traditional testbenches. These introduce not only the [4] Open Verification Methodology, http://www.ovmworld.org/
benefit of productivity increase, but also improve the [5] Designer Forum 2010. Proceedings. 2010 Audio sobre
robustness of verification. Ethernet: Implementación utilizando FPGA. José Mosquera,
Having implemented our own testbench framework Andrés Stoliar, Sol Pedre, Maximiliano Sacco y Patricia
totally in Verilog, as next step, we are going further on the Borensztejn. Proceedings of SPL Southern Programmable
adoption of innovative tools and methodologies, such as Logic Conference 2010. ISBN: 978-85-7656-171-2. Rima
System Verilog and constrained-random testcases, with the Editora. pag.13-18
intention of future adoption of OVM/UVM framework. [6] Audio Codec ‘97, Revision 2.3 Revision 1.0, Intel. April,
2002
22
DEVELOPMENT AND IMPLEMENTATION OF AN ADAPTIVE NARROWBAND
ACTIVE NOISE CONTROLLER
23
Fig. 1. Adaptive feedback ANC system with FXLMS.
Fig. 2. Block diagram of the experimental model.
2. SYSTEM IMPLEMENTATION The only exception was in the filter adaption process,
whose precision was improved by using 32 bits for the
The implementation of an ANC applied to a headset was result of µ(n) times e(n), and then performing a 32 by 16
done using a high performance DSP StarCore MSC7116 bits multiplication for W(z) coefficient´s update in (2). This
from Freescale Semiconductor Inc. The StarCore MSC7116 prevented the adaptation process to stop by lack of
is a low cost, 16 bits word-length, fixed point DSP with precision, resulting on a performance improvement. The
four Arithmetic and Logic Units (ALU). It can produce block diagram of one ANC audio channel is shown on Fig.
1000 MMACS at 266MHz. Due to its high processing 2.
power, complex calculations as those required by the
adaptive filters of Fig. 1 for both audio channels can be 2.1. DSP evaluation board
achieved within a sampling period (“single-sample real
time processing”). The kit MSC711xEVMT [8], is an evaluation board for
The DSP program runs over SmartDSP, the specific applications using the DSP StarCore MSC711x. It was used
DSP’s Real Time Operating System (RTOS) designed for to schedule and evaluate the program to the DSP from a
the StarCore family. The SmartDSP Application Program PC. The board has also integrated the stereo 16 bit CODEC
Interface (API) [6], made up from functions developed in AK4554 from AKM Semiconductor Inc. It was used to
the C language, allows an easy configuration and utilization handle the electro-acoustic transducers input and output
of the DSP peripherals. The API has a driver for every analog signals. Besides the ADC and DAC for both
peripheral type, allowing the application program to channels, the CODEC is used to make the required anti-
communicate with it. The Time Division Multiplexing aliasing and reconstruction low-pass filters. The TDM
(TDM) peripheral driver was used for input and output of peripheral inside the DSP communicates to the CODEC to
both DSP audio channels. Data come in and out at the produce the input and output of both data channels.
sampling rate, which was selected to be 8kHz. This
sampling rate was considered enough for this application,
which aims to produce ANC at frequencies below 500Hz. 2.2. The acoustic system
The application program was fully developed in the C
language, using the so called “intrinsic” functions to The headset used was the circumaural stereo SHP1900
optimize the adaptive filters routines. These functions are from Philips. On circumaural headsets, the user ears are
also written in C and belong to the compiler’s domain [7] covered by the ear-cup, leaving small acoustic cavities near
rather than to the RTOS’s API. They are designed to made each ear. The Electret type omnidirectional microphone,
fractional operations and take advantage of the DSP parallel ECM-30 was used as error sensor microphone. They have
processing capabilities. The intrinsic functions are directly sensitivity, bandwidth, signal to noise ratio and physical
inserted within the C language code, allowing the size appropriate for ANC applications.
programmer to closely match the efficiency of the DSP The error microphone position within the headset
assembler language. determines the “quiet zone”. The selected position follows
The data precision was defined to be 16 bits word length the ideal placement suggested by the authors of [9] and
for most of the data, using the fixed point format. Most data [10]. This place is the nearest possible to the user’s ear
multiplications were then 16 by 16 bits, which is optimized canal, and produces the flattest possible frequency response
on the DSP architecture. of the secondary path.
24
2.3. The amplifier’s board
3. RESULTS
25
For the different signals tested, the user reports
comfortable levels of remaining noise inside the headset
shell cavity, being these significantly lower than those
without the ANC.
The future directions will focus on improving the
feedback ANC performance, and on broadband noise ANC
within a headset. Different learning algorithms will also be
investigated, analyzed and implemented on real conditions
with commercially available components.
5. ACKNOWLEDGMENTS
6. REFERENCES
26
BIO-INSPIRED HARDWARE SYSTEM BASED IN ANIMALS OF COLD AND HOT BLOOD
27
parameters: phase and frequency. As the signal is a digital FPGA. Such specification states the maximum frequency
one, it is considered that the amplitude only can take two allowed for a clock signal. For its calculation, the time of
values: 0 and 1, and therefore this amplitude does not offer transit through the sequential output is taken into account,
any sensible information. In this case, the frequency of the including the delay introduced by the flip-flop of this block.
signal will be used, so that it will vary with the elected Hence, the output is directly used if the sub-system is
physical magnitude. A circuit that satisfies these combinational. However, if it is sequential, the frequency
specifications is the so called Ring Oscillator (RO). should be decreased to the device datasheet recommended
value. Obviously, one or more counters could be used to
2. RING OSCILATOR divide the signal, and thus to obtain the desired value.
Furthermore, if the transducer is built with an odd number
of linked gates, a lower frequency will be obtained due to
The RO, in its simplest form, is a combinational digital
the rising of the delay inserted by these additional gates.
circuit integrated by a NOT gate with a feedback loop
Finally, such as shown in [2], the relationship between
closed between the input and the output. After power-up it
begins to oscillate, delivering a signal whose frequency is frequency and temperature is linear, getting close to a
dependent on the delay time of gate, and this delay varies as straight line with negative slope within the range of work
specified by the manufacturer. In [3] it is observed that
a function of the temperature.
when the number of gates of the ring is increased, the
If this circuit is described and implemented over a
exchange rate of the output diminishes along with the
reconfigurable hardware device, the obtained behavior,
decrement of frequency.
according to our experience, will not be the expected one.
Therefore, another description of the RO, with a dual input
NAND gate is used. One of them is a used as the “enable” 3. BEHAVIOR
line and the other is connected with the output. In addition
this version presents the advantage of being able to be Following the exposed proposal, a bio-inspired hardware
controlled by the sub-system via “enable”. system sensible to the temperature and self-contained in a
To describe this circuit by using VHDL is necessary to device, its body, is made. In this section, the observable
define an entity with an input port and an output port. This behavior that appears with the changes of temperature in
port must be of the buffer type, because this will be read the animal kingdom, and the way to emulate such behavior
and written, being the architecture as simple as: “output with the proposed system, will be described.
<= nenable nand output;”. In Nature, animals tend to maintain constant the
If the FPGA architecture [1] is took as a design temperature of its own bodies, and in the same way, a
reference, where each island is a LBA (Logic Array Block) reconfigurable device can operate in a closed interval
formed by a set of ALMs (Adaptive Logic Modules), the depending upon the techniques of its fabrication. Then, it
implementation of a RO occupy only one of these basic seems reasonable to explore the mechanisms used by
blocks. The feedback is built by employing one LCs (Local animals in relation to its thermic adaptation [4].
Connections) of the LBA, as shown in Fig. 2. In the animal kingdom, diverse behaviors in relationship
with the temperature are found to be scattered over a wide
spectrum, whose extremes are denominated cold and hot
blood. The first one alludes to the lack of internal
mechanisms to stabilize the corporal temperature. The
second one refers to the capability of maintaining it
constant by using those mechanisms. Thus, animals which
find themselves further close to one than of another extreme
they will have different behaviors in front of the changes of
the environment.
In the nearnesses of the cold blood side, a significant
part of the time is inverted in searching different places of
its habitat where to remain some particular hours of the day,
and thus to hold its regulated temperature. Instead in the
neighborhood of the other extreme, the time only is used in
Fig. 2. Implementation of the RO. In blue, the LCs used for such activities of occasional way. Further, the first ones do
the feedback. In green, the ALM used by the logic. not use its metabolism to get cold or get hot, while the
second ones effectively do it. Hence, for animals with
All outputs used to implement the RO are identical corporal weight, but in opposed extremes of the
combinational ones, so the frequency of oscillation will be spectrum, them cold-blooded they need minus energy than
superior that the highest specified by the maker of the them warm-blooded, due to minor energy consumption in
28
the first ones. 3.2. Hot Blood System
Close to the cold extreme, the metabolism is composed
by various reactions that activate themselves into different If on the contrary, the system has mechanisms to get hot or
temperatures thresholds. On the other hand, nearby of the to get cold, it will be constituted by three parts: the
hot side, only is necessary one reaction or a few of them to transducer, the SS, and a circuit of varying the temperature
conform it. Thereby, the hot-blooded animals stabilize its of the device. The diagram of this is shown in the Fig.3.2.
temperature to optimize its simple metabolism. The cold- The behavior is as follows: changes in the frequency of the
blooded ones possess a complex metabolism composed by transducer's output signal are due to temperature changes.
several reactions that are optimal to different temperatures. Such changes influence the thermal control circuit
Thus, the metabolic complexity is exchanged by modifying its set-point in order to cause a contrary effect to
consumption of energy, conferring this interchange, that initiated by the environment. In this way, the work
advantages and disadvantages to different animals in conditions of the sub-system are kept constants, ensuring
specific situations. the maximum performance.
With the presented observations, two plausible systems
will be considered: one of them near of the cold terminal FPGA
and the other one close to the hot extreme. Sensor(TE)
Sub-System
3.1. Cold Blood System
29
in other applications, was used. A way to lead the biological 6. REFERENCES
inspiration toward the emulation of the behavior of the
animals nearby of the spectrum extremes of its kingdom, [1] Altera Corp., Stratix II Device Handbook, vol. 1, sec. 2, pp.
the cold and the hot, was also shown. For this reason, is 1–106, May 2007.
believed that in this journey they have shown concrete [2] S. Lopez-Buedo, J. Garrido, and E. I. Boemo, “Dynamically
options that can be useful when implementing bio-inspired Inserting, Operating, and Eliminating Thermal Sensors of
hardware systems. FPGA-Based Systems”, IEEE Trans. Components and
Packaging Technologies, vol. 25, no. 4, pp. 561–566, Dec.
5. ACKNOWLEDGMENTS 2002.
[3] S. K. Yoo, D. Karakoyunlu, B. Birand, and B. Sunar,
This work was made during 2010 with the partial support of “Improving the Robustness of Ring Oscillator TRNGs”,
BINID – UTN. Thanks: to Ángel C. Veca for to invite me ACM Trans. Reconfigurable Technology and Systems, vol. 3,
to participate of research and development, and to Eduardo no. 2, art. 9, pp. 1–30, May 2010.
Zavalla, INAUT – FI – UNSJ, for to collaborate in the [4] M. S. Blumberg, Body Heat: Temperature and Life on
grammatical revision of this paper. Earth, Cambridge, MA: Harvard University Press, pp. 1–69,
2002.
30
ANÁLISE COMPARATIVA E QUALITATIVA DE FERRAMENTAS DE
DESENVOLVIMENTO DE FPGA’S
ABSTRACT
2. ANALISADOR LÓGICO E MEMÓRIA FIFO
Este trabalho fornece um estudo das ferramentas de
desenvolvimento dos principais fabricantes de FPGA’s no O projeto de iniciação científica citado compreendeu a
mercado atualmente, a fim de realizar uma análise elaboração de um analisador lógico [7] para análise on-
comparativa e qualitativa entre as mesmas. Utilizou-se chip de Sistemas Digitais implementado em FPGA. Este
como base para este estudo um projeto de iniciação consiste na implementação de um dispositivo para análise
científica implementado em FPGA que abordou de sinais digitais on-chip que seja open-source, visando
ferramentas de síntese, simulação e geração de IP-cores. possuir um número irrestrito de canais de entrada,
permitindo-o trabalhar com circuitos mais complexos, e
1. INTRODUÇÃO não ser condicionados ao uso das FPGA’s dos seus
próprios fabricantes.
Nos últimos anos, o crescimento dos dispositivos A Fig. 1 ilustra o diagrama de blocos de um analisador
reconfiguráveis e de suas respectivas ferramentas de lógico, representando suas principais funções.
desenvolvimento - tanto em diversidade, quanto em
densidade - tem favorecido a implementação de sistemas
complexos e completos em lógica integrada e programável
(SoC – System on Chip). Altera [1], Lattice [2] e Xilinx [3]
são exemplos de empresas que elaboram soluções na área
de sistemas reconfiguráveis digitais, cada uma delas
possuindo suas respectivas ferramentas de
desenvolvimento: Quartus II, Diamond e ISE,
Fig. 1. Diagrama de Blocos do Analisador Lógico
respectivamente.
As vantagens em se trabalhar com FPGA’s [4] estão na O bloco Base de Tempo define se os processos de
possibilidade de desenvolver soft-cores [5], podendo ser aquisição e armazenamento de dados serão feito com sinal
reutilizados (um mesmo soft-core pode ser utilizado em de clock advindo do dispositivo analisado ou externo. O
diversos projetos, sem custo adicional nem gasto com bloco Estágio de Disparo inicia o processo de captura dos
tempo de projeto) e portáteis (pode ser adequado a diversas dados, possuindo duas opções: disparo (trigger) interno, no
plataformas de desenvolvimento de dispositivos qual são comparados os dados adquiridos com uma palavra
reconfiguráveis). Por isso é extremamente importante o de informação (dado de entrada) previamente determinada,
estudo de linguagens de descrições de hardware e destas e disparo externo, no qual o procedimento se dá após o
plataformas existentes no mercado. reconhecimento de um pulso advindo de uma entrada
A escolha da linguagem de descrição de hardware 0 externa específica. O bloco Memória representa uma
Verilog para a implementação do projeto de iniciação memória FIFO, First In, First Out, responsável pelo
científica se dá pela maior facilidade de aprendizagem em armazenamento dos dados adquiridos. O último bloco,
relação ao VHDL, visto que esta opção se assemelha muito Interface, responsável pela forma que os dados são
a linguagem C, amplamente conhecida, enquanto que a apresentados ao usuário, não foi abordado por este projeto.
escolha das plataformas para o mesmo é realizada pelas
empresas que se destacam atualmente no ramo.
3. FERRAMENTAS DE DESENVOLVIMENTO
31
simulá-los, para garantir o funcionamento correto dos família contém uma matriz bi-dimensional de LAB’s
mesmos. Para tanto, utiliza-se um IDE (Integrated (Logic Array Blocks), cada um contendo 16 elementos
Development Environment), ferramenta de lógicos (Logical Element - LE), pequenas unidades lógicas
desenvolvimento que contém aplicativos responsáveis responsáveis pela implementação das funções lógicas do
pelos processos desejados: design, síntese, place-and-route usuário, possuindo LUT (Look-Up Table) de quatro
e verficação; como ilustra a Fig. 2. entradas, um registrador programável, etc. Estão presentes
também nessa arquitetura, blocos de memória
denominados M4K, capazes de implementar vários tipos
de memória (single-port RAM, ROM, FIFO); e blocos
multiplicadores otimizados para processamento digital de
sinais (DSP).
32
traços que interconecta esses elementos funcionais e optando pela memória implementada, o módulo do
transmite sinais entre os mesmos. Cada um desses analisador lógico acaba sendo limitado a parâmetros de
elementos possui uma chave matricial associada que entrada, largura de dados e número de palavras pequenos.
permite múltiplas conexões no roteamento. Fato que não ocorre ao utilizar a memória obtida por IP-
core, devido ao processo de síntese adotar o uso de blocos
de memória ao invés de elementos lógicos.
3.3. Lattice Diamond 1.0 Para realizar uma análise comparativa dos três
processos de síntese, verificou-se os reports fornecidos
Lattice Diamond 1.0 pertence a empresa Lattice
Semicondutor, pioneira do sistema de programação ISP e pelos mesmos, ao utilizar tanto o soft-core implementado
uma das três maiores fabricantes de CI’s reconfiguráveis quanto o IP-core gerado. Esta análise apresenta um alto
nível de dificuldade devido às diferentes arquiteturas
de todo o mercado internacional.
adotadas por cada dispositivo. Em posse dos reports
Esta IDE inclui a Synopsys Synplify Pro como
devidamente analisados constrói a tabela 1, onde são
ferramenta de síntese integrada, que, diferentemente dos
demais IDE’s, é um aplicativo de outra empresa: Synopsys apresentados dados comparativos a cerca dos dois tipos de
[8]. Apresenta como vantagem, o suporte a síntese de memórias sintetizadas pelas três ferramentas abordadas por
este projeto. Para melhor entendimento desta tabela, são
designs mistos entre Verilog e VHDL.
apresentadas algumas considerações a cerca dos reports
Para o processo de simulação, este IDE utiliza uma
fornecidos e dos itens apresentados.
ferramenta externa que necessita de projeto próprio,
O report fornecido pelo IDE da empresa Altera,
Active-HDL Lattice WebEdition 8.2, aplicativo que, da
mesma maneira que o aplicativo de síntese, pertence a Analysis & Synthesis Summary Reports, possui, dentre as
outra empresa, a empresa Aldec [9]. O mesmo também se suas diversas informações, o número total de elementos
lógicos, incluindo o total de funções combinacionais e de
destaca por suas características de simulação de códigos
registradores lógicos dedicados, o número total de
mistos de VHDL e Verilog, além de verificação avançada
registradores, e o número total de bits de memória
e muitos recursos de depuração.
utilizados e disponíveis. Na Tabela 1, encontram-se os
Conforme os outros IDE’s, este possui sua ferramenta
geradora de IP-core, a IPexpress. Este aplicativo reúne dados referentes aos elementos lógicos, aos registradores e
vários módulos funcionais que ajudam na geração de aos bits de memória.
O report fornecido pelo IDE ISE, Synthesis Report,
códigos em VHDL ou Verilog, podendo ser reutilizados
possui uma forma diferente de abordagem, na qual analisa
conforme a necessidade do usuário, agilizando e obtendo
o uso de células no processo de síntese, dividindo-as entre
os melhores resultados do projeto. Os módulos provem
BELS, elementos lógicos básicos como inversores, LUT’s
funções I/O, aritméticas, de memória, etc.
Cada dispositivo da família LatticeXP2, família e mux’s, flip-flops/latches e buffers. Adota-se, ao verificar
todo o documento, que os flip-flops/latches são
representante da empresa Lattice no projeto, possui uma
considerados registradores, enquanto que os LUT’s são
matriz de blocos lógicos cercada por PIC’s (Programmable
considerados os elementos lógicos. Para analisar a
I/O Cells). Entre as fileiras de blocos lógicos se encontram
utilização de blocos de memória, é importante analisar
linhas de EBR’s (Embedded Block RAM), blocos de
memórias de 18 Kbits (RAM, ROM ou FIFO), e uma também os reports gerados para o processo de Map. Na
fileira de DSP (Digital Signal Processing). Existem dois Tabela 1 encontram-se os dados referentes às LUT’s, aos
registradores e aos blocos de memória. Os reports desta
tipos de blocos lógicos, o PFU (Programmable Functional
ferramenta definem como bloco de memória o conjunto de
Unit), responsável por funções lógicas, aritméticas, RAM e
18 Kbits de memórias.
ROM, e o PFF (Programmable Functional Unit without
O IDE Diamond fornece o documento Resource Usage
RAM), responsável pelas funções lógicas, aritméticas e
ROM; ambos possuindo quatro slices interligados (LUT’s Report, que também indica por meio de LUT’s e bits de
de quatro entradas e dois registradores, ou apenas LUT’s). registradores os itens a serem comparados. Da mesma
forma que acontece com o ISE, utilizaram-se os reports
gerados no processo de Map para a análise dos blocos de
4. RESULTADOS memória. Na Tabela 1 encontram-se os dados referentes
somente as LUT’s que podem ser utilizadas como RAM,
O processo de síntese, mesmo entre as diferentes IDE’s, é aos bits de registradores e aos blocos de memória. Os
responsável por checar a sintaxe do código, compilá-lo reports desta ferramenta definem como bloco de memória
(traduzir e otimizar o mesmo, tornando-o um conjunto de o conjunto de 18 Kbits de memórias. Devido
componentes que possam ser reconhecidos) e mapeá-lo principalmente a possuir estes dois tipos de blocos lógicos,
(converte os componentes da fase de compilação para contendo ou não RAM, a ferramenta de síntese consegue
componentes primitivos da tecnologia a ser trabalhada). ótimos resultados, otimizando o uso de registradores e
Ao realizar a síntese do soft-core implementado outros elementos.
durante o projeto de iniciação cientifica, nota-se que,
33
Tabela 1. Quadro Comparativo (parâmetros: 8 bits de largura de dados e 1024 palavras de dados ao todo)
5. AGRADECIMENTOS
6. REFERÊNCIAS
34
GENERACIÓN AUTOMÁTICA DE VHDL A PARTIR DE UNA RED DE PETRI. ANÁLISIS
COMPARATIVO DE LOS RESULTADOS DE SÍNTESIS
Roberto Martínez, Javier Belmonte, Rosa Corti, Estela D’Agostino, Enrique Giandoménico
35
Fig. 1. Pantalla de PIPE con el módulo MakeVHDL.
esta última comunicación, es la posibilidad de aplicar un rigen las reglas matemáticas que dan soporte a la
control de actividad de los componentes VHDL, para descripción y permiten además la realización de
ahorro de energía consumida por el dispositivo, haciendo simulaciones tendientes a verificar su comportamiento.
uso del principio de “propagación de actividad”. En [7], los Una de las muchas herramientas gráficas existentes para
autores descomponen el modelo en bloques estructurales la construcción y simulación de RdeP es la denominada
básicos de una RdeP, compuestos de un lugar y una PIPE (Platform Independent Petri Net Editor) [11], de tipo
transición y luego cada uno es implementado en un bloque open-source y desarrollada en Java. PIPE está estructurado
lógico configurable (CLB) de una FPGA. Los autores de de manera que es posible el agregado de prestaciones
[8] informan el desarrollo de Animator4FPGA, herramienta específicas por medio de módulos que se pueden incorporar
de código cerrado, que permite la descripción de a su interfaz. Para la generación de VHDL, implementamos
controladores por medio de RdeP para luego generar el un módulo (MakeVHDL) que traduce en forma directa la
VHDL correspondiente. En [9], se propone que las RdeP RdeP representada en PIPE a código VHDL conforme el
puedan ser usadas como lenguaje de especificación en el método descrito en [4]. El mismo realiza la traducción
codiseño hardware/software de los sistemas embebidos, desde una perspectiva global del sistema, a partir de la
poniendo condiciones, entre ellas, la de que a partir de esta representación matricial de la RdeP asociada, lo que
especificación se pueda generar el código para distintas permite acotar la complejidad de la descripción VHDL
plataformas que pueda ser usado para simulación, resultante. La implementación de la arquitectura de la red
verificación e implementación. El trabajo descrito en [10] consta de tres bloques que se comunican mediante señales.
informa de un estudio comparativo de recursos utilizados en El primero determina cuales son las transiciones que están
la síntesis de MEF, para distintos estilos de descripción de en condiciones de disparo, el segundo define el nuevo
la máquina y métodos de codificación de sus estados. marcado y el tercero asigna las salidas.
Proponen una metodología basada en el análisis de los La Fig. 1 muestra una pantalla de PIPE con el agregado
reportes de síntesis, contabilizando slices, flip-flops y del módulo MakeVHDL. Dicho módulo incluye facilidades
LUTs. También se analiza la frecuencia teórica máxima de para la identificación de las entradas y salidas del sistema.
reloj que se estima en los reportes. Además, permite agregar condiciones lógicas a las
El trabajo aquí presentado se basa en una descripción transiciones y definir salidas condicionadas. La
matricial del modelo de RdeP, y a diferencia de los metodología de traducción propuesta en [4] se amplió
descriptos, está basado en una herramienta open source de agregando los elementos mencionados. Se logró por tanto
libre disponibilidad. la generación completa del código VHDL a partir de la
RdeP, incluyendo la creación de entidades, arquitecturas,
3. GENERACIÓN AUTOMÁTICA DE VHDL señales, puertos de entrada y salida y demás elementos
necesarios para obtener una descripción VHDL sintetizable.
La realización de un modelo mediante Rde P de un sistema, El código VHDL generado por MakeVHDL puede ser
guardado como un archivo o copiado y pegado en el
cualquiera sea la índole de éste, consiste en la realización
ambiente de diseño elegido. En nuestro caso, para verificar
de grafos o diagramas de diferentes estilos, conforme al tipo
de RdeP utilizado para la modelización. También es posible el código obtenido, y realizar el análisis de los resultados de
utilizar directamente una RdeP como forma de especificar síntesis, se trabajó con ISE 8.2i de Xilinx.
al sistema. En cualquier caso, subyacente a dicho diagrama,
36
250
200
150
Mhz
100
50
0
2 3 6 8 10
Procesos
Petri MEF
(a) (b)
Fig. 2. (a) Procesos concurrentes. (b) Recursos Fig. 4. Frecuencia máxima para procesos paralelos
compartidos.
70
60 35
50 30
Recursos
40 25
Recursos
30 20
20 15
10 10
0 5
2 3 6 8 10
0
Procesos
2 3 6
Petri FF MEF FF Petri Slice MEF Slice
Procesos
Fig. 3. Recursos reportados para procesos paralelos. Petri FF MEF FF Petri Slice MEF Slice
Fig. 5. Recursos reportados para recursos compartidos.
4. ANÁLISIS COMPARATIVO DE LOS
RESULTADOS DE SÍNTESIS En la Fig. 2 (a) se muestra el diagrama de Petri de dos
procesos paralelos que reinician su funcionamiento cuando
Se analizaron dos casos de estudio donde la modelización ambos han finalizado su ejecución.
con RdeP es más ventajosa que con MEF. Como El modelo basado en MEF del mismo problema se
contrapartida, la herramienta de síntesis XST, optimiza la modularizó, utilizando una máquina para cada proceso. El
implementación de los diseños si se los describe utilizando problema se resolvió para dos, tres, seis, ocho y diez
el formato aconsejado para las MEF. Nuestro objetivo, al procesos. La Fig. 3 indica la cantidad de flip-flops (FF) y
comparar los resultados de la síntesis a partir del código slices utilizados en la síntesis para ambos modelos de
VHDL obtenido por medio de ambos modelos, fue representación, con la opción de optimización de área. La
mensurar la incidencia del uso de RdeP sobre la frecuencia Fig. 4 muestra los valores correspondientes de la frecuencia
de trabajo y el uso de recursos. El análisis se basó en los de trabajo máxima. La diferencia entre los valores para
reportes de síntesis, ya que constituyen un indicador clave máxima frecuencia en el peor de los casos llega
de la forma en que la herramienta interpreta el código. El aproximadamente al 25%, que no resulta significativo en
código VHDL se obtuvo utilizando MakeVHDL al trabajar los sistemas industriales.
con RdeP, mientras que para MEF, se codificó respetando Respecto al uso de recursos, la síntesis del modelo Petri
el formato de dos procesos propuesto por XST. utiliza tres veces más FF que la MEF, mientras que el
número de slices utilizados es similar para ambos.
4.1. Procesos concurrentes
4.2. Recursos compartidos
Una RdeP resulta ventajosa para modelizar un sistema de
evolución en paralelo compuesto de varios procesos que Los sistemas en los cuales varios procesos comparten uno o
cooperan para la realización de un objetivo común. más recursos, pueden representarse utilizando una RdeP
37
200
150
Mhz
100
50
0
2 3 6 Fig. 7. Generación automática de código VHDL.
Procesos
Petri MEF
Fig. 6. Frecuencia máxima para recursos compartidos. [2] I. Viskic, D. Rainer, "A Flexible, Syntax Independent
Representation (SIR) for System Level Design Models,"
como muestra la Fig. 2 (b). En la misma se muestran dos 9th EUROMICRO Conference on Digital System Design
(DSD'06), 2006 , pp. 288-294.
procesos A y B que comparten el recurso R.
Este tipo de sistema se modelizó para dos, tres y seis [3] K. Keutzer, S. Malik, R. Newton, J. Rabaey and A.
procesos que comparten un único recurso. La Fig. 5 permite Sangiovanni-Vincentelli, “System level design:
comparar el número de FF y slices inferidos por XST en el Orthogonalization of concerns and platform-based design”,
proceso de síntesis para los dos modelos de representación IEEE Trans. on Computer-Aided Design of Integrated
utilizados. La Fig. 6 por su parte, se refiere a los valores de Circuits and Systems, 19 (12), Dec. 2000.
frecuencia máxima. Al incorporar el uso de recursos [4] R. Martínez., J. Belmonte, R. Corti, E. D’Agostino, E.
compartidos, MEF utiliza un 50 % menos de slices. En Giandoménico, “Descripción en VHDL de un sistema
cuanto al uso de FF la comparación entre ambos modelos digital a partir de su modelización por medio de una red de
pone en evidencia que la MEF agrega un FF más que Petri Petri”, in Proc. V Southern Conference on Programmable
por cada proceso incorporado. Logic, Apr 2009, pp 7-11.
[5] L. Gomes, A. Costa, J.P. Barros, P. Lima, “From Petri net
5. CONCLUSIONES models to VHDL implementation of digital controllers”, The
33rd Annual Conference of the IEEE Industrial Electronics
Society, (IECON), pp 94-99, Taiwan, Nov. 2007.
El análisis realizado, muestra que el uso de las RdeP para
modelar los sistemas propuestos, tiene un costo en la [6] D. Andreu, G. Souquet, Thierry Gil, "Petri Net Based Rapid
síntesis respecto de recursos utilizados y velocidad de Prototyping of Digital Complex System," isvlsi, pp. 405-
trabajo que, en general, es mayor que la modelización con 410, 2008 IEEE Computer Society Annual Symposium on
MEF. Sin embargo, la metodología que en este trabajo se VLSI, 2008.
propone, esquematizada en la Fig. 7, realiza una traducción [7] E. Soto, M. Pereira, “Implementing a Petri net specification
automática del modelo gráfico de Petri a código in a FPGA using VHDL”, Int. Workshop on Discret-Event
sintetizable, en todos los casos, y elimina toda posibilidad System Design, Przytok, Poland, June 27-29, 2001.
de error en la codificación. Por otro lado, ante una [8] F. Moutinho, L. Gomes, “From Models to Controllers
modificación en el sistema físico, su descripción con RdeP Integrating Graphical Animation in FPGA through
resulta notablemente más simple que con MEF. El módulo Automatic Code Generation", Industrial Electronics, 2009.
software desarrollado está basado en una herramienta open ISIE 2009. IEEE International Symposium on, pp 712-717
source y por lo tanto es de libre disponibilidad. Por último, [9] L. Gomes, J.P. Barros, A. Costa, R. Pais, F. Moutinho,
se puede concluir que si los requerimientos de diseño no “Towards Usage of Formal methods within Embedded
son críticos, en lo que se refiere al uso de recursos de Systems Co-design”, Proc. of the 2005 IEEE Conference on
pastilla, el método propuesto resulta muy conveniente. Emerging Technologies and Factory Automation, Vol 2, pp
284-287.
6. REFERENCIAS [10] Nader I Rafla, Brett LaVoy Davis, “A Study of Finite State
Machine Coding Styles for Implementation in FPGAs”,
[1] M. Uzam and A.H. Jones, “Design of a Discrete Event Circuits and Systems, 2006 IEEE International Midwest
Control System for a Manufacturing System Using Token Symposium on, pp 337 – 341.
Passing Ladder Logic”, Proc. of the CESA'96 IMACS
[11] P. Bonet, C.M. Llado, R. Puijaner and W.J. Knottenbelt,
Multiconference, Symposium on Discrete Events and
Platform Independent Petri net Editor 2,
Manufacturing Systems, July 1996, pp. 513-518.
http://pipe2.sourceforge.net/, (consultado 10/10/10).
38
USING A WII REMOTE AND A FPGA TO DRIVE A MECHANICAL ARM TO
AID PHYSICALLY CHALLENGED PEOPLE
ABSTRACT
39
The DE2 board is configured with the Nios II softcore, attached to the internal bus. This process was done within
general purpose processor, along with the needed the Quartus II software provided
ovided by Altera. The focus of
peripherals, such as memory chips and communication the proposed system was to establish communication with
ports controllers. Given the overall complexity of the the Wiimote controller,, therefore only a few devices were
system, it was more efficient to build it in layers, required. In [2] is described the process of choosing and
configuring the board with an operating system (uClinux) connecting modules using the SOPC Builder tool inside
and on top of it running the programs to deal with the Quartus II software.
communication and data treatment.
The main devices used by the system were: the FPGA
The mechanical arm used is a simple model manufactured chip, to host the Nios II processor, the SDRAM memory
by Lynxmotion, which has four degrees of freedom to its chip, which is loaded with the uClinux operating system,
movement and Serial communication to take commands. the USB controller, to attach the Bluetooth adapter and
the flash drivee and the serial UART controller to drive the
We did a broad search through many paper databases and mechanical arm.
there are plenty
nty of works on using the Wiimote as a
motion capture device for a variety of purposes. However, The Nios II processor version used was the fast (/f core)
we found that none of these works uses a FPGA system to version, which provides the best performance but costs
gather and interpret the data. Thus, we believe that the more in FPGA usage [3].
research done to accomplish this kind of system, syste
integrating a FPGA system with a Bluetooth device, is After setting up the system hardware, the uClinux
completely new. distribution was. A Linux system was chosen because it is
open source, highly configurable and actively maintained
by its developers. Also, a Linux system allows one to read
2. PROPOSED SYSTEM
the full source code and also to make suitable
modifications. The version of uClinux-dist
uClinux used in the
The proposed system developed in this paper was built
system is from July 30th, 2009, and is hosted at Nios II
using the DE2 board. Fig. 3 depicts the overview of the
Community’s ftp site [4]. This distribution
di was made by
board with the peripherals attached and hardware and
the community of Nios II users and targets Altera boards
software
ftware configured. Note that USB HUB, Bluetooth
(including the DE2 board). The whole set of tools and
USB adapter and USB Flash drive boxes represent source code is called uClinux-dist.
dist.
physical devices, but C Software, uClinux OS and Nios II
Processor boxes represents logical layers, as they are
The uClinux compilation parameters are set using the
either stored in the SDRAM memory chip, in case of the
make menuconfig command,
command which opens a screen
OS and the C programs, or configured in the FPGA chip,
containing all the program, library and options available
which would be the processor.
to compile with or change. There are several options
shown using this tool. All of the settings are divided in
two categories: Kernel Settings and Application/Library
Settings. Once the configuration is finished, a simple
make command will start compiling the source code into
an image file. Fig 4 shows a typical configuration menu
men
screen.
40
With the board attached to the computer, the image is use, and discovered one called Wiiuse [6], which, like
uploaded through the Nios II Embedded Design Suit BlueZ, is also open source. This library is well
software, but only after the .sof file is configured. The documented and well written, allowing painless
process is illustrated in Fig. 5. modifications to be made. With the Wiiuse, we were able
to run the sample program included with it, and make
modifications to perform some tests.
3. RESULTS
5. ACKNOWLEDGEMENTS
Fig. 10. With the controller just past the center position We wish to thank FAPESP (Fundação de Amparo à
(when it is facing up), the corresponding LED turns on Pesquisa do Estado de São Paulo) for its financial and
(3rd LED). institutional support to this research, registered under the
process number 2010/07179-8. Emerson C. Pedrino is
grateful to FAPESP by the process: 2009/17736-4, too.
6. REFERENCES
The proposed system is actually under development. The [4] “Nios II Community FTP”, Accessed Oct 14, 2010. [Online].
Available: http://www.niosftp.com/pub/
mechanical arm that will be driven by the system is under
study, and the software to control it is being developed. [5] “BlueZ”, Accessed Oct 15, 2010. [Online]. Available:
Our goal is to allow physically challenged people to http://www.bluez.org/
control a robotic arm with ease, in order to make the arm [6] “wiiuse – The Wiimote C Library”, Accessed Oct 15, 2010.
perform simple tasks, like pushing a heavy object around [Online]. Available: http://www.wiiuse.net/
or reaching normally out of reach objects. Although this
idea is not new [7], the use of a FPGA to gather the data [7] C. Smith, H. I. Christensen, “Wiimote Robot Control Using
Human Motion Models” The 2009 IEEE/RSJ International
produced by the Wiimote and to drive the mechanical arm
Conference on Intelligent Robots and Systems, St. Louis, USA
is original. The use of embedded devices to do such task (2009).
instead of personal computers represents a new branch of
research, allowing real time responsiveness and
portability for the system.
42
SYSTOLIC MATRIX-VECTOR MULTIPLIER FOR A HIGH-THROUGHPUT
N-CONTINUOUS OFDM TRANSMITTER
43
processing time is given by For the N-Continuous OFDM transmitter described for
analysis we can express the bandwidth as
ρN 2
Tseq = (2)
f (1 + GI)N
B= . (7)
where ρ represents the processing time in the elemental mul- Tpar
tiplier and may be fixed at one if pipelining is applied, f
represents the system clock rate. By selecting 32 parallel multipliers, K = 32, the obtained
Since N-Continuous technique has been proposed for bandwidth is 5.72 MHz, according to the resulting process-
out-of-band power reduction in OFDM systems [5], and it is ing time of 56.25 µs for the matrix-vector multiplier. These
based on a correction vector obtained by means of a matrix- values achieve the required bandwidth in [6]. Also, the pro-
vector multiplication. We analyze the sequential operation cessing period for the matrix-vector multiplier does not cover
in the correction calculation. In this case, N is defined by the whole OFDM symbol duration; then, a fraction of the
the subcarriers number of the OFDM system. Based on the symbol transmission period may be used for another spe-
3GPP E-UTRA/LTE specification for wireless communica- cific OFDM processing.
tions N = 300 is chosen [6]. Then, a typical clock rate
for wireless communication architectures implemented in 3. ARCHITECTURE DESIGN
FPGA is considered, f = 50 MHz. According to a typical
complex multiplier scheme, which operates in four cycles, A processing unit fed by the v vector is considered. We
a straightforward pipelining is considered, ρ = 1. So, the suppose that the elements of the M matrix have been pre-
global processing time is Tseq = 1.8 ms. If Guard Interval viously stored in an internal memory. Then, the objective
(GI) is applied with fraction GI = 22/300 [6], the transmit- is to present the result of the calculation as fast as possible.
ter can achieve a final bandwidth of According to (5), we can build a parallel multipliers bank
composed by K elemental multipliers. Since each one has
(1 + GI)N
B= = 179 KHz (3) two inputs, a and b, we can join all the multiplier inputs and
Tseq form two buses, A and B, which are fed by v and M, re-
at most. Unfortunately, this bandwidth is less than the one spectively. This scheme is depicted in Fig. 1. According to
specified in [6]. Also, a practical OFDM transmitter in- the parallel concept, the bus A is fed by the k-th fraction of
cludes other operations that can further reduce the presented the v vector in each cycle. In this way, the complete load of
speed performance. v requires L cycles. As stated in (5), this fraction of the v
vector is multiplied by the k-th fraction of the i-th row of M.
2.2. Parallel Operation Performance
As in the application considered, the timing requirement
may not be achieved by using the unique multiplier scheme
discussed above. An alternative is to use K multipliers and
synchronize them for simultaneous operation. Based on this
approach, (1) can turn into an L-elements addition
L−1
X
0
ri = ri,k i = 0, ..., N − 1 (4)
k=0
Fig. 1. Functional Diagram
0
where L = N/K, and ri,k represents the k-th partial addi-
tion Note that the multipliers bank output represents the ele-
0
K−1 ments to be summed in (4), then ri,k is computed. Since the
0
X
0 values of ri,k are sequentially generated, it is necessary to
ri,k = Mi,j+kL .vj+kL i = 0, ..., N − 1. (5)
j=0 store each one of the set k = 0, ..., L − 1. The value of ri
may be computed by means of a new addition before the L
The matrix element selection in column-sense is obtained
partial additions are obtained.
from j + kL, where k = 0, ..., L − 1. Then, the model in (4)
A special consideration is based on the need of feed-
indicates L steps and K elemental multipliers for the pro-
ing the two K-length vectors represented by {Mi,j+kL } and
posed architecture.
{vj+kL } for j = 0, 1, ..., K − 1 in a simultaneous way. This
The processing time for the complete calculation is re-
requirement allows the calculation of every element of r in
duced depending on the K parameter. So, it follows the
only L clock cycles. However, this configuration implies
expression N 2 /K as
special memory units to accomplish the described behav-
Tpar = ρN 2 /(Kf ). (6) ior for v in the input bus. Also, it is observed that after
44
L cycles, each fraction of the vector v, i.e. {vj+kL } for
j = 0, 1, ..., K − 1, is required again for calculation. Then,
we can feed it back by means of a simple circular buffer con-
nected to A. It is a consequence of the systolic approach in
the proposed system and allows an important simplification
in the design.
If we select R bits for resolution, then the buses A and B
must be sized as R × K bits. In turn, the bus B needs to be
connected to a memory where the N 2 /K words of R × K
bits are allocated for completely represent M.
This analysis remains valid even if fixed-point or floating-
point number representation is used. For complex number Fig. 2. Cascaded addition for Adder 1
representation, the storage units and the add stages may pro-
cess real and imaginary parts independently.
for high-parallelism operation, is improved and the required
operation frequency is allowed.
3.1. Data Propagation Optimization
In this scheme, K − 2 delay units are inserted into the
Although the former section presents system level require- connection between the multipliers bank output and the el-
ments, this section discuss the datapath in the design. If emental adders of the tree. Delay value is fixed at one for
0
we consider the ri,k calculation, the implementation may the output in the position K − 3 and it is increased in one as
be based on tree adders, as shown in Fig. 2. We can use the position decreases up to the position 0 in Fig 3. This tree
0 in Adder 1 represents the simplest approach and achieves a
K − 1 two-input adders and finally obtain ri,k by perform-
ing a cascaded connection. Unfortunately, it was showed good performance in the mentioned application; however, it
that this scheme may affect the global timing performance may be improved further by defining a symmetric tree adder
in a strong way because of the critical path extension. It is [4].
0
a consequence of the extensive combinatorial logic inferred In our design, ri,k is available K − 2 cycles after the
by the adders, which defines a long propagation path. Ac- multipliers bank produces its output. In turn, once the first
0
cording to [4], we establish the critical path in a VLSI cir- ri,k element is calculated, it is necessary to store it up to the
cuit by means of the latches interconnection. Then, as K complete set for k = 0, ..., L−1 is available. As K increase,
increases, more combinatorial adders are inserted into a two L becomes lower. Then, if the value of L is small, we can
0
latches path and the performance becomes poor. synchronize the ri,k contributions to ri by means of a new
Although the technique in the previous section was to in- set of delay units without affecting the area performance. In
crease K to improve the speed performance, it is possible to other cases, a memory based subsystem may replace it, and
obtain an opposite effect because of the critical path exten- address logic needs to be appended. In the case of delay
sion. An appropriate parameter selection criterion may be units, their values are represented as
stated. In one hand, K may be chosen as large as possible,
limited by the area resources. In other hand, this selection ri,0 ri,1 ··· ri,L−2 ri,L−1
affects the speed performance in a negative way if the crit- X ri,0 ··· ri,L−3 ri,L−2
ical path extension becomes too high. This behavior was .. .. .. .. .. (8)
. . . . .
presented in the wireless communication application con- X X ··· ri,0 ri,1
sidered, for K = 32. Nevertheless, other (N, K) settings X X ··· X ri,0
may be located in a beneficial point of the space parameters.
Based on [4], we include flip-flops which interrupt the where each column represent different clock cycles
propagation paths and shorten them, it produces a pipelined t0 , t1 , ..., tL−1 from left to right, so it is a classical S/P unit.
architecture for our design. Although delay cycles are in- After this operation, we use a new tree adder fed by the en-
0
troduced into the system as a result of this technique, the tire set ri,k in a parallel way. In this case, the propagation
global performance is improved since K is sufficiently large path extension does not affect significantly the performance
and the operation of the transmitter is periodic. because of the small value of L. Nevertheless, a more so-
phisticated critical path treatment is still possible. Based on
whether a specific system achieve timing constraints or not,
3.2. Final Settings
a pipelined tree adder similar to Adder 1 may replace Adder
The complete the design is depicted in Fig. 3, where a 2.
pipelined architecture is used for Adder 1. This way, the According to the error computation for the ana-
speed bottleneck imposed by high values in K, as desired lyzed OFDM transmitter where fixed-point number repre-
45
Fig. 3. Complete matrix-vector multiplier architecture
sentation is chosen, 2R bits are used for real and imaginary OFDM signal generation, and performance achieved is suffi-
parts independently in the adders input. Then, truncation cient for implementing an N-Continuous OFDM transmitter
is not applied in the multipliers output. Based on numeri- by following the LTE standard.
cal simulation, the adder outputs are defined as 2R-bit. A
truncation unit is placed in the last stage, and the output bus 6. REFERENCES
represent the results in R bits for real and imaginary part,
independently. [1] T. Onizawa, A. Ohta, and Y. Asai, “Experiments on fpga-
implemented eigenbeam mimo-ofdm with transmit antenna se-
lection,” Vehicular Technology, IEEE Transactions on, vol. 58,
4. SIMULATION RESULTS no. 3, pp. 1281 –1291, march 2009.
[2] P.-Y. Chen, C.-Y. Lien, and C.-P. Lu, “Vlsi implementation of
The proposed architecture has been tested on an Altera
R
an edge-oriented image scaling processor,” Very Large Scale
EP2C70F672C6 device where a VHDL specification was
Integration (VLSI) Systems, IEEE Transactions on, vol. 17,
developed. Debugging was performed by means of a fixed- no. 9, pp. 1275 –1284, sept. 2009.
point simulator built on Matlab
. R It was complemented by
[3] L. Androuchko and I. Nakajima, “Developing countries and e-
a special unit for connecting the test board with a PC through
health services,” in Enterprise Networking and Computing in
an Ethernet port. The final performance is summarized in
Healthcare Industry, 2004. HEALTHCOM 2004. Proceedings.
Table 1. 6th International Workshop on, 28-29 2004, pp. 211 – 214.
[4] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design
Table 1. Synthesis Results and Implementation. Wiley, 1999.
Resource Utilization % [5] J. van de Beek and F. Berggren, “N-continuous OFDM,” Com-
LEs 3129 4.6 munications Letters, IEEE, vol. 13, no. 1, pp. 1 –3, 2009.
LABs 215 5 [6] Physical Channels and Modulation (Release 8), 3GPP Std.
Registers 530 0.88 TSG RAN TS 36.211, v8.4.0., 2008.
Memory Bits 5490 0.49
Hardware Multipliers 191 0.64
5. CONCLUSION
46
SYNTHESIS OF THE HARTLEY TRANSFORM WITH A HADAMARD-BASED MATRIX
ARCHITECTURE
Gilson J. Alves, Member, IEEE, and Edval J. P. Santos, Senior Member, IEEE
47
𝑁 −1
1 ∑ 2𝜋𝑘𝑛
ℎ𝑛 = 𝐻𝑘 𝑐𝑎𝑠 , n = 0, 1,..., N-1 (6)
𝑁 𝑁
𝑘=0
48
The Turbo Hartley Transforms-THT for short block
lenght presented by De Oliveira, Cintra and Campello [18]
was used for this approach. In this method, the technique of
decomposition in Layer Matrix is used [17].
( 𝑁
) ( ) ( )
2𝜋𝑘(𝑛 + 2 ) 2𝜋𝑘𝑛 𝑛 2𝜋𝑘𝑛
𝑐𝑎𝑠 = 𝑐𝑎𝑠 + 𝜋𝑘 = (−1) 𝑐𝑎𝑠 (10)
𝑁 𝑁 𝑁
The design was developed following the steps: Specifica- 4.3. Behavioral Simulation
tion, HDL description, Behavioral Simulation and Hardware
Implementation. The next subsections are referring to the The simulation was leaded to check the system response.
first approach, the Hadamard-based matrix architecture. The simulation environment used was the ModelSim-XE⃝ R
⃝
R
with the Xilinx ISE tool. Tests were carried out with a vec-
4.1. Specification tor simulating the input signal, in two situations: The first,
where the input signal simulates a rectified sine wave, and
The project aims to implement the Discrete Hartley Trans- the response is showed in Fig. 6 and Fig. 7. The Signal-In
form of lenght 16, 16-DHT, in a cheap FPGA module, in vector (entrada𝑡ℎ ) is an integer approach of a Sin wave with
accordance with Fig. 4 and Fig. 3. The Fig.2 resume the amplitude 10 and positive rectification, and the response, the
design conception. Signal-Out (saida𝑡ℎ ) is presented as integer numbers due to
The Discrete Signal-In is a vector of sixteen 14-bits- the tool limitations; in the second situation, the input signal
samples. In a serial mode, the samples are stored in a entry- is a gate function, and the response is showed in Fig. 5. The
memory. After that, the memory vector is multiplied for ele- time to compute the 16-DHT of a signal is 3 𝜇𝑠. This makes
ments of the Hartley Matrix of transformation, in a 4-layers it feasible for a range of applications, like audio and image
operation according to the scheme of figure 4 and the expla- processing [20].
nation in the previous section (3). The 16-DHT response is
the Discrete Signal-out, a vector of 16-length, where each 4.4. Hardware Implementation
component is represented in a 12-bit word.
As the result of simulation occuried as expected, the syn-
thesis was executed in a Xilinx⃝ R
Spartan-3E, XC3S500E-
4.2. HDL Description ⃝R
4fg320, via the Xilinx ISE 11.1, with a previous RTL-
The Matlab⃝ R
Simulink software can be used to simulate Register Transfer Level generation. Due to characteristics
a Hartley transform execution system [19]. In this design, of the test platform used, the auxiliary-clock frequency was
the Matlab⃝ R
was used to implement the 16-lenght Hartley adjusted in 6.25Mhz, but it can be set up to 50 MHz, de-
Matrix with a Hadamard-based matrix architecture, that was pending on the FPGA platform.
converted in simulink blocks, as shown in Fig. 4. The com- With comparison purposes, an other synthesis of the 16-
putation of the Hartley transform is executed according to DHT was carried out, via its matrix definition algorithm, ac-
49
Fig. 3. 16-DHT block conception
50
16−DHT of a gate function (synthetized) 16−DHT of a Rectified Sine Wave
100 100
X= 0
Y= 100
80
60
50
40
16−DHT
Amplitude
20
X= 4 X= 6
Y= 8 Y= 8.2843 X= 8
0 Y= 4
0
−20
X= 12
Y= −8
−40
X= 14
−60 Y= −48.2843
0 5 10 15
−50
t 0 5 10 15
t
51
SBrT 2000 - XVIII Simposio Brasileiro de Telecomunicaçoes,
Set 2000.
[19] R. C. de Oliveira, H. M. de Oliveira, R. Campello, and E. San-
tos, “A flexible implementation of a matrix laurent series-
based 16-point fast fourier and hartley transforms,” IEEE
Proceedings of VI Southern Programmable Logic Confer-
ence, pp. 175–178, Mar 2010.
[20] S. A. Parthasarathy Ranganathan and N. P. Jouppiy, “Perfor-
mance of image and video processing with general-purpose
processors and media isa extensions,” Proceedings of the
IEEE, Aug. 2002.
52
IMPLEMENTACIÓN DE MODBUS EN FPGA MEDIANTE VHDL - CAPA DE ENLACE -
53
1.2. Codificación ASCII
54
2.2. Transmisor y Receptor • Distribución de Clock: DDL (Delay-Locked
Loop).
El Transmisor funcionalmente debe generar la trama a ser • Boundary Scan.
enviada, esto, tanto en el Maestro como en los Esclavos. Se Con las pautas de diseño ya presentadas, como así también
diseña una máquina de estados, pendiente del proceso de la identificación de los distintos bloques que componen
escritura del bloque de RAM, llevada a cabo por la capa de nuestra descripción, se presenta el resultado de la síntesis,
aplicación. Tabla 1.
La máquina de estados realiza las lecturas sucesivas desde
el bloque de RAM hasta enviar uno a uno los caracteres, Tabla 1. Resumen de utilización de recursos
respetando los marcadores de comienzo y fin de trama. Dispositivo FPGA: 2S200EPQ208-6Q
En la recepción, al igual que en la transmisión, se utiliza Recurso Utilizado Disponible Porcentaje
nuevamente una máquina de estados, que deberá cumplir Slices 194 2352 8%
con las especificaciones del modo de codificación. En este Flip Flops 239 4704 5%
caso se cuenta con la información en forma serial recibida Lógica 334 4704 7,1%
LUTs
por la capa “Física”. Los datos son almacenados en el RAM 8 4704 0,1%
bloque de RAM, momento en el que el bloque de recepción Entradas/Salidas 37
posee el control absoluto de escritura en la memoria. IOBs conectados 37 142 26%
Por lo expuesto, resulta necesaria la presencia de un control GCLKs 1 4 25%
de accesibilidad del bloque de RAM, dado que varios
componentes precisan de la escritura y/o lectura de dicho De la Tabla 1 se aprecia los escasos recursos utilizados, ya
bloque. que se cuenta con un dispositivo con gran número de CLBs
(Configurable Logic Blocks). Igualmente es de suma
importancia la simulación, verificación y posterior
2.3. UART simplificación de la descripción, para lograr un mejor
rendimiento de los recursos en vista de su implementación
MODBUS define para las capas 1 y 2 del modelo OSI, el
en diferentes dispositivos lógicos.
“Protocolo MODBUS de Línea Serial” [3]. Esto implica la
Un análisis más detallado ha de resaltar la importancia de la
utilización de una UART (Universal Asynchronous
no implementación de elementos primitivos en el actual
Receiver Transmitter) para poder transmitir y recibir los
proyecto. Al respecto, en el caso del bloque de RAM
datos en forma serie.
descriptivo, es necesario un determinado y reducido
La UART constituye entonces la conexión de la capa de
número de elementos que permiten su instanciación hasta
“Enlace” con la capa “Física”. Esta última puede ser
en dispositivos lógicos más pequeños, por ejemplo, CPLDs.
cualquier estándar de comunicación serial como el RS232 o
En caso de necesitar un bloque de RAM de mayor tamaño,
el RS485 adoptado en el presente desarrollo.
ha de considerarse el empleo de bloques de memoria RAM
Este bloque se realiza al igual que los demás de manera
primitivas, obviamente, realizándose un previo estudio del
descriptiva en VHDL, y en forma general presenta el dato
dispositivo a utilizar. Así como la consideración anterior, se
recibido en forma serial, como salida en paralelo. De forma
debe tener en cuenta todos los recursos necesarios para el
análoga, recibe el dato a transmitir en paralelo y envía los
proyecto y los disponibles en el hardware a utilizar.
bits de información en forma serie atendiendo las
configuraciones de velocidad elegidas, y las condiciones
preestablecidas por el protocolo MODBUS sobre la
conformación de la palabra a enviar: bits de comienzo,
datos, paridad y parada [3].
3. SÍNTESIS E IMPLEMENTACIÓN
55
La utilización de un único reloj para el sincronismo de los
CLBs resulta ser más flexible en el diseño que disponer de
varios clocks externos conectados a la FGPA. Sin embargo,
debe tenerse presente que esto se logra con el
correspondiente consumo de recursos físicos, ya que un
divisor de clock, implementado con bloques lógicos, se
sintetiza como un contador lógico.
RTL (Register Transfer Level) permite la representación
gráfica del diseño descrito en VHDL, visualizándose los
componentes finales, Fig. 6.
Fig. 8. Simulación de una trama de transmisión y
recepción en capa de Enlace del MODBUS.
5. CONCLUSION
56
SECUENCIADOR MUSICAL EN UNA PLACA FPGA
MUSIC SEQUENCER ON A FPGA BOARD
57
Voz1
Tabla de 22 22
NCO
tonos
7 Voz2
7 Tabla de 22 22 1 PWM
Tabla NCO
7 tonos
partitura
7 Mezclador
Voz3 22
8 Tabla de 22 22
NCO
tonos
1
PC Metrónomo
Voz4
Tabla de 22 22
NCO
tonos
50 MHz
En nuestra implementación no nos centramos en la for- El Metrónomo, que es un divisor de la frecuencia del
ma en que se ingresan los datos de la ejecución. Almacena- reloj de la placa.
mos en una tabla los datos necesarios para reproducir una
El Registro PC, que indexa la Tabla Partitura.
pieza musical, haciendo ésta las veces de una entrada real.
Ası́, dejamos el camino abierto para poder ingresar datos de La Tabla Partitura, que contiene los datos de ejecución
otras formas (tiempo real vı́a puerto serie desde una pc o un de la pieza musical.
controlador MIDI, por medio de un teclado ps/2, etc.).
La Tabla de Tonos, que traduce cada nota musical a
Nos referimos a esta tabla que hace las veces de entrada
ser reproducida al valor que necesita el Oscilador para
como “Tabla Partitura”.
generar la señal correspondiente a ella.
El Oscilador (NCO), que genera una señal diente de
2.2. Salida de audio
sierra con frecuencia controlada numéricamente.
Las salidas de audio son dos, y contienen el audio digital El Mezclador, que combina las señales de los diferen-
resultado de la ejecución de la partitura que se toma en la tes osciladores en una sola.
entrada.
Una de estas salidas es una señal diente de sierra digi-
4. IMPLEMENTACIÓN
tal discreta en un bus de 22 bits que varı́a en ciclos de fre-
cuencia asociada al tono que se desea reproducir. En nuestro 4.1. Módulo Metrónomo
proyecto esta salida no se utiliza pero queda disponible para
cualquier otra conversión digital-analógica que se quiera re- Este módulo es simplemente un divisor de la frecuencia
alizar. del reloj de la placa. Su salida pasa de 0 a 1 indicando que
La otra salida es la versión modulada por ancho de pulso ha transcurrido una unidad de tiempo para la interpretación
(PWM) de la salida mencionada anteriormente. La misma musical.
es de 1 bit que alterna entre 0 y 1 en ciclos con frecuencia En futuras implementaciones, será posible modificar el
asociada al tono que se desea reproducir. Esta salida sı́ es tempo de la pieza musical durante su ejecución con solo
utilizada y es recibida por el PMOD-AMP1 y transformada cambiar el valor por el cual se divide la frecuencia en este
en audio capaz de ser reproducido por cualquier parlante. módulo.
4.2. Registro PC
3. DISEÑO
Este registro funciona como un contador de la cantidad
El Data Path (ver Fig. 1) del proyecto se compone de los de pulsos emitidos por el Metrónomo. Este valor se utiliza
siguientes módulos: para indexar la Tabla Partitura.
58
v
timestamp nro. nota NoteOn/NoteOff nro. voz
0x1BBE4
8 7
4.3. Tabla Partitura Fig. 3. Contador de tics para obtener un “La” medio.
Esta tabla simula una entrada propiamente dicha. Co- v
mo se comentó en 2.1, en futuras versiones podrı́a ser reem- 0x20DDF2
plazada por una entrada en otro formato.
Cada una de sus filas (ver Fig. 2) representa un evento 0x200000
de ejecución musical para un determinado momento al que
llamaremos timestamp. Este proviene del valor almacenado 0x1F220D
en 4.2. t
0x0
Un evento está compuesto por: la nota involucrada; una
voz (por la cual se generará el sonido); y un valor binario
Fig. 4. Contador de tics centrado en 0 X 200000 para obtener
que indica si representa el comienzo o el fin del sonido de
un “La” medio.
esa nota en esa voz (NoteOn/NoteOff ).
En esta primer etapa la aplicación sólo es capaz de hacer
sonar hasta 4 notas en simultáneo. Las mismas empezarán Tics del clock de la FPGA necesarios para obtener un La:
a sonar y dejarán de hacerlo en el timestamp indicado en la
partitura. 1
seg = 1ticF P GA (3a)
50M
1 50M
4.4. Tabla de tonos seg = x = ticsF P GA (3b)
440 440
Esta tabla guarda los valores precalculados que sirven de
De esta forma almacenamos los valores precalculados
tope a los contadores de los osciladores para lograr las fre-
para las 128 notas del espectro musical que contempla el
cuencias deseadas. Dado que son contadores discretos, las
protocolo MIDI.
frecuencias generadas pueden tener un error, pero el mismo
es despreciable para el oı́do humano.
El funcionamiento es sencillo de explicar: dado que el 4.5. Módulo Oscilador (NCO)
clock interno es de 50 MHz, lo que tenemos que pregun- Este módulo es el encargado de generar la señal que rep-
tarnos es cuántos tics deberı́amos contar para retrasar esta resenta cierta frecuencia. La salida es una señal diente de
frecuencia a la de la nota deseada, entonces contamos desde sierra que oscila a cierta velocidad, la cual es determina-
0 hasta ese número una y otra vez para ası́ lograr una señal da por la entrada. Para generar la señal de diente de sierra
de la frecuencia de dicha nota. lo que hace el módulo es contar tics del clock interno del
La cantidad de tics del clock de la FPGA: FPGA. La entrada de este módulo entonces será el número
correspondiente a la nota deseada según 4.4 y la salida es el
50M Hz = 1seg (1a)
valor del contador (ver Fig. 3).
1 Cuando decimos que una voz se activa en una nota, nos
1Hz = seg (1b)
50M referimos a que uno de los cuatro Osciladores comienza a
1 generar a la salida una señal de diente de sierra que “os-
1ticF P GA = seg (1c)
50M cila digitalmente” en la frecuencia asociada a esa nota. (Por
(1d) ejemplo, para el La medio, a 440 Hz.) Cuando decimos que
una voz se desactiva, el Oscilador asociado a esa voz tiene
La cantidad de tics que tiene el “La” medio: en forma constante el valor cero a la salida.
Este módulo también centra la señal. Por centrarla nos
440Hz = 1seg (2a) referimos a que la mitad de nuestra representación de 22
1 bits sea siempre alcanzada en la mitad del rango a recorrer.
1Hz = seg (2b)
440 Con lo cual, en lugar de contar desde 0 a n contamos desde
1 p hasta q con p < q y p es la negación bit a bit de q que
1ticLA = seg (2c)
440 cumplen que q − p ∼ = n (ver Fig. 4).
59
6. SÍNTESIS
Table 1. Tabla de sı́ntesis.
Componente Utilizados Porcentaje En la tabla 1 se puede ver el resultado de la sı́ntesis de
Slices: 150/4656 3% nuestro proyecto, sin ninguna partitura cargada, para la pla-
Slice Flip Flops: 107/9312 1% ca Spartan-3E (XC3S500E) sintetizando nuestro proyecto
con XST. Presentamos la sintesis sin la partitura, porque la
4 input LUTs: 264/9312 2%
idea es que la misma deje de estar dentro de la placa para
IOs: - 21 % pasar a ser una entrada de otro tipo como explicamos en 2.1.
Bonded IOBs: 17/232 7%
7. CONCLUSIÓN
60
FLEXIBLE PLATFORM FOR REAL-TIME VIDEO AND IMAGE PROCESSING
Paulo da Cunha Possa, Zied El Hadhri, Laurent Jojczyk and Carlos Valderrama
61
simplified diagram of the video framework architecture. generates two signals corresponding to the pixel
coordinates.
TV Decoder
ADV7181B Video DAC
ITU-R 656 SDRAM
Input
ADV7123
Deinterlacer
Decoder Interface
Output
Cyclone II
EP2C35 to 4:4:4 to
SDRAM Video Input RGB
8 MB
Module
SRAM
512 kB
Fig. 3. Diagram of the Video Input module.
Fig. 1. Altera DE2 Development and Education Board. Customized video processing modules can be easily
placed between these two modules (Video Input and Video
Output). A basic scalable architecture was utilized to create
a complete video application. Fig. 4 shows a diagram of
Video VGA the video processing module created to evaluate our
Camera Monitor platform.
Video Processing
Module
TV Decoder
Video DAC
ADV7123
ADV7181
Output
Video Video Video
Background
Subtraction
Input Output
Tracking
Processing
Mirroring
Input
Cyclone II 2C35
62
FPGA device. The dual port RAM allows storing data in support a large number of real-time video applications in a
one address and read data from another address at the same VGA standard resolution (640 × 480 pixels @ 60 fps). In
time. Using the LIFO structure, the Mirroring block creates terms of internal memory resources, our design reaches
a mirror effect in the output video. 15% of utilization. Next, Table 1 summarizes the FPGA
resource usage by our system and Fig. 5 shows the
2.2. Background Subtraction Cyclone II EP2C35 floorplan after the fitting process with
the main blocks location.
The Background Subtraction module extracts in real-time
the background of a frame, highlighting new objects on the Table 1. FPGA resource usage by the Video Processing
frame. Background subtraction is a commonly used class Platform.
of techniques for segmenting out objects of interest in a Modules
Logic
Memory Bits
Embedded
PLLs
scene for applications such as surveillance [6]. In our Elements Multipliers
approach, we store a specific part of a frame into an Video In 1550 53184 9 1
Processing
external SRAM. After that, we compare each pixel, from Module
671 19200 0 0
the next frames, with the buffered pixels. As result of the Video Out 86 0 0 0
comparison algorithm (1), each pixel is classified as a Total 2307/33216 72384/483840 9/35 1/4
background pixel or a foreground pixel. In the output, the Percentage 7% 15% 26% 25%
background pixels will appear black and the foreground
pixels will appear as in the input, i.e. in the output we will
see only what is new in the frame.
63
(a)
Background
Subtraction
Tracking
Mirroring
(b)
Background
Subtraction
Tracking
Mirroring
(c)
Background
Subtraction
Tracking
Mirroring
Fig. 6. Video Processing Module results: (a) bypassing the background subtraction/tracking block; (b) bypassing de
mirroring block without foreground objects; (c) bypassing de mirroring block with a foreground object.
Related with resource usage and system performance, As we mentioned before, we used a low cost Altera’s
the Altera’s PowerPlay tool estimated a power dissipation FPGA device EP2C35 from the family Cyclone II. This
of 235.7 mW by the FPGA device in our system. device is embedded in an also low cost development
The experimental results demonstrate the board, the DE2. The DE2 has the advantage of containing
effectiveness of our platform. Fig. 6 illustrates the a video input and output based on a TV input decoder
platform video output with different multiplexer settings (ADV7181B) and a Video DAC output (ADV7123). Also,
in the Video Processing Module. Fig. 6a shows the the DE2 was especially developed targeting educational
mirroring block result without background purpose, which is our focus.
subtraction/tracking. In Fig. 6b and 6c, only the mirroring We implemented three basic processing algorithms in
block is bypassed and we can see the result of the our system in order to validate the entire system and test
background subtraction/tracking blocks. In Fig. 6b, the its flexibility and performance. The results showed that
output shows an empty space in the centre of the image. even a relative small FPGA device can support a large
This space is where the background subtraction/tracking number of real-time video applications in a VGA standard
block is active. In the next image (Fig. 6c), an object is resolution.
added to the environment. We can see the new object In the future work, we intend to implement extra
without the background information and also a square memory in the DE2 board through daughter boards
enclosing it. At the same time, the on-board seven- connected in its expansion connector. This will allow us
segment is showing the centre position of the square and implementing multiple frame buffers required for more
the object area in the image. complex algorithms. Also, we want to utilize a digital
video source (for example the Terasic D5M digital
4. CONCLUSIONS camera) instead the analog that we used. This will
simplify the Video Input Module and save FPGA
In this paper, we present a platform for real-time image resources. Moreover, we will migrate our platform to a
and video processing applications. The objective of this more powerful development board aiming applications on
framework is to allow engineering students to design, Full HD resolution.
explore and evaluate different image and video processing
modules.
64
ACKNOWLEDGEMENT [3] S. Asano, T. Maruyama, Y. Yamaguchi, “Performance
comparison of FPGA, GPU and CPU in image
This work is supported by the French Community of processing,” International Conference on Field
Belgium under the Research Action ARC-OLIMP Programmable Logic and Applications, pages 126 – 131,
(Optimization for Live Interactive Multimedia Processing 2009.
2008-2013). Also, we would like to thank Altera
University Program for providing the development [4] N. Lawal, B. Thornberg, M. O'Nils, “Power-aware
boards. automatic constraint generation for FPGA based real-time
video processing systems,” Norchip, 2007.
REFERENCES
[5] J. Li, H. He, H. Man, S. Desai, “A general-purpose
[1] M. Akil, “Special issue on reconfigurable architecture FPGA-based reconfigurable platform for video and image
for real-time image processing,” Journal of Real-Time processing,” International Symposium on Neural
Image Processing, volume 3(3), pages 117-118, 2008. Networks, pages 299-309, 2009.
[2] J.A. Kalomiros, J. Lygouras, “Design and evaluation [6] A.M. McIvor, “Background subtraction techniques,”
of a hardware/software FPGA-based system for fast image In Proc. of Image and Vision Computing, 2000.
processing,” Microprocessors & Microsystems, volume
32(2), pages 95-106, 2008.
65
66
SOPC PLATFORM FOR REAL-TIME DVB-T MODULATOR DEBUGGING
ABSTRACT more problems, which have held back the video system dig-
ital. The transmission of digitized images without compres-
The debugging of DVB-T FPGA based systems is not a triv-
sion at the speed required by television requires too much
ial task. The large bandwidth requirements in combination
bandwidth, something intolerable given the congested spec-
with the massive storage needed for further analysis of the
trum. It was therefore necessary to compress digital send-
video frames, requieres an add-hoc solution. This article
ing no more than what is necessary to reconstruct the image
presents a SoPC architecture specifically designed to cap-
at the receiver. This compression technique was developed
ture frames of a Digital Television modulator IP core in real
by MPEG (Moving Picture Experts Group). Regarding this,
time. All the required processing (video, communications,
MPEG2 image compression system is used as a reference
TCP-IP encapsulation, etc.) is managed by the FPGA, and
for the European Digital TV standard [2].
the frames can be captured between any stage of the pipeline
hardware processing of the DVB-T modulator IP core. As The flexibility and computing power required for Digi-
a result, a powerful tool for Digital Television hardware de- tal Television hardware processing are faced optimally using
bugging is obtained. reconfigurable logic. In fact, the state of the art regarding
processing hardware modules for DTT (Cores of Modula-
1. INTRODUCTION tors / Demodulators DVB-T) shows how many companies
offer specialized IP cores for integration into FPGA. Ad-
Recent years have witnessed the development of technology ditionally, the latest platform FPGAs have enabled the in-
in several digital areas. Similarly, this evolution has lead tegration of whole digital systems in a single device [3]:
into the need to replace existing technology in field of broad- hardware cores, microprocessors, on-chip buses, etc. G.
casting, which has been mostly analog until recently. This Martin in the chapter “The History of the SoC Revolution”
evolution not only concerns TV and radio end user but also (2003) [4] emphasized how the core-based design with com-
RF links between intermediate equipments. An example is mercial reconfigurable FPGA platforms was a strong reality
the communication between a camera and production center in the System-on-Chip (SoC) [5] design, and it would con-
within the context of the broadcast of a sport event. tinue in the future. This announcement has been met and
Trying to solve the shortcomings of previous analog sys- nowadays, the SoCs are widely extended, specially the SoCs
tems a digital broadcasting service for TV and radio has implemented in reconfigurable logic: the SoPCs. Regard-
emerged. In order to organize this evolution, a European ing methods and tools for high performance systems debug,
standard for digital television [1] has been set. most work has been done in the last years. FPGAs have
The basis of the new digital technology is digital com- become popular as a valuable resource for the debug and
pression of the image. The development of digital sound verification of those high-performance embedded complex
has been early treat but real-time moving image has many systems. With current FPGA technology, it became pos-
sible to control and manage several different real-time and
∗ This work has been partially supported by the research program DIPE-
high bandwidth interfaces simultaneously. In this way, in [6]
BEAZ 2009 (DIPE09/02)
† This work has been partially supported by the Government of they use a FPGA to allow a general purpose full observ-
the Basque Country within the research program NETS (project IN- ability cosimulation platform. As another example, in [7],
2010/0000012) a JTAG compatible logic analyzer core is presented, which
67
is necessary to design SoPCs architectures and appropriate
Table 1. Input and output data bus width of the DVB-T IP
technologies. This paper presents a solution based on a sys-
internal modules.
tem that can extract SoPC real-time information to a host via
1Gbps Ethernet TCP-IP connection. The useful information
Module name Input bus data width Output bus data width
throughput will be above 200 Mbps (payload).
transport stream if 8 9 To meet this challenge, the key technological elements
randomized 9 9 in the system are:
reed salomon 9 9
external interleaver 9 8 • High-end Virtex-5 FPGA (XC5VFX70TFF-1136).
viterbi puncture 8 2
• A hard core Power-PC processor 440, integrated into
internal interleaver 2 33
the FPGA silicon.
pilot and tps 33 33
ifft ig 33 32 • A hard core High performance Gigabit Ethernet con-
dac core 32 16 (DDR) troller integrated into the FPGA silicon.
68
Mpeg2
(Input)
MPEG2 IF (S)
FIFO IF (M)
FIFO IF (M)
FIFO IF (M)
FIFO IF (S)
FIFO IF (S)
2 3
1
RANDOMIZED REED_SALOMON
TRANSPORT_STREAM_IF
FIFO IF (M)
FIFO IF (M)
FIFO IF (S)
FIFO IF (S)
FIFO IF (M)
FIFO IF (S)
5 6
4
VITERBI_PUNCTURE INTERNAL_INTERLEAVER
EXTERNAL_INTERLEAVER
FIFO IF (M)
FIFO IF (S)
FIFO IF (S)
DAC IF (M)
FIFO IF (S)
7 8 10
PILOT_AND_TPS IFFT_IG DAC_CORE
Output
(DACs)
Debug WB IF (M)
UART IF
11
CTRL
FPGA
• TCP-IP and lwIP parameters optimization: There • Auxiliary modules for clock, reset and JTAG manage-
can be substantial performance improvements in com- ment.
munication achieved by modifying some parameters
of the TCP-IP stack in combination with some size 3. IMPLEMENTATION RESULTS
optimizations of the transmission and reception FI-
FOs. The most significant parameters are the follow- In order to obtain the debug system as fast as possible, both
ing: the IP core and the SoPC have been implemented on a ML507
Xilinx Virtex-5 evaluation board. This populates a XC5V-
– Maximum Segment Size (TCP MSS): 1.460 bytes.
FX70T-FFG1136 device and it has all the means need for the
– TCP Transmission Buffer (TCP SND BUF): real-time operation: DDR2 external memory, SRAM mem-
16.384 bytes. ory and Gigabit Ethernet physical Link.
– TCP Window(TCP WND): 4.096 bytes. Figure 3 shows the block diagram of the whole system.
Inside the FPGA the IP and the SoPC have been imple-
– TMAC transmission and reception FIFO: 4.096
mented. In this set-up, the SoPC is capturing the data be-
bytes.
tween the output of the FFT and the input of the DAC mod-
ule. FSM Ctrl. is the Finite State Machine that controls
Figure 2 shows the block diagram of the proposed SoPC
the data transfer between the DVB-T modulator IP core and
for real-time debug. It has been implemented on a Virtex-5
the FIFO stored in the CAPTURE FSL MASTER OUT IP.
FPGA. In addition to critical modules mentioned above, the
Table 2 summarizes the implementation results. The first
following additional cores are presented in the system:
column describes the FPGA resource type. Column 2 and 3
• 16 Kbytes of internal RAM memory built using dedi- respectively, summarize the FPGA occupation for the SoPC
cated block RAM modules. alone and in combination with the IP core under test. In this
case, the DVB-T modulator. It is worth noting that a huge
• SRAM controller for external memory. FPGA like the one used for this implementation, allows easy
69
SRAM
CAPTURE_FSL_
PPC440 RAM
Internal RAM memory
UART
LLDM
------------------------------------------------------------
Table 2. Implementation results of the SoPC designed for
Server listening on TCP port 2000
real-time debug of a DVB-T transmisor IP core (data for a
TCP window size: 8.00 KByte (default)
Virtex-5 XC5VFX70T-FFG1136 FPGA).
------------------------------------------------------------
[1856] local 192.168.1.50 port 2000 connected with 192.168.1.105 port 4097 FPGA resource type SoPC system IP core under analy-
sis and SoPC system
[ ID] Interval Transfer Bandwidth
[1856] 0.0- 2.0 sec 30.8 MBytes 129 Mbits/sec 4 input LUTs 4.850 (10%) 5.762 (12%)
[1856] 2.0- 4.0 sec 30.8 MBytes 129 Mbits/sec Slice Flip-Flops 5.221 (11%) 6.851 (15%)
[1856] 4.0- 6.0 sec 36.5 MBytes 153 Mbits/sec Virtex-5 Slices 3.008 (26%) 3.762 (33%)
[1856] 6.0- 8.0 sec 35.0 MBytes 147 Mbits/sec 36K BlockRAM 17 (11%) 23 (15%)
[1856] 8.0-10.0 sec 37.1 MBytes 156 Mbits/sec
Hard Power-PC processor 1 (100%) 1 (100%)
[1856] 10.0-12.0 sec 37.3 MBytes 156 Mbits/sec
TMAC Gigabit Ethernet 1 (50%) 1 (50%)
[1856] 12.0-14.0 sec 33.6 MBytes 141 Mbits/sec
[1856] 14.0-16.0 sec 31.1 MBytes 130 Mbits/sec
[1856] 16.0-18.0 sec 31.1 MBytes 130 Mbits/sec
[1856] 18.0-20.0 sec 35.7 MBytes 150 Mbits/sec
[1856] 20.0-22.0 sec 34.2 MBytes 144 Mbits/sec
fast-prototyping for complex debug systems. Only 33% of
[1856] 22.0-24.0 sec 35.5 MBytes 149 Mbits/sec
the general purpose resources of the FPGA are used and all
[1856] 24.0-26.0 sec 32.0 MBytes 134 Mbits/sec
timing constraints are easily met.
[1856] 26.0-28.0 sec 36.6 MBytes 154 Mbits/sec Figure 4 shows a screenshot of the real-time commu-
[1856] 28.0-30.0 sec 36.5 MBytes 153 Mbits/sec nication between the ML507 evaluation board used to im-
[1856] 30.0-32.0 sec 29.9 MBytes 126 Mbits/sec plement the platform presented with a PC through a point
[1856] 32.0-34.0 sec 33.1 MBytes 139 Mbits/sec to point Gigabit Ethernet communication link. In the PC
[1856] 34.0-36.0 sec 36.4 MBytes 153 Mbits/sec runs a Iperf server, which evaluates the actual data flow in
[1856] 36.0-38.0 sec 37.2 MBytes 156 Mbits/sec transfer. The program used to capture the TCP-IP packets is
[1856] 38.0-40.0 sec 34.3 MBytes 144 Mbits/sec Wireshark. It is in charge of saving the reconstructed frames
in the PC hard disk for further analysis. Thoses frames are
Fig. 4. Communication performance between fast prototyp- captured and stored in real-time; however, they are analyzed
ing board (ML507) and PC host. Data provided by Iperf off-line, when they are compared with the ones generated
tool. by the DVB-T modulator reference model (implemented in
C language). As it can be noticed, for the chosen commu-
70
SRAM BOARD
MPEG2
Transport
Stream 2 3
FPGA
1
RANDOMIZED REED_SALOMON
TRANSPORT_STREAM_IF
FIFO IF (S)
FIFO IF (S)
FIFO IF (M)
FIFO IF (M)
FIFO IF (M)
MPEG2 IF (S)
WB IF (S) WB IF (S) WB IF (S)
5 6
4
VITERBI_PUNCTURE INTERNAL_INTERLEAVER
EXTERNAL_INTERLEAVER
FIFO IF (S)
FIFO IF (S)
FIFO IF (M)
FIFO IF (M)
FIFO IF (S)
FIFO IF (M)
WB IF (S) WB IF (S) WB IF (S)
7 8 10
PILOT_AND_TPS IFFT_IG DAC_CORE
FIFO IF (S)
FIFO IF (S)
FIFO IF (M)
FIFO IF (M)
DAC IF (M)
FIFO IF (S)
DEBUG
(PC host)
WB IF (M)
11
CTRL
UART IF
DVB-T modulator
IP core
FSM
CAPTURE_FSL_
Ctrl.
MASTER_OUT SFSL PLB2FSL SPLB
71
FIFO MFSL Bridge INTERRUPT SRAM GP IOs
Controller Interface Leds, buttons
PPC440 RAM
Internal RAM memory
UART
LLDM
Ethernet
(PC host)
ETHERNET
Fig. 3. Block diagram of the SoPC in combination with DVB-T transmisor IP core for real-time debug.
PHY
4. CONCLUSIONS
5. REFERENCES
72
HIGH RELIABILITY CAPTURE CORE FOR DATA ACQUISITION IN SYSTEM ON
PROGRAMMABLE CHIPS
Jesús Lázaro, Armando Astarloa, Aitzol Zuloaga, Jaime Jimenez, Unai Bidarte, José Luis Martı́n
73
2. OVERALL STRUCTURE • Spectrum analyzer. This block is used to compare the
different outputs: ideal, output and single hardware
The Capture core is in charge of receiving data from the filter output.
ADC decide the correct value and ready it into a PLB com-
patible core. The SoPC is composed of several cores (one 3.2. Structure
being the capture core) and a microprocessor that will use
the captured data. The block in charge of combining the filter outputs in or-
der to give the correct answer is built around the following
blocks:
external redundant
sensor ADC • Voter. This block is in charge of deciding which filter,
if any, is giving a corrupted output.
FPGA Capture • Fault counter. This block counts how many error are
found in each of the filters for a given time.
Core
• Disabling circuit. Knowing the amount of error of
PowerPC plb
each filter, this circuit disables the one with more er-
rors (if all have the same number of errors, C circuit
is disabled)
74
Fig. 2. Overall system, depicting inputs, filters and voting circuitry.
Fig. 3. Voter and mean calculator. A majority voter decides which filter output is probably failing. Several counters counts
how many failures happen in a given time. A third block disables a core if an anomalous condition is found. The fourth block
makes the output of the failing filter 0 while the last block calculates the mean of the outputs of the filters.
75
a thing is not possible, bits with lower binary weight should hardware cost, the circuit has been designed to add the three
be use. outputs from the filters and divide them by two.
Contrary to conventional voting circuit. The current out-
put of the voter is not used, but the average of errors is used.
This way the voting circuit needs not to be perfectly tuned,
since there is margin for spurious outputs.
76
CH 1
interconection interface with the PLB bus has to be defined.
20 CH 2
CH 3
In our case, this connections is done through a FIFO style
0
shared memory. This memory is written by the core and
-20
read through the PLB.
-40
Magnitude-squared, dB
-60
4.1. Hardware structure
-80
-120
• Capture core
-140
0
Frame: 86
5 10 15 20 25
Frequency (kHz)
30 35 40 45 50 • TEMAC: Hard Ethernet MAC
• Memory cores: DDR2 interface core, Flash interface
Fig. 5. Spectrum result of the filters. Channel 1 depicts the core
ideal floating point filter. Channel 2 the output of the real
system. Channel 3 depicts the output of the non working The capture core has been explained in previous sec-
filter. tions. So we will focus on the rest of the cores.
Table 3. Resource summary report for a Spartan 3A-DSP. 4.2. PowerPC hard processor
Timing constrains set to 64MHz, allowing 1MHz input sam- The IBM PowerPC
440 c core is a hard 32-bit RISC CPU
pling time. blocks designed into the fabric of select Virtex series FPGAs
Quantity % of FPGA to implement high performance embedded applications. The
DSP48As 3 3% combination of hard cores with integrated co-processing ca-
Slice 648 3% pability enables a wide range of performance optimization
options.
The PowerPC 440 processor supported by Virtex-5 FXT
in a worst case scenario, the hardware overhead is limited to FPGAs with a sophisticated CPU/APU controller and high-
2 DSP48A blocks and less than 2% of the FPGA. This im- bandwidth crossbar switch. The crossbar switch enables
plementation is useful for a standalone version of the core. high-throughput 128-bit interfaces and point-to-point con-
The resources used in the final FPGA, are slightly bigger nectivity. Integrated DMA channels, dedicated memory in-
since the interconnection logic has to be added. It may seem terface, and Processor Local Bus (PLB) interfaces minimize
that it requires less Slices, but Virtex5 slices are twice that logic utilization, reduce system latency and optimize perfor-
of a Spartan 3. mance. Simultaneous I/O and memory access maximizes
data transfer rates.
4. SOPC STRUCTURE
4.3. TEMAC: Hard Ethernet MAC
The capture core seen in the previous section can be used TEMAC is an acronym for Tri-Mode Ethernet Media Access
as standalone, but, using the export Pcore feature, it can Controller and is a reference to the three speed (10, 100, and
be used inside a SoPC. In this kind of system, all the el- 1000 Mb/S) capable Ethernet MAC function available in this
ements of the circuit are integrated inside an FPGA. The core. This core is based on the Xilinx hard silicon Ethernet
capturing core has to be slightly modified, specifically, the MAC in the Virtex-5 FXt.
This core provides some very advanced capabilities:
77
on Control, Data Acquistion, and Remote Participation
Table 5. Resource summary report for a Virtex5 70fxt. Tim- for Fusion Research. [Online]. Available: http://www.
ing constrains set for 100MHz bus speed to allow high speed sciencedirect.com/science/article/B6V3C-4CGNSF2-1/2/
communications. bfd1aabcaa30ed6414008b4742affb1a
Quantity % of FPGA [2] K. Nurdan, H. Besch, B. Freisleben, T. Conka-Nurdan,
PPC440 1 100% N. Pavel, and A. Walenta, “Development of a Compton Cam-
TEMAC 1 50% era Data Acquisition System Using FPGAs,” in Proceed-
Slice 3495 31% ings of the 2003 International Signal Processing Conference,
2003.
BRAM 15 10%
DSP48As 3 2% [3] H. I. Schlaberg, D. Li, Y. Wu, and M. Wang, “FPGA
Based Data Acquisition and Processing for Gamma Ray
Tomography,” AIP Conference Proceedings, vol. 914,
no. 1, pp. 831–837, 2007. [Online]. Available: http:
4.4. Memory cores
//link.aip.org/link/?APC/914/831/1
The system has two memory interfaces, one for DDR2 an [4] P. Adell and G. Allen, “Assessing and mitigating radiation
another for Flash. The combination of these memories al- effects in Xilinx FPGAs,” JPL, Tech. Rep., 2008. [Online].
lows the use of complex software scheme such as operat- Available: http://hdl.handle.net/2014/40763
ing systems, IP stacks,. . . allowing the system to transfer any [5] R. Baumann, “Soft errors in advanced semiconductor
data using standard protocols. devices-part I: the three radiation sources,” Device and Ma-
terials Reliability, IEEE Transactions on, vol. 1, no. 1, pp.
17–22, mar 2001.
4.5. Hardware results
[6] R. Baumann and E. Smith, “Neutron-induced boron fission
In table 5 a summary of the required resources is presented. as a major source of soft errors in deep submicron SRAM
The system is built around the high performance Virtex5 devices,” 2000, pp. 152–157.
70fxt. The PowerPC is running at 400 MHz to provide max- [7] M. Bellanger, Digital Processing of Signals: Theory and
imum performance. The presented system has only a single Practice. John Wiley & Sons Ltd., 2000.
capture core, but there is plenty of room both to have more
[8] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal
capturing core and to have a more complex SoC. Processing, 3rd ed. Prentice Hall, 2009.
[9] Xilinx, “TMRTool Product Brief,” http://www.xilinx.com/
5. CONCLUSIONS AND FUTURE WORK publications/prod mktg/XTMRTool ssht.pdf.
[10] ——, “Xilinx System Generator for DSP,” http://www.xilinx.
The present paper presents both a simulation framework and
com/tools/sysgen.htm.
a practical implementation of a high reliability filter imple-
mentation. The implementation uses FIR filters although it [11] ——, “Xilinx Platform Studio,” http://www.xilinx.com/tools/
xps.htm.
can be extended to IIR filters or any other kind of mathemat-
ical circuit. [12] T. MathWorks, “Simulink - Simulation and Model-Based De-
In systems where FPGA failure is of concern, the vote sign,” http://www.mathworks.com/products/simulink/.
and mean circuitry should also be tripled as well as any fol- [13] Xilinx, “Processor Local Bus (PLB) v4.6,” http://www.xilinx.
lowing signal processing circuitry. com/support/documentation/ip documentation/ds531.pdf.
The system can be upgraded to detect an error both in the [Online]. Available: http://www.xilinx.com/support/
input (analog to digital converter) and in the output (result documentation/ip documentation/ds531.pdf
of the filtering). This way action can be taken to try to solve [14] R. Perez, “Methods for Spacecraft Avionics Protection
the problem. If the error is in the input, not much can be Against Space Radiation in the Form of Single-Event Tran-
done but, if the error is inside the FPGA some action can be sients,” Electromagnetic Compatibility, IEEE Transactions
taken. This can range from resetting the offending circuit on, vol. 50, no. 3, pp. 455–465, aug. 2008.
to full FPGA reconfiguration with partial reconfiguration as [15] Xilinx, “Spartan-3A DSP FPGA Family: Complete Data
the middle point. Sheet,” http://www.xilinx.com/support/documentation/data
sheets/ds610.pdf, 3 2009.
6. REFERENCES [16] ——, “XtremeDSP DSP48A for Spartan-3A DSP FPGAs
User Guide,” http://www.xilinx.com/support/documentation/
[1] B. McHarg, “Control, data acquisition, and remote participa- user guides/ug431.pdf, 7 2008.
tion for fusion research,” Fusion Engineering and Design,
vol. 71, no. 1-4, pp. 1–3, 2004, 4th IAEA Technical Meeting
78
DESARROLLO DE UNA PLATAFORMA GENÉRICA PARA SISTEMAS DE VISIÓN
BASADA EN LA ARQUITECTURA CORECONNECT
INCA/INTIA
Universidad Nacional del Centro de la pcia. de Bs. As.
Paraje Arrollo Seco, Tandil, pcia. de Bs. As, Argentina
email: [email protected], {lleiva,mvazquez}@exa.unicen.edu.ar
ABSTRACT PLB
79
Sobre un determinado bus se conectan los periféricos en-
cargados de la UART, memoria externa, bloques de ram, in-
terruptores externos.
El core IP del controlador de video es del tipo “master”
y “slave”. Se conectan a dos buses distintos, el master se
conecta a un bus propio para comunicarse con la memoria y
el slave al bus donde estan conectado el resto de los periféri-
cos. El motivo de que se conecte con el controlador de me-
moria mediante un bus dedicado se debe a que el periférico
necesita un alto ancho de banda. De esta forma evita com-
partir el bus con otros periféricos, teniendo siempre acceso
a él.
El core IP encargado de controlar el PowerPC posee cua-
tro puertos, dos dedicados a los datos y otras dos dedicadas
a las instrucciones del mismo, denominadas D0 y D1 para
los datos, e I0 e I1 para las instrucciones. Los puertos I0 y
Fig. 2. Arquitectura del sistema D0 se conectan al mismo bus donde están conectados el res-
to de los cores de los periféricos para poder interactuar con
ellos. Mientras que D1 e I1 están conectados a la memoria
werPC. Esta arquitectura se basa en uso de Cores y buses. La mediante un bus dedicado para obtener un acceso rápido a
versión de la arquitectura es la 4.6[3], la cual viene integrada ella sin necesidad de competir por el acceso al bus con otros
en el EDK 10.1[4]. cores.
El sistema posee un core encargado de controlar y recibir
los datos desde la cámara y escribirlos en la memoria. Se
3.1. Core de la Cámara
trata de un sensor CMOS de 5MP, fabricado por la empresa
Micron, cuyo nombre de serie es MT9P001[5]. Dicho sensor Como se mencionó anteriormente el core (fig 3) se desa-
va montado sobre una placa de desarrollo (headboard), la rrolló bajo la arquitectura de capas para el desarrollo de
cual posee una lente Navitar capaz de controlar la apertura cores[7]. Esta arquitectura en su capa de más bajo nivel (lla-
como la distancia focal. La resolución del mismo es de 2592 mada IPIF) se comunica con el bus PLB y provee ası́ una
(horizonal) x 1944 pı́xeles (vertical). Cada pı́xel tiene una interface simplificada denominada IPIC hacia la capa supe-
profundidad de 12 bits. El sensor trabaja con el patrón de rior denominada User Logic. En el User Logic es donde se
Bayer[6]. coloca la lógica del core.
El procesamiento de las imágenes se lleva a cabo en el El core se desarrolló en VHDL. El mismo es portable
microprocesador embebido PowerPC (o MicroBlaze), me- hacia otros sistemas, siempre y cuando utilicen la versión
diante la ejecución de un programa codificado en C. Tam- 4.6 del bus PLB.
bién se encarga de transmitir las imágenes tanto hacia un Principalmente consta de dos módulos, uno encargado
monitor como hacia la interface serial. de recibir los datos y configurar la cámara (driver); y otro
Las imágenes procesadas se almacenan en la memoria el encargado de enviar los datos a través del bus hacia la
externa, en tres áreas distintas. Un área de memorı́a es la del memoria para su posterior procesamiento.
core de video, una segunda área es donde el core de la cáma- El driver se encarga de la configuración de la cámara y
ra escribe las imágenes, y una tercera donde se almacena la de la obtención del valor de intensidad de los pı́xeles con
imágen procesada para luego ser visualizada o transmitida. una resolución de 8 bits. La configuración de la cámara se
realiza mediante el protocolo I2C.
El User Logic implementa la lógica del controlador. Esta
3. ARQUITECTURA DEL SOC entidad se comunica con el driver de la cámara y envı́a a los
datos hacia la memoria a través del bus.
En la figura 2 se observa la arquitectura con los compo- Este compononente se encarga de comunicar los datos
nentes que intervienen en el sistema. desde el driver hacia el IPIF mediante el seteo de señales y
Los cores IP’s empleados son principalmente controla- direcciones para la transferencia de los datos. Debido a que
dores de block ram (bram block y xps bram if cntrl), con- el driver trabaja con 8 bits de resolución por pı́xel y que el
troladores de memoria RAM (mpmc), controlador UART ancho de bus es de 32 bits, se almacena el dato en un buffer
(xps uartlite), driver de video (xps tft), y el controlador del y se envian paquetes de 32 bits (4 pı́xeles). La dirección de
PowerPC (ppc405). El bus que se utiliza para la comunica- inicio del área de escritura es configurable vı́a parámetros
ción es el PLB en su versión 4.6. del core.
80
Table 1. Tiempos empleados
Algoritmos Sin Cache Con Cache Aceleración
Interp. simple 251ms 225ms 11.6 %
Interp. bilineal 276ms 270ms 2.2 %
Interp. gradiente 647ms 602ms 7.5 %
4. RESULTADOS EXPERIMENTALES
81
implementados en el microprocesador o como un core co- [5] Micron, 1/2.5-Inch 5-Megapixel CMOS Digital Image
nectado al bus. Otra ventajas es la facilidad de agregar di- Sensor, 2005.
versos cores a la plataforma. Es posible trabajar con más de
una cámara en la plataforma, agregando un core de la cáma- [6] S. Imaging, RGB Bayer Color and MicroLenses, 2010.
ra por cada cámara que se conecte y configurándolos para [7] Xilinx, PLB IPIF (v1.00f), 2007.
que escriban en distintas áreas de memorı́a. .
Se desarrolló un Core encargado de controlar y recibir [8] ——, Virtex-II Pro and Virtex-II Pro X Platform FP-
los datos de la cámara. El mismo resulta ser portable, siem- GAs: Complete Data Sheet, 2007.
pre y cuando, se trabaje con la versión 4.6 del bus PLB. La
dirección de inicio de escritura es parametrizable. [9] ——, Xilinx University Program Virtex-II Pro Deve-
Se realizaron métricas sobre los algoritmos ejecutados lopment System - Hardware Reference Manual, 2005.
sobre el microprocesador. Se observó que la diferencia en [10] ——, PowerPC 405 Processor Block Reference Guide,
tiempos entre los diversos algoritmos se debió a la diferen- 2010.
tes cantidad de accesos a memoria que realizaban por cada
pı́xel. Se logró una reducción en los tiempos al incorporar [11] ——, Embedded Processor Block in Virtex-5 FPGAs,
una cache al microprocesador. 2010.
7. REFERENCES
82
PROTOTIPADO RÁPIDO DE UN IP PARA APLICAR LA TRANSFORMADA WAVELET EN
IMÁGENES
MELO Hugo Maximiliano PEREZ Alejandro
email: [email protected] email: [email protected]
83
mejor concentración en información de tiempo y frecuencia La característica de energía Wavelet {Eni} n=1...d,
[1]. i= H, V, D refleja la distribución de energía a lo largo del
Las transformadas Wavelet se clasifican en eje de frecuencia sobre una escala y en una orientación
Transformadas Wavelet Discretas (DWT) y Transformadas determinada.
Wavelet Continuas (CWT). La energía de las imágenes se concentra en las
frecuencias bajas. Una imagen tiene un espectro que se
reduce con el incremento de las frecuencias. Estas
2.1.Transformada Wavelet Discreta en 2-D propiedades quedan reflejadas en la Transformada Wavelet
El análisis por transformada Discreta de Wavelet Discreta de la imagen [3].
(DWT) puede ser implementada con bancos de filtros, pasa En compresión y en algunas otras aplicaciones de la
transformada se hace necesario aplicar una técnica
bajos y pasa altos seguidos de etapas de down sampling.
multinivel. Esta se obtiene aplicando sucesivamente las
Para la síntesis también se utilizan los bancos de filtros y up
transformadas a la parte de aproximación de la etapa
sampling de la señal. La “Fig. 1” es un esquema del
proceso de análisis. anterior. En la “Fig. 3” se observa una representación
El decimado (Down Sampling) y undecimado (Up clásica del resultado de la transformada Wavelet multinivel,
en donde las dimensiones de la matriz son las mismas que
Sampling) indican decremento o incremento,
la imagen original.
respectivamente, de números de muestras, lo cual se logra
La nomenclatura se interpreta de la siguiente manera:
eliminando una muestra o intercalando un cero entre ellas
La primer letra indica el sentido del detalle o aproximación:
[2].
V=Vertical, D=Diagonal, H=Horizontal, A=Aproximación;
el número representa el nivel de transformada al cual
corresponde.
3.1.Características
84
'db3' le indica a la función de MatLab que la Wavelet la siguiente etapa, lo que dio como resultado una alteración
madre es una Daubechies 3. Los filtros resultantes para esta de la imagen reconstruida ya que los ceros quedaban
Wavelet son de orden 5 con un total de 6 coeficientes. embebidos en el análisis.
A continuación se realiza el siguiente esquemático “Fig. b) Se optó por truncar el número de datos, considerando
4” en entorno SimuLink, utilizando bloques propios de válidos 5000 datos. Durante los ensayos se determinó que
SimuLink y System Generator. no se puede descartar cualquier dato, ya que esto repercute
en los resultados de la posterior reconstrucción. Para cada
descomposición simple, se optó por tomar como válidos los
N primeros datos, eliminando M datos de la
descomposición, el valor de M se obtiene truncando la parte
entera de la siguiente relación:
Orden del filtro
M =
2 Truncación parte entera
85
Donde entrada es un vector de valores finitos, db3 Instanciar las fuentes de los archivos generados por
corresponde al tipo de onda utilizado para el cálculo de System Generator en los archivos “user:logic.hdl” y
coeficientes y las salidas son Ap y De corresponden a la “top_entidad.hdl.”.
Aproximación y Detalle respectivamente. Una vez verificada la incorporación, mediante la síntesis
de las fuentes, se agrega el IP desde el repositorio en el
entorno EDK, se conecta al bus correspondiente y se
3.4.Resultado de la implementación generan las posiciones de memoria del sistema .
Al comparar la imagen reconstruida con la imagen
original utilizando el método implementado, se hallaron
errores en el margen superior izquierdo de la imagen, de
manera más específica en una submatriz de n x n, donde
“n” es la cantidad de coeficientes del filtro implementado
para Daubechies 3.
Como solución a este problema se agregaron marcos a
la imagen original. En esta experiencia para prototipado Figura 5 EDK.
rápido se utilizó un marco cuyo valor numérico era el uno,
esto permitió apreciar el comienzo de la imagen al finalizar Incorporado el filtro al sistema se creará una rutina en
el marco, y pese a que no se empleó ninguna extensión de SDK la cual escribirá en el registro de entrada del filtro
frontera se logró una reconstrucción perfecta. Esto se debe a datos enviados por el puerto serie de una PC conectada al
que los errores de los procesos de truncación de sistema. Los datos resultantes en el registro de salida se
información para la formación de matrices auxiliares de envían a la PC para contrastar con los resultados de las
recorrido, descriptos anteriormente, se sitúan dentro del simulaciones realizadas en SimuLink.
perímetro del marco, “Fig. 5”, que posteriormente es
eliminado. 4. CONCLUSIÓN Y TRABAJOS FUTUROS
86
Cortex-M0 Implementation on a Xilinx FPGA
Pedro Ignacio Martos y Fabricio Baglivo
ABSTRACT
87
Verilog, and test-bench code. The test-bench instantiates from the Digilent web page. This is very important for our
the Cortex-M0 DS module and connects it in a minimal purpose because it is possible to program the board and see
way to a memory model and clock and reset generators. It the state of it easily.
also provides a means of outputting information from the The Xilinx S3E500-4 is the FPGA included on
processor to the Verilog simulator’s console output. the board. It has 500K gates, 10,500 logic cells, 20
The aim of this work is: synthesize the Cortex-M0_DS hardware multipliers, 360Kbits of dedicated RAM, 73
in a real FPGA, connect it to a memory with a small Kbits of distributed RAM, 4 clock handlers, and a
program inside using the AMBA-Lite interface, evaluate maximum clock frequency of 300MHz .
how much FPGA fabric resources are needed to do it, and
see its applicability in small footprint systems.
2.2. Software tools
1.2. AHB AMBA-Lite overview We used the Xilinx® Integrated Software Environment
(ISE™) as our design suite software. The ISE project
This protocol defines the data and address buses, and all navigator allowed us to manage the project and synthesize.
the control signal for high performance synthesizable Core Generator is the tool we used to generate the ROM
designs requiring high bandwidth. First, the address bus, memory and the reset generator. ISIM was used for
HADDR[31:0], is a MASTER output to a SLAVE simulation purpose. Also we used Chipscope Pro to
device. Data transfers are performed using two buses: a perform online debugging. This program let us see the
read one, which is a SLAVE output to a MASTER input, state of the system. For C code compiling, we used the
called HRDATA[31:0], and a write one, which is a ARM Microcontroller Development Kit by KeilTM.
MASTER output to a SLAVE, called HWDATA[31:0]. In The ARM deliverables package contains a logical
this work we use only HRDATA. Finally the protocol folder with synthesizable code and test-bench code. The
specifies control signals. HBURST[2:0] indicates if the test-bench project has a Verilog implementation of the
transfer is a single transfer or forms part of a burst. processor, Cortexm0ds_tb.v, prepared for simulation. It also
HMASTLOCK indicates if the current transfer is part of a includes a HelloWorld.c program with a make file
locked sequence. HPROT[3:0] provides additional containing the compilation parameters. As result of this
information about a bus access and is primarily intended compilation a .bin file is obtained. This file is a memory
for use by any module that wants to implement some level
of protection. HSIZE[2:0] indicates the size of the
transfer, e.g. ,byte, half word, or word. HTRANS[1:0]
indicates the transfer type of the current transfer (IDLE,
BUSY, NON SEQUENTIAL, SEQUENTIAL). The
HWRITE signals indicates the transfer direction, when
HIGH this signal indicates a write transfer, and when
LOW, a read transfer.
2. THE IMPLEMENTATION
88
Figure 4 FPGA use with the Project Implementation.
Figure 3 Cortex M0 Design test-bench Schematic A “Toggle LED” project was created. Its aim was
to turn on and off one of the board´s LEDs with the Cortex
image that is loaded by the Verilog test-bench at the processor. The .bin file was generated using the Keil tools.
beginning of the simulation. The Core Generator tool allowed us to fill the memory
with an image. The file format required was .coe. A script
was generated to transform the .bin file in a .coe. Once the
2.3. System Implementation project was synthesized, IMPACT was used to program the
board.
The first step of the implementation process was the
replication of the Hello World project. We realized that
the makefile did not work correctly with Keil or IAR, so 3. RESULTS
we decided to begin a new project with Keil. We made the
compilation process and we compared the .bin file As mentioned before, the Chipscope Pro package was used
obtained with the one ARM provide in the test-bench to see the transitions on the AMBA AHB-Lite interface
package. After that, we began the simulation with ISIM. signals and for debugging. With this tool, we could verify
In it we saw that the data bus did not contain valid values. that the interface worked as expected. All memory
The .bin file was not correctly loaded by the Verilog code accesses were correctly synchronized and we saw that the
because it was assumed that the Verilog instruction LED on the board blinks. So we conclude that the
$fread() made double words accesses, where it actually implementation is correct and functional.
made byte accesses. After this modification the test-bench In Figure 4 we show some results using the FPGA
was working as expected. with the project implementation. Figure 5 shows the
The next step was the implementation of the simulation of the memory accesses in Keil software. Figure
testbench code into a synthesized VHDL code (see Figure 6 shows the Chipscope Pro capture of the AMBA bus.
3). An external 50MHz oscillator was used as the external Timing reports (post place & route) gave a maximum
clock. We used a synthesized DCM to generate the 10Mhz clock speed of 40MHz. This value could be improved by
system clock. The reset generator was implemented using implementing time and placement constraints.
the Xilinx architecture wizard. A two clock-cycles reset
pulse is needed to initialize the processor. The pre
initialized 1Kx32bit RAM was created using the Core
Generator tool. The processor does 32 bits data access, so
we had to shift the RAM address bus in 2 positions, so
processor addr[0] is connected to RAM addr[2]. Some
LEDs were connected to internal signals, namely,
LOCKED and SLEEPING. The address and data buses of
the AMBA-Lite interface and HWRITE were connected
directly to the memory. HREADY was fixed to ‘0’ and
HRESP was fixed to ‘1’. Others signals of the interface
were connected to internal signals for debugging purpose.
A bus signal detector was developed to compare HWRITE
bus information with two patterns, one to turn the LED on
and the other one to turn it off. The user constraint file Figure 5 Simulation in Keil
(.ucf) was defined using the Plan Ahead tool
89
[3]
ARM Ltd, “AT510-DC-80001-r0p0-00-rel0 ARM Cortex
M0 DesignStart Release Note” August 2010.
[4]
ARM Ltd, “ARM DDI 0432C Cortex M0 Revision r0p0
Technical Reference Manual”, November 2009.
[5]
ARM Ltd, “ARM DUI 0497A Coertex M0 Devices Generic
User Guide”, October 2009.
[6]
Xilinx, “DS312 Spartan-3E FPGA Family: Datasheet”,
August 2009.
[7]
Digilent, “Digilent Spartan 3E Starter Kit Reference
Manual”, June 2008.
4. CONCLUSIONS To William Hohl, Joe Bungo, Fiona Cole, the people at the AR
University Program and the people at the Xilinx University
The most remarkable conclusion is that it is possible to Program (XUP) for their support and cooperation.
implement the M0DS in a low range FPGA. With this
result, Xilinx, Actel and Altera (the three most important
FPGA manufacturers) can support this core making it a
III. TRADEMARKS AND COPYRIGHTS
considerable alternative when portability between these
three FPGA types is a requirement for a design.
As an improvement, it would be useful to have a The information about ARM processor families was mainly extracted
from ARM Ltd web site (www.arm.com), as published on October, 2010.
complete test bench that allows us to generate the bin file
from the source code. That is not possible right now, and it ARM, CORTEX, CORTEX-M, AMBA, AMBA-LITE, and other
would accelerate the development time. designated brands included herein are trademarks of ARM Limited.
In future work, other peripherals will be
Xilinx, Spartan, ISE, and other designated brands included herein are
connected to the AMBA bus in order to increase the trademarks of Xilinx Inc.
processors capacity. Also, going further, a Linux operating
system can be investigated on this processor, obtaining a Digilent, Spartan3E Starter Kit, Adept, and other designated brands
Linux implementation in a small footprint design. included herein are trademarks of Digilent Inc.
90
DIGITALY CONFIGURABLE PLATFORM FOR POWER QUALITY ANALYSIS
91
the FFT analysis of harmonics and interharmonics waves. 0.1% for the complete instrument including sensors (e.g.
In Section 5, we describe the platform proposed. In Section current clamps).
6, the architecture is analysed. Conclusions are drawn in An adequate measurement is the basis for any other
Section 7. power quality device. Modern PQ monitoring systems
range from traditional watt-hour meters or digital
2. POWER QUALITY CONCEPTS protection relays in which the PQ analysis algorithms are
inserted, to complex devices that deal with PQ parameters
The power quality is determined by a set of different and events. Most of these devices have to be configured in
measurements performed over the voltage and current order to implement the adequate strategies to guarantee an
waveforms. The main purpose of these measurements is to acceptable power quality. PQ strategies determine the
determine both: (1) how efficiently the electrical energy is different actions to take for the different PQ events. The
utilized and (2) how good is the energy provision. possible actions could be the modification of the regimen
Former electrical applications consisted in linear and of electrical loads, the connection of compensators, the
balanced loads. Under this case of loads, power quality switch off of secondary components, etc.
analyses were confined to determine the phase angle
between the voltage and current waveform. The cosine of 3. VARIABLE SIGNAL SAMPLING
this phase angle, denoted cos(φ), gives a relationship
between the electrical energy effectively utilized (active An important parameter for power quality is the
energy) and the electrical energy supplied (apparent harmonics content of the power line supply and load. The
energy). specification of the measurement and analysis are well
Nowadays, the characteristics of the loads are different defined in [5] and [8].
from the former ones. Most of the electrical loads use In [8], it is specified that the measurement interval shall
semiconductor devices that produce a non-linear be 10 or 12 cycle time for 50Hz or 60Hz respectively. The
behaviour, and consequently it introduces perturbations standard is defining the time period which need to be
into the power line that worsen the PQ of the system. measured and how the measured values will be aggregated.
The perturbations defined in [3], and supported by most The interval in time is not fixed but varies in time as the
of the modern PQMS, are: fundamental frequency of the power changes. This kind of
• Swell: is an increase in the A/C voltage, with a measurement requires synchronization with the power line
duration which may range from a half cycle to a few in order to adapt the sampling interval accordingly. The
seconds. easiest way to achieve adaptive sampling frequency is
• Sag: idem swell with reduction of the voltage. using PLL (Phase Lock Loop).
• Flicker: is a momentary interruption of the electrical A PLL is an electronic feedback system that generates a
energy. signal, the phase of which is locked to the phase of an
• Undervoltage: is a reduction of the voltage during input or reference signal. This is accomplished in a
more than 1 minute. common negative feedback configuration by comparing
• Overvoltage: is an increase of the voltage during the output of a voltage controlled oscillator to the input
more than 1 minute. reference signal using a phase detector.
• Interruption: is a reduction of the voltage below 10%. Analog PLLs are generally built of a phase detector,
• Harmonics: are voltage and currents components with low pass filter and voltage-controlled oscillator (VCO)
a frequency different from the frequency of the power placed in the forward path of a negative feedback closed-
line. loop configuration. Figure 1 shows a block diagram of a
• Frequency derivation: is the difference between the basic PLL structure.
frequency measured and the theoretical of the power
line.
Similarly, and according to IEC61000-4-30, Power reference
input phase
low pass filter
Quality Analyzer should analyze and evaluate these detector
(LPF)
(PD)
quantities: power frequency, magnitude of the supply
voltage, flicker, harmonics and interharmonics, supply
controlled
voltage unbalance, rapid voltage changes and voltage dips, VCO output oscillator
swells and interruptions. The standard suggests the (VCO)
92
Analogue PLL circuits should be calibrated in order to get the 50th harmonic frequency as the highest that can be
achieve adequate response times when the input frequency measured. Besides, the 10/12 cycles windows determines
changes. Moreover, including additional analogue at least 1200 sample per window. Because of the radix-2
components in a system requires a more careful design to factor of the DFT transform, a length of was 2048 chosen,
avoid interferences between the analogue and digital with a sampling frequency varying from 9kHz to 13kHZ.
circuits.
On the other hand, software PLL circuits require strict 5. PLATFORM PROPOSED
temporal requirements to be met. The processing time
required by this king of algorithms is very demanding and Power quality measurement and analysis requires strict
hard to be met by most of the embedded processor without temporal constraints to be met. On the other hand, it is also
an exclusive or prioritised utilisation. When these temporal required processing and storage a large amount of
constraints are not met, then harmonics components are information. We can define two kinds of functions that a
introduced in the fequential analysis and consequently power quality device has to carry out: (1) power quality
measurement errors. synchronization, measurement, monitoring and analysis
In this paper,, we utilised the digital PLL circuit for and (2) processing, storage and communication of the data
power line applications proposed in [9]. This PLL is and information of the system.
implemented as a digital circuit that produces a fast When the first kind of functions is implemented as
response time when the power frequency changes. This software, then perturbations are introduced when the
PLL circuit does not require processing time from the processor’s time is shared among all the functions of the
system processor. system. These perturbations are produced because the real-
The PLL synchronises with the power frequency and time features of a Real-Time Operating System are not
generates the adequate sampling frequency to meet the well match with the temporal constraints that a power
harmonic analysis specified in [5] and [8]. The goal of an quality analysis requires.
adequate synchronization for the harmonics analysis is to We proposed a platform based on a FPGA device that
reduce the spectral leakage effect. Besides, the PLL implements two main units: (1) the Power Quality unit
utilised computes the sine and cosine of the power line (PQU) and (2) the SoPC unit with Linux. Both units are
frequency that it is used to easily detect voltage and current communicated through a communication bus that maps the
perturbations as well as to determine reactive, active and PQU in the memory map of the SoPC unit.
apparent powers and energies. The PQU contains the acquisition, synchronisation and
DFT stages. Voltage and current signal are attenuated,
4. FFT for Harmonics and Inteharmonics Analysis isolated, filtered, converted from analogue to digital and
transformed. Figure 2 shows a scheme of the PQU.
Power frequency is called the fundamental frequency. A Bus Interface
sinusoidal wave with a frequency k times (k is an integer)
is called harmonic wave or harmonic for short. Other DFT Dual Port Memory
sinusoidal waves whose frequency cannot be expressed as
an integer multiple of the fundamental, it is called an LP Filter PLL
V1
interharmonic wave or interharmonic for short. clk
In [6], it is specified the principle of harmonics and ADC LM12L458 Sampling
interharmonics measurements: a 200 ms windows (10 Generator
periods of 50Hz or 12 periods of 60Hz signal) is used in Isolation Isolation
DFT calculation resulting with 5Hz increment in frequency
spectrum. Analogue Stage
Attenuation Attenuation
For power quality measurements, usually the analysis of
harmonics is reduced to the 50th harmonic (i.e. to 2500Hz
for 50Hz signal). I1, I2, I3 V1, V2, V3
FFT (Fast Fourier Transform) transforms a time
Figure 2: PQU Architecture
sampled signal into its frequential spectrum. When FFT is
implemented for discrete time applications, then the
Whilst the power quality measurement has need of a
suitable algorithm is the DFT (Discrete Fourier
dedicated hardware to meet the strict timing constraints,
Transform).
the rest of the functions required to either communicate,
With these specifications, it is determined that the
storage or process a great deal of information may be
sampling frequency is at least 100 samples per period to
93
implemented on a processor. For this reason, a soft- possible perturbations considered for power quality
processor FPGA-based system was found as a suitable analysis. We need to define a protocol of testing to
alternative to implement it. The System on Programmable compare the architecture with a Class A certified analyser.
Chip gives a great flexibility of the system as well as a However, we can assure that the digital architecture of
friendly environment to build the embedded system the PQ unit proposed, allows us to configure the
architecture. We use the NIOS II soft-processor from synchronisation, transform and analysis parameters to
Altera, implemented on a Cyclone III FPGA device. The optimise the performance of the unit for different power
system executes a µCLinux operating system to give line perturbations. This feature helps to improve the
support to the software applications. flexibility of the architecture.
Figure 3 shows the architecture of the FPGA-Linux The utilization of a FPGA device running µCLinux
board. µCLinux was chosen because of its wide support for reduces the design complexity. Linux reduces the
communication an storage. The embedded system offers complexity on implementing the data processing and data
native server communications through Ethernet and Serial communication functions since they are programmed as
interface. applications that use the already tested drivers.
Data Exchange The processing and communication speed reached is
adequate to measure and analyse the power quality
Modbus Pertubations to Modbus Events parameters with harmonics up to the 60th order.
RTU Waveform conver sion TCP processing
protocol appli cation protocol appl ication
uCLinux 7. CONCLUSIONS
DRIVES
Seri al Ethernet PQ U
Modern loads utilize semiconductor devices whose non-
Interface NIOS Interface linear behaviour introduces perturbations in the power line.
processor Such perturbations could reduce the efficiency on the
secondary to PC wi th energy utilization as well as cause damage to the
communication Matlab/ Simuli nk Voltages and
link Currents equipment connected to the power line. Several standards
Figure 3. Architecture of the FPGA-Linux board has been published to define the different parameters to
take into account to assure a good quality of service.
The interface between the NIOS processor and the PQ Power quality requires the processing of the voltage and
unit is through the Avalon interface. A µCLinux device current of the power line. Analogue and software
driver was programmed in order to an easy access to the approaches have been proposed for this purpose. Whilst
PQ unit from software applications. the analogue ones requires a precisely tuning and
The FPGA-Linux Board was implemented on a DE2 calibration for each device, the second ones require a great
Altera board with a Cyclone® II 2C35 FPGA. The board deal of processing time of the system processor.
includes Ethernet and serial ports used to communicate We proposed a power quality platform implemented on
with a supervisor PC. The drivers and protocols for these a FPGA. The power quality measurement, synchronization
communications links are easily implemented as Linux and analysis are performed by the Power Quality unit. This
applications. unit may be changed and modified in order to incorporate
new power quality specification.
On the other hand, the processing, storage and
6. RESULT ANALYSIS communication is implemented on a NIOS II soft-
processor executing a Linux version for software support.
Power Quality standards do not prescribe protocols or In this way, the platform is highly flexible from both,
experiences that have to be carried out to meet different the power quality unit and the SoPC unit. Changes
Class requirements. Instead, they define the measurement produced in one unit does not affect on the other, making
and parameters for power quality analysis and monitoring. the design and adaptation easy.
This turns difficult to assure that a certain instrument,
device or platform meets the power quality specification of
the standards. 8. REFERENCES
Several simulations have been carried out considering
different scenarios of perturbations, finding processing [1] B. H. Chowdhury, "Power Quality," IEEE Potentials, vol.
20, pp. 5-11, 2001.
errors within the boundaries of the standard. However, we
cannot assure that this performance is achieved for all the [2] I.-Y. C.-J. Won, J.-M. Kim, S.-J. Ahn, S.-I. Moon, J.-C.
Seo, and J.-W. Choe, "Development of Power Quality
94
Diagnosis System for Power Quality Improvement," - General guide on harmonics and interharmonics
presented at Power Engineering Society General Meeting, measurements and instrumentation, for power supply
2003. systems and equipment connected thereto
[3] "IEEE Std 1100-1992, "IEEE Recommended Practice for [7] IEC 61000-4-15 Electromagnetic compatibility (EMC):
Powering and Grounding Sensitive Electronic Testing and measurement techniques Flickermeter
Equipement", (IEEE Emeral Book)"," ISBN 1-55937-231- Functional design specifications
1, 1992. [8] IEC 61000-4-30 Electromagnetic compatibility (EMC):
[4] D.-J. Won, I.-Y. Chung, J.-M. Kin, S.-J. Ahn, S.-I. Moon, Testing and measurement techniques – Power quality
J.-C. Seo, and J.-W. Choe, "Power Quality Monitoring measurement methods.
System with a New Distributed Monitoring Structure," [9] Ricardo Cayssials, Omar Alimenti, Edgardo Ferro, “A Digital
KIEE International Transactions on PE, vol. 4A, pp. 214- PLL Circuit for AC Power Lines with Instantaneous Sine
220, 2004. and Cosine Computation”, IV IEEE Southern Conference
[5] EN 50160: Voltage characteristics of Electricity supplied on Programmable Logic, San Carlos de Bariloche, ISBN
by Public Distribution Systems. 978-1-4244-1992-0, 26-28 de Marzo de 2008, Argentina.
[6] IEC 61000-4-7 Amend.1 to Ed.2: Electromagnetic
compatibility (EMC): Testing and measurement techniques
95
96
SOLAR TRACKER FOR COMPACT LINEAR FRESNEL REFLECTOR USING PICOBLAZE
Daniel Hoyos, Maiver Villena, Carlos Cadena ∗ Victor Serrano, Telmo Moya, Marcelo Gea †
ABSTRACT
This paper describes a distributed control system for a
Compact Linear Fresnel Reflector using a combination of
chronological and light-sensing tracking techniques. The
system uses LabVIEW at controller stage, ZigBee for wire-
less communications and Spartan 3 FPGA’s at input /output
stages.
1. INTRODUCTION
97
Table 1. Protocol codes.
Code Operation
Start 11110000 Start daily controller routine
Time update 11110001 Set controller time
Time check 11110010 Check controller’s time
Position Check 11110011 Check controller’s position
Position change 11110100 Order position change
Save 11110101 Save
Relocate 11110110 Relocate
Echo order 11110111 Echo request
Status order 11111000 Status request
Time blw steps 11111001 Set time between steps
Fig. 2. LabVIEW communications subroutine. Id Request 11111010 Request controller identification
3. SYSTEM CONTROL
284 + n
d = 23.45 sin 360 (1)
365 The system has a central control, a communications net-
work and one controller for each mirror. Central control
Solar time it does not coincide with local clock time. To makes more complex calculations like sunrise time, sunset
convert standard time to solar time takes two corrections: time and day duration and sends them to the controller set.
First, there are a constant for the difference between the ob- It also verifies controller operation, updates system time and
server’s longitude and the longitude of the country. The sec- orders system protection mode on bad weather. This central
ond correction is from equation of time, which takes into ac- control was implemented with LabVIEW running on an em-
count the perturbations of the rotation of the earth, is show bedded PC (PXI8155) sending data through serial port [3].
in (2) Communications subroutine is show in Fig. 2
A simple three bytes protocol was defined for control
orders, containing instructions at first byte and data at the
Solar time − Standard time = 4(Lst − Lloc ) + E (2) others. Instruction byte is composed by 0xf at high nib-
ble and the proper instruction code at the low nibble. The
Where E is given by (3) and B is shown in (4) instruction set is show in Table 1
The controller located at each mirror drives its move-
ment in function of the orders received from the central con-
E = 229.2(0.000075 + 0.001868 cos B trol and position sensors data. The controllers are indepen-
− 0.032077 sin B − 0.014615 cos 2B (3) dent among themselves. Protocol 802.15.4 (ZigBee) modu-
les are used for RF communications, working at the 2, 4GHz
− 0.04089 sin 2B) band [4]. A module configured as Coordinator is connected
to PC serial port and End Device configured one for each
360 controller.
B = (n − 1) (4)
365
3.1. Mirror Control
where n is number of days, Lst Standard longitude of
the country and Lloc is longitude of place in question. [2] Controller stage uses a Xilinx’s FPGA with PicoBlaze, an
To protect mirrors at night they must be placed looking embedded soft processor that performs overall control. The
down, so this device must go over 135 degrees (with 7.500 tasks required by the controller are implemented on FPGA
steps). In order to go back to start position at sunrise it must including: motor control, real time clock, analog to digital
return 12.500 steps. The speed of this movement is limited converters for sensors and UART to drive ZigBee module.
by motor’s maximum possible speed and system’s inertia. As those devices are connected to PicoBlaze’s input/output
It was experimentally determined that 100Hz pulses gives ports it can access them using configuration registers.
a free fault motor working, so the time needed to put the Motor control is implemented as a state machine that
system in repose mode is one minute. After, at sunrise, it compares position register data with internal current posi-
relocates the system in two minutes. tion and sets movement sense and steps number. This con-
98
Fig. 3. Controller System.
99
ture of the absorber if necessary, for example by blurring a
mirror.
6. REFERENCES
5. CONCLUSION
100
TOOLBOX NURBS AND VISUALIZATION SYSTEM VIA FPGA
101
1 if u ≤ u < u
N (u ) = i i +1 (2b) D1 T1
i, p 0 otherwise
u −ui u −u q1 Dn Tn
Ni, p(u) = Ni, p−1(u) + i+p+1 Ni+1, p−1(u) (2c) q0
ui+p −ui ui+p+1 −ui+1 T0 D2
T2
D0 qn-1
We assume throughout this paper that the knot vector
has the following form: Dn-1
Tn-1
U = { a, a,2...4
,3
a,u p+1, ... , um-p-1, b, b,2...4
, b} (3)
14 14 3 Fig. 1. Data points, junctions (knots), distance and tangent
p +1 p +1
vectors in a NURBS curve.
where, in most practical applications, a = 0 and b = 1.
A NURBS surface of degree (p, q) is defined similarly as: knots is calculating initial parameters values given by the
chord length method:
n m
S(u,v) = ∑∑ wi,j Pi,j N i,p(u) N j,q (v) (4) t0 = 0 (6a)
i = 0 j =0 k
1
where u and v are the parameter values in the longitudinal
tk = ∑ Di − Di −1
L i =1
(6b)
102
Layer 0 Layer 1 Layer 2 Ve rtex R e ad er
103
Fig. 4. Chord length and averaging cores for the knot vector generation.
t0 t0 – u 0
u0 ÷
u4 u4 – u0
0 0 0 P01
(1- ).d0 + .d10
1
d 0, d1, d2 (1- ).P01 + 1
.P11 P02 = C(0)
0 P11
(1- ).d10 + 0
.d2 0
Fig. 5. Core for the generation of one point in the degree 3 NURBS curve of p=3 layers.
- dividing the task in blocks, each of them consisting of a dedicated circuit, designed to optimally perform graphic
thread, defining 8 to 32 threads for each block; functions. The cores are synthesized to perform similar
- passing the serial processing to the computer processor; functionallity like the GPU, with 16 simultaneous threads
- 16 bits data size (compatible with the FPGA system). and a independent clock counter synchronized with the
The GPU used has 16 multicores with 4 processors each beginning and end of the process.
(enabling at most 64 processors). The NURBS is
implemented multithreading each layer of the Cox-de Boor 7. CONCLUSIONS
algorithm, just like the FPGA implementation, regarding
the hardware limitation. The use of FPGAs in computer graphic is still incipient,
being confirmed despite the fact that circuits dedicated to
6. RESULTS this aim leading to the following items:
1. while the GPU is a highly specialized processor that
The NURBS local interpolation algorithm and the can get great performance (for a specific subset of the
visualization system for FPGAs, is compared with a single problems), actually most of them are not suitable for
GPU, comparing the number of clock cicles, separated embedded applications in respect to FPGAs due to the
The GPU processing still results in a more efficient power dissipation, sometimes requiring more cooling than
manner to deal with the data interpolation, being a computer processors;
104
Table 1. Total clock cycles for 32 interpolation points and Actually, General Purpose GPUs (GPGPUs) provide
100 parameters (p is the NURBS degree). more flexibility to the system designer, still locked to the
CPU - GPU - FPGA hardware architecture, being some operations, like fixed
Vectorized code CUDA point operations, efficiently done in FPGAs. Future works
NURBS are devised to match reconfigurable systems with
47p(p+1) 16p 8p GPGPUS.
interpolation
Visualization 12p+2 4p 10p+20
pipeline 8. REFERENCES
Knot addiction 10p 16p
(16 knots) [1] M. C. Tsai, C. W. Cheng, M. Y. Cheng, “A real-time
NURBS surface interpolator for precision three-axis CNC
Table 2. Cores and respective number of logic elements. machining,” International Journal of Machine Tools &
Core LEs Manufacture, vol. 43, no. 12, pp. 1217–1227, May. 2003.
105
[14] J. E. Bresenham, "Algorithm for Computer Control of a Proc. IEEE International Conference on Field-
Digital Plotter", IBM Systems Journal, vol. 4(1), pp. 25-30, Programmable Technology, pp. 111-118, Dec. 2005.
1965. [17] M. L. Stokes, “A Brief Look at FPGAs, GPUs and Cell
[15] OpenCores Organization, “WISHBONE System-on-Chip Processors,” ITEA Journal, pp. 09 – 11, Jun./Jul. 2007.
(SoC) Interconnection Architecture for Portable IP Cores”, [18] L. W. Howes, P. Price, O. Mencer, O. Beckmann, “PGAs,
Revision: B.3, 2002. GPUs and the PS2 - A Single Programming Methodology,”
[16] B. Cope, P. Y. K. Cheung, W. Luk, S. Witt, "Have GPUs 14th Annual IEEE Symposium on Field-Programmable
made FPGAs redundant in the field of Video Processing?", Custom Computing Machines, pp. 313 – 314, Apr. 2006.
106
UNA METODOLOGÍA PARA EL DESARROLLO DE SISTEMAS EN CHIP DE ALTA
PERFORMANCE
ABSTRACT
107
2. LIMITACIONES DE LOS SISTEMAS BASADOS trabajo se demostrara la implementación de un HPSoC que
EN PROCESADOR permitirá crear una plataforma de computo especifica y
orientada a la aplicación, a modo de optimizar el camino de
Los procesadores fueron concebidos para realizar ejecución de datos (a través de extraer e implementar
computación de propósito general. Esta decisión de diseño paralelismo), optimizar el uso de la memoria (aumento la
produjo que los procesadores no sean eficientes a la hora de localidad y el acceso), disminuir la disipación de potencia
realizar tareas de cómputo específicas y por lo tanto, que no (hardware especifico requiere menor número de
puedan satisfacer la performance de procesamiento que transistores) y disminuir la frecuencia de trabajo (posible
demandan algunos sistemas embebidos actuales. debido a que en cada ciclo de reloj se realizan múltiples
Persiguiendo la ley de Moore, a lo largo de los años se operaciones).
ha buscado alternativas para mejorar la performance de las
plataformas de cómputo basadas en procesador. Sin 3. METODOLOGIA DE DESARROLLO DE UN
embargo estas alternativas no han sido eficientes ni HPSOC
aplicables en muchos escenarios en donde la performance
era un requerimiento. Como se enuncia en [1], esto se debe En la metodología propuesta, el diseño de un HPSoC
principalmente a que existen limitaciones físicas inherentes consistirá en dos áreas separadas pero que requieren
a los procesadores que en muchos casos y dada la interacción entre ellas. Una de esas áreas es la creación del
tecnología actual, impiden que estas alternativas se apliquen soporte necesario para implementar un sistema embebido
arbitrariamente: en la FPGA basado en microprocesador y la otra es la
optimización de la aplicación en vistas de una posterior
- El hecho de aumentar la cantidad de transistores y la implementación basada en un codiseño hardware-software.
frecuencia a la que estos trabajan, introduce serios En este codiseño el componente de hardware se
problemas de disipación de calor (barrera de potencia). implementara como un componente acelerador que se
comunicara con el componente de software a través del
- La frecuencia no puede ser incrementada diseño embebido.
arbitrariamente, no solo por la barrera de potencia, si En primera instancia, el desarrollo de sistemas
no también debido a una inherente limitación física en embebidos se realiza utilizando herramientas EDA que
los tiempos de conmutación de los transistores permiten interconectar, a través de una jerarquía de buses
utilizados en el diseño del microprocesador (barrera de de interconexión, un microprocesador (que puede ser un
frecuencia). softcore, o un hardcore como se menciona en [3]), con un
conjunto de dispositivos que vuelven al sistema embebido
- En un sistema de cómputo actual, el ancho de banda una plataforma de computo funcional. El desarrollo de
del microprocesador es generalmente 70 veces superior sistemas embebidos no es estandarizado y varía
al de la memoria externa, convirtiendo el acceso a la dependiendo del fabricante de FPGAs que se utilice. En el
misma en un cuello de botella. El uso de complejas presente trabajo se utilizó FPGAs de la firma Xilinx, por lo
jerarquías de memorias locales al microprocesador cual se trabajo con el ecosistema de desarrollo de Xilinx
(caches) disminuye considerablemente el tiempo de para implementar el sistema embebido. Esto consistió en
acceso a los datos, pero debido a la imposibilidad utilizar las herramientas EDK, ISE y las librerías de
tecnológica de incrementar el tamaño del cache componentes de hardware XilinxProcessorIPLib.
arbitrariamente, el acceso a memoria sigue siendo un En segunda instancia, se deberá trabajar sobre la
problema real (barrera de memoria). aplicación que se busca optimizar. Para esto se debe realizar
un prototipo por software de la aplicación o algoritmo a
- Finalmente los procesadores en si tienen una limitación implementar en el HPSoC. Este prototipo será luego
fundamental: Un diseño basado en ejecución serial, que caracterizado y evaluado mediante herramientas como
hace extremadamente difícil extraer niveles de profilers y analizadores de código, a modo de poder
paralelismo de un flujo de ejecución de instrucciones. detectar cuales son los segmentos o áreas de la misma en
Como se menciona en [2] existen complejos diseños y donde más procesamiento se realiza (secciones críticas en
técnicas en las arquitecturas de los procesadores términos de performance). Con esta información y a través
actuales que intentan extraer el paralelismo en las de un enfoque top-down, se procederá a estudiar el
instrucciones y mitigar esta limitación. algoritmo que define la aplicación, a modo de refactorizar
la misma y que las secciones críticas puedan ser
La utilización de lógica programable y la realización de optimizadas y aisladas para ser implementadas en
un HPSoC es una valida alternativa para lograr sustanciales hardware. La implementación en hardware de las secciones
incrementos en performance en sistemas donde la críticas de la aplicación permiten que las operaciones
performance es el principal requerimiento. En el presente computacionales puedan ser representadas en lenguajes de
108
descripción de hardware y que a través de una estrategia de tiempo y aumentar el nivel esfuerzo, a modo de mejorar el
optimización por niveles, se puedan implementar rendimiento disminuyendo el tiempo de propagación de
componentes aceleradores de hardware, es decir hardware datos a través del hardware.
de procesamiento especifico que permita realizar computo Por otra parte, a nivel de sistema se puede paralelizar el
altamente performante y eficiente. procesamiento de datos a nivel de componente acelerador.
Para el desarrollo del componente acelerador se puede Siempre que el algoritmo a procesar lo permita, es decir que
utilizar un lenguaje de descripción de hardware como el algoritmo de procesamiento trabaje con un conjunto de
VHDL o utilizar una herramienta ESL como ImpulseC [4]. datos independiente unos de otros, y que además exista
El componente acelerador además, deberá integrarse dentro disponibilidad de recursos en la FPGA utilizada, se puede
del diseño embebido del HPSoC, por lo que un canal de implementar más de un componente acelerador y procesar
comunicación de alta velocidad entre hardware y software así varios conjuntos de datos en paralelo.
deberá también ser desarrollado.
Por otro lado, cabe mencionar que el componente de
software del codiseño HW/SW puede correr directamente 4.2. Optimizaciones a nivel de aplicación
sobre el procesador o bajo el control y soporte de un Describimos como aplicación al algoritmo computacional
sistema operativo (como una aplicación más de espacio de
que cumple un cierto número de requerimientos con el fin
usuario). Dado los beneficios que provee un sistema
de implementar por software o hardware la funcionalidad
operativo, en nuestra metodología se brinda soporte de un
principal del HPSoC.
sistema operativo para el componente de software.
El objetivo de las optimizaciones a nivel de aplicación
Una vez que estas dos instancias del HPSoC estén es estudiar el algoritmo que define las operaciones críticas
completas, el diseño de hardware tiene que ser trasladado y en performance de la aplicación, a modo detectar el
mapeado en el fabric de una FPGA, y las imágenes binarias
paralelismo inherente en el mismo y optimizar el
del software correspondiente tienen que ser almacenadas en
procesamiento.
las memorias correspondientes para su posterior evaluación.
Cabe aclarar que las optimizaciones sobre el algoritmo
se harán sobre los detalles de alto nivel del mismo, y no
4. OPTIMIZACION DE PERFORMANCE EN UN sobre los detalles de bajo nivel que definen la
DISEÑO HPSOC implementación del mismo. Entonces, si posibles
paralelizaciones son detectadas, y siempre cumpliendo con
Existen diversos factores que pueden ser modificados y los requerimientos funcionales iniciales, se buscara
técnicas que pueden ser aplicadas en la arquitectura de un implementar las modificaciones necesarias en el código del
HPSoC a modo de incrementar la performance general del algoritmo, de modo de que este deje de lado su flujo de
mismo. Estos factores pueden ser agrupados en tres áreas a ejecución serial y adopte un modelo de funcionamiento en
las que llamaremos niveles de optimización. paralelo.
Además de detectar niveles de paralelización y
optimizaciones en el flujo de ejecución del algoritmo, otra
4.1. Optimizaciones a nivel de Sistema interesante técnica que se puede utilizar para optimizar la
performance a nivel de aplicación es el uso de precomputo
Describimos como sistema a la plataforma física en donde
de datos. Esto consiste básicamente en acotar el rango de
se implementa la aplicación. Las optimizaciones a nivel de
acción del algoritmo, tomando asumpciones sobre el
sistema están ligadas a la forma en que se pueden
espacio de trabajo del algoritmo, a modo de precomputar y
implementar las aplicaciones en esta plataforma, y las
modificaciones que pueden ser realizadas en la misma para simplificar sus operaciones y de ese modo acelerar la
que estas se ejecuten más rápido y para que el throughput ejecución del mismo.
sea más elevado.
La optimización trivial es modificar el diseño de 4.3. Optimizaciones a nivel de micro arquitectura
hardware que compone la plataforma de cómputo para que
los diversos componentes de esta funcionen a la máxima Describimos como micro arquitectura a los componentes de
velocidad admisible. Además, es óptimo establecer canales lógica programable que implementan los detalles de bajo
de comunicaciones de alta velocidad entre los componentes nivel del algoritmo que define la aplicación que se ejecutara
de uso frecuente por el procesador, por ejemplo los bancos sobre el HPSoC. Algunas técnicas para mejorar la
de memorias o la comunicación con el fabric de la FPGA. performance de la micro arquitectura del componente
El uso de caches de memorias (preferentemente memoria acelerador son las siguientes:
RAM en bloque) puede aumentar la localidad de datos y así
mejorar la performance. 1) Replicar los arrays o bancos de memoria que
Así mismo, las herramientas de síntesis que sintetizan el contienen los datos: Una de las ventajas más importantes
diseño de hardware permiten configurar restricciones de que nos ofrece la programación en hardware es la
109
posibilidad de acceder a múltiples bancos de memoria en un HPSoC con aceleración por hardware. En estos dos últimos
solo ciclo de reloj. A diferencia de una implementación de casos, el componente acelerador fue desarrollado en VHDL
software, en la que un CPU esta conectado a uno o mas y con la herramienta de síntesis de alto nivel ImpulseC
dispositivos de memoria física siempre a través de un solo respectivamente.
bus, una implementación en hardware permite la La aplicación criptográfica consistió en obtener un set
flexibilidad de generar una topología de conexionado de datos de memoria y cifrarlos a través del algoritmo de
arbitraria, en la que un conjunto de operaciones al ser cifrado simétrico TripleDES. El algoritmo de cifrado
ejecutadas puedan acceder a datos distribuidos en varios simétrico TripleDES se utilizó en modo ECB, siguiendo los
bancos de memoria en una sola operación de reloj. Es por lineamientos mencionados en [7] y [8]. Siguiendo la
esto que un factor importante a tener en cuenta, es que para metodología propuesta en el presente trabajo, después de
lograr resultados óptimos debemos replicar nuestro set de diseñar e implementar el sistema embebido en el SoC, se
datos en diferentes bancos de memoria. Con esto desarrollo en software un prototipo no optimizado del
lograremos tener bancos de memoria separados, cada uno algoritmo a utilizar. Este prototipo sirvió para estudiar el
con su puerto de lectura/escritura, lo que permitirá acceder algoritmo y caracterizarlo. Con los datos obtenidos y
a los mismos en forma paralela para su posterior evaluando las técnicas de optimización enumeradas en [9],
operación/procesamiento. se procedió a desarrollar los componentes aceleradores y
aplicar los niveles de optimización anteriormente descritos.
2) Operaciones sobre bucles: En un algoritmo, los bucles Cabe aquí citar que para la implementación de los
son una de las construcciones que contienen un alto grado prototipos se utilizó el kit de desarrollo FX12 Minimodule,
de paralelismo inherente, y por lo tanto, son una de las provisto por la firma Avnet y que cuenta con una FPGA
construcciones que se apunta a optimizar. Los bucles Virtex4 FX12 y diversos componentes externos descritos en
generalmente realizan operaciones repetitivas sobre un set la página del fabricante. El diagrama en bloque del HPSoC
de datos. Si cada de las operaciones del bucle no depende desarrollado puede verse en la figura 1.
de datos calculados en interacciones anteriores, es decir si
en cada iteración se puede operar sobre set de datos
independientes, el grado de paralelismo que se puede 5.1. Desarrollo del sistema embebido del SoC
obtener es elevado. Existen dos técnicas para optimizar las Durante el desarrollo del diseño embebido se dio soporte a
operaciones sobre bucles, estas son el desenrollado del
todos los dispositivos físicos de hardware del kit de
bucle y la generación de “líneas de ensamblado”, o mas
desarrollo, utilizando los IP Cores de Hardware necesarios
conocido por su término en ingles, pipelines. El
para el funcionamiento del sistema embebido.
desenrrollado de bucles consiste en expandir el conjunto de
Para implementar el diseño embebido se utilizó la
iteraciones consideradas por el bucle y reacomodar el herramienta EDK de Xilinx descrita en [10]. El procesador
algoritmo para que estas puedan ser realizadas en paralelo y
elegido para el diseño embebido fue un recurso de hardware
en una sola iteración del bucle. El desarrollo de pipelines
que posee la FPGA elegida, es decir el hardcore de un
consiste en dividir el trabajo a procesar en subtareas, a
PowerPC 405 (PPC). Mediante esta herramienta se
modo de que a medida que van entrando los datos a
desarrollo un sistema embebido que permitió comunicar el
procesar, cada subtarea pueda ir procesando en forma procesador PPC con los dispositivos externos del kit de
concurrente un diferente set de datos. Entonces, si cada desarrollo, tales como la Memoria RAM, la memoria
iteración del bucle requiere ejecutar N subtareas, en una
FLASH, el puerto UART, la PHY de Gigabit Ethernet así
implementación sin pipeline, el bucle realizara una cantidad
como también implementar componentes necesarios para
(N * cantidad_elementos_de_datos) de iteraciones para
volver al sistema embebido y su procesador una plataforma
completar su trabajo. En cambio en una implementación
de computo funcional. Además la herramienta permitió,
con pipeline, la totalidad de datos serán procesados en una desarrollar un canal de comunicación de alta velocidad
cantidad (N + 1) de iteraciones. La teoría de pipelines y entre el componente acelerador implementado en la lógica
desenrollado de bucles ha sido extensamente desarrollada
programable y el microprocesador del sistema embebido.
en [5] y [6].
Se genero el soporte necesario para que el PPC pueda
comunicarse a través de los buses PLB, OPB, FCB, OCM y
5. V. PRUEBA DE CONCEPTO – DCR a los distintos dispositivo. Estos buses pertenecen a la
IMPLEMENTACIÓN DE UN HPSOC familia de buses CrossConnect y están descriptos en [10].
CRIPTOGRAFICO Cabe aclarar que este procesador solo soporta conexión
directa a los buses PLB, OCM, DCR y FCB, por lo que los
A modo de evaluar las mejoras obtenidas a través de la dispositivos atrás del bus OPB se alcanzaran a través de un
implementación de un HPSoC acelerado por hardware, se bridge PLB2OPB.
desarrollo un SoC prototipo sin aceleración Una vez definida la arquitectura de buses, sus tamaños y
(implementación solo por software), y dos versiones de un frecuencias de trabajo, así como también los elementos
110
RAM de Power PC Bus FCB Componente
bloque 405 Acelerador de HW
I cache D cache
Arbitro
Bridge
Controlador Controlador Controlador Controlador Controlador Fig. 2. Flujo de consultas para booteo del sistema
PHY DDR UART GPIO HWICAP
operativo.
111
TABLA 1. RESULTADOS DE IMPLEMENTACION HPSOC TABLA 2. RESULTADOS DE SINTESIS HPSOC CRIPTOGRAFICO
CRIPTOGRAFICO
HPSoC con componente desarrollado en ImpulseC
HPSoC TripleDES Porcentaje de
Recurso Utilización
Implementación Throughput uso
Frecuencia de
(aplicación Ganancia
operación BUFGs 11 out of 32 34%
userspace)
Software 300 Mhz 42.096 Kbps 1X DCM_ADVs 2 out of 4 50%
Hardware ILOGICs 29 out of 320 9%
50 Mhz 17.929 Mbps 415X
ImpulseC
Hardware External IOBs 73 out of 240 30%
50 Mhz 19.280 Mbps 458X
VHDL
LOCed IOBs 73 out of 73 100%
d) Se utilizo el máximo nivel de esfuerzo en la OLOGICs 54 out of 320 16%
síntesis: Esto se logro editando el archivo
PPC405_ADVs 1 out of 1 100%
etc/fast_runtime.opt.
e) Se incremento la transferencia de datos para utilizar RAMB16s 26 out of 36 72%
el máximo ancho de banda provisto por el canal SLICES 5470 out of 5472 99%
APU. Esto es 64 bits de datos.
SLICEMs 355 out of 2736 12%
112
operaciones de búsqueda en las tablas de forma para implementar el sistema embebido y los componentes
paralela. aceleradores.
Este trabajo concluye que la utilización de un HPSoC es
c) Pipelines: A modo de aumentar el throughput de una alternativa técnicamente viable para mejorar la
procesamiento de datos, se subdividió el performance de los sistemas digitales embebidos del mundo
procesamiento del algoritmo en diferentes etapas actual.
que pueden funcionar en forma aislada. Estas
etapas se encargan en la gran mayoría de realizar 8. REFERENCIAS
el proceso de combinar las cajas SP con los datos
de entrada. El pipeline permitió que se procesen [1] Wohlmuth, Otto, “High performance computing based on
varios datos al mismo tiempo. FPGAS” IEEE Field Programmable Logic and
Applications, FPL, 2008.
6. RESULTADOS OBTENIDOS [2] Ramakrishna, Rau and Fisher, Joseph, "Instruction-level
parallel processing: History, overview, and perspective",
Las métricas obtenidas de la ejecución del los componentes The Journal of Supercomputing, Volume 7, Numbers 1-2.
aceleradores de hardware desarrollados, así como también [3] Meyer-Baese, Uwe, "Digital signal processing with field
de la versión en software de la aplicación pueden verse en programmable gate arrays", Third Edition, Springer, pagina
la tabla 1. 589.
En esta tabla se muestra que la aceleración obtenida al
implementar parte del algoritmo de la aplicación en un [4] D. Pellerin and S. Thibault, “Practical FPGA Programming
in C”. Prentice Hall Professional Technical Reference,
HPSoC con un componente acelerador es de alrededor de
2005.
400X en ambos casos. El throughput expuesto corresponde
a la medición del tiempo de ejecución de la función de SW [5] Pai, Vijay and Adve, Sarita, "Code transformations to
que envía los datos al componente acelerador. improve memory parallelism", Proceedings of the 32nd
Se muestra además en la tabla 2 un extracto del reporte annual ACM/IEEE international symposium on
de la síntesis del componente acelerador más significativo Microarchitecture, Pages: 147 - 155, 1999.
(desarrollado en ImpulseC) que muestra el porcentaje de [6] Wolf, M.E, Chen, Ding-Kai, "Combining loop
recursos usados en la FPGA. transformations considering caches and scheduling", 29th
Annual IEEE/ACM International Symposium on
Microarchitecture, 1996.
7. CONCLUSIONES [7] Bruce Schneier, “Applied Cryptography Second Edition”.
John Wiley, 2004.
En el presente trabajo se presentaron los beneficios, en [8] Federal Information Processing Standars Publication,
términos de performance, obtenidos a través de la “DATA ENCRYPTION STANDARD (DES)”. FIPS PUB
implementación de aceleradores criptográficos utilizando 46-3.
HPSoCs. [9] PK Yuen, “Practical Cryptology and Web Security”.
Las implementaciones realizadas muestran cómo es Pearson Education Limited, Chap 4, 2006.
posible incrementar la performance de una aplicación de
[10] Xilinx Documentation files, “EDK Concepts, Tools, and
software corriendo en un sistema embebido en varios
Techniques”.
órdenes de magnitud. Los resultados comparativos
mostraron que luego de aplicar la metodología propuesta se [11] Shenoy, "Accelerating Software Applications Using the
obtuvo una ganancia de alrededor de 400X en ambos casos. APU Controller and C-to-HDL Tools", Xilinx Application
En el presente trabajo se utilizaron además herramientas note XAPP 901.
del estado del arte en el desarrollo de lógica programable
113
114
HIGH THROUGHPUT 4X4 AND 8X8 SATD SIMILARITY CRITERIA ARCHITECTURES
FOR VIDEO CODING APPLICATIONS
Julio S. Dominges Jr.,Vinicius N. Possani, Dieison S. Silveira,
Leomar S. da Rosa Jr., Luciano V. Agostini
115
4x4 and 8x8 samples using the 2-D Hadamard transform,
transform
116
Table 1. Comparison of PSNR
R and bitrate
b among The architecture takes as input two 4x4 blocks, the
SATD, SSD and SAD criteria of similarity current block and the candidate block, where each block
has 16 samples. The first designed module calculates the
SATD SSD SAD
difference between the two blocks,
bloc subtracting each
Mobile
sample of the current block of each sample of the
PSNR (dB) 33.807 33.790 33.754 candidate block. The results of subtraction
subtr are sent to the
Bitrate (kbit/s) 325.28 328.02 328.72 unit that generates the absolute values.
values Then these values
Foreman are applied to the 2-D D Hadamard module. After
PSNR – Y (dB) 36.790 36.713 36.641 transformed, these values are added through an adder tree
Bitrate (kbit/s) 120.02 121.61 120.48 to obtain the final value of SATD. Fig. 1 illustrates the
block diagram of the 4x4 SATD architecture,
architecture where the
Carphone
four designed modules are presented.
PSNR (dB) 37.396 37.341 37.296
Bitrate (kbit/s) 99.09 100.73 100.00
4.DESIGNED
DESIGNED ARCHITECTURES
ARCHITECTUR
Two different architectures were designed in VHDL Figure 1. Block diagram of the 4x4 SATD architecture .
language for SATD calculation: one for 4x4 blocks and
another for 8x8 blocks.. The main objective was to find the Some pipelined versions of this architecture were
best hardware solution for calculating this similarity designed to find a best relation between processing rates
criterion. As equations (1) and (2) shows, the matrices and hardware use. The solution presented in this paper was
involved on Hadamard calculations have only two possible that with the highest throughput among all investigated
values (1 and -1), so the calculation of 2--D Hadamard only solutions. This version was designed in a pipeline with 10
requires additions and subtractions. A division by two or stages. One stage is used for the differences calculation,
four is also used in the final result for 4x4 and 8x8 one stage is used for the absolute
bsolute generation, four stages
Hadamards respectively. These divisions are easily are used in the Hadamard calculations and four stages are
converted to a shift right of one or two binary positions,
positions used in the output adder tree.
which is very simple to be designed in hardware.
hardware
4.2. SATD 8x8 Architectures
rchitectures
4.1. SATD 4x4 Architecture The structure used in the 8x8 SATD architecture is
Based on the 4x4 2-D Hadamard formula defined in similar to that used in the 4x4 SATD architecture. The
(1), the process was divided in four steps, in order to better algorithm
hm for calculating the 8x8 Hadamard is more
detect parallelizable operations. These calculations are complex due to the large number of input samples. Aiming
expressed on Table 2 [3]. to avoid a large increase in use of hardware resources the
8x8 Hadamard was designed exploiting the separability
Table 2 - Algorithm for the 4x4 2-D Hadamard calculation principle of the 2-D transforms.
transform Then, two 1-D transforms
a0 = w0 + w4 b0 = a0 + a1 c0 = b0 + b1 S0 = c0 + c1 are applied over the input data to generate the final result.
a1 = w8 + w12 b1 = a2 + a3 c1 = b2 + b3 S1 = c0 - c1 The two 1-D D transforms are identical, but the 2-D
2 output
a2 = w1 + w5 b2 = a4 + a5 c2 = b0 - b1 S2 = c2 - c3 block of the first transform must be transposed before to be
a3 = w9 + w13 b3 = a6 + a7 c3 = b2 - b3 S3 = c2 + c3 used as input for the second 1-D1 transform. In this case,
a4 = w2 + w6 b4 = a0 - a1 c4 = b4 + b5 S4 = c4 + c5 each 1-D D transform must process eight input samples to
a5 = w10 + w14 b5 = a2 - a3 c5 = b6 + b7 S5 = c4 - c5 finish its calculations (one line or one column of the 8x8
a6 = w3 + w7 b6 = a4 - a5 c6 = b4 - b5 S6 = c6 - c7 input block) and the process must be repeated eight times
a7 = w11 + w15 b7 = a6 - a7 c7 = b6 - b7 S7 = c6 + c7 to process one complete 8x8 block.
a8 = w0 - w4 b8 = a8 - a9 c8 = b8 + b9 S8 = c8 + c9 An algorithm was extracted from equation (2) to
a9 = w8 - w12 b9 = a10 - a11 c9 = b10 + b11 S9 = c8 - c9 support the hardware design. This algorithm is used to
a10 = w1 - w5 b10 = a12 - a13 c10 = b8 - b9 S10 = c10 - c11
calculate the 1-D Hadamard and it is illustrated in Table 3.
a11 = w9 - w13 b11 = a14 - a15 c11 = b10 - b11 S11 = c10 + c11
This algorithm was simplified for a beter understanding,
a12 = w2 - w6 b12 = a8 + a9 c12 = b12 + b13 S12 = c12 + c13
avoiding the use of a lot of lines in Table 3. The
a13 = w10 - w14 b13 = a10 + a11 c13 = b14 + b15 S13 = c12 - c13
calculations shown in Table 3 are expanded eight times
a14 = w3 - w7 b14 = a12 + a13 c14 = b12 - b13 S14 = c14 - c15
and the index "i" is incremented by 8 units at each
expansion. Thus, after the eight expansions, the algorithm
a15 = w11 - w15 b15 = a14 + a15 c15 = b14 - b15 S15 = c14 + c15
is complete. The hardware implementation of this
117
algorithm allow the processing of 64 samples
amples in parallel or, Table 5. Synthesis results
esults of architectures SATD 4x4 and
in other words, one complete 8x8 block can be processed SATD 8x8.
at each clock cycle. SATD 4x4 SATD 8x8
# Slices 858 (6%) 4067 (29%)
Table 3. Simplified algorithm
lgorithm for the 8x8 Hadamard # Slices Flip Flop 1.535 (5%) 3265 (11%)
a0 = Wi + Wi+4 b0 = ai + ai+2 c0 = bi + bi+1 # 4 input LUTs 1.120 (4%) 6459 (32%)
a1 = Wi+1 + Wi+5 b1 = ai+1 + ai+3 c1 = bi - bi+1 Minimum Period 2,800ns 6.549ns
Frequency 357,079 MHz 152.685MHz
a2 = Wi+2 + Wi+6 b2 = ai - ai+2 c2 = bi+2 + bi+3
a3 = Wi+3 + Wi+7 b3 = ai+1 - ai+3 c3 = bi+2 - bi+3
All solution presented high operation frequencies. The
a4 = Wi - Wi+4 b4 = ai+4 + ai+6 c4 = bi+4 + bi+5 consumption of hardware resources of the 8x8 SATD was
a5 = Wi+1 - Wi+5 b5 = ai+5 + ai+7 c5 = bi+4 – bi+5 relatively high. Such consumption of hardware resources
a6 = Wi+2 - Wi+6 b6 = ai+4 - ai+6 c6 = bi+6 + bi+7 can be a limiting factor in implementations of the complete
a7 = Wi+3 - Wi+7 b7 = ai+5 - ai+7 c7 = bi+6 – bi+7 encoder or decoder in hardware, where more than one
SATD unity is required.
The 8x8 SATD was also designed with different
pipeline stages and the best solution in terms of processing 6.CONCLUSION
CONCLUSION
rates is presented in this paper. Then the differences
calculations must be able to process two 8x8 input blocks, This paper presented two SATD architectures, one for
one for the current block and other for the candidate 4x4 blocks and other for 8x8 blocks. The SATD was
blocks. The absolute generation of the subtraction must evaluated through comparisons with other criteria (SSD
process 64 input samples. This means that the two first and SAD) and the SATD presented the best tradeoff
modules are four times bigger than that of the 4x4 SATD between bitrate and quality.. The architectures designed in
version. The 1-D transform
form architecture was duplicated to this work were described in VHDL and synthesized to
increase the processing rates,
s, as show in Fig. 2.
2 Then the Xilinx Virtex2P FPGAs. The designed SATD modules can
same hardware is not reused, avoiding data dependencies. be introduced into the inter prediction or intra prediction
With this solution the 8x8 SATD architecture
itecture is able to modules of H.264/AVC standard, but also they can be used
process one new input block at each clock cycle. in older standards. As the SATD has a higher complexity
when compared to other criteria, it brings higher
consumption in area. But,
ut, using the SATD it is possible to
achieve higher quality in compressed video without
reducing the compression rate, or even achieve higher
compression rates without significant degradation in the
Figure 2. Block diagram of the 8x8 SATDarchiteture.
SATD quality of video. Thus,
hus, with the software evaluation results
and with the hardware design results it is possible to
conclude that the use of SATD is a good solution to be
5.RESULTS
used in hardware implementations of video coders.
The architectures were described in VHDL,
VHDL synthesized
and validated using the Xilinx ISE 10.1 CAD tool. The 7.REFERENCES
REFERENCES
Virtex2p FPGA family was used and the XC2VP30 device
was selected (XILINX INC, 2010). The synthesis results [1] ITU-TT Recommendation H.264/AVC (03/05):
are presented in Tables 1 and 2. Table 1 shows the results advanced video coding for generic audiovisual services, 2005.
only for the 2-D Hadamard transform for 4x4 and 8x8 [2] RICHARDSON, I. H.264/AVC and MPEG-4
MPEG Video
blocks sizes . Table 2 presents the complete SATD Compression – Video Coding for Next-Generation
Next Multimedia.
architectures, also considering 4x4 and 8x8 block sizes. Chichester: John Wiley and Sons, 2003.
[3] Omitted to allow blind review.
Table 4. Synthesis results of 4x4 and 8x8 Hadamard [4] KUHM, P. Algorithms, Complexity
Co Analysis and VLSI
architectures Architetures for MPEG-44 Motion Estimation. Boston: Kluwer
Hadamard 4x4 Hadamard 8x8 Academic Publisher, 1999.
# Slices 433 (3%) 402 (2%) [5] VCEG. JM Reference Software 17.2. Disponible in
# Slices Flip Flop 656 (2%) 736 (2%) <http://iphome.hhi.de/suehring/tml>. Accessed August 2010.
# 4 input LUTs 672 (2%) 481 (1%) [6] XILINX INC. Virtex-II
Virtex Pro and Virtex-II Pro X
Minimum Period 2.774ns 2.738ns Platform FPGAs: Complete Data Sheet. [S.l.], 2005. Disponible
Frequency 360.458MHz 365.263MHz in: <www.xilinx.com>. Accessed August 2010.
118
ADQUISICIÓN DE VIDEO BAJO ESTÁNDAR ITU-R BT.656-4 MEDIANTE LÓGICA
PROGRAMABLE
119
Tabla 1.Estructura de AVCODE
Nº de Bit Nombre de
Descripción
bit
7 (MSB) 1 Cte.
6 F Campo par/impar
5 V Campo blanking
4 H SAV/EAV
3 P3 Bit de protección
2 P2 Bit de protección
1 P1 Bit de protección
0 P0 Bit de protección
Tabla 2.
(1)
120
eliminación, etc. Para lograr la sincronización de las señales
En la Figura 3 se observa un diagrama en bloques provenientes del AD con la señal de clock de la FPGA se
simplificado del sistema implementado, los bloques diseñó un componente denominado “Sincronizador”. Este
“UART”, “I2C”, “Ethernet”, “PowerPC” y el Bus PLB son componente sincroniza el reloj LLC2 de 13.5MHz con la
provistos por el EDK. El IP “Adquisición de Video” señal de reloj CLK de 50MHz. El circuito descripto en este
(motivo principal de este paper), y “Transformada Wavelet” bloque se muestra en la Figura 5. Es posible observar
son desarrollos propios. también la simulación de las señales de entrada y salida del
Para poder incluir desarrollos propios al sistema, se circuito en la Figura 6.
utiliza la herramienta de importación provistas por el EDK,
esta herramienta facilita el proceso de interconexión del
periférico con el sistema a través de lo que ellos denominan
IPIF. Este IPIF contiene módulos ya pre ensamblados para
intercambiar datos entre el PLB y el periférico, así como
memorias FIFO y líneas de control, facilitando
enormemente la tarea de importación y asegurando la
compatibilidad en el sistema.
Fig.5. Circuito sincronizador.
4. DISEÑO DEL IP
121
almacena los datos para que puedan ser separados, este 5. SÍNSTESIS E IMPLEMENTACIÓN
componente se llama “Secuenciador”. En la Figura 7 se
puede ver su simulación. Al formar la palabra de 64 bits La síntesis del IP arrojó los resultados de la Tabla 4.
los datos son transferidos a los buffers intermedios
llamados “Buffer”. Existen seis de ellos: SAV; Y1; Cr; Y2; Resultados de la síntesis del IP
Tabla 4.
Cb; EAV. Dado el formato 4:2:2 de video, se hace Descripción Utilizado Total %
necesario dos buffer para la luminancia Y1 e Y2.
Nº de Slices 466 13696 3%
Tabla 2. Funcionamiento del multiplexor Nº Slices de FF 649 27392 2%
C2 C1 C0 Salida
Nº LUTs 4 input 364 27392 1%
0 0 0 SAV
Nº de IOBs 85 556 15%
0 0 1 Y1
Nº de FF IOB 8 - -
0 1 0 Cr
Nº de GCLKs 5 16 31%
0 1 1 Y2
1 0 0 Cb 6. CONCLUSIÓN Y FUTUROS TRABAJOS
1 0 1 EAV El presente desarrollo ya forma parte del sistema de
compresión de video, ha sido embebido con éxito en la
El último paso es el almacenamiento en la memoria plataforma y se están realizando pruebas de verificación.
FIFO. Esto lo realiza el componente “Organizador”, el cual El método de incorporación de periféricos propuesto por
se compone de una máquina de estados de Moore, que al Xilinx en el EDK, utilizando un IPIF como interface,
recibir el aviso de llenado del “Buffer” correspondiente a demuestra ser robusto y con un campo muy amplio de
SAV, Cb o EAV, generará una secuencia binaria. posibles aplicaciones. En estos momentos se está buscando
optimizar el movimiento de datos entre la FIFO y la DDR
Tabla 3. Almacenamiento en FIFO del sistema de compresión combinando las señales de
FIFO 64-bit word Nº dato crominancia y luminancia para optimizar el número de
AV CODE (SAV) 1 movimientos de datos durante la compresión.
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 2 7. REFERENCIAS
Cr1 Cr2 Cr3 Cr4 Cr5 Cr6 Cr7 Cr8 3
Y9 Y10 Y11 Y12 Y13 Y14 Y15 Y6 4 [1] “Xilinx University Program Virtex-II Pro Development
System” - Hardware Reference Manual - Marzo 2005.
Cb1 Cb2 Cb3 Cb4 Cb5 Cb6 Cb7 Cb8 5 [2] AN9728.2 Intersil Aplication Note. - “BT.656 Video
----------------------------------- --- Interfce for ICs” Julio 2002.
[3] Recommendation ITU-R BT.656-4. “INTERFACES
Y705 Y706 Y707 Y708 Y709 Y710 Y711 FOR DIGITAL COMPONENT VIDEO SIGNALS IN
178
Y712 525-LINE AND 625-LINE TELEVISION SYSTEMS
Cr353 Cr354 Cr355 Cr356 Cr357 Cr358 OPERATING AT THE 4:2:2 LEVEL OF
179
Cr359 Cr360 RECOMMENDATION ITU-R BT.601 (PART A)”.
Y713 Y714 Y715 Y716 Y717 Y718 Y719 [4] AN-10 Digital Creation Labs “Digital Video
180
Y720 Overview” Rev 1.0 Abril 2004.
Cb353 Cb354 Cb355 Cb356 Cb357 Cb358 [5] Datasheet “Multiformat SDTV Video Decoder
181
Cb359 Cb360 ADV7183B”- Rev.B 2005.
AV CODE (EAV) 182 [6] “Capturing Higher Quality Video” - Justin A. Horn,
Student Member, IEEE, James Y. Hu, and Bryce C. Orgill.
La misma va desde cero a seis para controlar un [7] “Real Time Video Processing on FPGA Using on the
multiplexor en el cual están conectados los diferentes Bus Fly Partial Reconfiguration” - Sheetal U. Bhandari, Shaila
de 64bit de los componentes “Buffer”. En la Tabla 2 puede Subbaraman, Shashank S. Pujari, Rashmi Mahajan.
verse el funcionamiento del multiplexor. El componente [8] “FPGA-Based Design Of a High-Performance and
“Organizador” también envía una señal a la FIFO para que Modular Video Processing Platform” Christophe
se almacene el dato puesto en la salida del multiplexor. En Desmouliers, Erdal Oruklu and Jafar Saniie
la Tabla 3 se puede observar el orden en que se almacenan [9] DS448 Xilinx Product specification “PLB IPIF
los datos en la FIFO. (v2.01a)” Agosto 2004.
122
Sponsors