0% found this document useful (0 votes)
523 views

procDF PDF

Uploaded by

Monty Va Al Mar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
523 views

procDF PDF

Uploaded by

Monty Va Al Mar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 137

2011 VII Designer Forum (DF)

Preface Table of Contents

Executive Forum Committee


Committee

Sponsors

Editors
Jorge M. Finochietto
Gustavo D. Sutter
Orlando Micolini
Pablo Recabarren
ii
Proceedings of the
2011 VII Designer Forum

Córdoba, Argentina
April 13 – 15, 2011

Organized by
Digital Communications Research Lab
School of Exact, Physical and Natural Sciences
National University of Córdoba

iii
iv
Proceedings of the
2011 VII Designer Forum

Editors

Jorge M. Finochietto
Gustavo Sutter
Orlando Micolini
Pablo Recabarren

ISBN: 978-84-614-7682-4

v
Preface

These Proceedings contain the technical papers presented at the VII 2011 Designer Forum
organized within the 2011 VII Southern Conference on Programmable Logic (SPL), held in Cór-
doba, Argentina, from April 13th to 15th, 2011. The SPL Conference is the South Hemisphere’s
largest and most comprehensive conference focused on reconfigurable technology (i.e., FPGA)
and its applications.
The history of SPL started in 2005. The Joint Latin American FPGA Laboratories Project
(SURLAB) was financed by Banco Santander Central Hispano of Spain. Its aim was to create
a network of Latin American laboratories to spread FPGA as a key technology for industry, up-
dating university curricula to include related subjects. The original partners were the Universidad
Autónoma de Madrid, the Instituto Tecnologico de Monterrey, the University of Lima in Peru, and
the Argentinean Universities of Mar del Plata, Salta, Tandil, and CAECE.
Starting in March 2005, the first SPL Conference was attended by more than 60 people from
Argentina, Brazil, Costa Rica, and Peru. This 5-day workshop in the unique atmosphere of the
one-hundred year old CAECE University building, introduced students, professors and engineers
to the FPGA state of the art.
In 2006, more than 80 engineers attended the 2nd SPL, and more than 50 papers from Ar-
gentina, Brazil, Costa Rica, and Peru, Spain, United Kingdom, Uruguay, and USA were selected.
In 2007, the 3rd SPL Conference was sponsored by IEEE for the first time, receiving more than
90 papers from 24 countries: Argentina, Australia, Bangladesh, Belgium, Brazil, Colombia, Costa
Rica, Czech Republic, France, Germany, Greece, Hong Kong, India, Italy, Mexico, Netherlands,
Paraguay, Peru, Portugal, Singapore, Spain, Taiwan, UK, and USA
In 2008, the 4th SPL Conference moved from Mar del Plata to San Carlos de Bariloche,
situated on the Andes foothills. A total of 29 full-papers, 23 short papers and 20 Designer Forum
papers were selected, from around one hundred submission, including authors from the following
countries: Argentina, Australia, Brazil, China, Canada, Colombia, France, Germany, Hong Kong,
Mexico, Peru, Portugal, Romania, Spain, United Kingdom, and USA.
In 2009, the 5th SPL Conference, sponsored again by IEEE, moved out of Argentina to Sao
Carlos, Brazil. 90 papers were submitted from many countries, 26 were accepted as full papers,
12 as short papers, and 8 as Designer Forum papers.
In 2010, the 6th SPL Conference, sponsored by IEEE, moved to the Northeastern Coast of
Brazil to the well known Porto de Galinhas Beach, near the city of Recife. This central location in
a relaxed atmosphere, combined with the fast-paced economic growth in this part of Brazil, was a
great site to discuss advanced technology. SPL2010 received submissions from Argentina, Brazil,
Canada, China, France, Iran, Italy, Mexico, Netherlands, Pakistan, Peru, Poland, Portugal, Spain,
United Kingdom, and United States. A total of 53 papers were selected: 22 full papers, 13 short
papers, and 18 Designer Forum papers.
In 2011, the 7th SPL Conference, sponsored as traditionally by IEEE, has moved to the Cór-
doba, the second-largest city in Argentina, and it will be hosted at the National University of Cór-
doba, one of the oldest universities in America. Paper submission from the following countries
were received: Argentina, Belgium, Brazil, Colombia, Finland, France, Germany, Greece, In-
dia, Mexico, Portugal, Spain, Sweden, United Kingdom, United States of America and Uruguay.
From 99 submissions, a total of 50 regular papers were selected: 24 for oral presentation and 21
for poster one.
A total of 25 papers were selected to be included in the Proceedings of Designer Forum, which
demonstrates the increasing relevance of this forum within the SPL conference. The goal of the
Designer Forum is to give exposure to ongoing researches, academic experiences, and industrial
designs in order to get feedback from experienced researchers and industrial partners. The De-
signer forum was born with the Southern Conference on Programmable Logic (SPL) in 2005 and
it became an important part of it. It promotes the participation of novel researchers and advanced
students of the conference region. Due to the regional scope of the Designer Forum, its papers can
be written also in Spanish and Portuguese languages.
This year 2 one-week intensive courses were held to encourage hardware digital design skills
on advance students and professionals; thus, maintaining the spirit to spread FPGA technology
knowledge in the southern hemisphere. Besides, 4 tutorials have been organized for conference
attendees which are lectured by both industry and academic experts.
This year over 150 participants are expected from more than 40 universities, technological
institutions and companies all around the world.
The topics in this year program include: Embedded Processors and IP Cores, System-on-Chip,
Computer Arithmetic, Image Processing and Vision, FPGA Architectures for Specific Applica-
tions, Fault Tolerance, Test & Verification. SPL has beautiful track record and it becoming an
important forum for discussion on FPGA technology and its applications.
We would like to express our gratitude to the many people who have contributed to the high
quality of the technical program. Special thanks to those who chaired or were members of the vari-
ous committees. Particularly the Program Committee who’s careful review has helped to maintain
the high quality of SPL.
Finally, we would like to thank our sponsors: Altera, ClariPhy Argentina, Fundación Tarpuy,
National Agency for the Scientific and Technologic Promotion (Agencia), National Scientific and
Technical Research Council (CONICET), and Synopsys.
A special thanks to the School of Exact, Physical and Natural Sciences (National University
of Córdoba) and Universidad Autónoma de Madrid for their support.

The Editors
Córdoba, Argentina, April 2011

vii
viii
Table of Contents

Executive Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Forum Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Poster Session 1

IP core MAC Ethernet


Rodrigo Melo, Salvador Tropea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Autonomous Intelligent Wireless Network accessible via IP
María Isabel Schiavon, Daniel Crepaldo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Multi-Level Synthesis on the Example of a Particle Filter
Jan Langer, Daniel Frob, Enrico Billich, Marko Robler, Ulrich Heinkel . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Layered testbench for assertion based verification
Jose Mosquera, Sol Pedre, Patricia Borensztejn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Development and Implementation of an Adaptive Narrowband Active Noise Controller
Fernando González, Roberto Rossi, German Rodrigo Molina, Gustavo Parlanti . . . . . . . . . . . . . . . . . . . 23
Bio-inspired hardware system based in animals of cold and hot blood
Pablo Salvadeo, Rafael Castro López, Ángel Veca, Elvo Morales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Análise Comparativa e Qualitativa de Ferramentas de Desenvolvimento de FPGA
Gabriel da Silva, Maximiliam Luppe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Generación automática de VHDL a partir de una Red de Petri. Análisis comparativo de los resultados de
síntesis.
Roberto Martinez, Javier Belmonte, Rosa Corti, Estela D’Agostino, Enrique Giandoménico . . . . . . . . 35
Using a WII remote and a FPGA to drive a mechanical arm to aid physicaly challenged people
Emerson Pedrino, Valentin Roda, Bruno Martins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Systolic Matrix-Vector Multiplier for a High-Throughput N-Continuous OFDM Transmitter
Enrique Lizarraga, Victor Sauchelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Synthesis of the Hartley Transform with a Hadamard-based matrix architecture
Edval JP Santos, Gilson Alves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Implementación de MODBUS en FPGA mediante VHDL - Capa de Enlace -
Luis Guanuco, Jonatan Panozzo Zenere, Sergio Olmedo, Agustin Rubio . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Poster Session 2

Music sequencer on a FPGA board


Matías López-Rosenfeld, Francisco Laborda, Patricia Borensztejn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Flexible Platform for Real-time Video and Image Processing
Paulo Da Cunha Possa, Zied El Hadhri, Laurent Jojczyk, Carlos Valderrama . . . . . . . . . . . . . . . . . . . . . 61
SoPC platform for real-time DVB-T modulator debugging
Armando Astarloa, Jesus Lázaro, Unai Bidarte, Aitzol Zuloaga, Mikel Idirin . . . . . . . . . . . . . . . . . . . . . 67
High reliability capture core for data acquisition in System on Programmable Chips
Jesus Lázaro, Armando Astarloa, Aitzol Zuloaga, Jaime Jimenez, Unai Bidarte, Jose Martín . . . . . . . 73
Desarrollo de una plataforma genérica para sistemas de visión basada en arquitectura CoreConnect
Luis Pantaleone, Lucas Leiva, Martín Vazquez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Prototipado rápido de un IP para aplicar la transformada Wavelet en imágenes
Hugo Melo, Alejandro Perez, Guillermo Gutierrez, Rodolfo Cavallero . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Cortex-M0 implementation on a Xilinx FPGA
Pedro Martos, Fabricio Baglivo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

ix
Digitally Configurable Platform for Power Quality Analysis
Bruno Falduto, Ricardo Cayssials, Edgardo Ferro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Solar Tracker for Compact Linear Fresnel reflector using PicoBlaze
Maiver Villena, Daniel Hoyos, Carlos Cadena, Victor Serrano, Telmo Moya, Marcelo Gea . . . . . . . . . 97
Toolbox NURBS and Visualization System Via FPGA
Luiz Marcelo Silva, Maria Paiva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Una Metodología para el Desarrollo de Sistemas en Chip de Alta Performance
Marcos Oviedo, Pablo Ferreyra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
High Throughput 4x4 and 8x8 SATD Similarity Criteria Architectures for Video Coding Applications
Luciano Agostini, Julio Saracol Domigues, Dieison Soares Silveira, Leomar Soares da Rosa, Vinicius
Possani . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Adquisición de Vídeo Bajo Estándar ITU-R BT.656-4 Mediante Lógica Programable
Juan Carlos Contreras, Guillermo Gutierrez, Emilio Kowalski, Rodolfo Cavallero . . . . . . . . . . . . . . . 119

x
Executive Committee

General Chairs

Jorge M. Finochietto
Universidad Nacional de Córdoba – CONICET, Argentina
Gustavo Sutter
Universidad Autónoma de Madrid, Spain

Forum Chairs

Orlando Micolini
Universidad Nacional de Córdoba, Argentina
Pablo Recabarren
Universidad Nacional de Córdoba – CONICET, Argentina

Tutorial Chair

Graciela Corral-Briones
Universidad Nacional de Córdoba, Argentina

Local Chair

Carmen Rodirguez
Universidad Nacional de Córdoba, Argentina

Financial Chair

Ramiro Calderón
Fundación Tarpuy, Argentina

Executive Secretary

María José Agazzi


Universidad Nacional de Córdoba, Argentina

Publicity Chair

Eduardo Boemo
Universidad Autónoma de Madrid, Spain
Edval Santos
Universidade Federal de Pernambuco, Brazil
Valentin Obac Roda
Universidade de Sao Paulo, Brazil
Elias Todorovich
Universidad Nacional del Centro, Argentina
Luciano Agostini
Universidade Federal de Pelotas, Brazil

xi
xii
Forum Committee

Carlos Valderrama, Université de Mons Polytech Mons, Belgium


Luciano Agostini, Universidade Federal de Pelotas, Brazil
Ali Akoglu, University of Arizona, USA
Fadi Aloul, American University of Sharjah, UAE
Cristiano Araujo, UFPE, Brazil
Edna Barros, Centro de Informatica - UFPE, Brazil
Gabriel Caffarena, Universidad San Pablo-CEU, Spain
João Cardoso, University of Porto, Portugal
Hugo Carrer, Universidad Nacional de Cordoba, Argentina
Jorge Castiñeira, Universidad Nacional de Mar del Plata, Argentina
Ricardo Cayssials, Universidad Nacional del Sur, Argentina
Scott Chin, University of British Columbia, Canada
Juan Cousseau, Universidad Nacional del Sur, Argentina
Angel de Castro, Universidad Autonoma de Madrid, Spain
Helio de Oliveira, Federal University of Pernambuco, Brazil
Debatosh Debnath, Oakland University, USA
Jean-Pierre Deschamps, Universidad Rovira i Virgili, Spain
Yongfeng Gu, The Mathworks, USA
Eduardo Romero, Universidad Tecnológica Nacional, Argentina
Guillermo Guichal, Universidad Tecnologica Nacional, Argentina
Reiner Hartenstein, TU Kaiserslautern, Germany
Juan P. Olivier, Universidad de la República, Uruguay
Valentin Obac Roda, Universidade de Sao Paolo, Brazil
Victor Grimblatt, Synopsis, Chile
Damián Morero, Universidad Nacional de Cordoba, Argentina
Carol Marsh, Selex Galileo, UK
Michelle Petracca, Columbia University, USA
Wolfgang Klingauf, Xilinx, USA
Gustavo Parlanti, Motorola, Argentina
René Cumplido, INAOE, Mexico
Esam El-Araby, The Catholic University of America, USA
Altamiro Susin, UFRGS, Brazil
Gabriela Peretti, Universidad Tecnológica Nacional, Argentina
Martín del Barco, ClariPhy, Argentina
J. Ignacio Alvarez-Hamelin, ITBA-UBA, Argentina
Neil Bergmann, University of Queensland, Australia
Philip Leong, The University of Sydney, Australia
Sergio Lopez-Buedo, Universidad Autonoma de Madrid, Spain
Norian Marranghello, Sao Paulo State University - Unesp, Brazil
Seda Memik, Northwestern University, USA
Ruben Milocco, Universidad Nacional del Comahue, Argentina
Rolf Molz, UNISC - Universidade de Santa Cruz do Sul, Brazil
Carlos Muravchik, Universidad Nacional de La Plata, Argentina
Horacio Neto, INESC-ID, Portugal
Felix Palumbo, CONICET - CNEA, Argentina
Michele Petracca, Columbia University, USA
Sébastien Pillement, IRISA, France
Salvatore Pontarelli, University of Rome Tor Vergata, Italy
Jose Saito, Universidade Federal de São Carlos, Brazil
Kentaro Sano, Tohoku University, Japan
Marco Domenico Santambrogio, MIT, USA
Edval JP Santos, Universidade Federal de Pernambuco, Brazil
Pete Sedcole, Viotech Communications, France
Cristian Sisterna, Universidad Nacional de San Juan, Argentina
Julio Pérez Acle, Universidad de la República, Uruguay
Jose Soares Augusto, Universidade de Lisboa, Portugal
Guillermo Jaquenod, JaqTek, Argentina
Dominique Lavenier, IRISA, France
Alfonso Chacon Rodriguez, Instituto Tecnologico, Costa Rica
Maria Jose Moure, Universidad de Vigo, Spain
Diego Crivelli, ClariPhy, Argentina
Pablo Ferreyra, Universidad Nacional de Cordoba, Argentina
Raoul Velazco, TIMA, France
Samir Belkacemi, General Electric, USA
Paulo Flores, INESC-ID, Portugal
Yana Krasteva, Universidad Politecnica de Valencia, Spain
Victoria Rodellar, Universidad Politecnica de Madrid, Spain
María Liz Crespo, ICSTP, Italy

xiv
IP CORE MAC ETHERNET

Ing. Rodrigo A. Melo, Ing. Salvador E. Tropea

Instituto Nacional de Tecnología Industrial


Centro de Electrónica e Informática
Laboratorio de Desarrollo Electrónico con Software Libre
Email: {rmelo,salvador}@inti.gob.ar

ABSTRACT
La tecnología Ethernet provee comunicación entre PCs y dispo-
sitivos que funcionen en forma autónoma, en ámbitos locales o a
través de Internet. En este trabajo presentamos un core que imple-
menta la capa MAC Ethernet, de uso sencillo, con diversas con-
figuraciones, que ocupa pocos recursos de una FPGA. El diseño
fue simulado con herramientas de Software Libre y verificado en
hardware utilizando una FPGA Virtex 4.

1. INTRODUCCIÓN

Nuestro equipo de trabajo desarrolla sistemas embebidos que


en la mayoría de los casos precisan estar comunicados con una PC. Fig. 1. Diagrama en bloques de GReth.
Si bien hemos desarrollado cores que cubran esta necesidad, co-
mo el core USB [1] , en la actualidad, esta conexión deja de ser
suficiente para incontables aplicaciones que precisan de un fun- Los buses AMBA utilizados son el APB (Advanced Perip-
cionamiento autónomo, que vaya más allá de un ámbito local. La heral Bus) para el manejo de registros de configuración y
tecnología Ethernet, presente en sus diversas variantes en la ma- control, y el AHB (Advanced High-performance Bus) para
yoría de los dispositivos dotados de conexión a una LAN (Local flujo de datos, dado a través de canales DMA (Direct Me-
Area Network), sumado al uso de Internet, provee la solución más mory Access) para transmisión y recepción.
conveniente a este problema. Se conecta a un PHY externo mediante las interfaces MII
Se realizó una búsqueda de cores Ethernet disponibles, de uso (Media Independent Interface) o RMII (Reduced MII) para
libre y descriptos en VHDL, ya que estas condiciones forman parte el intercambio de datos y MDIO (Management Data In-
de la línea de trabajo de nuestro laboratorio. Los resultados fueron put/Output) para acceder a la configuración y estado.
pocos, siendo el más destacable el core GReth [2], perteneciente a
La interfaz EDCL (Ethernet Debug Communication Link)
la GRLib [3]. Sin embargo, el área ocupada de la FPGA, el com-
provee acceso de lectura/escritura al bus AHB mediante
plejo modo de uso y la única opción de utilización mediante un bus
Ethernet.
AMBA [4], excedían las características deseadas.
En este trabajo presentamos un core MAC (Media Access Con- El core posee tres dominios de reloj: los de transmisión y
troller) Ethernet que surgió de lo aprendido en base al estudio del recepción, provistos por el PHY externo, y el del resto de
core GReth. Es compacto, de fácil utilización y capaz de ser usado componentes y buses AMBA.
en FPGAs de cualquier fabricante.
2.3. Descripción de hardware
2. CORE GRETH La GRLib está descripta utilizando el llamado Método de los
dos procesos [6]: usando dos procesos por entidad, uno contenien-
2.1. Introducción do toda la lógica combinacional y el otro toda la secuencial, el al-
La GRLib es una biblioteca de IP cores, distribuida mediante goritmo completo puede ser codificado en el proceso combinacio-
un sistema de doble licenciamiento: comercial y GPL [5]. GReth nal, mientras que el proceso secuencial sólo contiene asignación de
provee una interfaz entre un bus AMBA y una red Ethernet (10/100 registros. Dicho método abstrae la descripción de hardware asimi-
Mb/s, full- and half-duplex). Implementa el estándar 802.3-2002, lándola al desarrollo de un software.
sin soporte de la capa opcional de control.
2.4. Modo de uso
2.2. Arquitectura El core es controlado mediante APB con registros de 32 bits:
El diagrama en bloques de GReth se encuentra en la Fig. 1. Registros 0 y 1: control/estado.

1
Registros 2 y 3: dirección MAC.
Registro 4: control/estado de interfaz MDIO.
Registros 5 y 6: dirección de memoria de la tablas de des-
criptores de transmisión y recepción.
Los descriptores son datos de 32 bits transmitidos mediante
AHB. Tanto en transmisión como en recepción se tienen dos des-
criptores contiguos:
Descriptor 0: se conforma de bits de control y estado. Utiliza
11 bits para especificar la cantidad de bytes a transferir.
Descriptor 1: consiste en un puntero de 30 bits a la zona de
memoria donde se almacenan/extraen los datos.
Fig. 2. Esquema de instanciaciones de GReth (izq.) y MAC (der.).
2.4.1. Transmisión
A través del AHB se colocan los datos a partir de la direc- Para el manejo de AMBA se desarrolló una biblioteca denomi-
ción apuntada por el descriptor 1. Los datos deben poseer las di- nada AMBA Handler, con propósitos de simulación. En la misma
recciones MAC destino y origen, y el campo tipo/tamaño. El CRC se implementaron ocho procedimientos que representan las com-
(Cyclic redundancy check) de 4 bytes es añadido automáticamente. binaciones de escritura o lectura, a un maestro o esclavo, APB o
A continuación, se especifica la dirección del descriptor 0 en AHB.
el registro 5. GReth comienza la transmisión cuando se le indica
en el registro 0. 4. EL CORE DESARROLLADO: MAC Ethernet
Cuando la transmisión finaliza, GReth escribe información de
estado en el registro 1 y el descriptor 0. Finalmente apunta al si- 4.1. Introducción
guiente par de descriptores y queda listo para la próxima opera-
ción. En la Fig. 2, se pueden ver dos esquemas resumidos de la ins-
tanciación de componentes del core del cual se partió (izquierda) y
2.4.2. Recepción del core que se obtuvo (derecha).
El nivel superior del GReth, instancia las FIFO de transmisión
Se especifica la dirección del descriptor 0 en el registro 6. y recepción, el componente ethc0, e implementa el manejo del core
GReth lee los descriptores cuando se le indica en el registro 0 y mediante descriptores, la comunicación MDIO y parte de AMBA,
aguarda un paquete entrante. Dicho paquete será aceptado cuando la interfaz EDCL y la sincronización entre distintos dominios de
la dirección MAC destino sea la indicada en los registros 2 y 3 o reloj. El componente ethc0, instancia a los componentes que re-
la de broadcast, o cuando el core tenga habilitado el modo promis- suelven la transmisión y recepción a través de MII/RMII y a un
cuo. En cualquier otro caso será descartado. componente que resuelve la otra parte de la comunicación AMBA.
Cuando finaliza, se escribe información de estado en el registro El core MAC desarrollado, presenta un nivel superior neta-
1 y el descriptor 0, y los datos recibidos son accesibles a partir de mente estructural, que solamente instancia a los llamados canales
la dirección apuntada por el descriptor 1. de transmisión y recepción, y opcionalmente la interfaz MDIO.
Los canales nombrados, instancian en su interior memorias RAM
2.4.3. MDIO dual port, los componentes que resuelven la transmisión y recep-
ción a través de MII y componentes para la sincronización entre
Esta interfaz permite acceder de 1 a 32 PHY, que contengan distintos dominios de reloj.
de 1 a 32 registros de 16 bits. Su control y estado es accesible
mediante el registro 4.
4.2. Implementación
La escritura se inicia especificando el dato, número de PHY
y registro, y colocando a ’1’ el bit de escritura, mientras que la El core desarrollado fue escrito en lenguaje VHDL 93 están-
lectura precisa el número de PHY y registro, e inicia colocando a dar. Para su desarrollo se utilizaron las herramientas y lineamientos
’1’ el bit de lectura. recomendadas por el proyecto FPGALibre [7].
Con respecto a GReth se eliminaron ciertas características, se
3. TESTEO DEL GRETH remplazaron descripciones y se modificaron en parte o totalmente
otras.
Con el objetivo de poder detectar cualquier error introducido Se eliminaron las siguientes características:
al simplificar el core se diseñó un testbench para el GReth. Esto Utilización de buses AMBA.
nos permitió tener un mejor conocimiento de su funcionamiento,
en particular teniendo en cuenta la utilización del Método de los Manejo mediante descriptores.
dos procesos en GReth. Interfaz EDCL.
El testeo consistió en instanciar el GReth, junto a una descrip-
ción denominada FakePHY, que simula ser un PHY y desde las Soporte de RMII.
interfaces AMBA realizar escrituras y lecturas MDIO, transmisio- Las FIFO genéricas utilizadas en GReth fueron remplaza-
nes y recepciones mediante MII, y verificar que lo enviado y lo das por unas propias del laboratorio, implementadas con memoria
recibido coincidiera, o abortar en caso contrario. RAM dual port. Además, las mismas pasaron a ser instanciadas

2
los datos escritos a la FIFO son los obtenidos de MII y los leídos
de la FIFO quedan disponibles para ser usados. Para evitar la pér-
dida de paquetes, debido a que la aplicación no haya terminado de
retirar los datos recibidos, se implementó un esquema de múltiples
FIFOs. El número de FIFOs es configurable y su manejo depende
exclusivamente del core.

4.4. Modo de uso


El core presenta diversas configuraciones en base a generics,
de las cuales se pueden destacar:
TXFIFOSIZE y RXFIFOSIZE: utilizadas para especificar
la capacidad de almacenamiento en bytes de las FIFO.
RX_CHANNELS: cantidad de canales de recepción a utili-
zar. Cada canal implica el uso de una FIFO.
ENABLE_MDIO: para indicar si se hace uso o no del mó-
dulo MDIO.
Además, posee líneas de control para:
Habilitar o deshabilitar los canales de transmisión y recep-
ción.
Habilitar o deshabilitar señales de interrupción.
Indicar half o full duplex.
Fig. 3. Diagrama en bloques del core MAC. Especificar la dirección MAC.
Activar el modo promiscuo.
dentro de los nuevos canales de transmisión y recepción, los cuales
implementan la comunicación del MAC con una aplicación supe- 4.4.1. Transmisión
rior, de manera mucho más sencilla.
se indica el inicio y fin con señales independientes para tal
La funcionalidad MDIO se extrajo de la compleja descripción
fin. Los datos se confirman mediante una señal de escritura. Posee
donde se encontraba para pasar a ser un componente independien-
indicación de ocupado y provee información de errores de overrun
te.
de la memoria o alcance de límite de reintentos de transmisión en
Los componentes que resolvían la transmisión y recepción a
el bus.
través de MII, son junto al MDIO, los únicos que mantienen parte
de la descripción original y la utilización del Método de los dos
procesos. Sufrieron cambios como: eliminación de soporte para 4.4.2. Recepción
RMII; eliminación o simplificación de estados de sus FSM (Finite
se informan datos disponibles colocando una señal en estado
State Machine); eliminación o cambios de señales de control y es-
alto, la cual se mantiene hasta la lectura de todos los datos. Es-
tado; eliminación de componente que filtraba posibles glitches en
tas lecturas se confirman mediante la señal de lectura o se abortan
la señal de reset; etc.
en caso de decidirse descartar el paquete. Los errores que seña-
La sincronización entre distintos dominios de reloj, antes se liza son: Overrun de la memoria de datos; paquete recibido más
daba entre las FIFO y los componentes de transmisión y recep- corto/largo que el mínimo/máximo soportado por Ethernet; alinea-
ción, y ahora se da entre los puertos de escritura y lectura de las miento o CRC erróneo; cantidad de datos recibidos no concuerda
RAM dual port. Además, antes eran una funcionalidad esparcida con los especificados en el campo length del paquete recibido.
por diversas zonas de la descripción, mientras que ahora utiliza un
nuevo componente desarrollado para tal fin.
4.4.3. MDIO

4.3. Arquitectura presenta características similares al GReth, pero una nueva in-
terfaz. Posee señales para especificar el número de PHY, de re-
La Fig. 3 muestra un diagrama en bloques core, donde se puede gistro y datos de entrada y salida por separado. Con señales indi-
apreciar los tres dominios de reloj con los cuáles trabaja el sistema. viduales se indica si la operación es una escritura o una lectura.
La transmisión consiste en una FSM que en función de señales Finalmente, cuenta con una señal de ocupado y otra de falla en la
de entrada, escribe datos a una FIFO implementada con una RAM comunicación.
dual port. Al terminar de transferir datos a la FIFO, se genera la
señal wr_end, que luego de ser sincronizada, es identificada por la
5. VALIDACIÓN DEL CORE DESARROLLADO
FSM que lee los datos de la FIFO y los transmite a través de MII.
Una vez leídos todos los datos, mediante la señal rd_end, la FSM
5.1. Simulación
de escritura vuelve a su estado inicial.
La recepción es similar a la transmisión, con la diferencia que Para la simulación se utilizó GHDL [8] 0.28.

3
común de uso, y además el caso de utilizar un sólo canal de recep-
Table 1. Resultados de la síntesis ción, lo cual puede ser suficiente en numerosas aplicaciones que no
core GReth
requieran un flujo de datos continuo.
Configuración LUTs FFs Slices BRAMs
Sin MDIO 1814 775 1099 2 7. CONCLUSIONES
Con MDIO 2011 834 1220 2
De la comparación de los resultados de la síntesis, puede apre-
core MAC
ciarse que se obtuvo una implementación más compacta de la que
Configuración LUTs FFs Slices BRAMs se partió. Para configuraciones de uso equivalentes, nuestro core
1 RX sin MDIO 823 333 491 2 utiliza menos del 50 % de área de la FPGA que el GReth. Debe
considerarse también que el core GReth precisa la disponibilidad
2 RX sin MDIO 872 341 516 3
de memoria accesible mediante AMBA, además de todo el soporte
2 RX con MDIO 1016 381 591 3 para el manejo de descriptores, mientras que nuestro core cuenta
con todo lo necesario para ser directamente utilizado.
En cuanto al modo de uso, el core desarrollado es más simple y
Se realizó un testbench, donde nuevamente se instancia al co- no depende de un cierto bus, aunque puede ser fácilmente adaptado
re FakePHY, esta vez junto a nuestro MAC, pero a diferencia del al que sea necesario, ya sea AMBA, WISHBONE [12] u otro. La
testeo del GReth, este es más riguroso, incluyendo características simplificación del modo de uso y el cambio de arquitectura, son
tales como: las principales razones de la menor ocupación de recursos de la
Implementa procesos separados para transmisión y recep- FPGA.
ción, en lugar de utilizar uno sólo de forma secuencial. La utilización de lenguaje VHDL 93 estándar, permite que el
core sea sintetizable en una FPGA de cualquier fabricante.
Verifica el funcionamiento de la indicación de errores. La utilización de las herramientas propuestas por el proyecto
Los tres relojes que utiliza, no son múltiplos exactos entre FPGALibre demostró ser adecuada para un proyecto de estas ca-
ellos, lo que permite una mejor simulación de la sincroniza- racterísticas.
ción entre señales. Tareas futuras sobre este trabajo, podrían implicar tanto capas
de menor nivel, como la implementación de algún PHY Ethernet,
Por otro lado, se desarrolló un core denominado Replies, el
como aplicaciones de un nivel superior, que provea manejo del pro-
cual contesta peticiones ARP (Address Resolution Protocol) e ICMP
tocolo IP (Internet Protocol).
(Internet Control Message Protocol). Cabe aclarar que los meca-
nismos que utiliza para tal fin no reflejan los especificado para es-
8. REFERENCES
tos dos protocolos, sino artilugios para realizar pruebas. Este core
se utilizó en un testbench junto a tramas Ethernet reales adquiridas [1] S. E. Tropea and R. A. Melo, “USB framework - IP core and related
con el software wireshark [9], para recrear la ejecución del coman- software,” in XV Workshop Iberchip, vol. 1, Buenos Aires, 2009, pp.
do ping y poder visualizar las formas de onda y los paquetes de 309–313.
datos intercambiados. [2] GRLIB IP Core User’s Manual, 1.0.19 ed. Gaisler Research, 2008,
pp. 324–336.
5.2. Validación en hardware [3] J. Gaisler, “An open-source VHDL IP library with plug&play confi-
guration,” in IFIP Congress Topical Sessions, R. Jacquart, Ed. Klu-
Se llevó a cabo utilizando una FPGA Virtex 4 de Xilinx y wer, 2004, pp. 711–718.
el software ISE WebPack 11.3 - L.57. El host utilizado fue una [4] ARM. (2010, Jun.) AMBA - Advanced Microcontroller Bus
computadora personal corriendo el sistema operativo Debian [10] Architecture. [Online]. Available: http://www.arm.com/products/-
GNU [11] /Linux. system-ip/amba/amba-open-specifications.php
Como aplicación se utilizó el core Replies, el cual es sinteti- [5] Free Software Foundation, Inc., “GNU General Public License,”
zable. Una vez que el core superó el testbench sin reportar ningún http://www.gnu.org/copyleft/gpl.html.
error, se hicieron múltiples pruebas utilizando el comando ping, [6] J. Gaisler, “A structured VHDL design method,” http://www.gaisler.-
que fueron desde horas hasta más de una semana de ejecución, pre- com/doc/vhdl2proc.pdf, Jun. 2010.
sentando en todos los casos cero paquetes perdidos. Nuevamente, [7] S. E. Tropea, D. J. Brengi, and J. P. D. Borgna, “FPGAlibre: Herra-
se utilizó el software wireshark, en este caso para verificar la co- mientas de software libre para diseño con FPGAs,” in FPGA Based
rrecta conformación de los paquetes recibidos. Systems. Mar del Plata: Surlabs Project, II SPL, 2006, pp. 173–180.
El PHY externo utilizado, fue el DP83847 de National Se- [8] T. Gingold. (2010, Jun.) A complete VHDL simulator. [Online].
miconductor. Las pruebas se realizaron usando una comunicación Available: http://ghdl.free.fr/
full-duplex de 100 Mb/s . [9] G. Combs and contributors. (2010, Jun.) Network protocol analyzer.
[Online]. Available: http://www.wireshark.org/
6. RESULTADOS [10] I. Murdock et al. (2010, Jun.) Debian GNU/Linux operating system.
[Online]. Available: http://www.debian.org/
En el Cuadro 1 pueden observarse los resultados de la síntesis [11] R. M. Stallman et al. (2010, Jun.) The GNU project. [Online].
de los cores GReth y MAC, para una Virtex 4. Available: http://www.gnu.org/
En el caso del GReth, se sintetizaron las configuraciones más [12] Silicore and OpenCores.Org. (2010, Jun.) WISHBONE System-
comunes con y sin el uso de la interfaz MDIO, en ambos casos on-Chip (SoC) interconnection architecture for portable IP cores.
[Online]. Available: http://prdownloads.sf.net/fpgalibre/wbspec_b3-
con la interfaz EDCL deshabilitada. Para el MAC se sintetizaron
2.pdf?download
las mismas opciones, siendo dos canales de recepción el caso más

4
IP CORE MAC ETHERNET

Ing. Rodrigo A. Melo, Ing. Salvador E. Tropea

Instituto Nacional de Tecnología Industrial


Centro de Electrónica e Informática
Laboratorio de Desarrollo Electrónico con Software Libre
Email: {rmelo,salvador}@inti.gob.ar

ABSTRACT
La tecnología Ethernet provee comunicación entre PCs y dispo-
sitivos que funcionen en forma autónoma, en ámbitos locales o a
través de Internet. En este trabajo presentamos un core que imple-
menta la capa MAC Ethernet, de uso sencillo, con diversas con-
figuraciones, que ocupa pocos recursos de una FPGA. El diseño
fue simulado con herramientas de Software Libre y verificado en
hardware utilizando una FPGA Virtex 4.

1. INTRODUCCIÓN

Nuestro equipo de trabajo desarrolla sistemas embebidos que


en la mayoría de los casos precisan estar comunicados con una PC. Fig. 1. Diagrama en bloques de GReth.
Si bien hemos desarrollado cores que cubran esta necesidad, co-
mo el core USB [1] , en la actualidad, esta conexión deja de ser
suficiente para incontables aplicaciones que precisan de un fun- Los buses AMBA utilizados son el APB (Advanced Perip-
cionamiento autónomo, que vaya más allá de un ámbito local. La heral Bus) para el manejo de registros de configuración y
tecnología Ethernet, presente en sus diversas variantes en la ma- control, y el AHB (Advanced High-performance Bus) para
yoría de los dispositivos dotados de conexión a una LAN (Local flujo de datos, dado a través de canales DMA (Direct Me-
Area Network), sumado al uso de Internet, provee la solución más mory Access) para transmisión y recepción.
conveniente a este problema. Se conecta a un PHY externo mediante las interfaces MII
Se realizó una búsqueda de cores Ethernet disponibles, de uso (Media Independent Interface) o RMII (Reduced MII) para
libre y descriptos en VHDL, ya que estas condiciones forman parte el intercambio de datos y MDIO (Management Data In-
de la línea de trabajo de nuestro laboratorio. Los resultados fueron put/Output) para acceder a la configuración y estado.
pocos, siendo el más destacable el core GReth [2], perteneciente a
La interfaz EDCL (Ethernet Debug Communication Link)
la GRLib [3]. Sin embargo, el área ocupada de la FPGA, el com-
provee acceso de lectura/escritura al bus AHB mediante
plejo modo de uso y la única opción de utilización mediante un bus
Ethernet.
AMBA [4], excedían las características deseadas.
En este trabajo presentamos un core MAC (Media Access Con- El core posee tres dominios de reloj: los de transmisión y
troller) Ethernet que surgió de lo aprendido en base al estudio del recepción, provistos por el PHY externo, y el del resto de
core GReth. Es compacto, de fácil utilización y capaz de ser usado componentes y buses AMBA.
en FPGAs de cualquier fabricante.
2.3. Descripción de hardware
2. CORE GRETH La GRLib está descripta utilizando el llamado Método de los
dos procesos [6]: usando dos procesos por entidad, uno contenien-
2.1. Introducción do toda la lógica combinacional y el otro toda la secuencial, el al-
La GRLib es una biblioteca de IP cores, distribuida mediante goritmo completo puede ser codificado en el proceso combinacio-
un sistema de doble licenciamiento: comercial y GPL [5]. GReth nal, mientras que el proceso secuencial sólo contiene asignación de
provee una interfaz entre un bus AMBA y una red Ethernet (10/100 registros. Dicho método abstrae la descripción de hardware asimi-
Mb/s, full- and half-duplex). Implementa el estándar 802.3-2002, lándola al desarrollo de un software.
sin soporte de la capa opcional de control.
2.4. Modo de uso
2.2. Arquitectura El core es controlado mediante APB con registros de 32 bits:
El diagrama en bloques de GReth se encuentra en la Fig. 1. Registros 0 y 1: control/estado.

5
Registros 2 y 3: dirección MAC.
Registro 4: control/estado de interfaz MDIO.
Registros 5 y 6: dirección de memoria de la tablas de des-
criptores de transmisión y recepción.
Los descriptores son datos de 32 bits transmitidos mediante
AHB. Tanto en transmisión como en recepción se tienen dos des-
criptores contiguos:
Descriptor 0: se conforma de bits de control y estado. Utiliza
11 bits para especificar la cantidad de bytes a transferir.
Descriptor 1: consiste en un puntero de 30 bits a la zona de
memoria donde se almacenan/extraen los datos.
Fig. 2. Esquema de instanciaciones de GReth (izq.) y MAC (der.).
2.4.1. Transmisión
A través del AHB se colocan los datos a partir de la direc- Para el manejo de AMBA se desarrolló una biblioteca denomi-
ción apuntada por el descriptor 1. Los datos deben poseer las di- nada AMBA Handler, con propósitos de simulación. En la misma
recciones MAC destino y origen, y el campo tipo/tamaño. El CRC se implementaron ocho procedimientos que representan las com-
(Cyclic redundancy check) de 4 bytes es añadido automáticamente. binaciones de escritura o lectura, a un maestro o esclavo, APB o
A continuación, se especifica la dirección del descriptor 0 en AHB.
el registro 5. GReth comienza la transmisión cuando se le indica
en el registro 0. 4. EL CORE DESARROLLADO: MAC Ethernet
Cuando la transmisión finaliza, GReth escribe información de
estado en el registro 1 y el descriptor 0. Finalmente apunta al si- 4.1. Introducción
guiente par de descriptores y queda listo para la próxima opera-
ción. En la Fig. 2, se pueden ver dos esquemas resumidos de la ins-
tanciación de componentes del core del cual se partió (izquierda) y
2.4.2. Recepción del core que se obtuvo (derecha).
El nivel superior del GReth, instancia las FIFO de transmisión
Se especifica la dirección del descriptor 0 en el registro 6. y recepción, el componente ethc0, e implementa el manejo del core
GReth lee los descriptores cuando se le indica en el registro 0 y mediante descriptores, la comunicación MDIO y parte de AMBA,
aguarda un paquete entrante. Dicho paquete será aceptado cuando la interfaz EDCL y la sincronización entre distintos dominios de
la dirección MAC destino sea la indicada en los registros 2 y 3 o reloj. El componente ethc0, instancia a los componentes que re-
la de broadcast, o cuando el core tenga habilitado el modo promis- suelven la transmisión y recepción a través de MII/RMII y a un
cuo. En cualquier otro caso será descartado. componente que resuelve la otra parte de la comunicación AMBA.
Cuando finaliza, se escribe información de estado en el registro El core MAC desarrollado, presenta un nivel superior neta-
1 y el descriptor 0, y los datos recibidos son accesibles a partir de mente estructural, que solamente instancia a los llamados canales
la dirección apuntada por el descriptor 1. de transmisión y recepción, y opcionalmente la interfaz MDIO.
Los canales nombrados, instancian en su interior memorias RAM
2.4.3. MDIO dual port, los componentes que resuelven la transmisión y recep-
ción a través de MII y componentes para la sincronización entre
Esta interfaz permite acceder de 1 a 32 PHY, que contengan distintos dominios de reloj.
de 1 a 32 registros de 16 bits. Su control y estado es accesible
mediante el registro 4.
4.2. Implementación
La escritura se inicia especificando el dato, número de PHY
y registro, y colocando a ’1’ el bit de escritura, mientras que la El core desarrollado fue escrito en lenguaje VHDL 93 están-
lectura precisa el número de PHY y registro, e inicia colocando a dar. Para su desarrollo se utilizaron las herramientas y lineamientos
’1’ el bit de lectura. recomendadas por el proyecto FPGALibre [7].
Con respecto a GReth se eliminaron ciertas características, se
3. TESTEO DEL GRETH remplazaron descripciones y se modificaron en parte o totalmente
otras.
Con el objetivo de poder detectar cualquier error introducido Se eliminaron las siguientes características:
al simplificar el core se diseñó un testbench para el GReth. Esto Utilización de buses AMBA.
nos permitió tener un mejor conocimiento de su funcionamiento,
en particular teniendo en cuenta la utilización del Método de los Manejo mediante descriptores.
dos procesos en GReth. Interfaz EDCL.
El testeo consistió en instanciar el GReth, junto a una descrip-
ción denominada FakePHY, que simula ser un PHY y desde las Soporte de RMII.
interfaces AMBA realizar escrituras y lecturas MDIO, transmisio- Las FIFO genéricas utilizadas en GReth fueron remplaza-
nes y recepciones mediante MII, y verificar que lo enviado y lo das por unas propias del laboratorio, implementadas con memoria
recibido coincidiera, o abortar en caso contrario. RAM dual port. Además, las mismas pasaron a ser instanciadas

6
los datos escritos a la FIFO son los obtenidos de MII y los leídos
de la FIFO quedan disponibles para ser usados. Para evitar la pér-
dida de paquetes, debido a que la aplicación no haya terminado de
retirar los datos recibidos, se implementó un esquema de múltiples
FIFOs. El número de FIFOs es configurable y su manejo depende
exclusivamente del core.

4.4. Modo de uso


El core presenta diversas configuraciones en base a generics,
de las cuales se pueden destacar:
TXFIFOSIZE y RXFIFOSIZE: utilizadas para especificar
la capacidad de almacenamiento en bytes de las FIFO.
RX_CHANNELS: cantidad de canales de recepción a utili-
zar. Cada canal implica el uso de una FIFO.
ENABLE_MDIO: para indicar si se hace uso o no del mó-
dulo MDIO.
Además, posee líneas de control para:
Habilitar o deshabilitar los canales de transmisión y recep-
ción.
Habilitar o deshabilitar señales de interrupción.
Indicar half o full duplex.
Fig. 3. Diagrama en bloques del core MAC. Especificar la dirección MAC.
Activar el modo promiscuo.
dentro de los nuevos canales de transmisión y recepción, los cuales
implementan la comunicación del MAC con una aplicación supe- 4.4.1. Transmisión
rior, de manera mucho más sencilla.
se indica el inicio y fin con señales independientes para tal
La funcionalidad MDIO se extrajo de la compleja descripción
fin. Los datos se confirman mediante una señal de escritura. Posee
donde se encontraba para pasar a ser un componente independien-
indicación de ocupado y provee información de errores de overrun
te.
de la memoria o alcance de límite de reintentos de transmisión en
Los componentes que resolvían la transmisión y recepción a
el bus.
través de MII, son junto al MDIO, los únicos que mantienen parte
de la descripción original y la utilización del Método de los dos
procesos. Sufrieron cambios como: eliminación de soporte para 4.4.2. Recepción
RMII; eliminación o simplificación de estados de sus FSM (Finite
se informan datos disponibles colocando una señal en estado
State Machine); eliminación o cambios de señales de control y es-
alto, la cual se mantiene hasta la lectura de todos los datos. Es-
tado; eliminación de componente que filtraba posibles glitches en
tas lecturas se confirman mediante la señal de lectura o se abortan
la señal de reset; etc.
en caso de decidirse descartar el paquete. Los errores que seña-
La sincronización entre distintos dominios de reloj, antes se liza son: Overrun de la memoria de datos; paquete recibido más
daba entre las FIFO y los componentes de transmisión y recep- corto/largo que el mínimo/máximo soportado por Ethernet; alinea-
ción, y ahora se da entre los puertos de escritura y lectura de las miento o CRC erróneo; cantidad de datos recibidos no concuerda
RAM dual port. Además, antes eran una funcionalidad esparcida con los especificados en el campo length del paquete recibido.
por diversas zonas de la descripción, mientras que ahora utiliza un
nuevo componente desarrollado para tal fin.
4.4.3. MDIO

4.3. Arquitectura presenta características similares al GReth, pero una nueva in-
terfaz. Posee señales para especificar el número de PHY, de re-
La Fig. 3 muestra un diagrama en bloques core, donde se puede gistro y datos de entrada y salida por separado. Con señales indi-
apreciar los tres dominios de reloj con los cuáles trabaja el sistema. viduales se indica si la operación es una escritura o una lectura.
La transmisión consiste en una FSM que en función de señales Finalmente, cuenta con una señal de ocupado y otra de falla en la
de entrada, escribe datos a una FIFO implementada con una RAM comunicación.
dual port. Al terminar de transferir datos a la FIFO, se genera la
señal wr_end, que luego de ser sincronizada, es identificada por la
5. VALIDACIÓN DEL CORE DESARROLLADO
FSM que lee los datos de la FIFO y los transmite a través de MII.
Una vez leídos todos los datos, mediante la señal rd_end, la FSM
5.1. Simulación
de escritura vuelve a su estado inicial.
La recepción es similar a la transmisión, con la diferencia que Para la simulación se utilizó GHDL [8] 0.28.

7
común de uso, y además el caso de utilizar un sólo canal de recep-
Table 1. Resultados de la síntesis ción, lo cual puede ser suficiente en numerosas aplicaciones que no
core GReth
requieran un flujo de datos continuo.
Configuración LUTs FFs Slices BRAMs
Sin MDIO 1814 775 1099 2 7. CONCLUSIONES
Con MDIO 2011 834 1220 2
De la comparación de los resultados de la síntesis, puede apre-
core MAC
ciarse que se obtuvo una implementación más compacta de la que
Configuración LUTs FFs Slices BRAMs se partió. Para configuraciones de uso equivalentes, nuestro core
1 RX sin MDIO 823 333 491 2 utiliza menos del 50 % de área de la FPGA que el GReth. Debe
considerarse también que el core GReth precisa la disponibilidad
2 RX sin MDIO 872 341 516 3
de memoria accesible mediante AMBA, además de todo el soporte
2 RX con MDIO 1016 381 591 3 para el manejo de descriptores, mientras que nuestro core cuenta
con todo lo necesario para ser directamente utilizado.
En cuanto al modo de uso, el core desarrollado es más simple y
Se realizó un testbench, donde nuevamente se instancia al co- no depende de un cierto bus, aunque puede ser fácilmente adaptado
re FakePHY, esta vez junto a nuestro MAC, pero a diferencia del al que sea necesario, ya sea AMBA, WISHBONE [12] u otro. La
testeo del GReth, este es más riguroso, incluyendo características simplificación del modo de uso y el cambio de arquitectura, son
tales como: las principales razones de la menor ocupación de recursos de la
Implementa procesos separados para transmisión y recep- FPGA.
ción, en lugar de utilizar uno sólo de forma secuencial. La utilización de lenguaje VHDL 93 estándar, permite que el
core sea sintetizable en una FPGA de cualquier fabricante.
Verifica el funcionamiento de la indicación de errores. La utilización de las herramientas propuestas por el proyecto
Los tres relojes que utiliza, no son múltiplos exactos entre FPGALibre demostró ser adecuada para un proyecto de estas ca-
ellos, lo que permite una mejor simulación de la sincroniza- racterísticas.
ción entre señales. Tareas futuras sobre este trabajo, podrían implicar tanto capas
de menor nivel, como la implementación de algún PHY Ethernet,
Por otro lado, se desarrolló un core denominado Replies, el
como aplicaciones de un nivel superior, que provea manejo del pro-
cual contesta peticiones ARP (Address Resolution Protocol) e ICMP
tocolo IP (Internet Protocol).
(Internet Control Message Protocol). Cabe aclarar que los meca-
nismos que utiliza para tal fin no reflejan los especificado para es-
8. REFERENCES
tos dos protocolos, sino artilugios para realizar pruebas. Este core
se utilizó en un testbench junto a tramas Ethernet reales adquiridas [1] S. E. Tropea and R. A. Melo, “USB framework - IP core and related
con el software wireshark [9], para recrear la ejecución del coman- software,” in XV Workshop Iberchip, vol. 1, Buenos Aires, 2009, pp.
do ping y poder visualizar las formas de onda y los paquetes de 309–313.
datos intercambiados. [2] GRLIB IP Core User’s Manual, 1.0.19 ed. Gaisler Research, 2008,
pp. 324–336.
5.2. Validación en hardware [3] J. Gaisler, “An open-source VHDL IP library with plug&play confi-
guration,” in IFIP Congress Topical Sessions, R. Jacquart, Ed. Klu-
Se llevó a cabo utilizando una FPGA Virtex 4 de Xilinx y wer, 2004, pp. 711–718.
el software ISE WebPack 11.3 - L.57. El host utilizado fue una [4] ARM. (2010, Jun.) AMBA - Advanced Microcontroller Bus
computadora personal corriendo el sistema operativo Debian [10] Architecture. [Online]. Available: http://www.arm.com/products/-
GNU [11] /Linux. system-ip/amba/amba-open-specifications.php
Como aplicación se utilizó el core Replies, el cual es sinteti- [5] Free Software Foundation, Inc., “GNU General Public License,”
zable. Una vez que el core superó el testbench sin reportar ningún http://www.gnu.org/copyleft/gpl.html.
error, se hicieron múltiples pruebas utilizando el comando ping, [6] J. Gaisler, “A structured VHDL design method,” http://www.gaisler.-
que fueron desde horas hasta más de una semana de ejecución, pre- com/doc/vhdl2proc.pdf, Jun. 2010.
sentando en todos los casos cero paquetes perdidos. Nuevamente, [7] S. E. Tropea, D. J. Brengi, and J. P. D. Borgna, “FPGAlibre: Herra-
se utilizó el software wireshark, en este caso para verificar la co- mientas de software libre para diseño con FPGAs,” in FPGA Based
rrecta conformación de los paquetes recibidos. Systems. Mar del Plata: Surlabs Project, II SPL, 2006, pp. 173–180.
El PHY externo utilizado, fue el DP83847 de National Se- [8] T. Gingold. (2010, Jun.) A complete VHDL simulator. [Online].
miconductor. Las pruebas se realizaron usando una comunicación Available: http://ghdl.free.fr/
full-duplex de 100 Mb/s . [9] G. Combs and contributors. (2010, Jun.) Network protocol analyzer.
[Online]. Available: http://www.wireshark.org/
6. RESULTADOS [10] I. Murdock et al. (2010, Jun.) Debian GNU/Linux operating system.
[Online]. Available: http://www.debian.org/
En el Cuadro 1 pueden observarse los resultados de la síntesis [11] R. M. Stallman et al. (2010, Jun.) The GNU project. [Online].
de los cores GReth y MAC, para una Virtex 4. Available: http://www.gnu.org/
En el caso del GReth, se sintetizaron las configuraciones más [12] Silicore and OpenCores.Org. (2010, Jun.) WISHBONE System-
comunes con y sin el uso de la interfaz MDIO, en ambos casos on-Chip (SoC) interconnection architecture for portable IP cores.
[Online]. Available: http://prdownloads.sf.net/fpgalibre/wbspec_b3-
con la interfaz EDCL deshabilitada. Para el MAC se sintetizaron
2.pdf?download
las mismas opciones, siendo dos canales de recepción el caso más

8
AUTONOMOUS WIRELESS INTELLIGENT NETWORK ACCESSIBLE VIA IP

María Isabel Schiavon, Daniel Alberto Crepaldo


Laboratorio de Microelectrónica
Universidad Nacional de Rosario, Argentina
[email protected], [email protected]

ABSTRACT
F
An autonomous wireless intelligent network is presented. D
H

The intended function is to sense meteorological data on L 1 IP address


field. A minimum and dedicated set of Internet Protocol J P 14 MAC address
rules was selected for communications, so that the net A
can be accessed remotely from an Ethernet wireless local C
M

area network. Internal intelligence of the network is O


Q
K
centered in dynamic topology reconfiguration according
to the physical location of the nodes. Border Gateway N
E
Protocol (BGP) was adapted to allow dynamic
reconfiguration.
Figure 1. Network fourteen nodes
1.INTRODUCTION with unbuilt architecture

An autonomous wireless intelligent network (AWIN) is Network is identified for one IP address, so all the
presented. It is defined as a wireless Ethernet local area nodes share it and have the same structure and capabilities
network. All the communications, internal and external, but each of them is identified with a different MAC
are made via Internet Protocol (IP). Stations remote address.
access via wireless Ethernet is enabled for reset process or The network builds autonomously its communication
data gathering. The protocol for wireless Ethernet architecture. As each wireless network node can
networks is defined in IEEE 802.11 standard rules [1] [2]. communicate only with those nodes that are within the
The rules are technology and internal structure range of transmitter, the communication inside the net
independent. The minimum and necessary subset of this must be neighbor node to neighbor node or “mouth to
standard rules was selected to implement the node mouth”. Once the communication path is defined, as it is
communication module. The network has an IP address; shown in figure 2, the net is ready and the programmed
all nodes shared this IP address and have their own process starts.
physical address (MAC). Nodes deployment is not fixed and it may change over
Internal network intelligence is centered in time. Nodes are battery powered, so the transmitter range
architecture dynamic reconfiguration according to the will be affected by the state of battery charge. This or
physical location of the nodes. Border Gateway Protocol another cause of failure as environmental or electronic
(BGP) was adapted to allow dynamic reconfiguration. risk or involuntary destruction can put some nodes out of
BGP was developed to allow an effective all to all service. If one or some nodes stop working, the network
interconnection between autonomous systems via IP [3]. must be auto reconfigured to maintain the network
As BGP capabilities exceed autonomous network needs, communication alive as it is shown in figure 3.
the capabilities needed for specific application were Periodically, an architecture check is done, and when it is
selected. To make dynamic reconfiguration in a simple necessary a communication path reconfiguration is made.
way, adding o removing nodes and changing the When an external access is required, the requirement
communication path without affect network performance can be received by many nodes, the first node that
presented an interesting compromise to solve. The answers assumes the role of hub node. Hub node is
commitment was high performance, low cost and responsible for wireless communication with the external
minimum power consumption. Figure 1 shows a fourteen Ethernet network and all others must report to it using
nodes net before the communication architecture has been intermediate nodes as repeaters.
was built.

9
F
H TO/FROM ETHERNET
D NETWORK
L
J P TRANSMITTER

A /RECEIVER
PROTOCOL
M COMMUNICATION
C CODE/DECO SUBSYSTEM
Q
O K
COMUNICATION
MEMORY

N
E
SENSOR SENSOR
Figure 2. Fourteen nodes network communication path MEMORY
SUBSYSTEM SENSOR
SUBSYSTEM
F CONTROL
H SENSOR
D
L
J P
A Figure 4. Network node block diagram
M
C

O
Q Dedicated communication module block (PROTOCOL
K
CODE/DECO) was designed on the basis of earlier works
N [4] [5]. System internal working frequency was defined at
E
100MHz and part of Ethernet manager works at 50MHz.
It is a bidirectional block to manage data transmission and
Figure 3. Fourteen nodes network communication path
reception. As receptor, it recognizes, decodes and
with C node out of service
processes the incoming frame according to ETHERNET
rules. In data transmission, the reverse process is
2.NODE DESCRIPTION managed.
It selects between a transmission or reception process.
Typical net node block diagram is shown in figure 4. It is In transmission process, the output frame is shaped
possible to difference two subsystems, one for assembling sensor subsystem incoming data with
communication and the other to manage sensor activity destination/origin MAC and IP addresses and control bits.
and configuration. Before starting transmission channel occupancy is
Communication subsystem has three blocks. detected, when channel is free transmission is enabled.
First block is a wireless ETHERNET compatible In reception process, when a valid data frame is
transmitter/receiver. The second (PROTOCOL CODE/DECO) detected, reception is starting. Incoming frame is
is a dedicated communication module that is responsible processed according to protocol and destination IP
for interpreting the message according to the IP protocol, address network matching is verified, in other way the
for storing in memory the fields it needs to keep and for frame is discarded. If origin MAC address matches with
transmitting data to the sensor subsystem in a reception one of the network nodes MAC addresses an internal net
process, or for shaping the frame according to the message is identified, in other way an external
Ethernet protocol retrieving from memory the fields communication is detected.
needed to build the outgoing message. The last is a In both, decoding process is accomplished and
memory block (COMMUNICATION MEMORY). redundancies are checked through a feedback shift
Sensor subsystem is composed by three blocks: one register that was proposed in XILINX application notes
to manage all subsystem activities (SENSOR SUBSYSTEM [6]. Origin and destination MAC and IP addresses are
CONTROL), a memory block to store data and extracted and stored in COMMUNICATION MEMORY
configuration parameters (SENSOR MEMORY) and the to be used in message answer construction, and data is
sensor itself (SENSOR). submited to the sensor subsystem with an special bit code
The transmitter/receiver to be used in this application to identify the external or internal communication.
will be a wireless ETHERNET IEEE 802.11 compatible COMMUNICATION MEMORY was implemented in a
transmitter/receiver and its description runs out of the two read/write ports memory.
scope of this paper.

10
Sensor Subsystem has three blocks: the SENSOR Once received the KEEPALIVE message, the hub
SUBSYSTEM CONTROL (SSC), a memory block to store node emits an UPDATE message to notifying its
sensor data and address and configuration parameters neighbors MAC addresses. Neighbor nodes receives
(SENSOR MEMORY) and the sensor it self. SSC has the message and emits an UPDATE message to announce
responsibility of management all sensor subsystem their own neighbor addresses and the route to reach hub
activities. node. Every node that receives the message repeat the
operation announcing its MAC neighbor addresses and
3.NETWORK OPERATION. the route to reach the hub node, and information goes
spreading for the network.
Network operations are differenced in five categories. When all nodes have been reached and the path
communication information has been stored in all of
Three of tem are defined for external communication
them, the net architecture is completely configured and
(shown in figure 1) and they are identified as Network Set
sensors start DR process. KEEPALIVE messages will be
Up, Network Programming and Data Gathering.
The fourth category corresponds to an internal periodically exchanged to ensure that the relationship
communication process of the net and it is defined as continues established. If some node goes out of service, a
communication break is reported and routes including this
Network Configuration, and the last, which is identified as
node are reconfigured with UPDATE messages
Data Recollecting, is defined for storing data collected by
generation.
the sensor in the sensor memory.
Network Programming (NP) and Data Gathering (DG)
Network Set Up (NSU) is the starting process.
Assuming the network has a predefined quantity of nodes, process start with the corresponding external messages.
each of them identified with a different MAC addresses, When a NP or a DG external message is received, all the
node are enable to receive it, the one that first answers the
and each node has stored the addresses of all the others,
requirement, assumes the role of hub node to receive and
an external NSU message is required to start net
retransmit information.
operation. When NSU message is received, the node that
NP is the process to programme sensors parameters.
receives and first answers the requirement, assumes the
role of hub node, and Network Configuration process The information goes spreading for the network and all
(NCP) is started (figure 5). the sensors are reprogrammed when it is stored in the
sensor memory of each node. DG is the process that
A dedicated protocol based on BGP was developed for
allows the transfer of data stored in the sensors outside the
NCP. Devices that can communicate directly are defined
network. When hub node sends a data request message,
as neighbors, and the first step is to detect neighboring.
data travel node to node to reach hub node and they are
Hub node sends a START message to all the others, the
nodes that answered message are assumed as neighbors transmitted to the external network.
Data Recollecting (DR) is an internal node process
and their MAC address are stored as a neighbor address.
which periodicity is programmed during NP process.
After a prefixed time without receive answer messages,
hub node assumes its table of neighbour node is
completed, and sends an OPEN message to each one of its 4.CONCLUSIONS
neighboring nodes, and waits for a KEEPALIVE message
that only includes the BGP header. Each one of the nodes Nodes structure and operation of an autonomous
carries out the same procedure to identify its neighbors. wireless intelligent network reachable remotely via
REMOTE INTERNET were presented. Specific application is
STATION
F
sensing meteorological data in field.
HUB NODE
H The structure of nodes is the same for all of them. All
D
nodes have the same capabilities, share the same IP
L
1 IP address address and have different MAC addresses. The minimum
J P
14 MAC address and necessary rules subset of IEEE 802.11 standard rules
A
M
was selected to implement node communication module.
C Internal network intelligence is centered in dynamic
Q topology reconfiguration according to the physical
O K
location of the nodes. Border Gateway Protocol (BGP)
N
was adapted to allow dynamic reconfiguration.
E

Figure 5. NSU message reception

11
[3]
Two prototypes nodes were implemented over Rekhter Y., Li T., Hares S. “Request for Comments 4271: A
SPARTAN III available in Digilent S3 SKB development Border Gateway Protocol 4 (BGP-4)”
XILINX field programmable logic devices boards [7]. http://www.ietf.org/rfc/rfc4271.txt
The design was validated with successfully [4]
Schiavon M. I., Crepaldo D., Martín R. L., Varela C.
communication tests made in Laboratory. For tests, “Dedicated system configurable via Internet embedded
connection between nodes was implemented as a wired communication manager module”, V Southern Conference on
connection using a 10BASE-T connection synchronized Programmable Logic, San Carlos, Brasil (2009) pp 193-197.
at 10Mb/seg. Now the work is RF transmitter analysis and [5]
Schiavon M. I., Crepaldo D., Martín R. L. “Wireless Internet
selection to implement wireless communication. configurable network module”, VI Southern Conference on
Programmable Logic, Puerto Galhinas, Brasil (2010) pp
5.REFERENCES [6]
Borrelli C. “IEEE 802.3 cycle redundancy check”, XILINX,
App. Note XAPP209. March, 2001.
[1]
IEEE, IEEE STD 802.11-2007, “Revision of IEEE STD [7]
Digilent S3 SKB development boards, SPARTAN 3 FPGA, and
802.11-1999”, June 2007.
ISE platform, http://www.xilinx.com
[2]
Waisbrot, J. “Request For Comments: 791”, http://www.rfc-
es.org/rfc/rfc0826-es.txt

12
Multi-Level Synthesis on the Example of a Particle
Filter
Jan Langer, Daniel Froß, Enrico Billich, Marko Rößler, Ulrich Heinkel
Chemnitz University of Technology
Chemnitz, Germany
{laja,daf,ebi,marr,heinkel}@hrz.tu-chemnitz.de

Abstract—In this paper we compare two high level synthesis A fundamentally different approach is to utilize the InTerval
approaches on the example of a particle filter design. First, a Language (ITL), that has been originally used as a formal
C synthesis is used to transform C code into RT level VHDL. verification technique. A system description is created as a set
The second method employs the tool vhisyn to compile a set of
operation properties written in ITL into RTL code. A particle of Operation Properties that split the system’s behavior into
filter component has been implemented using both methods and operations of fixed length, which are connected by a property
the resulting designs were synthesized and run on a FPGA board. graph. Using ITL has been proposed as an intermediate
The corresponding synthesis results have been compared to a HLS methodology that compensates specific drawbacks of the
hand coded design. previous approach.
This work focuses on the comparison of two high level design
methods starting from different levels of abstraction and hand
This paper is structured as follows. First, an overview of
coded VHDL. As a result, the resource utilization and timing previous work in the field of HLS is given. The second section
of the high level designs are not prohibitively high. Especially, describes the specification of the particle filter design. In
it is interesting to classify operation properties as an efficient section IV and V, we provide some details about the high
prototyping and design method in certain application areas. level design methodologies we have used. The paper concludes
In general, high-level design methods are applied when a more with a presentation of the design results and the respective
abstract, concise and maintainable system description is required
and only a short design time is allowed. Operation properties performance of the two implementations compared to a hand
represent a compromise between abstract C based methods and coded VHDL design.
classical RT design.
II. P REVIOUS W ORK
I. I NTRODUCTION High-level synthesis rises the design level with the objective
to improve verification and system design productivity. Related
High level synthesis (HLS) raises the level of designing a work dates back 30 years, starting from algorithmic level [1]
system from the traditional register transfer (RT) level up to and moving up to system level. ANSI C/C++ and derivatives
higher levels of abstraction. This step helps to improve both of them like SystemC, Single Assignment-C (SA-C) [2] and
design productivity and achieved verification quality. In this Handle-C [3] provide functionality similar to languages like
paper, two very different approaches to HLS and a hand coded Verilog and VHDL and aim at a unified hardware-software
design on RT level are evaluated by means of a case study in representation. Commercial and academic C to VHDL com-
performance and efficiency. A particle filter algorithm is used pilers like CatapultC, C-to-Silicon [4], Cyber [5] and others
as an application example. The particle filter is an estimation generate intermediate RT level code, which can be processed
technique for Bayesian models that is primarily well suited for by logic synthesis tools afterwards [6]. C2H [7], Streams-C [8]
localization purposes. Furthermore, the particle filter is a good and CoDeveloper [9] combine HLS and hardware software co-
example to illustrate certain aspects of the different design design. Tools for compiling other languages like Java [10] or
approaches of this work. Matlab to hardware appeared recently.
The first HLS approach is the generation of RT Hardware In general, it is a well understood process to generate ex-
based on a system description written in an augmented C ecutable and even synthesizable models from single temporal
language that will be translated into synthesizable VHDL. The properties or sets of properties. Those models can be either
resulting hardware implementation exploits coarse-grained used as monitors in system simulation and emulation or they
parallelism on process level and low level parallelism on form abstractions for early prototypes in system verification.
instruction level. Synthesizing temporal properties has mostly focused on Linear
Time Logic (LTL) as implemented in PSL or SVA [11]–
This research work was supported in part by the German Federal Ministry [14]. However, all those methods can only handle a subset
of Education and Research (BMBF) in the project HERKULES under the
contract number 01 M 3082 and the project InnoProfile under contract number of the operators of the property language or they can only
03 IP 505. process problems of very small complexity. Another problem

13
is ambiguity. In most cases, a property or a set of properties is See [20] for a comprehensive introduction to particle filters.
satisfied by more than one exact behaviour. Thus, the synthesis For reasons of approximation accuracy the number of particles
method can either create a general solution that contains all has to be large - depending on the problem to be estimated.
consistent behaviour or an arbitrarily chosen specific solution. As a consequence, a software implementation on an embedded
In contrast to PSL or SVA, the synthesis of models from microprocessor platform is infeasable due to low update rates.
complete sets of ITL properties can profit from additional This has made a hardware implementation necessary. In our
constraints, that are not present in pure LTL properties. For case, the state to be estimated is the unknown position (x, y, z)
one, the property graph connecting the operations imposes of the object. Thus, every particle represents one possible
structural information that is used during synthesis. Further- position hypothesis
more, the special syntax of ITL (in many aspects more
p[m] = (x[m] , y [m] , z [m] ), (1)
restricted than general LTL) and the assertions obtained during
the check for completeness simplify the synthesis process and where m is the running index in the particle set. A filter update
allow a much higher complexity to be handled. Thus, in [15] at time t consists of the following steps:
a tool vhisyn has been proposed to translate ITL descriptions [m]
1) Prediction. A hypothetical position pt for each par-
to VHDL. This work uses the tool to generate the operation ticle is predicted at the actual timestep t based on its
property based design to be compared to the other two design [m]
former position pt−1 . Therefore, every new particle has
approaches. to be sampled from a proposal distribution that is based
Similar to this paper, [16] also uses ITL properties to gen- on a given state transition or motion model. In our
erate executable models, called Cando objects. The algorithm case, the mobile node is assumed to move without any
does not employ the property graph structure, and on one hand favored direction. Hence, this distribution is modeled
is more general than our approach, but on the other hand less [m]
symmetrically around pt−1 as a three dimensional nor-
able to handle complex property descriptions. mal distribution with identical variances σp2 ∆t for x,
Case studies of HLS tools are available (e.g. in [6], y and z. Due to the fact that positional uncertainty
[17]–[19]), but limit the comparison exclusively to either increases with time, the variance values are scaled with
programming language based HLS approaches or to RT level the time difference ∆t between the actual time and the
designs. To the best of our knowledge, there is presently time of the last filter update.
no comprehensive case study available that comparatively 2) Weight Calculation. The next step consists of calculat-
qualifies the results of synthesizing a complete design of a [m] [m]
ing a weight wt for each particle pt by incorporating
complex algorithm at these levels of abstraction.
a distance measurement dt between the object and an
III. PARTICLE F ILTER anchor position pa . This weight is the probability of the
[m]
This section presents a particle filter for localization estima- distance measurement under the particle pt . In our
tion as a possible specification for a hardware implementation. case the weight is given by
The filter estimates an object’s three-dimensional position by [m] k
wt = , k>0 (2)
incorporating distance measurements to reference points of k + |∆d|
known position. The localization problem is similar to that [m]
∆d = dt − |pt − pa | (3)
of the global positioning system (GPS).
The particle filter has been chosen as an example for this where ∆d is the difference between expected distance
comparative work, because it can be described as a short, well- (euclidean distance between particle and anchor posi-
understood piece of C code, that will be used as a starting point tion) and measured distance dt . The scaling constant k
for C based synthesis. Furthermore, the particle filter’s behav- characterizes the quality of distance information. If pre-
ior can be split into meaningful operational properties making dicted and measured distance match exactly the weight
it a feasible target of property based synthesis. However, de- maximizes to one. With increasing difference the weight
spite these characteristics, a specific hardware implementation decreases asymptotically to zero according to the value
of this design on register-transfer-level requires a lot of work. of k.
Considering these facts, the particle filter appears as an ideal 3) Resampling. The final particle set is generated through
candidate for a study to compare the design approach using a resampling procedure of the hypothetical set from
operational properties with both a higher level method based step 1). The probability of drawing each particle from
on C and a lower level manual implementation. the set is given by its weight. The resulting particle
A particle filter is a nonparametric implementation of the set possesses duplicates of particles with large weights
Bayes filter algorithm, where the posterior distribution is while particles of lower weight have been replaced.
approximated by a set of random state samples (particles). Thus, the resulting particle set focuses on regions with
The likelihood of the true system’s state is proportional to the high posterior probability. In our implementation, a so-
density a region of the state space is populated by particles. called low variance sampler from [20] is deployed. In

14
wt timing specifications, memories, communication patterns and
other constraints [21]. However, these aspects are crucial to
w [1]
t w [2]
t ... synthesize the corresponding hardware structures. Handling of
these issues differs between the available C synthesis tools and
r r + wt r + 2w t
there appears to be no clear winning solution.
Nevertheless, all tools share a more or less semi-automated
Fig. 1. Low variance resampling procedure way to handle the various levels of parallelism to generate
hardware with reasonable performance. For the work in this
at least M paper, the tool CoDeveloper by Impulse Accelerated Tech-
Weight

measurement
Weight FIFO nologies has been used. It is the commercial successor of the
weights Calculation
Streams-C compiler. In general, the principles described in this
Po sition FIFO paper also apply to other synthesis tools based on C that do
particles
particles
not depend on explicit annotation of concurrency on a fine
Resampling Prediction grained level.
Power
PC On the lowest level, blocks of C code, bounded by control
Statistics
mean / covariance
statements (e.g. case, if, for, ... ), are automatically processed
to exploit parallelism. Data dependencies between instructions
Fig. 2. Block diagram of the particle filter design. are analyzed to extract implicit concurrency. Simple operations
(e.g. addition of fixed point values) are directly mapped to
the corresponding HDL statement, whereas more complex
a first step, a single random number r in the interval instructions are mapped to specific components from a library.
[0; wt ) is chosen where wt is the arithmetic mean of The following allocation step decides how many operators will
all particle weights. In the following steps the algorithm be instantiated and how memory access and data operations are
selects particles by repeatedly adding wt to r and by scheduled into fixed time slices according to their estimated
choosing the particle that corresponds to the resulting execution time.
value. Figure 1 illustrates this resampling method. The automatic transformation of loops and control structures
4) Density Extraction. Finally, based on the discrete par- generally results in state machines. Loops are either unrolled
ticle set maintained by the filter, a continuous density is and each step is executed concurrently to minimize compu-
estimated. We compute the mean and the covariances tational delay or the steps are pipelined for area efficiency.
over all particles assuming them to be normally dis- Unrolling and pipelining span a rather large design space
tributed. The probability density at any position can then bound by the required speed (frequency) and size (area) of the
be calculated by a normal distribution using the obtained chip. A constraint driven synthesis process explores solutions
mean vector and covariance matrix. to meet the restrictions defined by the designer.
To compare both high level design approaches to a hand The original resampling algorithm of the particle filter is
coded VHDL design, the particle filter has been implemented shown in the left part of Fig. 3. The resulting scheduling
using all three methods. All designs are structured similarly as is annotated in the right part. The initialization phase takes
shown in Fig. 2. The three blocks: prediction, weight calcu- two cycles due to a memory read. Loop conditions consume
lation and resampling correspond to the update rules 1) to 3) one cycle and the loop bodies two cycles each due to data
above. The resampling block will not start operating until the dependencies and memory accesses.
cumulative sum of all particle weights is available. Therefore,
the weights and positions of one complete set (M = 8192) C-Code Cycle Block
of particles need to be stored in a FIFO, that is located at U = rand() % step; 0 Block1
i = 0; 0
the input of the resampling block. As soon as the resampled j = 0; 0
particles drop out of the resampler, they are processed by the c = M*weight[i]; 0-1
prediction and weight calculation and again pushed into the for (j=0; j<=M; j++) 2 Loop1
FIFOs. The statistics block corresponds to update step 4) with while (U>=c) 3 Loop2
i++; 4
calculating mean and covariance parameters over all particles. c += M*weight[i]; 4-5

state2[j] = state1[i]; 6-7 Block2


IV. C-BASED S YNTHESIS U += step; 6

To synthesize hardware from derivatives of the sequential Fig. 3. C code of the low variance resampling algorithm.
software programming language C, several problems have to
be considered. The programming model of pure C does not Pure C language is not especially well-suited to specify
define certain aspects of the concurrency model, data types, hardware. Therefore, a designer is forced to guide the synthesis

15
C-Code Cycle Block idle
U = rand() % step; 0 Block1 start reset
i = 0; 0
j = 0; 0
c = M*weight[i]; 0-1 read write
while (j < M) 1 Loop1
#pragma CO PIPELINE
if (U>=c) 1
c += M*weight[i++]; 2-3 Block2 Fig. 5. Property graph of the resampler.
else {
U += step; 4
state2[++j] = state1[i]; 4-5 Block3 property read is
assume: Jan Langer read
Professur Schaltkreis-
} 3 und Systementwurf
at t : U >= c; i <M +1
Fig. 4. Optimized C code of the resampling algorithm property at t :is
read i < M;
prove : weight
assume:
at t+2 : i = prev(i,2)+1; c +
at t : weight >= limit;
at t+2 : c = prev(c + weight,2);
at t+1 at
: rd_cnt
t+2 : U<=M; prev(U,2); U >= =
process in order to achieve the best possible performance. during[t+1,t+2] : wr_en = ’0’; wr_en
Guidelines by tool vendors and the research community in-prove: during[t+1,t+2] : state2 = 0;
state2 0
clude combining loops, combine or split memories, mark loops at t+2 at
: rd_cnt
at t+2 at
: limit
t+2 : =
= prev(rd_cnt,2)+1;
t+1 : rd_en = ’1’;
prev(limit,2)
rd_en = ’0’; + rd_en
for pipeling or unrolling. In general, it is necessary to review ... prev(in_weight,2); t t+1 t+2
end :property;
the synthesis results in order to optimize critical code sections. at t+2 weight = prev(weight,2);
during[t+1,t+2] : wr_en = ‘0‘;
The resulting C code might be less efficient to be run in during[t+1,t+2] Fig. 6. ITL code and timing
: out_state = 0;diagram of the read property.
software but more suitable for hardware synthesis. Fig. 4 at t+1 : rd_en = ‘1‘;
shows an optimized version of the resampling algorithm of at t+2 : rd_en = ‘0‘;
end property;
the particle filter. Rewriting the algorithm and advising loop needed, as shown in Fig. 5. The reset property sets the
piplining reduced the latency in each path to two cycles. component’s state variables to defined values after a system
All C-synthesis tools require a manual definition of par- reset has occurred. Furthermore,
Jan Langer it definesProfessur
the Schaltkreis-
values of all
und Systementwurf
4
allelism on the coarse grained level. This is often achieved output signals in this phase. The idle property is activated in
by processes or threads. In particular, the fundamental unit of the time between subsequent update cycles of the filter and sets
concurrently executed computation in CoDeveloper is called the output values to zero. In case a new update cycle is started,
process. Streams, signals, registers and shared memories are the start property applies and prepares the internal variables
provided to synchronize processes and to extract the global for the following resampling process. The two properties read
data path. The implementation of the particle filter in Fig. 2 and write alternate according to the received particle weights.
uses processes for weight calculation, prediction and resam- As soon as all particles have been read and as many particles
pling. Global arrays are used to buffer the particles between have been written, the idle property is activated again.
the processes, whereas all remaining communication utilizes The resampling component’s read operation picks the state
streams. and weight of the next particle from the FIFO and does not
write a new particle to the output. This operation is shown in
V. O PERATION P ROPERTIES Fig. 6. The corresponding timing diagram tries to visualize the
The commercial tool 360MVTM by OneSpin Solutions [22] behavior. The expressions in the assume part of the operation
introduces a Gap Free Verification methodology based on form the antecedent of the property and indicate the activation
operation properties. It provides a special property syntax conditions. In this case, the read property is executed as long
known as InTerval Language (ITL). A set of additional rules as variable c is greater or equal to variable U and the particle
helps to write a complete set of properties, that explicitly read count i is smaller than the total number of particles M .
covers the design intent for every valid sequence of input The prove part forms the consequent of the operation and sets
values. The tool employs a powerful engine to prove the output and internal signals to their new values.
completeness of the property set as well as the correctness of In contrast to high level synthesis approaches based on
each individual property with respect to the design. A property algorithmic descriptions like the C language, the properties
set is complete, if the conjunction of the properties alone is contain no loops. The user has to encode loop-like behavior
able to map every valid sequence of input data to exactly one implicitly in the sequence of the operations allowed by the
corresponding sequence of output data [23]. The completeness property graph. Furthermore, the properties are designed such
of a property set can be proven without the need of an actual that they can partly overlap and therefore exploit a pipelining
design. behavior in the resulting design.
To illustrate the property-based design, we want to show The length of the read operation is two cycles. So, during
one property of the resampling component. The resampler’s read’s third cycle at t + 2, the following property can be
behavior is first split into distinct operations based on the activated and the two properties overlap for one cycle. In
specification. It turns out, that exactly five operations are general, the use of operations is more beneficial for properties

16
TABLE I
16000 D ESIGN DESCRIPTION AND SYNTHESIS RESULTS

14000 Hand coded vhisyn CoDeveloper

12000
lines of code 2138 (vhdl) 1243 (vhi) 447 (C)
estimated design effort 1-2 weeks 3 days 2 days
10000
slices 3855 (28%) 6011 (43%) 4603 (33%)
8000
slice FF 5924 (21%) 5120 (18%) 5286 (19%)
4 input LUT 3552 (12%) 8930 (32%) 6387 (23%)
6000 BRAM 70 (51%) 69 (50%) 82 (60%)
real position
estimated position MULT18x18 18 (13%) 23 (16%) 29 (21%)
4000
anchors
max. freq. (in MHz) 182 25 113
covariance ellipsis
2000 avg. cycles per particle 2 3 66

0 2000 4000 6000 8000 10000 12000 14000 16000


distribution of particles results in a large ellipsis, whereas a
Fig. 7. A random walk of an object in the playground and the corresponding dense particle cloud results in a small ellipsis, i.e. the filter
estimated position of the particle filter.
more strongly ”believes” in its estimation.
One interesting aspect of comparing the three approaches
of lengths greater than one, since there is no need for the is the effort to describe the design. As shown in Table I,
designer to define a FSM for the substates of the operation. the property-based ITL design took about 3 days to write the
The process of designing the properties is supported by the properties. Most of this time has been spent on implement-
360MV verification tool. Its completeness check guarantees ing arithmetic operations, which could be reused as library
that for each specific input sequence there is exactly one functions in later designs, such as a square root algorithm.
unique sequence of properties defining a unique and deter- Furthermore, the ITL code contains not only the functional
ministic output sequence. When the completeness check is design description, but also the necessary code to conduct a
employed during the design process, less errors will remain complete formal verification.
undetected, and the design quality improves. In contrast to that, the hand coded VHDL design took
In [15], it is argued that operation properties are an alter- between one and two weeks of pure coding on RT-level. Since
native, and in some application areas more convenient, design most of the implementation level decisions have already been
approach. That means, operation properties are not suitable explored and specified during the property-based design, the
for all kinds of designs. However, in cases that are dominated VHDL design could follow a pretty straight-forward path. This
by ”sequences” of behavior that are activated under certain also illustrates the use of operation property based design as
conditions, operation properties are a very natural design a prototyping method that allows to quickly explore design
method. For an experienced designer of the future, it is most decisions, while even so resulting in an actual design that is
useful to know a couple of different design paradigms from ready for simulation or logic synthesis.
which to choose the most suitable for the design task at hand. The ImpulseC design methodology using CoDeveloper can
We propose operation properties as one of those paradigms. be divided into two phases. First, the ImpulseC code has been
The tool vhisyn derives a cycle-accurate register transfer implemented based on the pseudo-code reference implemen-
model from a given specification based on a set of operation tation given in [20]. This step has been completed within
properties. It processes a set of ITL properties and outputs just a few hours. However, the synthesis result was some
a VHDL model. The model is synthesizable and exactly orders of magnitude slower than the results obtained with the
implements the propertyies’ behavior as long as they satisfy other methods. Consequently, it has been necessary to optimize
the completeness check of the 360MV tool. critical code sections and to partition the design functionality
into smaller blocks to improve the coarse grain parallelism.
VI. R ESULTS Additionally, in Table I the results of the synthesis of all
Fig. 7 is a plot of a random movement of an object, whose three designs are shown. It can be seen that the design size
position is estimated by the particle filter. Although the design has the same order of magnitude. As expected, the hand coded
estimates 3D positions, the plot displays only two dimensions. design is very small, whereas the vhisyn approach uses nearly
It can be seen, that the estimated position follows the real twice the resources. Furthermore, the timing of the design is
position very closely, except near the border of the playground. considerably different. There are several reasons for this:
The ellipsis that is plotted at equidistant intervals of the 1) First, in contrast to the hand coded design the two
estimation path represents the covariance matrix of the particle generated designs do not use the highly optimized
set. It indicates the certainty of the estimation. A sparse and efficiently pipelined components generated by the

17
Xilinx Coregen Tool. This applies for example to the [2] W. A. Najjar, W. Böhm, B. A. Draper, J. Hammes, R. Rinker, J. R.
various arithmetic operators with large bit widths such Beveridge, M. Chawathe, and C. Ross, “High-level language abstraction
for reconfigurable computing,” Computer, vol. 36, no. 8, pp. 63–69, Aug.
as division, square root and multipliers. 2003.
2) The design generated by CoDeveloper is of moderate [3] Celoxica Limited, “Handle-C Language Reference Manual,” 2005.
size and reasonably fast but it needs about 66 cycles to [Online]. Available: www.celoxica.com
[4] Cadence Design Systems Inc., “Cadence C-to-Silicon Compiler Delivers
process one particle. In particular, CoDeveloper fails to On The Promise Of High-level Synthesis,” 2008.
implement a pipelined division and employs a sequential [5] K. Wakabayashi, “C-based synthesis experiences with a behavior syn-
component that needs 64 cycles for one operation. thesizer, ”Cyber”,” in Design, Automation, and Test in Europe (DATE).
Munich: IEEE Comput. Soc, 1999, pp. 390–393.
The runtime of the synthesis tools itself has been negligible. [6] O. Hammami, Z. Wang, V. Fresse, and D. Houzet, “A Case Study:
The vhisyn tool runs for about 16 seconds to generate the Quantitative Evaluation of C-Based High-Level Synthesis Systems,”
particle filter design. The time scales linearly with the amount EURASIP Journal on Embedded Systems, vol. 2008, 2008.
of hardware generated. It has been intended as a prototyping [7] Altera Corporation, “Nios II C2H Compiler User Guide,” 2009.
[Online]. Available: www.altera.com
platform and offers a lot of room for speed improvements. In [8] M. B. Gokhale, J. M. Stone, J. Arnold, and M. Kalinowski, “Stream-
general, when developing vhisyn, it has been a major point Oriented FPGA Computing in the Streams-C High Level Language,” in
not to include algorithms that do not scale well with big, IEEE Symposium on Field-Programmable Custom Computing Machines
(FCCM). Washington, DC, USA: IEEE Computer Society Press, 2000,
industrial strength designs blocks. By far the largest runtime p. 49.
is consumed by the tools, that process the generated VHDL [9] M. Rößler, H. Wang, N. Engin, W. Drescher, and U. Heinkel, “Rapid
code and generate a bistream file for the FPGA. Prototyping of a DVB-SH Turbo Decoder Using High-Level-Synthesis,”
in Forum on Specification & Design Languages (FDL), Sophia Antipolis,
VII. C ONCLUSION France, Sep. 2009.
[10] S. S. Huang, A. Hormati, D. F. Bacon, and R. Rabbah, “Liquid
In this paper, we used two high level design approaches Metal: Object-Oriented Programming Across the Hardware/Software
to implement a particle filter design. We compared the two Boundary,” in Object-Oriented Programming (ECOOP). Springer,
generated designs to a hand coded VHDL design of the 2008, pp. 76–103.
[11] Y. Abarbanel, I. Beer, L. Gluhovsky, S. Keidar, and Y. Wolfsthal,
same functionality. As expected the hand coded design leads “FoCs - Automatic Generation of Simulation Checkers from Formal
in terms of resource utilization and frequency requirements. Specifications,” in Computer Aided Verification. Berlin / Heidelberg:
However, when considering the improved ease of use and Springer, 2000, pp. 538–542.
[12] M. Boule and Z. Zilic, “Efficient Automata-Based Assertion-Checker
much lower code maintenance costs of both the property and Synthesis of SEREs for Hardware Emulation,” in Asia South Pacific
the C code approach, the higher resource requirements and Design Automation Conference (ASP-DAC). IEEE, 2007, pp. 324–329.
lower maximum frequency seem to be acceptable. [13] R. Bloem, S. Galler, B. Jobstmann, N. Piterman, A. Pnueli, and M. Wei-
glhofer, “Specify, Compile, Run: Hardware from PSL,” Electronic Notes
Furthermore, it can be seen that the C based methodology in Theoretical Computer Science, vol. 190, no. 4, pp. 3–16, 2007.
is more abstract than the property based method, which results [14] K. Morin-Allory and D. Borrione, “Proven correct monitors from PSL
in a very low implementation effort but reduced control over specifications,” in Design, Automation, and Test in Europe (DATE),
the cycle accurate behaviour. 2006, pp. 1246–1251.
[15] J. Langer and U. Heinkel, “High Level Synthesis Using Operational
One of the most important aspects of the property based Properties,” in Forum on Specification & Design Languages (FDL), Sep.
design effort has been the constant use of formal verification, 2009, pp. 1–6.
that provides the designer with information about the design [16] M. Schickel, “Applications of Property-Based Synthesis in Formal
Verification,” Ph.D. thesis, Technische Universität Darmstadt, 2009.
quality. Such measures are the determination of all output [17] E. El-Araby, M. Taher, M. Abouellail, T. El-Ghazawi, and G. B. Newby,
signals at each time step, the absence of deadlocks in the “Comparative Analysis of High Level Programming for Reconfigurable
control flow automaton and the unambiguous design behavior Computers: Methodology and Empirical Study,” in Southern Conference
on Programmable Logic (SPL). Mar del Plata: IEEE, Feb. 2007, pp.
for every possible sequence of valid input data. 99–106.
The paper classifies operation properties as an intermedi- [18] S. Ahuja, S. T. Gurumani, C. Spackman, and S. K. Shukla, “Hardware
ate level of description for hardware blocks, that offers a Coprocessor Synthesis from an ANSI C Specification,” IEEE Design &
valuable design approach for certain applications. In a future Test of Computers, vol. 26, no. 4, pp. 58–67, Jul. 2009.
[19] L. Piga and S. Rigo, “Comparing RTL and high-level synthesis meth-
development environment or even single hardware description odologies in the design of a theora video decoder IP core,” in Southern
language, algorithmic descriptions, operations and traditional Conference on Programmable Logic (SPL). Sao Carlos: IEEE, Apr.
RT level design will coexist and the developer chooses the 2009, pp. 135–140.
[20] S. Thrun, W. Burgard, and D. Fox, “The Particle Filter,” in Probabilistic
most appropriate design method for each individual block. In Robotics. MIT Press, 2005, ch. 4.3, pp. 96–113.
certain cases, even a mixture of different methods might be [21] S. A. Edwards, “The Challenges of Synthesizing Hardware from C-Like
applied. Languages,” IEEE Design & Test of Computers, vol. 23, no. 5, pp. 375–
386, 2006.
R EFERENCES [22] (2010) OneSpin Solutions. [Online]. Available: http://www.
onespin-solutions.com
[1] M. C. McFarland, A. C. Parker, and R. Camposano, “Tutorial on
[23] J. Bormann, “Vollständige funktionale Verifikation,” Ph.D. thesis, Uni-
high-level synthesis,” in Design Automation Conference (DAC). Los
versität Kaiserslautern, 2009.
Alamitos, CA, USA: IEEE Computer Society Press, 1988, pp. 330–336.

18
LAYERED TESTBENCH FOR ASSERTION BASED VERIFICATION

José Mosquera, Sol Pedre y Patricia Borensztejn

Departamento de Computación
Facultad de Ciencias Exactas y Naturales
Universidad de Buenos Aires
email: [email protected], {spedre, patricia}@dc.uba.ar

constraints and our own goal of capturing the benefits of a


ABSTRACT layered testbench environment with ABV.
This article is organized in the following way, section 2
In this paper we present the use of an assertion based describe the DUV, section 3 list the chosen testcases,
verification technique in combination with our own layered
section 4 explain the design of the layered testbench,
testbench environment to dynamically verify a design section 5 introduce the assertions based verification and the
implemented on an FPGA. We use PSL (Property coverage points used as a metric of verification progress.
Specification Language) to build a set of assertions to Last section presents the conclusions and future works.
ensure the consistency between requirements and
implementation. The layered simulation environment
provides a higher level of abstraction making the 2. DEVICE UNDER VERIFICATION (DUV)
verification process easier and more robust than a
monolithic testbench approach. We have chosen as a DUV the Verilog module “AC97
controller” described in a previous work [5], which
implements a subset of the AC97 protocol [6] to gather
1. INTRODUCTION
microphone samples and loads them into an asynchronous
FIFO for a later transmission over an Ethernet interface.
In this paper we expose a functional verification through The audio samples come serially into the DUV through
simulation following the layered testbench methodology
the signal AC97_SDATA_IN in a form of 256 bit frames.
[1].
The start of each frame is signaled by AC97_SYNC. The
Monolithic testbenches are written to verify an specific
AC97 controller takes the 20 bit of time slot 3; converts bits
functionality of a Device Under Verification (DUV), where
into bytes and send them out trough DATA_OUT signal. In
the testbench designer have to take care of all aspects of the case of FIFO full event, the AC97 controller flush the FIFO
simulation at the same time, like low level details such as
asserting the RESTART signal during 6 AC97 bit clocks.
signals values and transitions. If other functionality is
needed to be verified, a new testbench has to be written.
In contrast, with layered testbenches it is possible to 3. FUNCTIONAL VERIFICATION OF DUV
split the test environment into smaller pieces or
components, allowing the designer to concentrate on The subset of AC97 protocol implemented by the DUV,
specific part of the environment. Each component has its include the reception of AC97 frames considering time slot
own aim, like stimulus generation, stimulus injection, 0 (control information) and time slot 3 (microphone
response capture and automated response check. samples). All other information of the frame is discarded.
In an automated verification environment there is a need On the other side, the DUV fills a FIFO with microphone
of a tool to measure the correctness of the behavior of the samples, so DUV should monitor the state of that FIFO
DUV. Assertion Based Verification (ABV) [2] technique before write on it.
permits the construction of such metric. The ABV The functional verification test cases selected are the
decomposes the intent of the design, into properties that following: “Valid Frame”, “Valid Time Slot”, “FIFO full”
have not to be violated during the simulation. We have and the combination of them. These cases are the golden
chosen PSL [3] in its Verilog flavor to write a set of reference in which we have defined what is success or fail.
assertions and coverage points that should be satisfied During simulation, assertions execution is monitored
during the simulation. inside the DUV and on its external interfaces. Whenever an
Despite the availability of verification frameworks like assertion is violated, it is reported. Internal observability for
OVM/UVM [4] we have decided to make our own bug detection, isolation and notification are then improved.
implementation due to two main reasons: budged Each assertion specifies an illegal behavior of a circuit
structure inside the design.

19
tested functionality during the simulation. Section 5
describes assertions and coverage points with more details.

4.2. Command Layer

The command Layer has two components, Driver and


Monitor.

4.2.1. Driver
The driver translates in proper stimulus the different
commands received from the Agent, and notify back the
execution of each command based on “AC97_SYNC”
signal.
Fig. 1. Desing Under Verification: AC97 controller. The driver was divided internally into two sub-drivers,
the “ac97_driver” and the “fifo_driver”. The last sets/resets
the “FIFO full” signal and the former injects serially the
We have also introduced properties to verify that the 256 bit audio frame into the DUV.
design does everything it is supposed to do. The collection 4.2.2. Monitor
of these additional verification properties represents the The monitor takes the 8 bit “DATA_OUT” signal each time
functional coverage model of the DUV. The properties the “LOAD” signal is asserted, and reports back to the
covered during the simulation provide a metric of Checker the obtained 20 bit of the sample data.
verification progress.

4.3. Functional Layer


4. LAYERED TESTBENCH
4.3.1. Agent
Based on the methodology of layered testbench, and having The Agent translates the transaction received from Scenario
previously defined the functional verification cases (section layer in the right commands to the driver. The possible
3) of the DUV (section 2), we have built a test environment transactions are the combination of “Valid Frame”, “Valid
dividing it into smaller components. Time Slot”, “FIFO full” and a set of generated microphone
The design of the layered testbench was done following samples.
the bottom-up practice, i.e. from DUV to Test cases, The end of the transaction is parameterized at the Agent
splitting the simulation environment as it is represented in in order to send a desired number of microphone samples
Fig. 2. throughout the DUV.
Hence we have defined 4 layers: signal, command,
4.3.2. Scoreboard
functional and scenario. Signal layer contains the DUV Based on the same command sent to the Driver, the
described in section 2. Scenario layer contains the test cases Scoreboard generates the golden reference of the expected
described in section 3. Functional and Command layers DUV’s response. In our case, the scoreboard notifies the
provide the interface between highest and lowest layers. checker when monitor response should be checked against
Functional layer translates each test case provided by the output generated by the sub-agent “mic sample_gen”.
Scenario layer, into a transaction to be checked. A
4.3.3. Checker
transaction in this layer is an AC97 frame. Command layer
Based on current testcase scenario, the Checker compares
takes the transaction and drives the proper signals to the
the monitored response of the DUV with the expected result
DUV (Signal layer).
generated by the Scoreboard, e.g. in the scenario of “Valid
Frame” and “Valid Time Slot”, the 20 bit microphone
4.1. Signal Layer sample generated by the Agent is compared against the 20
bit sampled by the Monitor.
Signal Layer, contains the DUV, i.e. the “AC97 controller”
module and the signals that connect it to the simulated
AC97 audio codec and Asynchronous FIFO modules. 4.4. Scenario Layer
The RTL code has embedded the PSL assertions and
4.4.1. Generator
coverage points. Both, PSL assertions and coverage points,
We have implemented the generator as a state machine,
are depicted as small circles in the DUV module of Fig.2,
which in sequence sends the proper transactions to the
reporting to the environment the successful or failure of the
agent to simulate each testcase scenario defined in section
3.

20
Fig. 2. Internal modules of layered testbench

Example 2:
5. ASSERTIONS AND COVERAGE
// psl property fifo_full_load = always
PSL specification define four layers, Boolean layer which {full==1'b1}|=>{!rose(load)};
has HDL boolean expressions; Temporal layer which is the // psl assert fifo_full_load;
core of PSL, providing temporal relationships between
We have introduced coverage points to verify that the
boolean expressions; Verification layer which directs the
design does everything it is supposed to do. Based on the
use of properties to coverage or assertions; and Model layer
which has statements to model the environment. DUV’s specification and the list of directed-testcases, we
We have written PSL embedded in Verilog comments have created a set of properties that reports the functionality
as a method to introduce assertions in the simulation tested.
The AC97 controller module is based on the FSMD
environment.
We have introduced assertions at AC97 interface based methodology, i.e. consist of a data path controlled by a
on the AC97 protocol specification; hence we have added FSM. So, we have added coverage to each of the possible
states (Example 3 show some covered states) to ensure that
properties to verify the duration of a frame (Example 1), the
duration of the “AC97_SYNC” signal in case of “Valid the simulation cover all its possible states. Also, we have
Frame” and “Valid Time Slot”. added assertions to verify the control signal are valid at the
right moment.

Example 1:
Example 3:
// psl property frame_len = always
// psl sequence fsm_idle = {state_reg==Idle_state};
{rose(ac97_sync)} |->
// psl cover fsm_idle;
{1'b1[*256];rose(ac97_sync)};
// psl assert frame_len; // psl sequence fsm_sync =
{state_reg==Sync_state};
At the FIFO interface we have added properties to // psl cover fsm_sync;
verify that no new data is loaded if FIFO is full (Example // psl sequence fsm_valid_frame =
2), and to ensure the right duration of the restart signal. {state_reg==ValidFrame_state};
// psl cover fsm_valid_frame;

21
6. CONCLUSION 7. REFERENCES

This paper is on the direction of adopting innovative tools [1] Chris Spear, “SystemVerilog for Verification: A Guide to
and methodologies applied to testing and verification. Learning the Testbench Language Features Second Edition”,
We have found that the time spent to implement the Springer, 2008.
layered testbench environment is on the same order of each [2] Harry D.Foster, Adam C.Krolnik, David J.Lacey.
testcase on the monolithic approach. Hence, the automation “Assertion-Based Design”, 2nd edition, Springer, 2004.
of directed-testcases is reflected as a productivity increase (ISBN: 1402080271).
on the verification process. [3] Property Specification Language (PSL), Accellera,
Assertions and coverage properties propose a higher www.eda.org/vfv
level of abstraction because are closer to the specification
than traditional testbenches. These introduce not only the [4] Open Verification Methodology, http://www.ovmworld.org/
benefit of productivity increase, but also improve the [5] Designer Forum 2010. Proceedings. 2010 Audio sobre
robustness of verification. Ethernet: Implementación utilizando FPGA. José Mosquera,
Having implemented our own testbench framework Andrés Stoliar, Sol Pedre, Maximiliano Sacco y Patricia
totally in Verilog, as next step, we are going further on the Borensztejn. Proceedings of SPL Southern Programmable
adoption of innovative tools and methodologies, such as Logic Conference 2010. ISBN: 978-85-7656-171-2. Rima
System Verilog and constrained-random testcases, with the Editora. pag.13-18
intention of future adoption of OVM/UVM framework. [6] Audio Codec ‘97, Revision 2.3 Revision 1.0, Intel. April,
2002

22
DEVELOPMENT AND IMPLEMENTATION OF AN ADAPTIVE NARROWBAND
ACTIVE NOISE CONTROLLER

Fernando A. González, Roberto R. Rossi Germán R. Molina, Gustavo F. Parlanti

Digital Signal Processing Laboratory Digital Signal Processing Laboratory


Universidad Nacional de Córdoba Universidad Nacional de Córdoba
Córdoba, Argentina Córdoba, Argentina
email: [email protected], email: [email protected],
[email protected] [email protected]

ABSTRACT On headsets feedback ANC, the FIR cannot govern the


anti-noise acoustic signal directly, but needs to act through
This paper presents the development and implementation of physical elements like a Digital to Analog Converter
an adaptive feedback Active Noise Control (ANC) system (DAC) and a loudspeaker inside the headset. Similarly, the
based on a commercial Digital Signal Processor (DSP). The acoustic error will have to be captured by microphones to
system aims to cancel the low frequency narrowband noise be converted to the electrical domain and then converted to
remaining inside a headset shell. This low frequency noise digital by an Analog to Digital Converter (ADC) to become
is particularly difficult to cancel by passive acoustic means. the error digital signal. Beside these elements, amplifiers
Using active techniques as the one presented here and attenuators are often needed. All these additional
appropriate levels of noise attenuation are achieved in an elements are usually represented by a unique transfer
efficient way suitable for commercial use. function called “secondary path” S(z) in series with the
adaptive FIR, W(z). In order to compensate the effect of
1. INTRODUCTION S(z) the input to the adaptive algorithm has to be affected
(filtered) in the same way as the controller´s output. The
In industrial environments, cabins near engines (as in a car modified algorithm is then called Filtered x Least Mean
or an airplane) or in any noisy environments in general, the Squares (FxLMS) algorithm [4]. As S(z) is unknown and
headsets used for passive noise cancellation are useful only may vary on time, it is often identified by a second adaptive
for frequencies over 500 Hz. As a complementary solution system and a copy of its transfer function, S^(z), is
to passive headsets, Active Noise Control (ANC) [1], [2] introduced to filter the input to the adaptive algorithm and
systems aim to cancel the remaining noise low frequency to produce a replica of the noise to cancel from the
components. ANC main objective is to produce inside the controller’s output. S^(z) can be obtained prior to W(z)
headset, a signal of equal amplitude but opposite phase to adaption process (“off-line” estimation) or at the same time
the remaining noise. This signal in often called “anti-noise” with it (“on-line” estimation). This paper presents the
The noise inside the headset may vary over time because implementation details of an ANC system built with a
the external noise has changed, or because the headset has commercial DSP, which produces active cancellation of the
moved, changing its acoustic transfer function. The ANC remaining narrowband low frequency noise inside a
system must be able to adapt to these changes, modifying commercial headset, using off-line secondary path
the produced anti-noise signal accordingly, then “learning” estimation. Fig. 1 shows the general architecture of the
from its errors [3]. In environments with presence of system. Main equations governing the adaption process [5],
engines, turbines, air-conditioning, sirens etc the noise to be including a normalized version of FxLMS are summarized
cancelled will be mainly narrowband. This means that the as follows:
noise spectrum will be concentrated among well defined
frequencies, and the noise signal will be periodic in time. In µ(n)= α/LPx’(n). (1)
this case the so called “feedback” ANC implementations
can be used without causality constrains. The ANC wk(n+1) = wk(n) + µ(n)e(n)x’(n-k). (2)
adaptive filters are often made of a Finite Impulse Response
(FIR) filter with varying coefficients. With the availability where n is the iteration number, µ(n) is the variable step or
of fast and high processing power Digital Signal Processors adaption speed, L the amount for coefficients in the FIR
(DSP), both coefficients update and ANC input signal filter, α is a constant, Px’(n) is the power of x’(n), wk(n) are
processing can be done in real time. the FIR coefficients and k = 0, 1, 2, … L-1 is the coefficient
number.

23
Fig. 1. Adaptive feedback ANC system with FXLMS.
Fig. 2. Block diagram of the experimental model.

2. SYSTEM IMPLEMENTATION The only exception was in the filter adaption process,
whose precision was improved by using 32 bits for the
The implementation of an ANC applied to a headset was result of µ(n) times e(n), and then performing a 32 by 16
done using a high performance DSP StarCore MSC7116 bits multiplication for W(z) coefficient´s update in (2). This
from Freescale Semiconductor Inc. The StarCore MSC7116 prevented the adaptation process to stop by lack of
is a low cost, 16 bits word-length, fixed point DSP with precision, resulting on a performance improvement. The
four Arithmetic and Logic Units (ALU). It can produce block diagram of one ANC audio channel is shown on Fig.
1000 MMACS at 266MHz. Due to its high processing 2.
power, complex calculations as those required by the
adaptive filters of Fig. 1 for both audio channels can be 2.1. DSP evaluation board
achieved within a sampling period (“single-sample real
time processing”). The kit MSC711xEVMT [8], is an evaluation board for
The DSP program runs over SmartDSP, the specific applications using the DSP StarCore MSC711x. It was used
DSP’s Real Time Operating System (RTOS) designed for to schedule and evaluate the program to the DSP from a
the StarCore family. The SmartDSP Application Program PC. The board has also integrated the stereo 16 bit CODEC
Interface (API) [6], made up from functions developed in AK4554 from AKM Semiconductor Inc. It was used to
the C language, allows an easy configuration and utilization handle the electro-acoustic transducers input and output
of the DSP peripherals. The API has a driver for every analog signals. Besides the ADC and DAC for both
peripheral type, allowing the application program to channels, the CODEC is used to make the required anti-
communicate with it. The Time Division Multiplexing aliasing and reconstruction low-pass filters. The TDM
(TDM) peripheral driver was used for input and output of peripheral inside the DSP communicates to the CODEC to
both DSP audio channels. Data come in and out at the produce the input and output of both data channels.
sampling rate, which was selected to be 8kHz. This
sampling rate was considered enough for this application,
which aims to produce ANC at frequencies below 500Hz. 2.2. The acoustic system
The application program was fully developed in the C
language, using the so called “intrinsic” functions to The headset used was the circumaural stereo SHP1900
optimize the adaptive filters routines. These functions are from Philips. On circumaural headsets, the user ears are
also written in C and belong to the compiler’s domain [7] covered by the ear-cup, leaving small acoustic cavities near
rather than to the RTOS’s API. They are designed to made each ear. The Electret type omnidirectional microphone,
fractional operations and take advantage of the DSP parallel ECM-30 was used as error sensor microphone. They have
processing capabilities. The intrinsic functions are directly sensitivity, bandwidth, signal to noise ratio and physical
inserted within the C language code, allowing the size appropriate for ANC applications.
programmer to closely match the efficiency of the DSP The error microphone position within the headset
assembler language. determines the “quiet zone”. The selected position follows
The data precision was defined to be 16 bits word length the ideal placement suggested by the authors of [9] and
for most of the data, using the fixed point format. Most data [10]. This place is the nearest possible to the user’s ear
multiplications were then 16 by 16 bits, which is optimized canal, and produces the flattest possible frequency response
on the DSP architecture. of the secondary path.

24
2.3. The amplifier’s board

The amplifier’s board lodges two preamplifiers for the


microphones and two power amplifiers for the
loudspeakers. The preamplifiers are transistor’s based and
were designed to accommodate the low level microphone
signals to the A/D input level requirements. The power
amplifiers are based on LM386 chips and deliver the power
required by the loudspeakers.

3. RESULTS

The resulting data were exported from the DSP to the


MATLAB program, where different metrics can be easily Fig. 3. S^(z) impulse response.
analyzed and plotted.

3.1. Secondary path S(z) identification

To identify S(z) off-line, S^(z) was made of an adaptive FIR


structure with length Ls = 128 coefficients. This length was
enough to reproduce most of S^(z) impulse response. S^(z)
coefficients were updated using the original LMS
algorithm. White noise with zero mean and 0.03 variance
was generated inside the DSP and used as learning signal.
An adaption LMS step µ = 0.01 was used.
Fig. 3 shows the resulting impulse response after S^(z)
convergence. It shows a delay of aproximatelly 40
coefficients, or 5 ms given a sampling rate of 8 KHz. The
CODEC’s datasheet reports a fixed delay of 36
coeficientes, which explains almost all S^(z) delay. If Fig. 4. Temporary evolution of S(z) identification.
needed, a smaller temporary delay can be achieved simply
by raising the sampling rate. In (3), only one value P^x’(n-1) is required between
The S^(z) learning process first 2 seconds (or 16000 iterations. The factor β determines how fast changes in x’(n)
samples) is shown in Fig. 4. There, the white noise learning will be reflected in P^x’(n). A β = 0.002 was used as a
signal filtered by S(z) is shown in blue, the white noise compromise between following closely x’(n) changes and
filtered by S^(z) is shown in green and the identification P^x’(n) stability. As near to zero x’(n) values would produce
error is shown in red. a very large µ in (3), a minimum limit of P^x’MIN(n)=0.01
was forced by software in (3).
3.2. The controller’s performance The first testing noise generated with MATLAB was a
single tone of 200Hz. A big amplitude was selected in order
Two different testing noises where generated in MATLAB to produce distortion in the output of the portable PC
to evaluate the performance of the controller W(z). Then, loudspeakers, generating then harmonic components. Fig. 5
the testing noises where reproduced by a portable PC shows in blue the generated noise spectrum, which is
loudspeaker. A user wearing the headset was seated in front represented by signal d(n) in Fig. 1, and in green the
of the PC loudspeaker to get uniform sound on both ears. attenuated noise, signal e(n) in Fig. 1. This result was
The S^(z) previously obtained off-line was used in the obtained after 3 seconds (24000 samples) from starting the
tests. The normalized version of the FxLMS algorithm, was W(z) learning process. In Fig. 5 we can see the 200Hz tone
used to update W(z) coefficients. An adaptive FIR structure with its harmonics and a single 50Hz tone, also present in
with length Lw = 160 coefficients was used to made W(z). In the circuit. The attenuation achieved for 50 Hz was 33.5
(1), an α = 0.016 was used. In order to calculate in an dB, for 200 Hz was 51 dB, for 400 Hz was 44 dB and for
efficient way an approximation of Px’, we used an 600 Hz was 35.5 dB.
exponential window of the form [5]: The second testing noise was a typical engine sound
[11], also generated by MATLAB, and reproduced without
distortion by a portable PC loudspeaker.
P^x’(n) = (1-β) P^x’(n-1) + βx’2(n) (3)

25
For the different signals tested, the user reports
comfortable levels of remaining noise inside the headset
shell cavity, being these significantly lower than those
without the ANC.
The future directions will focus on improving the
feedback ANC performance, and on broadband noise ANC
within a headset. Different learning algorithms will also be
investigated, analyzed and implemented on real conditions
with commercially available components.

5. ACKNOWLEDGMENTS

The authors wants to acknowledge the Cordoba National


Fig. 5. Noise power spectrum of a distorted 200 Hz tone University Cience and Technology Office (Secretaría de
(blue line) and error signal (green line). Ciencia y Tecnología de la Universidad Nacional de
Córdoba), and the Freescale Semiconductor, Inc. company.

6. REFERENCES

[1] S. M. Kuo and D. R. Morgan, “Active noise control: a


tutorial review”, Proceedings of the IEEE, vol. 87, no. 6, pp.
943-973, Jun. 1999.
[2] S. J. Elliot and P.A. Nelson, “Active Noise Control”, IEEE
Signal Processing Magazine, vol. 10, no. 4, pp. 12-35, Oct.
1993.
[3] B. Widrow et al, “Adaptive Noise Cancelling”, IEEE
Proceedings, vol. 63, no. 12, pp. 1692-1716, Dec. 1975.
[4] B. Widrow, D. Shur and S. Shaffer, “On adaptive inverse
control” in Proc. 15th Asilomar Conf., 1981, pp. 185-189.
Fig. 6. Engine noise power spectrum (blue line) and error
signal (green line). [5] S. M. Kuo and D. R. Morgan, Active Noise Control Systems-
Algorithms and DSP Implementations, New York: Wiley,
1996.
This noise is a combination of 12 components
multiples of 60Hz, with different amplitudes. Most of the [6] SmartDSP OS Reference Manual, Rev. 1.42, Metrowerks,
power is concentrated around the 240, 300, 360, 420 y 480 Austin, TX, Sept. 2005.
Hz components. Data was also collected after 3 seconds [7] CodeWarrior Development Studio for StarCore DSP
(24000 samples) from the start of the controller W(z) Architectures: C Compiler User Guide, Freescale, Austin,
learning process. The Fig. 6 shows the noise power TX, Aug. 2009.
spectrum and the error signal. The noise attenuation for the [8] MSC711XEVM User’s Guide, Rev. 0, Freescale, Austin,
main engine noise components was: 31.3 dB for 240 Hz, TX, Apr. 2005.
46.7 dB for 300 Hz, 35.2 dB for 360 Hz, 38.7 dB for 420
Hz and 25.6 dB for 480 Hz [9] W. S. Gan and S. M. Kuo, “Adaptive Feedback Active
Noise Control Headset: Implementation, Evaluation and Its
Extensions”, IEEE Transaction on Consumer Electronics,
vol. 51, no. 3, pp. 975-982, Aug. 2005.
4. CONCLUSION AND FUTURE DIRECTIONS
[10] S. M. Kuo and W. S. Gan, “Active Noise Control System for
Headphone Applications”, IEEE Transaction on Control
From the analysis and the results presented in this paper, Systems Technology, vol. 14, no. 2, pp. 331-335, Mar. 2006.
the following conclusions can be summarized:
The designed and implemented ANC system attenuates [11] MathWorks. Filter design toolbox. Active Noise Control
periodic (narrowband) noises in the frequencies of interest Using a Filtered-X LMS FIR Adaptive Filter. [Online].
in an acceptable level. Available:
http://www.mathworks.com/products/filterdesign/demos.htm
The precision and computational power of the used
l?file=/products/demos/shipping/filterdesign/adaptfxlmsdem
DSP, are enough to process simultaneously both
o.html
independent audio channels in real time.

26
BIO-INSPIRED HARDWARE SYSTEM BASED IN ANIMALS OF COLD AND HOT BLOOD

Pablo A. Salvadeo Rafael Castro López

Laboratorio de Computación Reconfigurable Instituto de Microelectrónica de Sevilla


FRM – UTN CNM – CSIC
Rodriguez 273, CP M5502AJE, Avda. Américo Vespucio S/N, CP 41092,
Mendoza, Arg. Sevilla, Esp.
[email protected] [email protected]

Ángel C. Veca Elvo H. Morales

Instituto de Automática INDEA


FI – UNSJ FRM – UTN
Av. San Martin Oeste 1112, CP J5400ARL, Rodriguez 273, CP M5502AJE,
San Juan, Arg. Mendoza, Arg.
[email protected] [email protected]

temperature will be used.


ABSTRACT The techniques of the digital design attempt to abstract
to the designer of the analogical behavior of the devices, it
In this document will be discussed a way to create a bio- which would be useful when interacting with physical
inspired hardware system sensitive to the temperature, magnitudes. This is the followed very approach when the
using a hardware description language and digital hardware description language is used.
reconfigurable devices. In addition, the system will be self- Fig. 1 shows the proposed system, which is composed
contained in the employed device. A FPGA (Field by two main parts: sensor circuit and block sensitive to it,
Programmable Gate Array) will be used for the named henceforth sub-system (SS). Both parts will be
implementation, together with VHDL (VHSIC Hardware designed by using the same techniques which are applicable
Description Language) for the description. Moreover, two within a range of temperature where the abstractions they
systems whose biological inspiration is based in the animals have been based on are still valid. Nevertheless, some
near of the spectrum extremes of its kingdom, the cold and digital circuits exist, whose nature makes them depending
the hot, will be described. upon physical magnitudes although the mentioned rules are
applied for its realization. Then, it is one of these cases
1. INTRODUCTION which will be chosen to execute the system input sensor
function.
Animals are biological self-contained systems in its own
body. Through this, an animal is able to detect changes in FPGA
the environment and have them into account to modify its
behavior. A bio-inspired electronic system should be, Sensor(TE)
likewise, self-contained: sensors should be intrinsic to the
mentioned system and this should not use external or Sub-System
extrinsic elements to it.
The objective of this work is to obtain a system capable
of to perceive any physical magnitude and to modify its Fig. 1. The self-contained system.
behavior as a function of this, in a similar way like happens
in Nature. In this study, the temperature is chosen as the Such a transducer should have the temperature as the
magnitude to follow and the problem consists to make this input variable, and as output should be a digital signal
job from within the hardware device. To solve this problem, whose parameters proportional to the input value. If the
a digital circuit whose dynamics depends on the output signal is periodical, it will be characterized by two

27
parameters: phase and frequency. As the signal is a digital FPGA. Such specification states the maximum frequency
one, it is considered that the amplitude only can take two allowed for a clock signal. For its calculation, the time of
values: 0 and 1, and therefore this amplitude does not offer transit through the sequential output is taken into account,
any sensible information. In this case, the frequency of the including the delay introduced by the flip-flop of this block.
signal will be used, so that it will vary with the elected Hence, the output is directly used if the sub-system is
physical magnitude. A circuit that satisfies these combinational. However, if it is sequential, the frequency
specifications is the so called Ring Oscillator (RO). should be decreased to the device datasheet recommended
value. Obviously, one or more counters could be used to
2. RING OSCILATOR divide the signal, and thus to obtain the desired value.
Furthermore, if the transducer is built with an odd number
of linked gates, a lower frequency will be obtained due to
The RO, in its simplest form, is a combinational digital
the rising of the delay inserted by these additional gates.
circuit integrated by a NOT gate with a feedback loop
Finally, such as shown in [2], the relationship between
closed between the input and the output. After power-up it
begins to oscillate, delivering a signal whose frequency is frequency and temperature is linear, getting close to a
dependent on the delay time of gate, and this delay varies as straight line with negative slope within the range of work
specified by the manufacturer. In [3] it is observed that
a function of the temperature.
when the number of gates of the ring is increased, the
If this circuit is described and implemented over a
exchange rate of the output diminishes along with the
reconfigurable hardware device, the obtained behavior,
decrement of frequency.
according to our experience, will not be the expected one.
Therefore, another description of the RO, with a dual input
NAND gate is used. One of them is a used as the “enable” 3. BEHAVIOR
line and the other is connected with the output. In addition
this version presents the advantage of being able to be Following the exposed proposal, a bio-inspired hardware
controlled by the sub-system via “enable”. system sensible to the temperature and self-contained in a
To describe this circuit by using VHDL is necessary to device, its body, is made. In this section, the observable
define an entity with an input port and an output port. This behavior that appears with the changes of temperature in
port must be of the buffer type, because this will be read the animal kingdom, and the way to emulate such behavior
and written, being the architecture as simple as: “output with the proposed system, will be described.
<= nenable nand output;”. In Nature, animals tend to maintain constant the
If the FPGA architecture [1] is took as a design temperature of its own bodies, and in the same way, a
reference, where each island is a LBA (Logic Array Block) reconfigurable device can operate in a closed interval
formed by a set of ALMs (Adaptive Logic Modules), the depending upon the techniques of its fabrication. Then, it
implementation of a RO occupy only one of these basic seems reasonable to explore the mechanisms used by
blocks. The feedback is built by employing one LCs (Local animals in relation to its thermic adaptation [4].
Connections) of the LBA, as shown in Fig. 2. In the animal kingdom, diverse behaviors in relationship
with the temperature are found to be scattered over a wide
spectrum, whose extremes are denominated cold and hot
blood. The first one alludes to the lack of internal
mechanisms to stabilize the corporal temperature. The
second one refers to the capability of maintaining it
constant by using those mechanisms. Thus, animals which
find themselves further close to one than of another extreme
they will have different behaviors in front of the changes of
the environment.
In the nearnesses of the cold blood side, a significant
part of the time is inverted in searching different places of
its habitat where to remain some particular hours of the day,
and thus to hold its regulated temperature. Instead in the
neighborhood of the other extreme, the time only is used in
Fig. 2. Implementation of the RO. In blue, the LCs used for such activities of occasional way. Further, the first ones do
the feedback. In green, the ALM used by the logic. not use its metabolism to get cold or get hot, while the
second ones effectively do it. Hence, for animals with
All outputs used to implement the RO are identical corporal weight, but in opposed extremes of the
combinational ones, so the frequency of oscillation will be spectrum, them cold-blooded they need minus energy than
superior that the highest specified by the maker of the them warm-blooded, due to minor energy consumption in

28
the first ones. 3.2. Hot Blood System
Close to the cold extreme, the metabolism is composed
by various reactions that activate themselves into different If on the contrary, the system has mechanisms to get hot or
temperatures thresholds. On the other hand, nearby of the to get cold, it will be constituted by three parts: the
hot side, only is necessary one reaction or a few of them to transducer, the SS, and a circuit of varying the temperature
conform it. Thereby, the hot-blooded animals stabilize its of the device. The diagram of this is shown in the Fig.3.2.
temperature to optimize its simple metabolism. The cold- The behavior is as follows: changes in the frequency of the
blooded ones possess a complex metabolism composed by transducer's output signal are due to temperature changes.
several reactions that are optimal to different temperatures. Such changes influence the thermal control circuit
Thus, the metabolic complexity is exchanged by modifying its set-point in order to cause a contrary effect to
consumption of energy, conferring this interchange, that initiated by the environment. In this way, the work
advantages and disadvantages to different animals in conditions of the sub-system are kept constants, ensuring
specific situations. the maximum performance.
With the presented observations, two plausible systems
will be considered: one of them near of the cold terminal FPGA
and the other one close to the hot extreme. Sensor(TE)

Sub-System
3.1. Cold Blood System

If the system hasn't an internal mechanism which stabilizes Thermic Controller(TE)


its body at the optimum work temperature, the device will
be composed by several sub-systems. Each one of these Fig. 3.2. The hot blood system.
will exhibit the maximum performance over a limited
portion of the operational range. This can be implemented The thermal circuit must be able to change the system
in two different ways, depending on the area portion temperature, offsetting the changes of the environment. In
occupied by the sub-system. If each one occupies a little order to do this, the energy consumption will permit to vary
portion of the total, then they all can coexist, see Fig. 3.1a. the system temperature, in the same way as it is done by the
In this case, depending of the transducer's frequency, each hot-blooded animals. As an option, the controller should be
one will be activated while the rest remains inactive. compound by an oscillator whose increase in frequency
However, if the SS occupies a big part of the chip, the must be proportional to the decrease of the temperature. In
device can be reconfigured in terms of period's changes, see this schema, the increment in the quantity of commutations
Fig. 3.1b. In order to do that, the configurations of the sub- for time unit increases the rate of transformation of electric
systems for each temperature should be stored and used in energy in heat, enlarging the consumption. If the heat
concordance with the current temperature. generation is bigger than its dissipation, the temperature
will increase. In this case as in Nature, is simpler to hold the
FPGA device's temperature higher than that of the environment.
Sensor(TE) When the environmental temperature rises, the circuit will
generate minus heat by consuming less energy. But if the
Sub-System(T0 – T1)
temperature falls, the controller will hold the performance
Sub-System(T1 – T2) of the sub-system at the expense of a bigger consumption.
Finally, two improvements are proposed in order to
Sub-System(T2 –T3) obtain a more uniform heating. The first one is to distribute
the signal of the thermal controller by the canals around the
(a) islands used by the SS. The second one is to place slave
switches in some islands that follow the thermal rhythm
Sensor(TE = T01) FPGA leaded by the controller.

Sub-System(T0 – T1) 4. CONCLUSION

A technique that does not make use of exogenous elements


Configurations(T01, T12, T23) to the device in order to obtain a bio-inspired hardware
system sensitive to the temperature was shown. For this
(b) system, a Ring Oscillator, a combinational circuit that
Fig. 3.1. The cold blood system: (a) small and (b) large SS. demonstrated attributes to be used as sensor of temperature

29
in other applications, was used. A way to lead the biological 6. REFERENCES
inspiration toward the emulation of the behavior of the
animals nearby of the spectrum extremes of its kingdom, [1] Altera Corp., Stratix II Device Handbook, vol. 1, sec. 2, pp.
the cold and the hot, was also shown. For this reason, is 1–106, May 2007.
believed that in this journey they have shown concrete [2] S. Lopez-Buedo, J. Garrido, and E. I. Boemo, “Dynamically
options that can be useful when implementing bio-inspired Inserting, Operating, and Eliminating Thermal Sensors of
hardware systems. FPGA-Based Systems”, IEEE Trans. Components and
Packaging Technologies, vol. 25, no. 4, pp. 561–566, Dec.
5. ACKNOWLEDGMENTS 2002.
[3] S. K. Yoo, D. Karakoyunlu, B. Birand, and B. Sunar,
This work was made during 2010 with the partial support of “Improving the Robustness of Ring Oscillator TRNGs”,
BINID – UTN. Thanks: to Ángel C. Veca for to invite me ACM Trans. Reconfigurable Technology and Systems, vol. 3,
to participate of research and development, and to Eduardo no. 2, art. 9, pp. 1–30, May 2010.
Zavalla, INAUT – FI – UNSJ, for to collaborate in the [4] M. S. Blumberg, Body Heat: Temperature and Life on
grammatical revision of this paper. Earth, Cambridge, MA: Harvard University Press, pp. 1–69,
2002.

30
ANÁLISE COMPARATIVA E QUALITATIVA DE FERRAMENTAS DE
DESENVOLVIMENTO DE FPGA’S

Gabriel Santos da Silva / Maximiliam Luppe

Departamento de Engenharia Elétrica / Escola de Engenharia de São Carlos


Universidade de São Paulo
Av. Trabalhador São Carlense, 400 – São Carlos – SP – Brasil – 13566-590
Email: [email protected]; [email protected]

ABSTRACT
2. ANALISADOR LÓGICO E MEMÓRIA FIFO
Este trabalho fornece um estudo das ferramentas de
desenvolvimento dos principais fabricantes de FPGA’s no O projeto de iniciação científica citado compreendeu a
mercado atualmente, a fim de realizar uma análise elaboração de um analisador lógico [7] para análise on-
comparativa e qualitativa entre as mesmas. Utilizou-se chip de Sistemas Digitais implementado em FPGA. Este
como base para este estudo um projeto de iniciação consiste na implementação de um dispositivo para análise
científica implementado em FPGA que abordou de sinais digitais on-chip que seja open-source, visando
ferramentas de síntese, simulação e geração de IP-cores. possuir um número irrestrito de canais de entrada,
permitindo-o trabalhar com circuitos mais complexos, e
1. INTRODUÇÃO não ser condicionados ao uso das FPGA’s dos seus
próprios fabricantes.
Nos últimos anos, o crescimento dos dispositivos A Fig. 1 ilustra o diagrama de blocos de um analisador
reconfiguráveis e de suas respectivas ferramentas de lógico, representando suas principais funções.
desenvolvimento - tanto em diversidade, quanto em
densidade - tem favorecido a implementação de sistemas
complexos e completos em lógica integrada e programável
(SoC – System on Chip). Altera [1], Lattice [2] e Xilinx [3]
são exemplos de empresas que elaboram soluções na área
de sistemas reconfiguráveis digitais, cada uma delas
possuindo suas respectivas ferramentas de
desenvolvimento: Quartus II, Diamond e ISE,
Fig. 1. Diagrama de Blocos do Analisador Lógico
respectivamente.
As vantagens em se trabalhar com FPGA’s [4] estão na O bloco Base de Tempo define se os processos de
possibilidade de desenvolver soft-cores [5], podendo ser aquisição e armazenamento de dados serão feito com sinal
reutilizados (um mesmo soft-core pode ser utilizado em de clock advindo do dispositivo analisado ou externo. O
diversos projetos, sem custo adicional nem gasto com bloco Estágio de Disparo inicia o processo de captura dos
tempo de projeto) e portáteis (pode ser adequado a diversas dados, possuindo duas opções: disparo (trigger) interno, no
plataformas de desenvolvimento de dispositivos qual são comparados os dados adquiridos com uma palavra
reconfiguráveis). Por isso é extremamente importante o de informação (dado de entrada) previamente determinada,
estudo de linguagens de descrições de hardware e destas e disparo externo, no qual o procedimento se dá após o
plataformas existentes no mercado. reconhecimento de um pulso advindo de uma entrada
A escolha da linguagem de descrição de hardware 0 externa específica. O bloco Memória representa uma
Verilog para a implementação do projeto de iniciação memória FIFO, First In, First Out, responsável pelo
científica se dá pela maior facilidade de aprendizagem em armazenamento dos dados adquiridos. O último bloco,
relação ao VHDL, visto que esta opção se assemelha muito Interface, responsável pela forma que os dados são
a linguagem C, amplamente conhecida, enquanto que a apresentados ao usuário, não foi abordado por este projeto.
escolha das plataformas para o mesmo é realizada pelas
empresas que se destacam atualmente no ramo.
3. FERRAMENTAS DE DESENVOLVIMENTO

A fim de desenvolver o analisador lógico deve-se, além de


implementar os soft-cores necessários, sintetizá-los e

31
simulá-los, para garantir o funcionamento correto dos família contém uma matriz bi-dimensional de LAB’s
mesmos. Para tanto, utiliza-se um IDE (Integrated (Logic Array Blocks), cada um contendo 16 elementos
Development Environment), ferramenta de lógicos (Logical Element - LE), pequenas unidades lógicas
desenvolvimento que contém aplicativos responsáveis responsáveis pela implementação das funções lógicas do
pelos processos desejados: design, síntese, place-and-route usuário, possuindo LUT (Look-Up Table) de quatro
e verficação; como ilustra a Fig. 2. entradas, um registrador programável, etc. Estão presentes
também nessa arquitetura, blocos de memória
denominados M4K, capazes de implementar vários tipos
de memória (single-port RAM, ROM, FIFO); e blocos
multiplicadores otimizados para processamento digital de
sinais (DSP).

3.2. ISE Design Suite 11.1

O IDE ISE Design Suite 11.1 pertence à empresa Xilinx,


maior fabricante de dispositivos lógicos reprogramáveis,
que lidera este mercado desde a década de 90, sendo a
inventora da FPGA.
Este software é capaz de sintetizar um projeto HDL
Fig. 2. IDE's e seus processos utilizando as ferramentas Synthesize-XST (Xilinx
Synthesis Technology), Synplify/Synplify Pro, ou
Como este projeto visa a generalização dos módulos Precision, sendo a primeira opção, a opção default
referentes ao analisador lógico, os mesmos devem ser utilizada para o processo de síntese, por ser da própria
depurados e simulados em diferentes IDE’s – Quartus II, Xilinx.
ISE Design Suite e Lattice Diamond, observando se as Diferente do Quartus II, o ISE não possui um simulador
respostas obtidas são as mesmas em todos os casos. Desta integrado, sendo gerado pelo IDE, um arquivo a ser
forma, podem-se obter as informações necessárias para utilizado em outro aplicativo, como por exemplo, o ISim,
uma análise a cerca destas ferramentas. simulador da própria Xilinx, instalado automaticamente no
software. Para realizar a simulação do projeto, este IDE
3.1. Quartus II 9.0 utiliza um arquivo de testbench, arquivo responsável pela
geração de sinais e valores iniciais de alguns vetores de
O IDE Quartus II 9.0 pertence à empresa Altera que entrada. Em posse deste tipo de arquivo teste, juntamente
lançou, em 1984, o primeiro dispositivo lógico com uma ferramenta de simulação, pode-se analisar, por
reprogramável complexo (CPLD) e que ocupa o segundo meio das formas de onda de saída, se o dispositivo em teste
lugar no mercado de dispositivos lógicos reconfiguráveis. (DUT – Device Under Test) está funcionando da forma
Este IDE possui uma ferramenta de síntese integrada, desejada.
não necessitando de outra ferramenta para este processo, A empresa Xilinx possui também uma ferramenta
mesmo possuindo esta opção. Esta ferramenta suporta geradora de IP-cores, a IP (CORE Generator &
ambas as linguagens HDL mais utilizadas – VHD e Architecture Wizard). Esta ferramenta é um design gráfico
Verilog - como a AHDL (uma linguagem de descrição de interativo que permite a criação de módulos de alto nível,
hardware própria da Altera). tais como elementos de memória, funções matemáticas e
O Quartus II 9.0 também possui um simulador de comunicação e núcleos de interface I/O. Estes módulos
integrado (permite o uso de arquivos Vector Waveform), podem ser personalizados e otimizados por meio de pré-
mas também pode trabalhar com outras ferramentas como módulos, a fim de aproveitar as inerentes características
o ModelSim. A partir de sua última versão, 10.0, o técnicas das arquiteturas das FPGA’s da Xilinx.
simulador integrado foi removido. A arquitetura Spartan-3, geração do dispositivo adotado
A Altera fornece, por meio de sua ferramenta Mega para o projeto, consiste de cinco elementos programáveis
Wizard Plug-In Manager, IP-cores - ou “mega funções” - fundamentais: CLB’s (Configurable Logic Blocks),
.parametrizáveis que são otimizados para arquiteturas de formados por slices, possuidores de LUT’s, podendo
seus próprios dispositivos. Essas funções oferecem síntese operar para implementação lógica e armazenamento de
lógica mais eficiente, podendo reduzir tempo de design dados; blocos de entrada e saída que controlam o fluxo de
gasto com codificação. Esta ferramenta permite ao usuário dados entre os pinos I/O e a lógica interna do dispositivo;
configurar várias opções de parâmetros destas funções. blocos de RAM que armazenam dados na forma de blocos
Adotou-se para este projeto de iniciação científica um de 18-Kbit; blocos multiplicadores; e os DCM’s (Digital
dispositivo da família Cyclone II. A arquitetura desta Clock Manager). Esta geração possui uma rica rede de

32
traços que interconecta esses elementos funcionais e optando pela memória implementada, o módulo do
transmite sinais entre os mesmos. Cada um desses analisador lógico acaba sendo limitado a parâmetros de
elementos possui uma chave matricial associada que entrada, largura de dados e número de palavras pequenos.
permite múltiplas conexões no roteamento. Fato que não ocorre ao utilizar a memória obtida por IP-
core, devido ao processo de síntese adotar o uso de blocos
de memória ao invés de elementos lógicos.
3.3. Lattice Diamond 1.0 Para realizar uma análise comparativa dos três
processos de síntese, verificou-se os reports fornecidos
Lattice Diamond 1.0 pertence a empresa Lattice
Semicondutor, pioneira do sistema de programação ISP e pelos mesmos, ao utilizar tanto o soft-core implementado
uma das três maiores fabricantes de CI’s reconfiguráveis quanto o IP-core gerado. Esta análise apresenta um alto
nível de dificuldade devido às diferentes arquiteturas
de todo o mercado internacional.
adotadas por cada dispositivo. Em posse dos reports
Esta IDE inclui a Synopsys Synplify Pro como
devidamente analisados constrói a tabela 1, onde são
ferramenta de síntese integrada, que, diferentemente dos
demais IDE’s, é um aplicativo de outra empresa: Synopsys apresentados dados comparativos a cerca dos dois tipos de
[8]. Apresenta como vantagem, o suporte a síntese de memórias sintetizadas pelas três ferramentas abordadas por
este projeto. Para melhor entendimento desta tabela, são
designs mistos entre Verilog e VHDL.
apresentadas algumas considerações a cerca dos reports
Para o processo de simulação, este IDE utiliza uma
fornecidos e dos itens apresentados.
ferramenta externa que necessita de projeto próprio,
O report fornecido pelo IDE da empresa Altera,
Active-HDL Lattice WebEdition 8.2, aplicativo que, da
mesma maneira que o aplicativo de síntese, pertence a Analysis & Synthesis Summary Reports, possui, dentre as
outra empresa, a empresa Aldec [9]. O mesmo também se suas diversas informações, o número total de elementos
lógicos, incluindo o total de funções combinacionais e de
destaca por suas características de simulação de códigos
registradores lógicos dedicados, o número total de
mistos de VHDL e Verilog, além de verificação avançada
registradores, e o número total de bits de memória
e muitos recursos de depuração.
utilizados e disponíveis. Na Tabela 1, encontram-se os
Conforme os outros IDE’s, este possui sua ferramenta
geradora de IP-core, a IPexpress. Este aplicativo reúne dados referentes aos elementos lógicos, aos registradores e
vários módulos funcionais que ajudam na geração de aos bits de memória.
O report fornecido pelo IDE ISE, Synthesis Report,
códigos em VHDL ou Verilog, podendo ser reutilizados
possui uma forma diferente de abordagem, na qual analisa
conforme a necessidade do usuário, agilizando e obtendo
o uso de células no processo de síntese, dividindo-as entre
os melhores resultados do projeto. Os módulos provem
BELS, elementos lógicos básicos como inversores, LUT’s
funções I/O, aritméticas, de memória, etc.
Cada dispositivo da família LatticeXP2, família e mux’s, flip-flops/latches e buffers. Adota-se, ao verificar
todo o documento, que os flip-flops/latches são
representante da empresa Lattice no projeto, possui uma
considerados registradores, enquanto que os LUT’s são
matriz de blocos lógicos cercada por PIC’s (Programmable
considerados os elementos lógicos. Para analisar a
I/O Cells). Entre as fileiras de blocos lógicos se encontram
utilização de blocos de memória, é importante analisar
linhas de EBR’s (Embedded Block RAM), blocos de
memórias de 18 Kbits (RAM, ROM ou FIFO), e uma também os reports gerados para o processo de Map. Na
fileira de DSP (Digital Signal Processing). Existem dois Tabela 1 encontram-se os dados referentes às LUT’s, aos
registradores e aos blocos de memória. Os reports desta
tipos de blocos lógicos, o PFU (Programmable Functional
ferramenta definem como bloco de memória o conjunto de
Unit), responsável por funções lógicas, aritméticas, RAM e
18 Kbits de memórias.
ROM, e o PFF (Programmable Functional Unit without
O IDE Diamond fornece o documento Resource Usage
RAM), responsável pelas funções lógicas, aritméticas e
ROM; ambos possuindo quatro slices interligados (LUT’s Report, que também indica por meio de LUT’s e bits de
de quatro entradas e dois registradores, ou apenas LUT’s). registradores os itens a serem comparados. Da mesma
forma que acontece com o ISE, utilizaram-se os reports
gerados no processo de Map para a análise dos blocos de
4. RESULTADOS memória. Na Tabela 1 encontram-se os dados referentes
somente as LUT’s que podem ser utilizadas como RAM,
O processo de síntese, mesmo entre as diferentes IDE’s, é aos bits de registradores e aos blocos de memória. Os
responsável por checar a sintaxe do código, compilá-lo reports desta ferramenta definem como bloco de memória
(traduzir e otimizar o mesmo, tornando-o um conjunto de o conjunto de 18 Kbits de memórias. Devido
componentes que possam ser reconhecidos) e mapeá-lo principalmente a possuir estes dois tipos de blocos lógicos,
(converte os componentes da fase de compilação para contendo ou não RAM, a ferramenta de síntese consegue
componentes primitivos da tecnologia a ser trabalhada). ótimos resultados, otimizando o uso de registradores e
Ao realizar a síntese do soft-core implementado outros elementos.
durante o projeto de iniciação cientifica, nota-se que,

33
Tabela 1. Quadro Comparativo (parâmetros: 8 bits de largura de dados e 1024 palavras de dados ao todo)

Software Quartus II ISE Diamond


Família Cyclone II Spartan3A(N) LatticeXP2
Dispositivo EP2C20F484C7 XC3S50A-5TQ144 LFXP2-5E-6TN144C
Freqüência Máxima
127,03 128,259 186,2
de Operação (MHz)
Memória Elementos Lógicos 23078/18752 (123%) 15,157/1,584 (1076%) 960/810 (119%)
Implementada Registradores 8251/18752 (44%) 8,264/1,408 (586%) 163/4752 (3%)
Elementos Lógicos 79/18752 (<1%) 58/1,408 (4%) 6/810 (1%)
Memória Gerada
(IP_Core) Registradores 53/18752 (<1%) 67/1408 (4%) 33/4752 (1%)
Blocos de Memória 8192/239616 (3%) 1/3 (33%) 1/9 (11%)

5. AGRADECIMENTOS

Os autores Gabriel Santos da Silva e Maximiliam Luppe


agradecem o apoio concedido pela FAPESP
2009/08512-5.

6. REFERÊNCIAS

[1] Altera, www.altera.com


[2] Lattice Semiconductor, www.latticesemi.com
[3] Xilinx, www.xilinx.com
[4] FPGA,
http://pt.wikipedia.org/wiki/FPGA.
[5] Soft-Core e IP-Core,
http://en.wikipedia.org/wiki/Semiconductor_intellectual_p
roperty_core
[6] HDL’s – Verilog e VHDL,
http://en.wikipedia.org/wiki/Hardware_description_langua
ge
Palnitkar, Samir, Verilog HDL, A guide to Digital Design
and Synthesis, Sunsoft Press, 1996. Strozek, Lukasz,
Verilog Tutorial, Edited for CS141, October 8, 2005. Tala,
Deepak Kumar, Verilog Tutorial, October 25, 2003
[7] Analisador Lógico,
Introdução ao Analisador Lógico.
www.prof2000.pt/users/lpa
Logic analyzer.
http://en.wikipedia.org/wiki/Logic_analyzer
[8] Synopsys, www.synopsys.com
[9] Aldec, www.aldec.com

34
GENERACIÓN AUTOMÁTICA DE VHDL A PARTIR DE UNA RED DE PETRI. ANÁLISIS
COMPARATIVO DE LOS RESULTADOS DE SÍNTESIS

Roberto Martínez, Javier Belmonte, Rosa Corti, Estela D’Agostino, Enrique Giandoménico

Facultad de Ciencias Exactas, Ingeniería y Agrimensura


Universidad Nacional de Rosario – (FCEIA/UNR)
Avenida Pellegrini 250, (2000) Rosario, Argentina
email: romamar, belmonte, rcorti, estelad, [email protected]

Los ambientes EDA (Electronic Design Automation),


RESUMEN que integran en el mismo marco de trabajo las herramientas
de descripción, síntesis, simulación e implementación de
Las Redes de Petri (RdeP) proveen un formalismo para la sistemas digitales, incluyen herramientas de síntesis que
modelización de sistemas donde el paralelismo y la reconocen estructuras lógicas; entre las cuales podemos
colaboración en la utilización de recursos son parámetros mencionar las asociadas con las máquinas de estado finito
que los caracteriza. En este trabajo se presenta un módulo (MEF). Los formatos propuestos para la codificación de las
software que automatiza el método de traducción directa de máquinas optimizan su síntesis, ya sea en área ocupada o
una RdeP a código VHDL sintetizable. Además se realiza velocidad.
un análisis comparativo de recursos utilizados y velocidad En este trabajo se presenta un módulo software,
de trabajo, referido a la síntesis de las soluciones alcanzadas MakeVHDL, que automatiza el método, descripto en [4], de
mediante la metodología propuesta respecto de las logradas traducción directa de una RdeP a código VHDL
utilizando máquinas de estado finito. Dicho estudio indica sintetizable. Además se realiza un análisis comparativo de
que los modelos basados en RdeP requieren, en general, los resultados, en cuanto a recursos y velocidad, cuando se
más recursos que los abordados con máquinas de estado sintetiza un código generado por MakeVHDL, respecto de
finito. Sin embargo, la metodología propuesta reduce en la codificación obtenida a partir del formato de MEF de la
gran medida el tiempo de desarrollo y previene errores de herramienta de síntesis de Xilinx XST (Xilinx Synthesis
codificación, resultando muy conveniente si los Tool).
requerimientos de recursos utilizados no son críticos. El resto de la publicación se organiza de la siguiente
manera, en la sección 2 se mencionan los trabajos
1. INTRODUCCIÓN relacionados con el aquí presentado; la sección 3 describe el
modulo software MakeVHDL desarrollado y la sección 4 el
En los sistemas industriales es común la existencia de análisis comparativo de los resultados de síntesis.
varios procesos de evolución paralela que muchas veces Finalmente en la sección 5 se incluyen las conclusiones.
necesitan sincronizarse entre ellos, comunicarse y/o
compartir algún recurso. Las Redes de Petri (RdeP) proveen 2. TRABAJOS RELACIONADOS
un formalismo para la modelización de sistemas donde el
paralelismo y la colaboración en la utilización de recursos Se han propuesto varios enfoques para traducir un modelo
son parámetros que los caracteriza. Además, este representado con una red de Petri, a una codificación en un
formalismo agrega a sus ventajas la facilidad de representar lenguaje de descripción de hardware. En [5] se reporta una
sistemas fuertemente no especificados [1]. herramienta de software (código cerrado) de traducción de
El modelado con alto nivel de abstracción y la un modelo en lenguaje PNML (Petri Net Markup
utilización de técnicas de descripción formal de Diseño a Language, un estándar internacional que define una sintaxis
Nivel de Sistemas (SLD, System Level Design) [2], permite de transferencia para diferentes versiones de redes de Petri)
el empleo de métodos de prototipado rápido, a partir de la a código C y VHDL. La estrategia de implementación se
creación de librerías y reutilización de componentes basa en analizar cada nodo y establecer la correspondencia
hardware/software [3]. lugar-registro y transición-lógica combinacional. Otros
Los HDLs (Hardware Description Languages), permiten autores en [6] comunican el desarrollo de una aplicación,
el diseño de circuitos digitales con un alto nivel de también en código cerrado denominada HILECOP, usada
abstracción. Estos lenguajes, orientados inicialmente a la en el dominio médico para generar código VHDL a partir
descripción de hardware y la simulación, se utilizan de una red de Petri interpretada. Un aporte interesante de
actualmente en la síntesis automática de circuitos sobre
dispositivos de lógica reconfigurable.

35
Fig. 1. Pantalla de PIPE con el módulo MakeVHDL.
esta última comunicación, es la posibilidad de aplicar un rigen las reglas matemáticas que dan soporte a la
control de actividad de los componentes VHDL, para descripción y permiten además la realización de
ahorro de energía consumida por el dispositivo, haciendo simulaciones tendientes a verificar su comportamiento.
uso del principio de “propagación de actividad”. En [7], los Una de las muchas herramientas gráficas existentes para
autores descomponen el modelo en bloques estructurales la construcción y simulación de RdeP es la denominada
básicos de una RdeP, compuestos de un lugar y una PIPE (Platform Independent Petri Net Editor) [11], de tipo
transición y luego cada uno es implementado en un bloque open-source y desarrollada en Java. PIPE está estructurado
lógico configurable (CLB) de una FPGA. Los autores de de manera que es posible el agregado de prestaciones
[8] informan el desarrollo de Animator4FPGA, herramienta específicas por medio de módulos que se pueden incorporar
de código cerrado, que permite la descripción de a su interfaz. Para la generación de VHDL, implementamos
controladores por medio de RdeP para luego generar el un módulo (MakeVHDL) que traduce en forma directa la
VHDL correspondiente. En [9], se propone que las RdeP RdeP representada en PIPE a código VHDL conforme el
puedan ser usadas como lenguaje de especificación en el método descrito en [4]. El mismo realiza la traducción
codiseño hardware/software de los sistemas embebidos, desde una perspectiva global del sistema, a partir de la
poniendo condiciones, entre ellas, la de que a partir de esta representación matricial de la RdeP asociada, lo que
especificación se pueda generar el código para distintas permite acotar la complejidad de la descripción VHDL
plataformas que pueda ser usado para simulación, resultante. La implementación de la arquitectura de la red
verificación e implementación. El trabajo descrito en [10] consta de tres bloques que se comunican mediante señales.
informa de un estudio comparativo de recursos utilizados en El primero determina cuales son las transiciones que están
la síntesis de MEF, para distintos estilos de descripción de en condiciones de disparo, el segundo define el nuevo
la máquina y métodos de codificación de sus estados. marcado y el tercero asigna las salidas.
Proponen una metodología basada en el análisis de los La Fig. 1 muestra una pantalla de PIPE con el agregado
reportes de síntesis, contabilizando slices, flip-flops y del módulo MakeVHDL. Dicho módulo incluye facilidades
LUTs. También se analiza la frecuencia teórica máxima de para la identificación de las entradas y salidas del sistema.
reloj que se estima en los reportes. Además, permite agregar condiciones lógicas a las
El trabajo aquí presentado se basa en una descripción transiciones y definir salidas condicionadas. La
matricial del modelo de RdeP, y a diferencia de los metodología de traducción propuesta en [4] se amplió
descriptos, está basado en una herramienta open source de agregando los elementos mencionados. Se logró por tanto
libre disponibilidad. la generación completa del código VHDL a partir de la
RdeP, incluyendo la creación de entidades, arquitecturas,
3. GENERACIÓN AUTOMÁTICA DE VHDL señales, puertos de entrada y salida y demás elementos
necesarios para obtener una descripción VHDL sintetizable.
La realización de un modelo mediante Rde P de un sistema, El código VHDL generado por MakeVHDL puede ser
guardado como un archivo o copiado y pegado en el
cualquiera sea la índole de éste, consiste en la realización
ambiente de diseño elegido. En nuestro caso, para verificar
de grafos o diagramas de diferentes estilos, conforme al tipo
de RdeP utilizado para la modelización. También es posible el código obtenido, y realizar el análisis de los resultados de
utilizar directamente una RdeP como forma de especificar síntesis, se trabajó con ISE 8.2i de Xilinx.
al sistema. En cualquier caso, subyacente a dicho diagrama,

36
250

200

150

Mhz
100

50

0
2 3 6 8 10
Procesos
Petri MEF
(a) (b)
Fig. 2. (a) Procesos concurrentes. (b) Recursos Fig. 4. Frecuencia máxima para procesos paralelos
compartidos.

70
60 35
50 30
Recursos

40 25
Recursos

30 20
20 15
10 10
0 5
2 3 6 8 10
0
Procesos
2 3 6
Petri FF MEF FF Petri Slice MEF Slice
Procesos
Fig. 3. Recursos reportados para procesos paralelos. Petri FF MEF FF Petri Slice MEF Slice
Fig. 5. Recursos reportados para recursos compartidos.
4. ANÁLISIS COMPARATIVO DE LOS
RESULTADOS DE SÍNTESIS En la Fig. 2 (a) se muestra el diagrama de Petri de dos
procesos paralelos que reinician su funcionamiento cuando
Se analizaron dos casos de estudio donde la modelización ambos han finalizado su ejecución.
con RdeP es más ventajosa que con MEF. Como El modelo basado en MEF del mismo problema se
contrapartida, la herramienta de síntesis XST, optimiza la modularizó, utilizando una máquina para cada proceso. El
implementación de los diseños si se los describe utilizando problema se resolvió para dos, tres, seis, ocho y diez
el formato aconsejado para las MEF. Nuestro objetivo, al procesos. La Fig. 3 indica la cantidad de flip-flops (FF) y
comparar los resultados de la síntesis a partir del código slices utilizados en la síntesis para ambos modelos de
VHDL obtenido por medio de ambos modelos, fue representación, con la opción de optimización de área. La
mensurar la incidencia del uso de RdeP sobre la frecuencia Fig. 4 muestra los valores correspondientes de la frecuencia
de trabajo y el uso de recursos. El análisis se basó en los de trabajo máxima. La diferencia entre los valores para
reportes de síntesis, ya que constituyen un indicador clave máxima frecuencia en el peor de los casos llega
de la forma en que la herramienta interpreta el código. El aproximadamente al 25%, que no resulta significativo en
código VHDL se obtuvo utilizando MakeVHDL al trabajar los sistemas industriales.
con RdeP, mientras que para MEF, se codificó respetando Respecto al uso de recursos, la síntesis del modelo Petri
el formato de dos procesos propuesto por XST. utiliza tres veces más FF que la MEF, mientras que el
número de slices utilizados es similar para ambos.
4.1. Procesos concurrentes
4.2. Recursos compartidos
Una RdeP resulta ventajosa para modelizar un sistema de
evolución en paralelo compuesto de varios procesos que Los sistemas en los cuales varios procesos comparten uno o
cooperan para la realización de un objetivo común. más recursos, pueden representarse utilizando una RdeP

37
200

150
Mhz

100

50

0
2 3 6 Fig. 7. Generación automática de código VHDL.
Procesos
Petri MEF

Fig. 6. Frecuencia máxima para recursos compartidos. [2] I. Viskic, D. Rainer, "A Flexible, Syntax Independent
Representation (SIR) for System Level Design Models,"
como muestra la Fig. 2 (b). En la misma se muestran dos 9th EUROMICRO Conference on Digital System Design
(DSD'06), 2006 , pp. 288-294.
procesos A y B que comparten el recurso R.
Este tipo de sistema se modelizó para dos, tres y seis [3] K. Keutzer, S. Malik, R. Newton, J. Rabaey and A.
procesos que comparten un único recurso. La Fig. 5 permite Sangiovanni-Vincentelli, “System level design:
comparar el número de FF y slices inferidos por XST en el Orthogonalization of concerns and platform-based design”,
proceso de síntesis para los dos modelos de representación IEEE Trans. on Computer-Aided Design of Integrated
utilizados. La Fig. 6 por su parte, se refiere a los valores de Circuits and Systems, 19 (12), Dec. 2000.
frecuencia máxima. Al incorporar el uso de recursos [4] R. Martínez., J. Belmonte, R. Corti, E. D’Agostino, E.
compartidos, MEF utiliza un 50 % menos de slices. En Giandoménico, “Descripción en VHDL de un sistema
cuanto al uso de FF la comparación entre ambos modelos digital a partir de su modelización por medio de una red de
pone en evidencia que la MEF agrega un FF más que Petri Petri”, in Proc. V Southern Conference on Programmable
por cada proceso incorporado. Logic, Apr 2009, pp 7-11.
[5] L. Gomes, A. Costa, J.P. Barros, P. Lima, “From Petri net
5. CONCLUSIONES models to VHDL implementation of digital controllers”, The
33rd Annual Conference of the IEEE Industrial Electronics
Society, (IECON), pp 94-99, Taiwan, Nov. 2007.
El análisis realizado, muestra que el uso de las RdeP para
modelar los sistemas propuestos, tiene un costo en la [6] D. Andreu, G. Souquet, Thierry Gil, "Petri Net Based Rapid
síntesis respecto de recursos utilizados y velocidad de Prototyping of Digital Complex System," isvlsi, pp. 405-
trabajo que, en general, es mayor que la modelización con 410, 2008 IEEE Computer Society Annual Symposium on
MEF. Sin embargo, la metodología que en este trabajo se VLSI, 2008.
propone, esquematizada en la Fig. 7, realiza una traducción [7] E. Soto, M. Pereira, “Implementing a Petri net specification
automática del modelo gráfico de Petri a código in a FPGA using VHDL”, Int. Workshop on Discret-Event
sintetizable, en todos los casos, y elimina toda posibilidad System Design, Przytok, Poland, June 27-29, 2001.
de error en la codificación. Por otro lado, ante una [8] F. Moutinho, L. Gomes, “From Models to Controllers
modificación en el sistema físico, su descripción con RdeP Integrating Graphical Animation in FPGA through
resulta notablemente más simple que con MEF. El módulo Automatic Code Generation", Industrial Electronics, 2009.
software desarrollado está basado en una herramienta open ISIE 2009. IEEE International Symposium on, pp 712-717
source y por lo tanto es de libre disponibilidad. Por último, [9] L. Gomes, J.P. Barros, A. Costa, R. Pais, F. Moutinho,
se puede concluir que si los requerimientos de diseño no “Towards Usage of Formal methods within Embedded
son críticos, en lo que se refiere al uso de recursos de Systems Co-design”, Proc. of the 2005 IEEE Conference on
pastilla, el método propuesto resulta muy conveniente. Emerging Technologies and Factory Automation, Vol 2, pp
284-287.
6. REFERENCIAS [10] Nader I Rafla, Brett LaVoy Davis, “A Study of Finite State
Machine Coding Styles for Implementation in FPGAs”,
[1] M. Uzam and A.H. Jones, “Design of a Discrete Event Circuits and Systems, 2006 IEEE International Midwest
Control System for a Manufacturing System Using Token Symposium on, pp 337 – 341.
Passing Ladder Logic”, Proc. of the CESA'96 IMACS
[11] P. Bonet, C.M. Llado, R. Puijaner and W.J. Knottenbelt,
Multiconference, Symposium on Discrete Events and
Platform Independent Petri net Editor 2,
Manufacturing Systems, July 1996, pp. 513-518.
http://pipe2.sourceforge.net/, (consultado 10/10/10).

38
USING A WII REMOTE AND A FPGA TO DRIVE A MECHANICAL ARM TO
AID PHYSICALLY CHALLENGED PEOPLE

Bruno Seiva Martins, Emerson Carlos Valentin Obac Roda


Pedrino
Computing Department Department of Electrical Engineering
Federal University of São Carlos Federal University of Rio Grande do
Rod. Washington Luiz,, km 235, São Norte
Carlos, SP, Brazil Caixa Postal 1524 - Campus
CEP: 13565-905,
905, Caixa Postal: 676 Universitário Lagoa Nova
email:: [email protected],
[email protected] CEP 59072-970 Natal/RN - Brazil
[email protected] email: [email protected]

ABSTRACT

The Nintendo Wii Remote videogame controller brought


an innovative way of playing videogames,
videogames using simple
hand movements along with simple button pressings as
input commands.. This type of controller may be used as
input device for several applications, such as robotic
control. Here it is presented a way to interface this
controller with the Altera’s DE2 development board,
which contains several devices that can be controlled by
the Cyclone II FPGA chip present in the board.
board The DE2 Fig. 1.. Overview of the proposed system.
board is configured with the Nios II softcore processor,
and the uClinux operating
rating system is installed. The
communication between the board and the controller is In Fig. 1 the Wiimote (1) communicates with the DE2
then made through the Bluetooth protocol. Thus, we board (3) through the Bluetooth protocol (2). This is the
propose a system that can take and interpret input from proposed system presented in this article. Moreover, the
the Wiimote controller to control a mechanical arm. This board (3) translates the commands and transmits them to
system is designed to be operated by physically the AL5A mechanical arm (5) through throu Serial
challenged people, in order to facilitate their lives. communication protocol (4).

The Wiimoteote controller has twelve buttons, an infrared


1. INTRODUCTION
camera and three accelerometers (one for each Cartesian
axis). The controller is also provided with one low quality
The main goal of the proposed system is to take input
sound speaker, and is powered by two AA batteries.
batte The
from a controller and interpret it to drive a mechanical
accelerometers are in a free fall frame of reference [1].
arm accordingly. The controller useded is the Nintendo Wii
Fig 2 illustrates the orientation of the axes and of the
Remote (or Wiimote, for short), which is be connected to
angles measured between those axes.
axes
an interpreter system. This interpreter system is built on
an Altera’s DE2 development board that has several
peripherals ready to use and is very good to make
prototypes. Afterr interpreted, the gesture made with the
Wiimote is then reproduced in the mechanical arm. The
mechanical arm is a Lynxmotion model: AL5A. An
overview of this system is given in Fig 1.

Fig. 2. Orientation of the three angles and axes.

The set of data


ata produced by the controller
controll is sent through
the Bluetooth protocol to a previously paired device,
which is a system built using the DE2 board.
board

39
The DE2 board is configured with the Nios II softcore, attached to the internal bus. This process was done within
general purpose processor, along with the needed the Quartus II software provided
ovided by Altera. The focus of
peripherals, such as memory chips and communication the proposed system was to establish communication with
ports controllers. Given the overall complexity of the the Wiimote controller,, therefore only a few devices were
system, it was more efficient to build it in layers, required. In [2] is described the process of choosing and
configuring the board with an operating system (uClinux) connecting modules using the SOPC Builder tool inside
and on top of it running the programs to deal with the Quartus II software.
communication and data treatment.
The main devices used by the system were: the FPGA
The mechanical arm used is a simple model manufactured chip, to host the Nios II processor, the SDRAM memory
by Lynxmotion, which has four degrees of freedom to its chip, which is loaded with the uClinux operating system,
movement and Serial communication to take commands. the USB controller, to attach the Bluetooth adapter and
the flash drivee and the serial UART controller to drive the
We did a broad search through many paper databases and mechanical arm.
there are plenty
nty of works on using the Wiimote as a
motion capture device for a variety of purposes. However, The Nios II processor version used was the fast (/f core)
we found that none of these works uses a FPGA system to version, which provides the best performance but costs
gather and interpret the data. Thus, we believe that the more in FPGA usage [3].
research done to accomplish this kind of system, syste
integrating a FPGA system with a Bluetooth device, is After setting up the system hardware, the uClinux
completely new. distribution was. A Linux system was chosen because it is
open source, highly configurable and actively maintained
by its developers. Also, a Linux system allows one to read
2. PROPOSED SYSTEM
the full source code and also to make suitable
modifications. The version of uClinux-dist
uClinux used in the
The proposed system developed in this paper was built
system is from July 30th, 2009, and is hosted at Nios II
using the DE2 board. Fig. 3 depicts the overview of the
Community’s ftp site [4]. This distribution
di was made by
board with the peripherals attached and hardware and
the community of Nios II users and targets Altera boards
software
ftware configured. Note that USB HUB, Bluetooth
(including the DE2 board). The whole set of tools and
USB adapter and USB Flash drive boxes represent source code is called uClinux-dist.
dist.
physical devices, but C Software, uClinux OS and Nios II
Processor boxes represents logical layers, as they are
The uClinux compilation parameters are set using the
either stored in the SDRAM memory chip, in case of the
make menuconfig command,
command which opens a screen
OS and the C programs, or configured in the FPGA chip,
containing all the program, library and options available
which would be the processor.
to compile with or change. There are several options
shown using this tool. All of the settings are divided in
two categories: Kernel Settings and Application/Library
Settings. Once the configuration is finished, a simple
make command will start compiling the source code into
an image file. Fig 4 shows a typical configuration menu
men
screen.

Fig. 3. Overview of the proposed system.


system

A USB HUB was attached to the USB port, to allow use


of multiple USB devices. One of these devices is the
Bluetooth USB Dongle, responsible for giving Bluetooth
capabilities to the board. The other is a USB flash drive,
which was used to store the many versions of the test
program. The SDRAM memory was loaded with the
uClinux operating system, so the system could run C
coded
ed programs benefiting from an operating system
environment. Fig. 4. Typical uClinux-dist
dist configuration screen. Here is
possible to enable Bluetooth support by the kernel.
The Cyclone II FPGA chip was configured with the Nios
II processor, and all other board devices needed were

40
With the board attached to the computer, the image is use, and discovered one called Wiiuse [6], which, like
uploaded through the Nios II Embedded Design Suit BlueZ, is also open source. This library is well
software, but only after the .sof file is configured. The documented and well written, allowing painless
process is illustrated in Fig. 5. modifications to be made. With the Wiiuse, we were able
to run the sample program included with it, and make
modifications to perform some tests.

3. RESULTS

Several results were produced by the proposed system.


First, we were able to read the data of the Wiimote on the
screen, in real time, showing, for instance, the readings of
the accelerometers (Fig. 7) and the buttons pressed (Fig.
8).

Fig. 5. Upload flow. Both of the files are uploaded using


Nios II EDS, first the .sof and then the zImage.

Using the same program (Nios II EDS), it’s possible to


see the output from the board and give input to it, after it
is properly configured. In Fig. 6 it is shown the initial
screen once the OS is booted.

Fig 7. Real-time data from the Wiimote, showing the


current accelerometer data.

Fig. 6. Screen after boot.

After testing the default configuration, several settings


were modified, in order to support the devices of the
board and the software to be run on the operating system.
We enabled kernel support for specific USB devices
(USB flash drives and USB Bluetooth adapter) Bluetooth
protocol and FAT filesystems. Other enabled by default
settings were disabled, like network communications via
TCP/IP protocol, given that they would have no use for
us. Several modules were also disabled, such as ifconfig
and dhcp configuration tools, among others, to save some
space. We had to add some configuration tools to deal Fig 8. Real-time data from the Wiimote, showing the
with the Bluetooth communication and Bluetooth code current buttons being pressed.
compilation, namely the BlueZ library tools [5], and that
came to be one of our great difficulties. The BlueZ library Then, we tested the integration between the software and
has three main versions, and only the oldest one would board hardware, by lighting up some LED’s according to
run on our system. Some source code analysis and the rotation of the controller. We wrote a program that
rewriting were made in order to get the tools compiled retrieved the accelerometer data of the X axis and
and running, successfully. Once the Bluetooth was up and displayed it using eight red LED’s. The result is shown in
running, we searched for Wiimote libraries that we could Figs 9 through 11.
41
4. CONCLUSIONS

The DE2 evaluation board is very versatile, as it allows


virtually any kind of system to be implemented. One of its
strengths is the variety of devices, already wired and
ready to be used, which gives great prototyping power to
the developer using the board. This power was used on
our system, which interprets commands from the
Nintendo Wiimote videogame controller and is able to
drive a mechanical arm attached to the board, according
to the commands received. The integration accomplished
by our work can be used as a base to the development of a
Fig. 9. With the controller turned 90º to left, the leftmost number of systems that can benefit from the Wiimote
LED turns on (8th LED). input. The use of embedded systems to do this job may be
more efficient, given the real time concern of such
systems compared to general purpose Personal
Computers. Our work’s ultimate goal is to enable this
system to be used smoothly by physically challenged
people in order to make their lives easier, so that they can
control a mechanical arm to do everyday tasks, such as
move relatively heavy objects or reach objects stored too
high.

5. ACKNOWLEDGEMENTS

Fig. 10. With the controller just past the center position We wish to thank FAPESP (Fundação de Amparo à
(when it is facing up), the corresponding LED turns on Pesquisa do Estado de São Paulo) for its financial and
(3rd LED). institutional support to this research, registered under the
process number 2010/07179-8. Emerson C. Pedrino is
grateful to FAPESP by the process: 2009/17736-4, too.

6. REFERENCES

[1] Wiimote – Wiibrew <http://wiibrew.org/wiki/Wiimote>,


Retrieved on October 14th, 2010.
[2] J. O. Hamblen, T. S. Hall, M. D. Furman, “Tutorial IV: Nios
II Processor Hardware Design” In Rapid Prototyping of Digital
Systems SOPC Edition Springer 352-370 (2008).
[3] “Altera’s Embedded Processors”, Accessed Oct 14, 2010.
[Online]. Available:
Fig. 11. As expected, when the controller is turned 90º to http://www.altera.com/products/ip/processors/nios2/ni2-
the right, the rightmost LED turns on (1st LED). index.html

The proposed system is actually under development. The [4] “Nios II Community FTP”, Accessed Oct 14, 2010. [Online].
Available: http://www.niosftp.com/pub/
mechanical arm that will be driven by the system is under
study, and the software to control it is being developed. [5] “BlueZ”, Accessed Oct 15, 2010. [Online]. Available:
Our goal is to allow physically challenged people to http://www.bluez.org/
control a robotic arm with ease, in order to make the arm [6] “wiiuse – The Wiimote C Library”, Accessed Oct 15, 2010.
perform simple tasks, like pushing a heavy object around [Online]. Available: http://www.wiiuse.net/
or reaching normally out of reach objects. Although this
idea is not new [7], the use of a FPGA to gather the data [7] C. Smith, H. I. Christensen, “Wiimote Robot Control Using
Human Motion Models” The 2009 IEEE/RSJ International
produced by the Wiimote and to drive the mechanical arm
Conference on Intelligent Robots and Systems, St. Louis, USA
is original. The use of embedded devices to do such task (2009).
instead of personal computers represents a new branch of
research, allowing real time responsiveness and
portability for the system.

42
SYSTOLIC MATRIX-VECTOR MULTIPLIER FOR A HIGH-THROUGHPUT
N-CONTINUOUS OFDM TRANSMITTER

Enrique Mariano Lizárraga Victor Hugo Sauchelli

CONICET - F.T.yCs.Ap. F.C.E.F.yN.


Universidad Nacional de Catamarca Universidad Nacional de Cordoba
4700 Catamarca, Argentina 5000 Cordoba, Argentina
email: [email protected] email: [email protected]

ABSTRACT The proposed architecture is tested for a wireless com-


munication transmitter showing good performance and al-
Digital systems frequently present high-speed operation re- lowing the required bandwidth.
quirements and many times its processing is based on arith-
The rest of the paper is organized as follows. In Section
metical operation. In this work, we consider a strong re-
2 we present the fundamentals of the matrix-vector multipli-
source demanding application such as matrix-vector multi-
cation algorithm and alternatives are discussed, in Section
plication, in the context of N-Continuous OFDM signal gen-
3 the architecture is explained and delay considerations are
eration. We propose a systolic architecture which can per-
given; the Section 4 present results obtained from simulation
form operations in a parallel way and reduces the processing
and synthesis for FPGA. Conclusions are given in Section 5.
time according to a design parameter. Our results expose the
benefits derived from a critical path treatment, and show an
attractive simplicity owing to a circular data shifting in a 2. ALGORITHM BASICS
systolic approach.
2.1. Matrix-Vector Multiplication Fundamentals
1. INTRODUCTION Let M be a N × N matrix, and v a N × 1 column vector.
Then, the matrix-vector multiplication result is allocated in
High-speed operation in current VLSI designs is a necessary
the r column vector with dimension N × 1 as defined in
feature; the bottleneck in the capacity of supporting certain
applications is frequently given by the timing performance. N
X −1
Considered applications may be found in devices ranging ri = Mij .vj i = 0, ..., N − 1, (1)
from smartphones to complex computers. Then, we focus on j=0
an efficient architecture for digital signal processing. This
area has allowed applications such as high data rate wireless where Mij and vj are the elements of M and v, respectively.
communications [1], image recognition [2], and biomedical From this expression we can derive the requirement of solv-
processing [3]. In these designs, matrix operations are fre- ing N 2 individual multiplications to complete the operation.
quently required but they imply a large number of elemental Although in many cases they may be complex multiplica-
(scalar) multiplications. Even simple scalar multipliers are tions, results obtained in this work remain valid.
resource demanding [4]. Therefore, area, power and speed Focusing on the result calculation, in one hand, a sim-
costs are increased in a strong way for the global system if ple approach is to use combinatorial logic, but the area re-
the design is not optimized. quirement is high and a slow clock is necessary. In other
In this work, we consider a matrix-vector multiplication hand, an alternative is to solve every multiplication sequen-
where the matrix elements are constant. Then, they can be tially. However, a large number of clock cycles will be re-
previously stored in a memory. This case may be found in quired to complete the processing. In addition, many ar-
several applications such as generalized DFT, coordinate ro- chitectures for real or complex number multipliers present
tations, coding, etc. We base the design on a multipliers latency, which may reduce the global performance. Pipelin-
bank that allows a parallel processing system. In this way, ing may be included to mitigate the delay drawback. Even
the operation time is reduced by a design parameter. We though, the global processing time is still governed by the
present particular consequences of the parallelism concept expression N 2 . However, the benefit reached with this ap-
and a detailed description of the architecture. proach is based on the use of only one multiplier. Then, the

43
processing time is given by For the N-Continuous OFDM transmitter described for
analysis we can express the bandwidth as
ρN 2
Tseq = (2)
f (1 + GI)N
B= . (7)
where ρ represents the processing time in the elemental mul- Tpar
tiplier and may be fixed at one if pipelining is applied, f
represents the system clock rate. By selecting 32 parallel multipliers, K = 32, the obtained
Since N-Continuous technique has been proposed for bandwidth is 5.72 MHz, according to the resulting process-
out-of-band power reduction in OFDM systems [5], and it is ing time of 56.25 µs for the matrix-vector multiplier. These
based on a correction vector obtained by means of a matrix- values achieve the required bandwidth in [6]. Also, the pro-
vector multiplication. We analyze the sequential operation cessing period for the matrix-vector multiplier does not cover
in the correction calculation. In this case, N is defined by the whole OFDM symbol duration; then, a fraction of the
the subcarriers number of the OFDM system. Based on the symbol transmission period may be used for another spe-
3GPP E-UTRA/LTE specification for wireless communica- cific OFDM processing.
tions N = 300 is chosen [6]. Then, a typical clock rate
for wireless communication architectures implemented in 3. ARCHITECTURE DESIGN
FPGA is considered, f = 50 MHz. According to a typical
complex multiplier scheme, which operates in four cycles, A processing unit fed by the v vector is considered. We
a straightforward pipelining is considered, ρ = 1. So, the suppose that the elements of the M matrix have been pre-
global processing time is Tseq = 1.8 ms. If Guard Interval viously stored in an internal memory. Then, the objective
(GI) is applied with fraction GI = 22/300 [6], the transmit- is to present the result of the calculation as fast as possible.
ter can achieve a final bandwidth of According to (5), we can build a parallel multipliers bank
composed by K elemental multipliers. Since each one has
(1 + GI)N
B= = 179 KHz (3) two inputs, a and b, we can join all the multiplier inputs and
Tseq form two buses, A and B, which are fed by v and M, re-
at most. Unfortunately, this bandwidth is less than the one spectively. This scheme is depicted in Fig. 1. According to
specified in [6]. Also, a practical OFDM transmitter in- the parallel concept, the bus A is fed by the k-th fraction of
cludes other operations that can further reduce the presented the v vector in each cycle. In this way, the complete load of
speed performance. v requires L cycles. As stated in (5), this fraction of the v
vector is multiplied by the k-th fraction of the i-th row of M.
2.2. Parallel Operation Performance
As in the application considered, the timing requirement
may not be achieved by using the unique multiplier scheme
discussed above. An alternative is to use K multipliers and
synchronize them for simultaneous operation. Based on this
approach, (1) can turn into an L-elements addition
L−1
X
0
ri = ri,k i = 0, ..., N − 1 (4)
k=0
Fig. 1. Functional Diagram
0
where L = N/K, and ri,k represents the k-th partial addi-
tion Note that the multipliers bank output represents the ele-
0
K−1 ments to be summed in (4), then ri,k is computed. Since the
0
X
0 values of ri,k are sequentially generated, it is necessary to
ri,k = Mi,j+kL .vj+kL i = 0, ..., N − 1. (5)
j=0 store each one of the set k = 0, ..., L − 1. The value of ri
may be computed by means of a new addition before the L
The matrix element selection in column-sense is obtained
partial additions are obtained.
from j + kL, where k = 0, ..., L − 1. Then, the model in (4)
A special consideration is based on the need of feed-
indicates L steps and K elemental multipliers for the pro-
ing the two K-length vectors represented by {Mi,j+kL } and
posed architecture.
{vj+kL } for j = 0, 1, ..., K − 1 in a simultaneous way. This
The processing time for the complete calculation is re-
requirement allows the calculation of every element of r in
duced depending on the K parameter. So, it follows the
only L clock cycles. However, this configuration implies
expression N 2 /K as
special memory units to accomplish the described behav-
Tpar = ρN 2 /(Kf ). (6) ior for v in the input bus. Also, it is observed that after

44
L cycles, each fraction of the vector v, i.e. {vj+kL } for
j = 0, 1, ..., K − 1, is required again for calculation. Then,
we can feed it back by means of a simple circular buffer con-
nected to A. It is a consequence of the systolic approach in
the proposed system and allows an important simplification
in the design.
If we select R bits for resolution, then the buses A and B
must be sized as R × K bits. In turn, the bus B needs to be
connected to a memory where the N 2 /K words of R × K
bits are allocated for completely represent M.
This analysis remains valid even if fixed-point or floating-
point number representation is used. For complex number Fig. 2. Cascaded addition for Adder 1
representation, the storage units and the add stages may pro-
cess real and imaginary parts independently.
for high-parallelism operation, is improved and the required
operation frequency is allowed.
3.1. Data Propagation Optimization
In this scheme, K − 2 delay units are inserted into the
Although the former section presents system level require- connection between the multipliers bank output and the el-
ments, this section discuss the datapath in the design. If emental adders of the tree. Delay value is fixed at one for
0
we consider the ri,k calculation, the implementation may the output in the position K − 3 and it is increased in one as
be based on tree adders, as shown in Fig. 2. We can use the position decreases up to the position 0 in Fig 3. This tree
0 in Adder 1 represents the simplest approach and achieves a
K − 1 two-input adders and finally obtain ri,k by perform-
ing a cascaded connection. Unfortunately, it was showed good performance in the mentioned application; however, it
that this scheme may affect the global timing performance may be improved further by defining a symmetric tree adder
in a strong way because of the critical path extension. It is [4].
0
a consequence of the extensive combinatorial logic inferred In our design, ri,k is available K − 2 cycles after the
by the adders, which defines a long propagation path. Ac- multipliers bank produces its output. In turn, once the first
0
cording to [4], we establish the critical path in a VLSI cir- ri,k element is calculated, it is necessary to store it up to the
cuit by means of the latches interconnection. Then, as K complete set for k = 0, ..., L−1 is available. As K increase,
increases, more combinatorial adders are inserted into a two L becomes lower. Then, if the value of L is small, we can
0
latches path and the performance becomes poor. synchronize the ri,k contributions to ri by means of a new
Although the technique in the previous section was to in- set of delay units without affecting the area performance. In
crease K to improve the speed performance, it is possible to other cases, a memory based subsystem may replace it, and
obtain an opposite effect because of the critical path exten- address logic needs to be appended. In the case of delay
sion. An appropriate parameter selection criterion may be units, their values are represented as
stated. In one hand, K may be chosen as large as possible,
limited by the area resources. In other hand, this selection ri,0 ri,1 ··· ri,L−2 ri,L−1
affects the speed performance in a negative way if the crit- X ri,0 ··· ri,L−3 ri,L−2
ical path extension becomes too high. This behavior was .. .. .. .. .. (8)
. . . . .
presented in the wireless communication application con- X X ··· ri,0 ri,1
sidered, for K = 32. Nevertheless, other (N, K) settings X X ··· X ri,0
may be located in a beneficial point of the space parameters.
Based on [4], we include flip-flops which interrupt the where each column represent different clock cycles
propagation paths and shorten them, it produces a pipelined t0 , t1 , ..., tL−1 from left to right, so it is a classical S/P unit.
architecture for our design. Although delay cycles are in- After this operation, we use a new tree adder fed by the en-
0
troduced into the system as a result of this technique, the tire set ri,k in a parallel way. In this case, the propagation
global performance is improved since K is sufficiently large path extension does not affect significantly the performance
and the operation of the transmitter is periodic. because of the small value of L. Nevertheless, a more so-
phisticated critical path treatment is still possible. Based on
whether a specific system achieve timing constraints or not,
3.2. Final Settings
a pipelined tree adder similar to Adder 1 may replace Adder
The complete the design is depicted in Fig. 3, where a 2.
pipelined architecture is used for Adder 1. This way, the According to the error computation for the ana-
speed bottleneck imposed by high values in K, as desired lyzed OFDM transmitter where fixed-point number repre-

45
Fig. 3. Complete matrix-vector multiplier architecture

sentation is chosen, 2R bits are used for real and imaginary OFDM signal generation, and performance achieved is suffi-
parts independently in the adders input. Then, truncation cient for implementing an N-Continuous OFDM transmitter
is not applied in the multipliers output. Based on numeri- by following the LTE standard.
cal simulation, the adder outputs are defined as 2R-bit. A
truncation unit is placed in the last stage, and the output bus 6. REFERENCES
represent the results in R bits for real and imaginary part,
independently. [1] T. Onizawa, A. Ohta, and Y. Asai, “Experiments on fpga-
implemented eigenbeam mimo-ofdm with transmit antenna se-
lection,” Vehicular Technology, IEEE Transactions on, vol. 58,
4. SIMULATION RESULTS no. 3, pp. 1281 –1291, march 2009.
[2] P.-Y. Chen, C.-Y. Lien, and C.-P. Lu, “Vlsi implementation of
The proposed architecture has been tested on an Altera R
an edge-oriented image scaling processor,” Very Large Scale
EP2C70F672C6 device where a VHDL specification was
Integration (VLSI) Systems, IEEE Transactions on, vol. 17,
developed. Debugging was performed by means of a fixed- no. 9, pp. 1275 –1284, sept. 2009.
point simulator built on Matlab . R It was complemented by
[3] L. Androuchko and I. Nakajima, “Developing countries and e-
a special unit for connecting the test board with a PC through
health services,” in Enterprise Networking and Computing in
an Ethernet port. The final performance is summarized in
Healthcare Industry, 2004. HEALTHCOM 2004. Proceedings.
Table 1. 6th International Workshop on, 28-29 2004, pp. 211 – 214.
[4] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design
Table 1. Synthesis Results and Implementation. Wiley, 1999.
Resource Utilization % [5] J. van de Beek and F. Berggren, “N-continuous OFDM,” Com-
LEs 3129 4.6 munications Letters, IEEE, vol. 13, no. 1, pp. 1 –3, 2009.
LABs 215 5 [6] Physical Channels and Modulation (Release 8), 3GPP Std.
Registers 530 0.88 TSG RAN TS 36.211, v8.4.0., 2008.
Memory Bits 5490 0.49
Hardware Multipliers 191 0.64

These values were obtained for standard synthesis effort


in Quartus II
R software, and the maximum operation fre-
quency is found to be 50.1 MHz.

5. CONCLUSION

Based on the requirement of high-speed processing for arith-


metic calculation units, we focused on the matrix-vector mul-
tiplication issue. Several implementation considerations are
analyzed for the case of parallel elemental multipliers. Al-
though these multipliers may be real or complex, and may
accept fixed-point or floating-point number representations,
the presented architecture remains valid. The proposed
scheme allows reducing the processing time from N 2 to
N 2 /K clock cycles. The design is tested for N-Continuous

46
SYNTHESIS OF THE HARTLEY TRANSFORM WITH A HADAMARD-BASED MATRIX
ARCHITECTURE

Gilson J. Alves, Member, IEEE, and Edval J. P. Santos, Senior Member, IEEE

Laboratory for Devices and Nanostructures, Electronics and Systems Department,


Universidade Federal de Pernambuco. Rua Academico Helio Ramos, s/n, Varzea,
50740-530, Recife, PE, Brasil.

ABSTRACT for the Discrete Hartley Transform of N-length (N-DHT)


have been proposed, such as PFA - Prime Factor Algorithm,
Hadamard matrices are used to synthesize the Hartley
where N is decomposable into prime factors, presentend in
Transform. This approach allows for the implementation of
[11, 12] and WFTA, the Winograd Fourier Transform Algo-
the Hartley transform in a scalable format. For compari-
rithm, presented in [13]. Marchesi [14] describes the N-
son, the transform has also been implemented via the Ma-
DHT using CORDIC processors and Systolic shuffle units,
trix definition. Tests were carried out with a vector simulat-
when N is a power of 2, but this implementation uses large
ing the input signal. The output of both implementations are
area and has a slow computation. More recently, H. M.
compared. The FPGA device used is a Xilinx⃝ R
Spartan 3E,
de Oliveira, and Renato S. Cintra have proposed the use of
XC3S500e.
Hadamard matrix based architecture to implement the Hart-
ley transform in a more scalable format. This is the approach
1. INTRODUCTION this paper has selected to synthesize. The implementation
was analyzed with Simulink MatLab⃝ R
tool, and synthesized
⃝R
Integral transforms, specially the Fourier Transform, play an with Xilinx ISE 11.1.
important role in several engineering fields, with emphasis The paper is divided into five sections: the first is this in-
special to Digital Signal Processing - DSP in Optics, Voice trodution; the second presents brief overview of the Hartley
and Image Recognition, and Telecomunication [1]. Applica- transform; the third presents the Hadamard-based matrix ar-
tion examples are Image Compression [2], Content Based chitecture; the fourth describes the methodology for imple-
Image Retrieval [3], ADSL modens [4] and Communica- mentation, with specification, HDL description, simulations
tion Systems with multiple access - CDMA [5]. However, and synthesis. An other implementation of the 16-DHT is
they have the disadvantage of requiring large hardware area presented, via Hartley Matrix Definition. Tests are carried
to be implemented. The Hartley Transform is an integral out with both implementations, the Hadamard-based and the
transform closely related to the Fourier Transform, with the Matrix Definition, and the results are compared; and the last
advantage that its results for a real entry signal does not have one, the conclusion.
complex numbers, what implies easy arithmetic operation
[6, 7, 8], that can be implemented in smallers areas.
There are several algorithms which cab be used to imple-
ment integral transforms. The Least Mean Squared (LMS) 2. A BRIEF HARTLEY TRANSFORM OVERVIEW
has been widely used in DSP [9]. However, the LMS al-
gorithm suffers from high computational complexity. Many
The Hartley Transform is an integral transformation
techniques have been proposed to reduce the computational
that maps a real-valued temporal or spacial function into a
complexity. Merched et al. [10] have introduced the imple-
real-valued frequency function via the kernel, 𝑐𝑎𝑠(𝜔𝑥) =
mentation of the Hartley transform using LMS - HBNLMS.
𝑐𝑜𝑠(𝜔𝑥) + 𝑠𝑖𝑛(𝜔𝑥). This symmetrical formulation of the
The HBNLMS involves extending the data matrices to cir-
traditional Fourier transform, attributed to Ralph Vinton Lyon
culant symmetric matrices and then using the Hartley trans-
Hartley in 1942, leads to a parallelism that exists between
form to diagonalize them, with the purpose of finding the
the function of the original variable and that of its transform
commonality between adaptive filters using multidelay con-
[7]. This transform remained in quiescent state for over 40
cepts and those using filterbanks. Efficient implementations
years, and was rediscovered by Bracewell [7].
The authors thanks professor H. M. de Oliveira for sugesting the im-
plementation of this algorithm.

47
𝑁 −1
1 ∑ 2𝜋𝑘𝑛
ℎ𝑛 = 𝐻𝑘 𝑐𝑎𝑠 , n = 0, 1,..., N-1 (6)
𝑁 𝑁
𝑘=0

Fig. 1. The Self-inverse transform and 𝑐𝑎𝑠(𝑖) = 𝑐𝑜𝑠(𝑖) + 𝑠𝑖𝑛(𝑖).


The existence of fast algorithm for computing the dis-
2.1. Definition crete transforms (FTA) is one of the main reasons for their
applications [15]. The Fast Hartley transforms are close to
The Hartley transform of a function 𝑓 (𝑥) can be expressed the N-DHT applications [16], and so the N-DHT have been
as the pair an efficient tool.
The transform presented in equation (5) can be expressed
with a matrix linear operator as
∫ +∞
1
𝐻(𝜔) = √ 𝑓 (𝑥).𝑐𝑎𝑠(𝜔𝑥)𝑑𝑥 (1) 𝐻 = ℋ𝒩 .ℎ (7)
2𝜋 −∞
∫ +∞ Where ℋ𝒩 is the Hartley matrix of N-length, whose ele-
1 ments ℎ𝑘𝑛 are given by 𝑐𝑎𝑠( 2𝜋𝑘𝑛
𝑓 (𝑥) = √ 𝐻(𝜔).𝑐𝑎𝑠(𝜔𝑥)𝑑𝜔 (2) 𝑁 ).
2𝜋 −∞
where the angular frequency variable 𝜔 is related to the fre- For this job, N=16. In this case, the equation (7) becomes
quency variable 𝑓 by 𝜔 = 2𝜋𝑓 , and the integral kernel is the
𝑐𝑎𝑠𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛, defined as 𝑐𝑎𝑠(𝑖) = 𝑐𝑜𝑠(𝑖) + 𝑠𝑖𝑛(𝑖). ⎡
𝐻0
⎤ ⎡
1 1 1 ⋅⋅⋅ 1
⎤⎡
ℎ0

⎢ 𝐻1 ⎥ ⎢ 1 1.3066 1.4142 ⋅⋅⋅ 0.5412 ⎥ ⎢ ℎ1 ⎥
⎢ 𝐻 ⎥ ⎢ 1 1.4142 1 ⋅⋅⋅ 0 ⎥⎢ ℎ ⎥
⎢ 2 ⎥ ⎢ ⎥⎢ 2 ⎥
⎢ ⎥=⎢ ⎥⎢ ⎥ (8)
⎢ . ⎥ ⎢ . . . .. . ⎥⎢ . ⎥
⎣ .
. ⎦ ⎣ .. .
.
.
. . .
. ⎦ ⎣ .. ⎦
2.2. Hartley and Fourier Transforms
𝐻15 1 0.5412 0 ⋅⋅⋅ 1.3066 ℎ15
The Hartley transform is closely related to the Fourier trans-
form, and this relationship can be expressed as follows: 3. THE HADAMARD-BASED MATRIX
ARCHITECTURE
The Fourier Transform:
∫ +∞ If One use the equation (8) for direct implementation of the
1 DHT, the problem of area cited in the introduction remains.
𝐹 (𝜔) = √ 𝑓 (𝑥).𝑒−𝑗𝜔𝑥 𝑑𝑥 (3)
2𝜋 −∞ Thus, it was used in the second approach just for comparison
From equation (3), is easily shown that the Hartley trans- with the first one, the Hadamard based matrix architecture.
form presented in equation (1) can be rewritten as For better results, the fast transform algorithms(FTA’s) are
commonly used. The reason is that the FTA’s meet mini-
𝐻(𝜔) = ℜ𝑒{𝐹 (𝜔)} − ℑ𝑚{𝐹 (𝜔)} (4) mal multiplicative complexity. Between others, the multi-
plicative complexity is one of the main criterions used for
The Hartley transform - HT has structural advantages the analysis of an algorithm efficiency. In a mathematical
over the Fourier transform - FT, owing to the fact that it is vision, there are basically three methods for reach better al-
a real transform that also has the property of Self-inverse, gorithms transforms: The Index Rebuild, the Matrix Ope-
as can be seen in Fig.1. So, if the resources are scarce, the rations or the use of the Convolution Theorem [17]. This
HT can be a way of solve the problem, I mean, only one article uses the second technique.
algorithm for computing the direct and the inverse Hartley
Transform is enough. The DHT and the DFT of a real discrete signal 𝑆𝑖 (𝑖 =
0, 1, ..., 𝑁 −1), where 𝑠 = 𝑓 is respect to Fourier, and 𝑠 = ℎ
2.3. The Discrete Hartley Transform is respect to Hartley, can be writed as:
The discrete version of the Hartley transform for discrete
signals of N-length is the Discrete Hartley Transform for N- DFT: 𝑓𝑖 ↔ 𝐹𝑘 and DHT: ℎ𝑖 ↔ 𝐻𝑘 .
length (N-DHT), and is defined by [6] as the pair
A relationship between DHT and DFT is expressed by:
𝑁
∑ −1
2𝜋𝑘𝑛
𝐻𝑘 = ℎ𝑛 𝑐𝑎𝑠 , k = 0, 1,..., N-1 (5) 1
𝑁 𝐹𝑘 = [(𝐻𝑘 + 𝐻𝑁 −𝑘 ) − 𝑗(𝐻𝑘 − 𝐻𝑁 −𝑘 )] (9)
𝑛=0 2

48
The Turbo Hartley Transforms-THT for short block
lenght presented by De Oliveira, Cintra and Campello [18]
was used for this approach. In this method, the technique of
decomposition in Layer Matrix is used [17].

In the equation (7), remarking that

( 𝑁
) ( ) ( )
2𝜋𝑘(𝑛 + 2 ) 2𝜋𝑘𝑛 𝑛 2𝜋𝑘𝑛
𝑐𝑎𝑠 = 𝑐𝑎𝑠 + 𝜋𝑘 = (−1) 𝑐𝑎𝑠 (10)
𝑁 𝑁 𝑁

Applying this to the Hartley Matrix ℋ𝒩 when N=16,


according to the method showed by De Oliveira et al. [18]
and Cintra [17], the 16-DHT can be computed via four in-
termediate matrix layers. Each layer is derived from the ma-
trix representation of the previous layer, and all the arith-
metic operations already executed in the previous layer can
be reused in the next one. Thus, all the operations are com-
puted just once, and reused everytime they are needed. So,
if the N-DHT scheme is knowed, it can be used for com-
puting the (N+1)-DHT, what means that the (N+1)-DHT Fig. 2. Design Conception
scheme encapsulates the N-DHT scheme, as can be seen in
Fig. 4. The result is that the effort for computing a DHT is
reduced with a much concise block of operations. the description in section 3, resumed in Fig. 4, adapted from
[17]. The HDL code was generated in VHDL, to realise the
4. DESIGN METHODOLOGY 16-DHT conception.

The design was developed following the steps: Specifica- 4.3. Behavioral Simulation
tion, HDL description, Behavioral Simulation and Hardware
Implementation. The next subsections are referring to the The simulation was leaded to check the system response.
first approach, the Hadamard-based matrix architecture. The simulation environment used was the ModelSim-XE⃝ R


R
with the Xilinx ISE tool. Tests were carried out with a vec-
4.1. Specification tor simulating the input signal, in two situations: The first,
where the input signal simulates a rectified sine wave, and
The project aims to implement the Discrete Hartley Trans- the response is showed in Fig. 6 and Fig. 7. The Signal-In
form of lenght 16, 16-DHT, in a cheap FPGA module, in vector (entrada𝑡ℎ ) is an integer approach of a Sin wave with
accordance with Fig. 4 and Fig. 3. The Fig.2 resume the amplitude 10 and positive rectification, and the response, the
design conception. Signal-Out (saida𝑡ℎ ) is presented as integer numbers due to
The Discrete Signal-In is a vector of sixteen 14-bits- the tool limitations; in the second situation, the input signal
samples. In a serial mode, the samples are stored in a entry- is a gate function, and the response is showed in Fig. 5. The
memory. After that, the memory vector is multiplied for ele- time to compute the 16-DHT of a signal is 3 𝜇𝑠. This makes
ments of the Hartley Matrix of transformation, in a 4-layers it feasible for a range of applications, like audio and image
operation according to the scheme of figure 4 and the expla- processing [20].
nation in the previous section (3). The 16-DHT response is
the Discrete Signal-out, a vector of 16-length, where each 4.4. Hardware Implementation
component is represented in a 12-bit word.
As the result of simulation occuried as expected, the syn-
thesis was executed in a Xilinx⃝ R
Spartan-3E, XC3S500E-
4.2. HDL Description ⃝R
4fg320, via the Xilinx ISE 11.1, with a previous RTL-
The Matlab⃝ R
Simulink software can be used to simulate Register Transfer Level generation. Due to characteristics
a Hartley transform execution system [19]. In this design, of the test platform used, the auxiliary-clock frequency was
the Matlab⃝ R
was used to implement the 16-lenght Hartley adjusted in 6.25Mhz, but it can be set up to 50 MHz, de-
Matrix with a Hadamard-based matrix architecture, that was pending on the FPGA platform.
converted in simulink blocks, as shown in Fig. 4. The com- With comparison purposes, an other synthesis of the 16-
putation of the Hartley transform is executed according to DHT was carried out, via its matrix definition algorithm, ac-

49
Fig. 3. 16-DHT block conception

Table 1. 16-DHT: Device utilization summary


Logic Available Matrix definition Hadamard-based
Used Utilization Used Utilization
SLICES 4656 4125 88% 617 13%
SLICE FF 9312 295 3% 174 1%
4 INPUT LUT 9312 6813 73% 953 10%
BONDED IOB 232 17 7% 17 7% Fig. 4. 16-DHT algorithm conception
MULT18X18SIO 20 19 95% 13 65%
GCLK 24 2 8% 2 8%
1. The results points to the feasibility for applications like
audio and image processing for teleconference, distance-
cording to the equation (7). In this situation, although the learning and medical investigations. Improvements that in-
response is the same, the hardware consumption is a strong cludes analogic interface for response visualisation are cur-
disadvantage, as can be seen in Table 1, where the hardware rently in execution.
characteristics are summarized.
The synthesis response of the 16-DHT of a rectified sine 6. REFERENCES
wave computed in the MatLab⃝ R
(Fig.7) is the discrete signal
presented in Fig.8. [1] K. J. Olejniczak and G. T. Heydt, “Special section on the hart-
ley transform,” Proceedings of the IEEE, vol. 82, pp. 372–
447, Mar. 1994.
5. CONCLUSION
[2] P. Meher, T. Srikanthan, J. Gupta, and H. K. Agarwal, “Near
lossless image compression using lossless hartley like trans-
Hadamard-based matrix implementation is useful for a wide form,” ICICS - PCM, vol. 19, Dec. 2003.
range of applications. This paper presented a methodology
for fast implementation of the discrete Hartley transform [3] P. Rajavel, “Directional hartley transform and content based
image retrieval,” Elsevier - SigPro, vol. 90, pp. 1267–1278,
with a Hadamard-based matrix architecture. The design
Nov. 2009.
was implemented in a cheap FPGA module, the Spartan-
3E XC3S500E-4fg320. With a low auxiliary-clock of 6.25 [4] J. I. Guo, “An efficient design for one-dimendional discrete
MHz, the requisite time for computing the 16-DHT is 3 𝜇𝑠. hartley transform using parallel additions,” IEEE Transac-
The area consumption with the Hadamard-based architec- tions on Signal Processing, vol. 48, 2000.
ture is better, using, for exemple, about 85% less slices than [5] H. Bogucka, “Effective implementation of the ofdm/cdma
the Matrix definition architecture, as can be seen in Table base station transmitter using joint fht and ifft,” Proc. IEEE

50
16−DHT of a gate function (synthetized) 16−DHT of a Rectified Sine Wave
100 100
X= 0
Y= 100
80

60

50
40
16−DHT

Amplitude
20
X= 4 X= 6
Y= 8 Y= 8.2843 X= 8
0 Y= 4

0
−20
X= 12
Y= −8
−40
X= 14
−60 Y= −48.2843
0 5 10 15
−50
t 0 5 10 15
t

Fig. 5. 16-DHT of a gate function - Synthetized


Fig. 7. MatLab⃝
R
16-DHT of a rectified sine wave

Fig. 6. 16-DHT of the rectified sine wave

Workshop Signal Process. Adv.Wireless Commun., pp. p.162–


165, 1999. Fig. 8. Synthesis of the 16-DHT of a rectified sine wave
[6] R. V. L. Hartley, “A more symmetrical fourier analysis ap-
plied to transmission problems,” Proc. IRE, vol. 30, pp. 144–
150, Mar. 1942.
[12] D. Yang, “Prime factor fast hartley transform,” Elect. Letters,
[7] R. N. Bracewell, “Discrete hartley transform,” J. Opt. Soc.
vol. 26, no. 2, pp. 119–121, Jan 1990.
Amer., vol. 73, pp. 1832–1835, Dec. 1983.
[8] ——, “The fast hartley transform,” Proc. IEEE, vol. 72, pp. [13] S. Winograd, “On computing the discrete fourier transform,”
1010–1018, Aug. 1984. Math. Comp., vol. 32, pp. 175–199, 1978.
[9] R. Vasanthan, K. Prabhu, and P. Sommen, “An analysis [14] M. Marchesi, G. Orlandi, and F. Piazza, “A systolic circuit
of real-fourier domain-based adaptative algorithms imple- for fast hartley transform,” Proceedings of the (ISCAS’88),
mented with the hartley transform using cosine-sine sym- vol. 39, no. 11, pp. 2685–2688, 1988.
metries,” IEEE Transactions on Signal Processing,, vol. 53, [15] R. Blahut, Fast Algorithms for Digital Signal Processing.
no. 2, Feb 2005. Addison-Wesley, 1985.
[10] R. Merched and A. H. Sayed, “An embedding approach
[16] G. Bi and Y. Chen, “Fast dht algorithms for length n=q*2m,”
to frequency domain and subband adaptive filtering,” IEEE
IEEE Trans. on Signal Processing, vol. 47, no. 3, pp. 900–
Transactions on Signal Processing,, vol. 48, no. 9, pp. 2607–
903, Mar 1999.
2619, Sep 2000.
[17] R. Cintra, “Transformada rapida de hartley: Novas fatoraçoes
[11] C. Chakrabarti and J. JaJa, “Systolic architectures for the
e um algoritmo aritmetico,” Dissertaçao de Mestrado - UFPE
computation of the discrete hartley and the discrete co-
- CTG, 2001.
sine transforms based on prime factor decomposition,” IEEE
Transactions on Computers, vol. 39, no. 11, pp. 1359–1368, [18] H. de Oliveira, R. Cintra, and R. Campello, “Multilayer
Nov 1990. hadamard decomposition of discrete hartley transforms,”

51
SBrT 2000 - XVIII Simposio Brasileiro de Telecomunicaçoes,
Set 2000.
[19] R. C. de Oliveira, H. M. de Oliveira, R. Campello, and E. San-
tos, “A flexible implementation of a matrix laurent series-
based 16-point fast fourier and hartley transforms,” IEEE
Proceedings of VI Southern Programmable Logic Confer-
ence, pp. 175–178, Mar 2010.
[20] S. A. Parthasarathy Ranganathan and N. P. Jouppiy, “Perfor-
mance of image and video processing with general-purpose
processors and media isa extensions,” Proceedings of the
IEEE, Aug. 2002.

52
IMPLEMENTACIÓN DE MODBUS EN FPGA MEDIANTE VHDL - CAPA DE ENLACE -

Guanuco Luis, Panozzo Zenere Jonatan, Olmedo Sergio, Rubio Agustin*


1.1. Capa de Enlace

MODBUS [2] [3] define un protocolo en esta capa para la


Centro Universitario de Desarrollo en comunicación serie entre un único dispositivo Maestro y
Automación y Robótica “CUDAR” uno (conexión punto a punto) a 247 Esclavos (conexión
FRC / UTN multipunto).
Córdoba, Argentina Una comunicación siempre la inicia un Maestro, por lo que
[email protected], un Esclavo solo transmite información luego de una
petición; de lo cual se deduce que no es posible la
[email protected], comunicación directa entre Esclavos. Cada uno de estos
[email protected], dispositivos, tiene una dirección específica que los
[email protected] distingue.
El dispositivo Maestro puede transmitir datos en dos modos
diferentes: Unicast o Broadcast. El primero, está dado por
una petición del Maestro a un Esclavo especifico y siempre
ABSTRACT
la respuesta de este. El segundo es una transmisión del
The Hardware description through VHDL (VHSI Hardware Maestro hacia todos los Esclavos al mismo tiempo, no
Description Language) programming, allows a wide habiendo respuesta alguna de ninguno de ellos.
flexibility in the digital circuits design. This article presents MODBUS permite la codificación de la información en la
a description about the comunication between red en dos formas diferentes, RTU y ASCII [3].
Programmable Logic Devices (PLDs), according to the RTU (Remote Terminal Unit), codificación síncrona, los
MODBUS protocol. This widely accepted comunication datos se presentan en bits consecutivos formando tramas de
standard defines protocols for the “Aplication,” “Data- datos, cuyo inicio y fin son indicados por intervalos de
link,” and “Physical” layers. This document explains the tiempo.
development of said standard in the Data-link layer, and ASCII ( American Standard Code for Information
offers a summary about the way this layer interacts with the Interchange), caracterizado por ser asíncrono, la
other two. In order to do this, the descriptions of the main información se encuentra codificada en caracteres ASCII.
blocks, synthesis, simulation and, finally, the La trama de datos comienza y termina con caracteres
implementation on FPGA (Field-Programmable Gate definidos.
Array) devices are included. Los tiempos de transmisión y recepción de una trama en
cada uno de éstos modos de codificación difieren en gran
medida. En modo ASCII los datos deben ser convertidos en
1. INTRODUCCIÓN su correspondiente caracter además de ser ponderados en
formato hexadecimal. Por el contrario, en el modo RTU la
Para el desarrollo de cualquier protocolo de comunicación información se encuentra en forma de bits consecutivos,
se deben considerar niveles de abstracción para el tratado permitiendo que para un mismo tiempo, haya un mayor
de la información como así también diferentes formas de flujo de información por la red que en el modo ASCII.
implementación tanto hardware como software. Para definir
éstas pautas de diseño se considera el modelo OSI.
El modelo OSI (Open System Interconnection) es un marco
de referencia para la definición de arquitecturas de
interconexión de sistemas de comunicaciones desarrollado
por la Organización Internacional para la Estandarización
[1]. Este permite al desarrollador seguir una determinada
estructura para el manejo de la información en dicha red,
Fig. 1.
Cada uno de los niveles de este modelo se regirá de acuerdo
a las especificaciones del protocolo. Este modelo logra
imponer un nivel de abstracción en el cual la comunicación Fig. 1. Modelo OSI con sus diferentes niveles.
es entre capas del mismo nivel de dos o más dispositivos.
Sin embargo, la comunicación existe solo entre capas
adyacentes de un mismo dispositivo, conectándose a otro
únicamente a través de las capas físicas.

53
1.2. Codificación ASCII

Se elige para el presente desarrollo el modo ASCII, debido


a la mejor legibilidad de la información. En este modo se
puede apreciar la trama circulante por el bus, conectando a
él un dispositivo con las capacidades de interpretar
caracteres ASCII. Esta es una característica fundamental si
se quiere realizar un análisis en cualquier punto de una red
donde se encuentra aplicado MODBUS.
La codificación en modo ASCII cuenta con una trama
limitada por un caracter de comienzo “:” y dos de fin “CR
(Charriage Return) – LF (Line Feed)”. El mensaje se Fig. 3. Diagrama de estados en transmisión y recepción en
encuentra dentro de éstos caracteres distribuido como se modo ASCII [3].
observa en la Fig. 2. Los cuatro campos que forman el
mensaje son:
• Dirección; del dispositivo esclavo que está 2.1. Bloque de RAM
actuando en la comunicación.
• Código de Función; códigos preestablecidos por La información que contiene la trama del MODBUS debe
MODBUS que establecen las operaciones que ser almacenada en registros para su tratado en los diferentes
debe llevar a cabo el esclavo. niveles. En este sentido, este bloque funciona como puente
• Datos; es la información. entre las capas de “Enlace” y “Aplicación”.
• CRC/LRC; campo que sirve para la detección, (no La primera, guarda en la RAM los datos recibidos en el
corrección) de errores. mensaje, preparando el servicio de la capa de aplicación.
Esta toma los datos desde la RAM, los procesa y escribe en
ella la información a transmitir.
Es posible utilizar un bloque de RAM ya embebido en el
dispositivo lógico (FPGA) o un bloque de RAM
Fig. 2. Trama de MODBUS en modo ASCII [3]. descriptivo.
Los bloques de RAM embebidos en FPGAs, llamados
también bloques de RAM primitivos, se encuentran
2. DISEÑO físicamente en el chip [6]; compuestos de entradas/salidas,
bus de direccionamiento y señales de control. La limitación
La implementación de un protocolo MODBUS en FPGAs en su utilización es que no se cuenta con un modelo
requiere un diseño en algún lenguaje de descripción de descriptivo de los mismos, lo que restringe el diseño, por no
hardware, basado en gran medida, en el desarrollo de poder reducirse el uso de recursos físicos, sumado a la
máquinas de estados finitas. dependencia del hardware a utilizar.
La generación de una trama comienza con el envío de un En el bloque de RAM descriptivo se puede llevar a cabo
caracter que define el principio de la misma. En forma análisis de tiempo y reducir la cantidad de bloques lógicos
consecutiva se transmiten los campos de dirección, función, en función de la necesidad de la implementación. Por esto,
datos, chequeo de error LRC y para terminar los caracteres y a los fines investigativos, se adopta en el presente trabajo
de fin de trama. De forma semejante se plantea para la este tipo de bloque de memoria. Sin embargo, el diseño
recepción de la trama. En forma general se definen los global ocupa mayores recursos dado que las RAM
estados de codificación/decodificación de la trama en la primitivas están igualmente incorporadas y disponibles en
Fig. 3. el chip.
Como bloques específicos de mayor relevancia en el
diseño, se considera los de Recepción y Transmisión, los
que se definirán como maquinas de estados, a nivel de
componentes, dentro de la descripción principal en VHDL.
Las máquinas de estados se clasifican en dos tipos:
“Moore” y “Mealy” [4]. Ambas se diferencian por la
dependencia o no, de las salidas con respecto al estado de Fig. 4. Diagrama de máquina de estados Mealy [5].
las entradas.
En virtud de los requerimientos necesarios para la capa de
enlace del MODBUS, se opta por la implementación de
máquinas de estados tipo Mealy, Fig. 4.

54
2.2. Transmisor y Receptor • Distribución de Clock: DDL (Delay-Locked
Loop).
El Transmisor funcionalmente debe generar la trama a ser • Boundary Scan.
enviada, esto, tanto en el Maestro como en los Esclavos. Se Con las pautas de diseño ya presentadas, como así también
diseña una máquina de estados, pendiente del proceso de la identificación de los distintos bloques que componen
escritura del bloque de RAM, llevada a cabo por la capa de nuestra descripción, se presenta el resultado de la síntesis,
aplicación. Tabla 1.
La máquina de estados realiza las lecturas sucesivas desde
el bloque de RAM hasta enviar uno a uno los caracteres, Tabla 1. Resumen de utilización de recursos
respetando los marcadores de comienzo y fin de trama. Dispositivo FPGA: 2S200EPQ208-6Q
En la recepción, al igual que en la transmisión, se utiliza Recurso Utilizado Disponible Porcentaje
nuevamente una máquina de estados, que deberá cumplir Slices 194 2352 8%
con las especificaciones del modo de codificación. En este Flip Flops 239 4704 5%
caso se cuenta con la información en forma serial recibida Lógica 334 4704 7,1%
LUTs
por la capa “Física”. Los datos son almacenados en el RAM 8 4704 0,1%
bloque de RAM, momento en el que el bloque de recepción Entradas/Salidas 37
posee el control absoluto de escritura en la memoria. IOBs conectados 37 142 26%
Por lo expuesto, resulta necesaria la presencia de un control GCLKs 1 4 25%
de accesibilidad del bloque de RAM, dado que varios
componentes precisan de la escritura y/o lectura de dicho De la Tabla 1 se aprecia los escasos recursos utilizados, ya
bloque. que se cuenta con un dispositivo con gran número de CLBs
(Configurable Logic Blocks). Igualmente es de suma
importancia la simulación, verificación y posterior
2.3. UART simplificación de la descripción, para lograr un mejor
rendimiento de los recursos en vista de su implementación
MODBUS define para las capas 1 y 2 del modelo OSI, el
en diferentes dispositivos lógicos.
“Protocolo MODBUS de Línea Serial” [3]. Esto implica la
Un análisis más detallado ha de resaltar la importancia de la
utilización de una UART (Universal Asynchronous
no implementación de elementos primitivos en el actual
Receiver Transmitter) para poder transmitir y recibir los
proyecto. Al respecto, en el caso del bloque de RAM
datos en forma serie.
descriptivo, es necesario un determinado y reducido
La UART constituye entonces la conexión de la capa de
número de elementos que permiten su instanciación hasta
“Enlace” con la capa “Física”. Esta última puede ser
en dispositivos lógicos más pequeños, por ejemplo, CPLDs.
cualquier estándar de comunicación serial como el RS232 o
En caso de necesitar un bloque de RAM de mayor tamaño,
el RS485 adoptado en el presente desarrollo.
ha de considerarse el empleo de bloques de memoria RAM
Este bloque se realiza al igual que los demás de manera
primitivas, obviamente, realizándose un previo estudio del
descriptiva en VHDL, y en forma general presenta el dato
dispositivo a utilizar. Así como la consideración anterior, se
recibido en forma serial, como salida en paralelo. De forma
debe tener en cuenta todos los recursos necesarios para el
análoga, recibe el dato a transmitir en paralelo y envía los
proyecto y los disponibles en el hardware a utilizar.
bits de información en forma serie atendiendo las
configuraciones de velocidad elegidas, y las condiciones
preestablecidas por el protocolo MODBUS sobre la
conformación de la palabra a enviar: bits de comienzo,
datos, paridad y parada [3].

3. SÍNTESIS E IMPLEMENTACIÓN

La implementación se realiza en una FPGA Xilinx Spartan


2E XC2S200E [6]. El sintetizador es el XST (Xilinx®
Synthesis Technology) [7], herramienta que forma parte del
paquete ISE Xilinx WebPack [8] disponible en el centro de
investigación donde se lleva a cabo el desarrollo.
La FPGA cuenta con una gran cantidad de recursos físicos,
Fig. 5. Los principales se detallan a continuación:
Fig. 5. Diagrama en bloque de la familia FPGA Spartan-
• Bloques de entradas y salidas.
IIE [6].
• Bloque lógico configurable.
• Bloques de RAM.

55
La utilización de un único reloj para el sincronismo de los
CLBs resulta ser más flexible en el diseño que disponer de
varios clocks externos conectados a la FGPA. Sin embargo,
debe tenerse presente que esto se logra con el
correspondiente consumo de recursos físicos, ya que un
divisor de clock, implementado con bloques lógicos, se
sintetiza como un contador lógico.
RTL (Register Transfer Level) permite la representación
gráfica del diseño descrito en VHDL, visualizándose los
componentes finales, Fig. 6.
Fig. 8. Simulación de una trama de transmisión y
recepción en capa de Enlace del MODBUS.

5. CONCLUSION

En base a las especificaciones del protocolo MODBUS, se


ha logrado un desarrollo totalmente descriptivo en VHDL,
esto permite la flexibilidad en el diseño de sistemas
digitales como así también se logra portabilidad en la
implementación sobre PLDs.
En el proceso de investigación a cerca de la
implementación de MODBUS en sistemas embebidos,
presenta una preferencia en la utilización de
Fig. 6. Representación RTL final.
microcontroladores .El avance tecnológico, su evolución en
nuevas arquitecturas y las herramientas de software han
incrementado ésta tendencia. Aún así la abstracción del
4. SIMULACIÓN lenguaje VHDL ha permitido satisfacer las especificaciones
del MODBUS embebido en una FPGA.
La simulación resulta fundamental en el proceso de síntesis
e implementación. En función de las especificaciones de la
capa de “Enlace” del MODBUS, se presenta los casos 6. REFERENCES
posibles de comunicación tanto para la transmisión como la
[1] MODBUS-IDA.ORG, “Modelo OSI”.
recepción. La estructura del proceso de simulación es http://es.wikipedia.org/. 2010.
acorde al esquema de la Fig. 7. De esta manera se crea un [2] MODBUS-IDA.ORG, “MODBUS application protocol
lazo que permite llegar al correcto funcionamiento del specification”, V1.1b. http://www.MODBUS.org, 2010.
sistema digital. [3] MODBUS-IDA.ORG, “MODBUS over serial line
specification and implementation guide”, V1.02.
http://www.MODBUS.org, 2010.
[4] K. Kuusilinna, V. Lahtinen, T. Hämäläinen, J.Saarinen,
“Finite state machine encoding for VHDL synthesis”, IEEE
Proc.-Comput. Digit. Tech, Vol. 148, No. 1, Enero 2001.
[5] A. Iborra y J. Suardiaz, “Diseño de Sistemas Electrónicos-
DB4”, Diseño Basado en Máquinas de Estado Finitas, Uni.
8. Mayo 2003.
[6] Xilinx® Inc., “Spartan-IIE 1.8V FPGA Family: Functional
Description”, v2.1, Product Specification. Julio 2003.
[7] ©2002-2008 Xilinx, “ISE 10.1 Quick Start Tutorial”.
http://www.xilinx.com/. Agosto 2010.
[8] Xilinx® Inc. “ISE WebPACK Design Software”.
Fig. 7. Proceso de validación de un diseño electrónico http://www.xilinx.com/. Agosto 2010.
digital mediante VHDL [9]. [9] J. Jiménez, E. Fernández, J. Martin, U. Bidarte,A. Zuloaga.
“Simulation environment to verify industrial communication
circuits”. University of the Basque Country, Department of
La simulación no solo ofrece información útil para corregir Electronics and Telecommunications, 2002.
problema en la síntesis, sino que además permite validar la
trama, como se observa en la Fig. 8.

56
SECUENCIADOR MUSICAL EN UNA PLACA FPGA
MUSIC SEQUENCER ON A FPGA BOARD

Matı́as López-Rosenfeld, Patricia Borensztejn Francisco Laborda

Departamento de Computación Instituto de Ciencias


Facultad de Ciencias Exactas y Naturales Universidad Nacional de General Sarmiento
Universidad de Buenos Aires
email: {mlopez, patricia} @dc.uba.ar email: [email protected]

ABSTRACT polifónico basándonos en los principios del protocolo MI-


DI.
En este artı́culo presentamos una aplicación implementada Este artı́culo está organizado de la siguiente manera: en
en un FPGA de Xilinx. La misma permite reproducir temas la sección 2 se describe el proyecto realizado, en la sección
musicales polifónicos sintetizando señales de audio digital. 3 se explica el diseño del sistema, en la sección 4 se de-
La aplicación fue diseñada para trabajar en un Spartan 3E scribe cada módulo individualmente, en la sección 5 se ex-
Starter Kit de Digilent. Está construido sobre las ideas del plica acerca de la verificación del proyecto y en la sección
protocolo MIDI como entrada y su salida principal es un bit 6 se muestra la información de sı́ntesis. Por último, en la
PWM (modulado por ancho de pulso), el cual es amplificado sección 7 se presentan las conclusiones.
utilizando otro módulo de Digilent (PMOD-AMP1).
In this paper we present an application implemented on
a Xilinx FPGA. This application allows it to play polyphon- 2. DESCRIPCIÓN DEL PROYECTO
ic music songs by synthesizing digital audio signals. It was
designed to work on a Spartan 3E Starter Kit of Digilent. It El mismo consiste en la implementación en Verilog de
is constructed upon the ideas of the MIDI protocol as input un secuenciador musical polifónico sobre una FPGA (field
and a PWM (pulse width modulation) bit is it’s main output, - programmable gate array). Permite, dada una secuencia
which is amplified using another Digilent module (PMOD- de eventos de tipo NoteOn y NoteOff (al igual que en el
AMP1). protocolo MIDI), reproducir una pieza musical con sonidos
sintetizados en la misma placa.
Las ondas de audio sintetizadas son dientes de sierra de
1. INTRODUCCIÓN valores discretos. La polifonı́a se resolvió utilizando multi-
plexación temporal de las diferentes voces.
A fines de los 60, se popularizó el uso de los sinteti- Para la implementación del proyecto se utilizó el Starter
zadores digitales en la música popular. En 1983, el protocolo Kit de la empresa Digilent Inc., donado por Xilinx a través
MIDI (Music Instrument Digital Interface) [1] estandarizó la del Xilinx University Program [2]. El kit contiene un FPGA
comunicación entre diferentes marcas de sintetizadores y Spartan-3E XC3S500E [3] de Xilinx. Además, se utilizó un
permitió la programación de los mismos. Esto revolucionó la PMOD-AMP1 [4] de Digilent Inc. como complemento para
forma de hacer música ya que cualquiera podı́a programar transformar la señal digital modulada, en una señal analógi-
aunque no fuera un buen ejecutante. ca reproducible por un parlante.
Cabe aclarar que MIDI no transmite señales de audio, Una implementación similar y completa de un sinteti-
sino datos de eventos y mensajes controladores que se pueden zador monofónico en un FPGA se puede encontrar en el
interpretar de acuerdo con la programación del dispositivo proyecto [5].
que los recibe. Es decir, MIDI es una especie de “partitura”
que contiene las instrucciones en valores numéricos (0-127)
2.1. Entrada
sobre cuándo generar cada nota de sonido y las caracterı́sti-
cas que debe tener. El aparato al que se envı́e dicha partitura Ası́ como un músico lee una partitura, nuestro proyecto
la transformará en música completamente audible. también lo hace. Tiene una entrada en la cual puede leer
En este trabajo creamos una aplicación que funciona co- cuándo tiene que empezar a hacer sonar una nota y cuándo
mo punto de partida para el desarrollo de un secuenciador la tiene que dejar de hacer sonar.

57
Voz1
Tabla de 22 22
NCO
tonos

7 Voz2
7 Tabla de 22 22 1 PWM
Tabla NCO
7 tonos
partitura
7 Mezclador
Voz3 22
8 Tabla de 22 22
NCO
tonos
1
PC Metrónomo
Voz4
Tabla de 22 22
NCO
tonos
50 MHz

Fig. 1. Data Path.

En nuestra implementación no nos centramos en la for- El Metrónomo, que es un divisor de la frecuencia del
ma en que se ingresan los datos de la ejecución. Almacena- reloj de la placa.
mos en una tabla los datos necesarios para reproducir una
El Registro PC, que indexa la Tabla Partitura.
pieza musical, haciendo ésta las veces de una entrada real.
Ası́, dejamos el camino abierto para poder ingresar datos de La Tabla Partitura, que contiene los datos de ejecución
otras formas (tiempo real vı́a puerto serie desde una pc o un de la pieza musical.
controlador MIDI, por medio de un teclado ps/2, etc.).
La Tabla de Tonos, que traduce cada nota musical a
Nos referimos a esta tabla que hace las veces de entrada
ser reproducida al valor que necesita el Oscilador para
como “Tabla Partitura”.
generar la señal correspondiente a ella.
El Oscilador (NCO), que genera una señal diente de
2.2. Salida de audio
sierra con frecuencia controlada numéricamente.
Las salidas de audio son dos, y contienen el audio digital El Mezclador, que combina las señales de los diferen-
resultado de la ejecución de la partitura que se toma en la tes osciladores en una sola.
entrada.
Una de estas salidas es una señal diente de sierra digi-
4. IMPLEMENTACIÓN
tal discreta en un bus de 22 bits que varı́a en ciclos de fre-
cuencia asociada al tono que se desea reproducir. En nuestro 4.1. Módulo Metrónomo
proyecto esta salida no se utiliza pero queda disponible para
cualquier otra conversión digital-analógica que se quiera re- Este módulo es simplemente un divisor de la frecuencia
alizar. del reloj de la placa. Su salida pasa de 0 a 1 indicando que
La otra salida es la versión modulada por ancho de pulso ha transcurrido una unidad de tiempo para la interpretación
(PWM) de la salida mencionada anteriormente. La misma musical.
es de 1 bit que alterna entre 0 y 1 en ciclos con frecuencia En futuras implementaciones, será posible modificar el
asociada al tono que se desea reproducir. Esta salida sı́ es tempo de la pieza musical durante su ejecución con solo
utilizada y es recibida por el PMOD-AMP1 y transformada cambiar el valor por el cual se divide la frecuencia en este
en audio capaz de ser reproducido por cualquier parlante. módulo.

4.2. Registro PC
3. DISEÑO
Este registro funciona como un contador de la cantidad
El Data Path (ver Fig. 1) del proyecto se compone de los de pulsos emitidos por el Metrónomo. Este valor se utiliza
siguientes módulos: para indexar la Tabla Partitura.

58
v
timestamp nro. nota NoteOn/NoteOff nro. voz
0x1BBE4
8 7

Fig. 2. Esquema de una fila de la Tabla de Tonos.


t
0x0

4.3. Tabla Partitura Fig. 3. Contador de tics para obtener un “La” medio.
Esta tabla simula una entrada propiamente dicha. Co- v
mo se comentó en 2.1, en futuras versiones podrı́a ser reem- 0x20DDF2
plazada por una entrada en otro formato.
Cada una de sus filas (ver Fig. 2) representa un evento 0x200000
de ejecución musical para un determinado momento al que
llamaremos timestamp. Este proviene del valor almacenado 0x1F220D
en 4.2. t
0x0
Un evento está compuesto por: la nota involucrada; una
voz (por la cual se generará el sonido); y un valor binario
Fig. 4. Contador de tics centrado en 0 X 200000 para obtener
que indica si representa el comienzo o el fin del sonido de
un “La” medio.
esa nota en esa voz (NoteOn/NoteOff ).
En esta primer etapa la aplicación sólo es capaz de hacer
sonar hasta 4 notas en simultáneo. Las mismas empezarán Tics del clock de la FPGA necesarios para obtener un La:
a sonar y dejarán de hacerlo en el timestamp indicado en la
partitura. 1
seg = 1ticF P GA (3a)
50M
1 50M
4.4. Tabla de tonos seg = x = ticsF P GA (3b)
440 440
Esta tabla guarda los valores precalculados que sirven de
De esta forma almacenamos los valores precalculados
tope a los contadores de los osciladores para lograr las fre-
para las 128 notas del espectro musical que contempla el
cuencias deseadas. Dado que son contadores discretos, las
protocolo MIDI.
frecuencias generadas pueden tener un error, pero el mismo
es despreciable para el oı́do humano.
El funcionamiento es sencillo de explicar: dado que el 4.5. Módulo Oscilador (NCO)
clock interno es de 50 MHz, lo que tenemos que pregun- Este módulo es el encargado de generar la señal que rep-
tarnos es cuántos tics deberı́amos contar para retrasar esta resenta cierta frecuencia. La salida es una señal diente de
frecuencia a la de la nota deseada, entonces contamos desde sierra que oscila a cierta velocidad, la cual es determina-
0 hasta ese número una y otra vez para ası́ lograr una señal da por la entrada. Para generar la señal de diente de sierra
de la frecuencia de dicha nota. lo que hace el módulo es contar tics del clock interno del
La cantidad de tics del clock de la FPGA: FPGA. La entrada de este módulo entonces será el número
correspondiente a la nota deseada según 4.4 y la salida es el
50M Hz = 1seg (1a)
valor del contador (ver Fig. 3).
1 Cuando decimos que una voz se activa en una nota, nos
1Hz = seg (1b)
50M referimos a que uno de los cuatro Osciladores comienza a
1 generar a la salida una señal de diente de sierra que “os-
1ticF P GA = seg (1c)
50M cila digitalmente” en la frecuencia asociada a esa nota. (Por
(1d) ejemplo, para el La medio, a 440 Hz.) Cuando decimos que
una voz se desactiva, el Oscilador asociado a esa voz tiene
La cantidad de tics que tiene el “La” medio: en forma constante el valor cero a la salida.
Este módulo también centra la señal. Por centrarla nos
440Hz = 1seg (2a) referimos a que la mitad de nuestra representación de 22
1 bits sea siempre alcanzada en la mitad del rango a recorrer.
1Hz = seg (2b)
440 Con lo cual, en lugar de contar desde 0 a n contamos desde
1 p hasta q con p < q y p es la negación bit a bit de q que
1ticLA = seg (2c)
440 cumplen que q − p ∼ = n (ver Fig. 4).

59
6. SÍNTESIS
Table 1. Tabla de sı́ntesis.
Componente Utilizados Porcentaje En la tabla 1 se puede ver el resultado de la sı́ntesis de
Slices: 150/4656 3% nuestro proyecto, sin ninguna partitura cargada, para la pla-
Slice Flip Flops: 107/9312 1% ca Spartan-3E (XC3S500E) sintetizando nuestro proyecto
con XST. Presentamos la sintesis sin la partitura, porque la
4 input LUTs: 264/9312 2%
idea es que la misma deje de estar dentro de la placa para
IOs: - 21 % pasar a ser una entrada de otro tipo como explicamos en 2.1.
Bonded IOBs: 17/232 7%
7. CONCLUSIÓN

4.6. Módulo Mezclador Se implementó un secuenciador musical capaz de repro-


ducir piezas musicales con, a lo sumo, cuatro voces sonando
Este módulo multiplexa en el tiempo sus cuatro entradas,
en simultáneo.
provenientes de los osciladores, a 100 KHz. Tiene dos sali-
Un punto destacable es que fue implementado desde cero
das, una de 1 bit y una de 22 bits.
sin utilizar código preexistente, en un afán de comprender
Optamos por utilizar la salida de un bit y enviar por ella
los mecanismos básicos que intervienen en la reproducción
una señal PWM ya que contábamos con el PMOD-AMP1
y generación de sonidos y en este caso música.
para realizar nuestro trabajo.
Algo valioso del proyecto es que sienta una base y un
Para entender qué hace este módulo, veamos qué datos
punto de partida para el desarrollo de varios proyectos dife-
tenemos a la entrada y qué deberı́amos hacer para obtener
rentes. Este proyecto podrı́a converger por diferentes caminos
lo que queremos a la salida. El oscilador en 4.5 entrega una
en un sintetizador programable o capaz de ser ejecutado en
señal centrada, caracterı́stica que vamos a aprovechar para
vivo, en un secuenciador MIDI, en un secuenciador basado
construir nuestro bit de salida. Vamos a enviar un 1 cuando
en muestras de audio, o hasta en un sintetizador de efectos o
la señal de entrada esté por encima del valor intermedio y
una caja de ritmos, entre otras cosas.
un 0 cuando sea inferior. Pero este dato viene dado en el bit
más significativo de la señal de entrada, con lo cual, sólo
tenemos que redirigir este bit a la salida y luego el PMOD- 8. REALIZACIÓN DEL PROYECTO
AMP1 podrá construir la señal con él y reproducir el audio
esperado. Este trabajo fue realizado en el contexto de la materia
Para mezclar las 4 voces en una única señal, lo que hace- Diseño de Sistemas con FPGA, en el Departamento de Com-
mos es ir alternando entre cada una de ellas. Técnicamente, putación de la Facultad de Ciencias Exactas de la Universi-
en un determinado momento suena a lo sumo una voz. Al dad de Buenos Aires a cargo de la Dra. Patricia Borensztejn
pasar tan rápidamente de una voz a otra, el oı́do humano no durante el primer cuatrimestre de 2010.
logra distinguir esos cambios y ası́ se genera la sensación de
estar oyendo un acorde. 9. REFERENCES
La segunda salida de este módulo es una señal completa
de 22 bits, que si bien no estamos utilizando actualmente en [1] MIDI Manufacturers Association Incorporated, “The
nuestra implementación, podrı́a ser dirigida hacia un DAC complete midi 1.0 detailed specification,” http://www.
(conversor analógico-digital) que permita enviarla a un par- midi.org/techspecs/midispec.php.
lante para ser oı́da. Cabe aclarar que esta señal es mezclada
[2] Xilinx, “Spartan-3e fpga starter kit board user
de la misma manera que la salida de un bit pero para 22 bits.
guide,” http://www.xilinx.com/support/documentation/
boards and kits/ug230.pdf.
5. VERIFICACIÓN
[3] ——, “Xilinx university program,” http://www.xilinx.
Dada la naturaleza subjetiva de la música y lo dı́ficil de com/university/index.htm.
diseñar testbenches para salidas de osciladores, se optó por
verificar el software de manera experimental directamente [4] Digilent Inc., “Pmodamp1TM speaker/headphone am-
en la placa. plifier reference manual,” http://www.digilentinc.com/
Si bien los sonidos obtenidos son tonos puros (o combi- Data/Products/PMOD-AMP1/PmodAMP1 rm RevB.
naciones de ellos), y por lo tanto, poco agradables al oı́do, pdf.
las piezas musicales testeadas fueron satisfactoriamente re- [5] S. Gravenhorst, “GateMan I,” http://www.fpga.synth.
conocidas por variados oyentes. net/pmwiki/pmwiki.php?n=FPGASynth.GateManI.

60
FLEXIBLE PLATFORM FOR REAL-TIME VIDEO AND IMAGE PROCESSING

Paulo da Cunha Possa, Zied El Hadhri, Laurent Jojczyk and Carlos Valderrama

Department of Electronics and Microelectronics, University of Mons


Boulevard Dolez, 31, 7000 Mons, Belgium
{paulo.possa, zied.elhadhri, laurent.jojczyk, carlos.valderrama}@umons.ac.be

ABSTRACT multipurpose reconfigurable platform, where more than


one process can be applied simultaneously on the incoming
This work provides a platform for real-time image and video signal. On that approach, a user-specific functional
video processing enabling exploration and evaluation of module implements most of the required functionalities.
different processing techniques. The goal of our approach This functional block can be extended depending on
is to provide a flexible environment for the prototyping of different application scenarios.
the different processing techniques on Field Programmable Nowadays, students should evolve from software
Gate Arrays (FPGAs), easily customizable to specific development to architecture design in order to satisfy such
target applications and suitable for educational purpose. In requirements. They must master algorithmic selection and
this paper we give an overview of different requirements processing power requirements working in a design
and techniques of video processing featuring FPGAs. framework built around the latest video/image standards.
Three real-time video processing algorithms were By adopting a video processing platform based on FPGAs,
combined to show the advantages and characteristics of we can provide real-time exploration in a flexible and
our approach. Within this system, the modules running in evolving environment, populated by a growing set of
parallel can be easily selected at run-time according to the technology bricks. In this context, the objective of this
application needs. work is to provide a flexible environment for the
prototyping of the different processing techniques on
Index Terms— Video signal processing, field FPGAs, easily customizable to specific target applications
programmable gate array, tracking, object detection, and suitable for educational purpose.
embedded system. In the following section, we detail the characteristics
of our video processing platform. We also describe the
1. INTRODUCTION processing algorithms implemented in our system in order
to evaluate it in terms of performance and effectiveness.
The performance requirements of image and video Section 3 presents the results obtained in the evaluation.
processing applications have led to increase the computing Finally, in section 4, we conclude with an analysis of
power of implementation platforms, especially when real- results and provide future directions of our work.
time constraints need to be met [1]. Traditional
implementation of image and video processing designs are 2. THE VIDEO PROCESSING PLATFORM
based on Digital Signal Processors (DSPs) or Application
Specific Integrated Circuits (ASICs). However, FPGAs The board chosen to host our system is the Altera DE2
have shown very high performance in many applications in Development and Education Board (Fig. 1). The
this field [2][3]. motivation of this choice was based on the educational
FPGAs hold a clear advantage compared to purpose of the DE2 board, with accessible components for
conventional DSPs to perform digital signal processing debugging (e.g. toggle switches, debounced pushbutton
which is their scalability (the capacity to replicate switches, and LEDs) and a complete set of peripherals,
functions as required) and its inherent parallelism. Also, including a 24-bit audio codec, an USB host/slave
current FPGAs devices provide several attractive features controller, 10/100 Ethernet controller, 8MB SDRAM, a
for implementing DSP algorithms, e.g. high performance TV decoder, and a VGA 10-bit DAC. The DE2 board
input and output pins, large memory blocks, embedded hosted FPGA is the low cost Altera’s FPGA device
multipliers and microprocessors [4]. EP2C35 from the family Cyclone II. The EP2C35 contains
Many works concerning video and image processing 33,216 Logic Elements, 105 M4K RAM blocks, 35
on FPGAs can be found with the most diverse applications. embedded multipliers, and 4 PLLs.
Within those design environments, we are interested on The proposed video processing platform takes
how the different processing algorithms can be combined advantage of the already available TV Decoder
in a flexible way to satisfy specific application (ADV7181B) and Video DAC (ADV7123) to create a low
requirements. One example of this is the work [5] about a cost environment for video processing. Fig. 2 shows a

61
simplified diagram of the video framework architecture. generates two signals corresponding to the pixel
coordinates.

TV Decoder
ADV7181B Video DAC
ITU-R 656 SDRAM

Input
ADV7123
Deinterlacer
Decoder Interface

YUV 4:2:2 YUV

Output
Cyclone II
EP2C35 to 4:4:4 to
SDRAM Video Input RGB
8 MB
Module
SRAM
512 kB
Fig. 3. Diagram of the Video Input module.

Fig. 1. Altera DE2 Development and Education Board. Customized video processing modules can be easily
placed between these two modules (Video Input and Video
Output). A basic scalable architecture was utilized to create
a complete video application. Fig. 4 shows a diagram of
Video VGA the video processing module created to evaluate our
Camera Monitor platform.

Video Processing
Module
TV Decoder

Video DAC
ADV7123
ADV7181

Output
Video Video Video

Background
Subtraction
Input Output

Tracking
Processing
Mirroring
Input

Cyclone II 2C35

Fig. 2. Simplified diagram of the video platform


architecture.
Control Bus
The Video Input and Output modules (Fig. 2) are
based on the DE2 TV box demonstration supplied with the Fig. 4. Diagram of the Video Processing Module.
DE2 board by Altera. Customized video processing
operators can be placed between these two modules. The main components of the Video Processing
In the Video Input module (Fig. 3), the ITU-R 656 Module are the processing algorithms blocks, the
Decoder block extracts YUV 4:2:2 video signals from the multiplexers, and Control Bus. The multiplexers can
ITU-R 656 data stream. As we are using an interlaced bypass or not the processing algorithms blocks. The
video signal in the input, a Deinterlacer block is employed selection between their inputs can be done at run time
in order to convert the input video stream into a through the Control Bus. The Control Bus can also share
progressive format. The Deinterlacer has an interface to the control data among blocks and a microcontroller. In this
SDRAM on-board where the two fields (F0 and F1) of the test we did not add a microcontroller. Instead, we used
interlaced frame are stored. After that, the chroma available on-board switches to control the multiplexers.
components of the video stream are up-sampled by the Later, we describe the three processing algorithms
YUV 4:2:2 to 4:4:4 block. Finally, the video stream is implemented in the Video Processing Module.
converted from YUV colour format to RGB, which is the
input format of the next module (Video Processing). 2.1. Mirroring
The Video Output module is basically an interface
between the Video Processing module and the Video DAC. The Mirroring algorithm is based on three line buffers
This block is responsible for generating all synchronism using the concept of a Last In First Out (LIFO) structure.
signals needed by the Video DAC and the VGA output, Each line buffer stores one of the three input colours. They
e.g. vertical and horizontal synchronization signals. It also are based on a dual port RAM block embedded in the

62
FPGA device. The dual port RAM allows storing data in support a large number of real-time video applications in a
one address and read data from another address at the same VGA standard resolution (640 × 480 pixels @ 60 fps). In
time. Using the LIFO structure, the Mirroring block creates terms of internal memory resources, our design reaches
a mirror effect in the output video. 15% of utilization. Next, Table 1 summarizes the FPGA
resource usage by our system and Fig. 5 shows the
2.2. Background Subtraction Cyclone II EP2C35 floorplan after the fitting process with
the main blocks location.
The Background Subtraction module extracts in real-time
the background of a frame, highlighting new objects on the Table 1. FPGA resource usage by the Video Processing
frame. Background subtraction is a commonly used class Platform.
of techniques for segmenting out objects of interest in a Modules
Logic
Memory Bits
Embedded
PLLs
scene for applications such as surveillance [6]. In our Elements Multipliers
approach, we store a specific part of a frame into an Video In 1550 53184 9 1
Processing
external SRAM. After that, we compare each pixel, from Module
671 19200 0 0
the next frames, with the buffered pixels. As result of the Video Out 86 0 0 0
comparison algorithm (1), each pixel is classified as a Total 2307/33216 72384/483840 9/35 1/4
background pixel or a foreground pixel. In the output, the Percentage 7% 15% 26% 25%
background pixels will appear black and the foreground
pixels will appear as in the input, i.e. in the output we will
see only what is new in the frame.

IF ( P > (Pbuffer + τ) OR P < (Pbuffer - τ) )


Pclass <= foreground; (1)
ELSE
Pclass <= background; Processing
END Module
Video
Input
In (1), P is the current pixel, Pbuffer is the
correspondent buffered pixel, and τ is threshold level.
In order save memory resources, we are buffering only
part of the frame (320 × 240 pixels), which results in a
partial background subtraction. Also, for the same reason, Video
we are working with only one colour channel carrying the Output
greyscale component.
Fig. 5. Cyclone II 2C35 floorplan.
2.3. Tracking
Regarding to the system performance, our system can
The Tracking algorithm analyses the address of pixels operate in a maximum clock frequency of 43.95 MHz. This
classified as foreground by the Background Subtraction result was obtained by the Altera’s Time Analyser tool. As
block in each frame. The address analysis objective is to our architecture can process 1 pixel / clock cycle, it can
locate the limits of the foreground object. With this achieve a maximum performance of 143 fps (2).
information, this block draws a green square enclosing the
object. Also, the centre position of square and the area Frame fmax = Pixel fmax / Frame resolution
occupied by the object are available into the Control Bus. Frame fmax = 43.95 MHz / (640 × 480) (2)
For the evaluation we used an extra module to show the Frame fmax = 143 fps
position and area information on the on-board seven-
segment displays. The maximum data rate achieved is 439.5 Mbit/s per
colour channel (3) or 1.32 Gbit/s for RGB (4).
3. RESULTS
Data Rate max = Pixel resolution × Pixel fmax
The first aspect to go into is the resources utilization by our Data Rate max = 10 bits × 43.95 MHz (3)
design. All internal modules, including decoders, Data Rate max = 439.5 Mbit/s
controllers, and the customized processing blocks, utilize Data Rate max = 10 bits × 3 × 43.95 MHz (4)
7% of all logic elements available in the EP2C35 Cyclone Data Rate max = 1.32 Gbit/s
II. This shows that even a relative small FPGA device can

63
(a)

Background
Subtraction

Tracking
Mirroring
(b)

Background
Subtraction

Tracking
Mirroring

(c)
Background
Subtraction

Tracking
Mirroring

Fig. 6. Video Processing Module results: (a) bypassing the background subtraction/tracking block; (b) bypassing de
mirroring block without foreground objects; (c) bypassing de mirroring block with a foreground object.

Related with resource usage and system performance, As we mentioned before, we used a low cost Altera’s
the Altera’s PowerPlay tool estimated a power dissipation FPGA device EP2C35 from the family Cyclone II. This
of 235.7 mW by the FPGA device in our system. device is embedded in an also low cost development
The experimental results demonstrate the board, the DE2. The DE2 has the advantage of containing
effectiveness of our platform. Fig. 6 illustrates the a video input and output based on a TV input decoder
platform video output with different multiplexer settings (ADV7181B) and a Video DAC output (ADV7123). Also,
in the Video Processing Module. Fig. 6a shows the the DE2 was especially developed targeting educational
mirroring block result without background purpose, which is our focus.
subtraction/tracking. In Fig. 6b and 6c, only the mirroring We implemented three basic processing algorithms in
block is bypassed and we can see the result of the our system in order to validate the entire system and test
background subtraction/tracking blocks. In Fig. 6b, the its flexibility and performance. The results showed that
output shows an empty space in the centre of the image. even a relative small FPGA device can support a large
This space is where the background subtraction/tracking number of real-time video applications in a VGA standard
block is active. In the next image (Fig. 6c), an object is resolution.
added to the environment. We can see the new object In the future work, we intend to implement extra
without the background information and also a square memory in the DE2 board through daughter boards
enclosing it. At the same time, the on-board seven- connected in its expansion connector. This will allow us
segment is showing the centre position of the square and implementing multiple frame buffers required for more
the object area in the image. complex algorithms. Also, we want to utilize a digital
video source (for example the Terasic D5M digital
4. CONCLUSIONS camera) instead the analog that we used. This will
simplify the Video Input Module and save FPGA
In this paper, we present a platform for real-time image resources. Moreover, we will migrate our platform to a
and video processing applications. The objective of this more powerful development board aiming applications on
framework is to allow engineering students to design, Full HD resolution.
explore and evaluate different image and video processing
modules.
64
ACKNOWLEDGEMENT [3] S. Asano, T. Maruyama, Y. Yamaguchi, “Performance
comparison of FPGA, GPU and CPU in image
This work is supported by the French Community of processing,” International Conference on Field
Belgium under the Research Action ARC-OLIMP Programmable Logic and Applications, pages 126 – 131,
(Optimization for Live Interactive Multimedia Processing 2009.
2008-2013). Also, we would like to thank Altera
University Program for providing the development [4] N. Lawal, B. Thornberg, M. O'Nils, “Power-aware
boards. automatic constraint generation for FPGA based real-time
video processing systems,” Norchip, 2007.
REFERENCES
[5] J. Li, H. He, H. Man, S. Desai, “A general-purpose
[1] M. Akil, “Special issue on reconfigurable architecture FPGA-based reconfigurable platform for video and image
for real-time image processing,” Journal of Real-Time processing,” International Symposium on Neural
Image Processing, volume 3(3), pages 117-118, 2008. Networks, pages 299-309, 2009.

[2] J.A. Kalomiros, J. Lygouras, “Design and evaluation [6] A.M. McIvor, “Background subtraction techniques,”
of a hardware/software FPGA-based system for fast image In Proc. of Image and Vision Computing, 2000.
processing,” Microprocessors & Microsystems, volume
32(2), pages 95-106, 2008.

65
66
SOPC PLATFORM FOR REAL-TIME DVB-T MODULATOR DEBUGGING

Armando Astarloa, Jesús Lázaro, ∗


Mikel Idirin †
Unai Bidarte, Aitzol Zuloaga
System-on-Chip engineering S.L.
Department of Electronics and Telecommunications,
Zitek Bilbao - ETSI
University of the Basque Country
48013 - Bilbao
Alameda Urquijo s/n
email: [email protected]
48013 Bilbao - Spain
email: [email protected]

ABSTRACT more problems, which have held back the video system dig-
ital. The transmission of digitized images without compres-
The debugging of DVB-T FPGA based systems is not a triv-
sion at the speed required by television requires too much
ial task. The large bandwidth requirements in combination
bandwidth, something intolerable given the congested spec-
with the massive storage needed for further analysis of the
trum. It was therefore necessary to compress digital send-
video frames, requieres an add-hoc solution. This article
ing no more than what is necessary to reconstruct the image
presents a SoPC architecture specifically designed to cap-
at the receiver. This compression technique was developed
ture frames of a Digital Television modulator IP core in real
by MPEG (Moving Picture Experts Group). Regarding this,
time. All the required processing (video, communications,
MPEG2 image compression system is used as a reference
TCP-IP encapsulation, etc.) is managed by the FPGA, and
for the European Digital TV standard [2].
the frames can be captured between any stage of the pipeline
hardware processing of the DVB-T modulator IP core. As The flexibility and computing power required for Digi-
a result, a powerful tool for Digital Television hardware de- tal Television hardware processing are faced optimally using
bugging is obtained. reconfigurable logic. In fact, the state of the art regarding
processing hardware modules for DTT (Cores of Modula-
1. INTRODUCTION tors / Demodulators DVB-T) shows how many companies
offer specialized IP cores for integration into FPGA. Ad-
Recent years have witnessed the development of technology ditionally, the latest platform FPGAs have enabled the in-
in several digital areas. Similarly, this evolution has lead tegration of whole digital systems in a single device [3]:
into the need to replace existing technology in field of broad- hardware cores, microprocessors, on-chip buses, etc. G.
casting, which has been mostly analog until recently. This Martin in the chapter “The History of the SoC Revolution”
evolution not only concerns TV and radio end user but also (2003) [4] emphasized how the core-based design with com-
RF links between intermediate equipments. An example is mercial reconfigurable FPGA platforms was a strong reality
the communication between a camera and production center in the System-on-Chip (SoC) [5] design, and it would con-
within the context of the broadcast of a sport event. tinue in the future. This announcement has been met and
Trying to solve the shortcomings of previous analog sys- nowadays, the SoCs are widely extended, specially the SoCs
tems a digital broadcasting service for TV and radio has implemented in reconfigurable logic: the SoPCs. Regard-
emerged. In order to organize this evolution, a European ing methods and tools for high performance systems debug,
standard for digital television [1] has been set. most work has been done in the last years. FPGAs have
The basis of the new digital technology is digital com- become popular as a valuable resource for the debug and
pression of the image. The development of digital sound verification of those high-performance embedded complex
has been early treat but real-time moving image has many systems. With current FPGA technology, it became pos-
sible to control and manage several different real-time and
∗ This work has been partially supported by the research program DIPE-
high bandwidth interfaces simultaneously. In this way, in [6]
BEAZ 2009 (DIPE09/02)
† This work has been partially supported by the Government of they use a FPGA to allow a general purpose full observ-
the Basque Country within the research program NETS (project IN- ability cosimulation platform. As another example, in [7],
2010/0000012) a JTAG compatible logic analyzer core is presented, which

67
is necessary to design SoPCs architectures and appropriate
Table 1. Input and output data bus width of the DVB-T IP
technologies. This paper presents a solution based on a sys-
internal modules.
tem that can extract SoPC real-time information to a host via
1Gbps Ethernet TCP-IP connection. The useful information
Module name Input bus data width Output bus data width
throughput will be above 200 Mbps (payload).
transport stream if 8 9 To meet this challenge, the key technological elements
randomized 9 9 in the system are:
reed salomon 9 9
external interleaver 9 8 • High-end Virtex-5 FPGA (XC5VFX70TFF-1136).
viterbi puncture 8 2
• A hard core Power-PC processor 440, integrated into
internal interleaver 2 33
the FPGA silicon.
pilot and tps 33 33
ifft ig 33 32 • A hard core High performance Gigabit Ethernet con-
dac core 32 16 (DDR) troller integrated into the FPGA silicon.

• A hard core DDR2 controller embedded inside the


Power-PC processor.
makes easier the real-time debug of FPGA. [8] and [9] show
two other examples of different FPGA based architectures The debug set-up establishes a point-to-point link be-
used to facilitate the debug of high requirements systems; a tween the capture system and a PC computer. The data
CCD image processing system and a wireless network node, throughput that the application demands is very high and
respectively. therefore, the transmission through the channel Gigabit Eth-
The remainder of this article is organized into four sec- ernet must be optimized. Apart from the FPGA side (hard
tions. Section 2 presents the system architecture of the SoPC Gigabit Ethernet controller), a high-performance PC must
debug platform and summarizes the architecture of the DVB-T be selected in order to avoid creating bottlenecks in the host
modulator. Section 3 summarizes the implementation re- side. So it must be taken into account the quality of the net-
sults of one configuration of the IP core and the SoPC debug work card incorporated in it, CPU, RAM memory and hard
platform. Section 4, concludes this paper and presents the disk recording speed.
future work in this field. From the standpoint of architectural design, to fulfill these
demands, the following elements are considered:
2. SYSTEM ARCHITECTURE
• Optimized FIFO with an dedicated on-chip bus:
The computation scheme of a DVB-T modulator fits into a Data capture is done through a FIFO. This memory is
pipelined bus architecture. Figure 1 shows the architecture written with the information that want to be analyzed.
of the DVB-T modulator IP core under verification. Video A simple FSM is charge of copying the data from the
frames are processed sequentially starting at the transport point-to-point link to the memory. The Power-PC pro-
stream interface core (transport stream interface) cessor reads the FIFO through a PLB bridge to FSL.
and ending at the core that prepares the I+Q output for the The FSL is a bus optimized for direct FIFO connec-
DAC (DAC core). Table 1 summarizes the modules that tions.
composes the DVB-T modulator IP. The name of each iden- • DMA transfers between memory and Gigabit Eth-
tifies the computation that performs. It is worth mentioning ernet controller: Power-PC processor, once it has ac-
that, as the input and output data bus width of each module quired data from FIFO, prepares TCP/IP packets in
differs, the speed of the data transfer in each point-to-point the dynamic RAM memory included in the system.
link is different as well. The transference of these packets to the Gigabit Eth-
While the system runs, the dataflow inside the modula- ernet controller is done using DMA. This solution op-
tor core cannot be stopped because video Transport Stream timizes PLB on-chip bus transfers and substantially
frames inputs into the modulator at the transport stream improves the performance of the TCP-IP transmission
if module at a given sampling rate. Thus, the proposed de- (see Section 3).
bug solution must be able to extract and transmit to the host
the data that it communicated between any core. • Use of hard checksum functions provided by TMAC:
Taking into account the huge data volumes that are in- TMAC Gigabit Ethernet controller provides functions
volved, the debug task in the development of FPGA based for checksum generation and verification. This im-
systems DVB-T modulator requires the storage of large vol- plementation takes advantage of those features. This
umes of real-time video frames. In order to deal with this, it releases the lwIP stack, in charge of TCP-IP packet

68
Mpeg2
(Input)

MPEG2 IF (S)

FIFO IF (M)

FIFO IF (M)

FIFO IF (M)
FIFO IF (S)

FIFO IF (S)
2 3
1
RANDOMIZED REED_SALOMON
TRANSPORT_STREAM_IF

WB IF (S) WB IF (S) WB IF (S)

WB IF (S) WB IF (S) WB IF (S)

FIFO IF (M)

FIFO IF (M)
FIFO IF (S)

FIFO IF (S)

FIFO IF (M)
FIFO IF (S)
5 6
4
VITERBI_PUNCTURE INTERNAL_INTERLEAVER
EXTERNAL_INTERLEAVER

WB IF (S) WB IF (S) WB IF (S)


FIFO IF (M)

FIFO IF (M)
FIFO IF (S)
FIFO IF (S)

DAC IF (M)
FIFO IF (S)
7 8 10
PILOT_AND_TPS IFFT_IG DAC_CORE

Output
(DACs)

Debug WB IF (M)
UART IF

11
CTRL

FPGA

Fig. 1. DVB-T modulator IP core block diagram.

composition , from the software execution of these • Interrupt controller.


tasks and allows a substantial improvement in final
performance of the communication. • UART for debug purposes.

• TCP-IP and lwIP parameters optimization: There • Auxiliary modules for clock, reset and JTAG manage-
can be substantial performance improvements in com- ment.
munication achieved by modifying some parameters
of the TCP-IP stack in combination with some size 3. IMPLEMENTATION RESULTS
optimizations of the transmission and reception FI-
FOs. The most significant parameters are the follow- In order to obtain the debug system as fast as possible, both
ing: the IP core and the SoPC have been implemented on a ML507
Xilinx Virtex-5 evaluation board. This populates a XC5V-
– Maximum Segment Size (TCP MSS): 1.460 bytes.
FX70T-FFG1136 device and it has all the means need for the
– TCP Transmission Buffer (TCP SND BUF): real-time operation: DDR2 external memory, SRAM mem-
16.384 bytes. ory and Gigabit Ethernet physical Link.
– TCP Window(TCP WND): 4.096 bytes. Figure 3 shows the block diagram of the whole system.
Inside the FPGA the IP and the SoPC have been imple-
– TMAC transmission and reception FIFO: 4.096
mented. In this set-up, the SoPC is capturing the data be-
bytes.
tween the output of the FFT and the input of the DAC mod-
ule. FSM Ctrl. is the Finite State Machine that controls
Figure 2 shows the block diagram of the proposed SoPC
the data transfer between the DVB-T modulator IP core and
for real-time debug. It has been implemented on a Virtex-5
the FIFO stored in the CAPTURE FSL MASTER OUT IP.
FPGA. In addition to critical modules mentioned above, the
Table 2 summarizes the implementation results. The first
following additional cores are presented in the system:
column describes the FPGA resource type. Column 2 and 3
• 16 Kbytes of internal RAM memory built using dedi- respectively, summarize the FPGA occupation for the SoPC
cated block RAM modules. alone and in combination with the IP core under test. In this
case, the DVB-T modulator. It is worth noting that a huge
• SRAM controller for external memory. FPGA like the one used for this implementation, allows easy

69
SRAM

CAPTURE_FSL_

DATA TO BE MASTER_OUT SFSL PLB2FSL SPLB


CAPTURED FIFO MFSL Bridge INTERRUPT SRAM GP IOs
Controller Interface Leds, buttons

DDR2 DDR2 PLB V46


PPC4 MPLB
Controller on-chip bus

PPC440 RAM
Internal RAM memory
UART

LLDM

LLIN TMAC SPLB Debug


Gigabit Ethernet SoPC

Ethernet Debug (host)


PHY

Fig. 2. Block diagram of the SoPC real-time debug.

------------------------------------------------------------
Table 2. Implementation results of the SoPC designed for
Server listening on TCP port 2000
real-time debug of a DVB-T transmisor IP core (data for a
TCP window size: 8.00 KByte (default)
Virtex-5 XC5VFX70T-FFG1136 FPGA).
------------------------------------------------------------
[1856] local 192.168.1.50 port 2000 connected with 192.168.1.105 port 4097 FPGA resource type SoPC system IP core under analy-
sis and SoPC system
[ ID] Interval Transfer Bandwidth
[1856] 0.0- 2.0 sec 30.8 MBytes 129 Mbits/sec 4 input LUTs 4.850 (10%) 5.762 (12%)
[1856] 2.0- 4.0 sec 30.8 MBytes 129 Mbits/sec Slice Flip-Flops 5.221 (11%) 6.851 (15%)
[1856] 4.0- 6.0 sec 36.5 MBytes 153 Mbits/sec Virtex-5 Slices 3.008 (26%) 3.762 (33%)
[1856] 6.0- 8.0 sec 35.0 MBytes 147 Mbits/sec 36K BlockRAM 17 (11%) 23 (15%)
[1856] 8.0-10.0 sec 37.1 MBytes 156 Mbits/sec
Hard Power-PC processor 1 (100%) 1 (100%)
[1856] 10.0-12.0 sec 37.3 MBytes 156 Mbits/sec
TMAC Gigabit Ethernet 1 (50%) 1 (50%)
[1856] 12.0-14.0 sec 33.6 MBytes 141 Mbits/sec
[1856] 14.0-16.0 sec 31.1 MBytes 130 Mbits/sec
[1856] 16.0-18.0 sec 31.1 MBytes 130 Mbits/sec
[1856] 18.0-20.0 sec 35.7 MBytes 150 Mbits/sec
[1856] 20.0-22.0 sec 34.2 MBytes 144 Mbits/sec
fast-prototyping for complex debug systems. Only 33% of
[1856] 22.0-24.0 sec 35.5 MBytes 149 Mbits/sec
the general purpose resources of the FPGA are used and all
[1856] 24.0-26.0 sec 32.0 MBytes 134 Mbits/sec
timing constraints are easily met.
[1856] 26.0-28.0 sec 36.6 MBytes 154 Mbits/sec Figure 4 shows a screenshot of the real-time commu-
[1856] 28.0-30.0 sec 36.5 MBytes 153 Mbits/sec nication between the ML507 evaluation board used to im-
[1856] 30.0-32.0 sec 29.9 MBytes 126 Mbits/sec plement the platform presented with a PC through a point
[1856] 32.0-34.0 sec 33.1 MBytes 139 Mbits/sec to point Gigabit Ethernet communication link. In the PC
[1856] 34.0-36.0 sec 36.4 MBytes 153 Mbits/sec runs a Iperf server, which evaluates the actual data flow in
[1856] 36.0-38.0 sec 37.2 MBytes 156 Mbits/sec transfer. The program used to capture the TCP-IP packets is
[1856] 38.0-40.0 sec 34.3 MBytes 144 Mbits/sec Wireshark. It is in charge of saving the reconstructed frames
in the PC hard disk for further analysis. Thoses frames are
Fig. 4. Communication performance between fast prototyp- captured and stored in real-time; however, they are analyzed
ing board (ML507) and PC host. Data provided by Iperf off-line, when they are compared with the ones generated
tool. by the DVB-T modulator reference model (implemented in
C language). As it can be noticed, for the chosen commu-

70
SRAM BOARD
MPEG2
Transport
Stream 2 3
FPGA
1
RANDOMIZED REED_SALOMON
TRANSPORT_STREAM_IF

FIFO IF (S)
FIFO IF (S)

FIFO IF (M)
FIFO IF (M)
FIFO IF (M)

MPEG2 IF (S)
WB IF (S) WB IF (S) WB IF (S)

WB IF (S) WB IF (S) WB IF (S)

5 6
4
VITERBI_PUNCTURE INTERNAL_INTERLEAVER
EXTERNAL_INTERLEAVER

FIFO IF (S)
FIFO IF (S)

FIFO IF (M)
FIFO IF (M)
FIFO IF (S)
FIFO IF (M)
WB IF (S) WB IF (S) WB IF (S)

7 8 10
PILOT_AND_TPS IFFT_IG DAC_CORE

FIFO IF (S)
FIFO IF (S)

FIFO IF (M)
FIFO IF (M)
DAC IF (M)

FIFO IF (S)
DEBUG
(PC host)
WB IF (M)

11
CTRL

UART IF
DVB-T modulator
IP core

FSM
CAPTURE_FSL_
Ctrl.
MASTER_OUT SFSL PLB2FSL SPLB

71
FIFO MFSL Bridge INTERRUPT SRAM GP IOs
Controller Interface Leds, buttons

DDR2 PLB V46


DDR2 Controller PPC4 MPLB
on-chip bus

PPC440 RAM
Internal RAM memory
UART

LLDM

LLIN TMAC SPLB Debug


Gigabit Ethernet SoPC

Ethernet
(PC host)
ETHERNET

Fig. 3. Block diagram of the SoPC in combination with DVB-T transmisor IP core for real-time debug.
PHY

Debug (PC host)


nication parameters the obtained throughput is around 140 International Conference on Wireless Communications, Net-
Mbps. This transfer bandwidth is enough to acquire data for working and Mobile Computing (WiCom09), 2009, pp. 1–4.
debugging purposes at any interface in DVB-T modulator
flow chain.

4. CONCLUSIONS

The main contribution of this work is the report of a fast


prototyping SoPC debug system for real-time video applica-
tions. Sometimes the debug of these systems needs to save
a huge amount of real-time data for further analysis. Nor-
mally, it is not easy to find a commercial instrumentation
that fits with these needs.
As it has been proven in the presented case, it is possible
thanks to the benefits of the portability of the RTL code.
This allows the migration of an IP core targeted for a low
cost FPGA to a high-end one in combination with a SoPC
that enables high bandwidth communication with a remote
host.
Future work in this field includes a better automatization
regarding all the processes involved in the experimental set-
up and a partial reconfigurable support that allows dynamic
interchange of the IP core under test.

5. REFERENCES

[1] E. B. Union, “Digital Video Broadcasting (DVB). ETS300744.


Framing structure, channel coding and modulation for digital
terrestial television,” 1997.
[2] ——, “Digital Video Broadcasting (DVB). ETR290. Framing
structure, channel coding and modulation for digital terrestial
television,” 1997.
[3] Xilinx Corp., “Xilinx Platform Studio and EDK,” Xilinx Doc-
umentation, http://www.xilinx.com, 2009.
[4] G. Martin and H. C. (Eds.), Winning the SoC Revolution: Ex-
periences in Real Design. Massachusetts, USA: Kluwer Aca-
demic Publishers, 2003.
[5] R. A. Bergamaschi, S. Bhattacharya, R. Wagner, C. Fellenz,
and M. Muhlada, “Automating the Design of SOCs Using
Cores,” IEEE Design & Test of Computers, vol. 18, no. 5, pp.
32–45, 2001.
[6] Cheng, X., Ruan, A.W., Liao, Y.B., Li, P., and Huang, H.C.
, “A run-time rtl debugging methodology for fpga-based co-
simulation,” in 2010 International Conference Communica-
tions, Circuits and Systems (ICCCAS), 2010, pp. 891 – 895.
[7] Z. K. Baker and J. S. Monson, “In-situ fpga debug driven by
on-board microcontroller,” pp. 219–222, 2009.
[8] F. Zhang, Q.-Z. Wu, and G.-Q. Ren, “A real-time capture and
transport system for high-resolution measure image,” vol. 1.
Los Alamitos, CA, USA: IEEE Computer Society, 2010, pp.
306–309.
[9] Q. Wang, L. Wang, and J. He, “A New Simulation Scheme
for Testing and Debugging Wireless Sensor Networks,” in 5th

72
HIGH RELIABILITY CAPTURE CORE FOR DATA ACQUISITION IN SYSTEM ON
PROGRAMMABLE CHIPS

Jesús Lázaro, Armando Astarloa, Aitzol Zuloaga, Jaime Jimenez, Unai Bidarte, José Luis Martı́n

Department of Electronics and Telecommunications,


University of the Basque Country
Alameda Urquijo s/n 48013 Bilbao - Spain
email: [email protected]

ABSTRACT Data acquisition is used in many critical systems [1, 2,


3]. These systems require of a high degree of reliability,
The present paper presents both an standalone capture core
both because human lives and great economic loss can be at
and a SoPC system. The paper also presents a simulation
risk. One of the traditional ways of dealing with reliability
framework and a practical implementation of a high relia-
is redundancy. Tripling the system assures greater reliability
bility filter implementation. The implementation uses FIR
and protects against a failure in one of the systems.
filters although it can be extended to IIR filters or any other
kind of mathematical circuit. The circuits makes use triple FPGAs are not strange to failure. As any electronic cir-
redundancy and voter circuits to obtain a correct filter out- cuit they suffer failures due to temperature, age, humidity,
put in presence of a failure either in the conversion circuit shock,. . . One difference between conventional circuits and
or in the FPGA. The system presented in this paper is not a FPGAs is their reconfiguration capability. SRAM (static
substitution of a traditional triple redundancy circuit but an random access memory) based FPGAs can suffer an up-
addition. The SoPC includes the standalone core and adds set both in the operational circuit and in their configuration
memory and communications cores to process the data and memory [4, 5, 6]. This leads to both temporal (an upset in
transfer it. the circuit) and permanent (a failure in the configuration will
remain there until it is configured again).
1. INTRODUCTION
This article presents a redundancy scheme for filters in-
Data acquisition is a key component in modern control sys- side the FPGA. The article is focused on FIR (finite impulse
tems. A data acquisition system is in charge of taking an response) filters because they suit better than IIR (infinite
analog signal and passing it to a digital processing system. impulse response) the internal structure of FPGAs. Specifi-
In this process, the analog signal must be filtered, digital- cally, the fixed point implementation of FIR filters requires
ized and digitally filtered before transferring the data to the much less hardware than the floating point implementation
processing unit. of IIR. IIR filters are not normally implemented using fixed
Traditionally the processing unit has been built around point to avoid stability problems [7, 8]. The redundancy
a DSP (digital signal processor). In recent years there has scheme is prepared to detect errors not only in the FIR cir-
been a change from the DSP to the FPGA (field programable cuit but in the conversion circuitry as well. The design pre-
gate array), the main reasons being: sented in the paper is the first stage of a more complex sys-
tem as in charge of combining the outputs of different con-
• FPGAs are more affordable every day version circuits into a single value. After the proposed sys-
tem, a more conventional triple redundancy [9] signal pro-
• FPGAs have increased their signal processing capa- cessing circuit is likely to appear before passing the data to
bilities a processing unit. In fact, the whole vote and mean circuit
should be tripled in a high reliability scheme.
• Thee parallel processing capabilities of FPGAs sur-
pass the sequential processing power of DSPs The article is divided as follows. First the filter structure
This work has been partially supported by the Government of
is presented, including the simulation framework. Secondly
the Basque Country within the research program SAIOTEK (project the simulation results and the hardware resource utilization
SAI09/17). appear to end with some final conclusions and future work.

73
2. OVERALL STRUCTURE • Spectrum analyzer. This block is used to compare the
different outputs: ideal, output and single hardware
The Capture core is in charge of receiving data from the filter output.
ADC decide the correct value and ready it into a PLB com-
patible core. The SoPC is composed of several cores (one 3.2. Structure
being the capture core) and a microprocessor that will use
the captured data. The block in charge of combining the filter outputs in or-
der to give the correct answer is built around the following
blocks:
external redundant
sensor ADC • Voter. This block is in charge of deciding which filter,
if any, is giving a corrupted output.
FPGA Capture • Fault counter. This block counts how many error are
found in each of the filters for a given time.
Core
• Disabling circuit. Knowing the amount of error of
PowerPC plb
each filter, this circuit disables the one with more er-
rors (if all have the same number of errors, C circuit
is disabled)

• Gate circuit. According to the disabling signal, it will


Fig. 1. Overall circuit and interconnection
output the filter value or 0 towards the mean calcula-
tor.
The system is built using XSG (Xilinx System Genera-
tor) [10] and XPS [11]. XSG is a Simulink toolbox [12] and • Mean calculator. This circuit adds all three gate out-
it is capable of linking FPGA description to Matlab sink and puts and divides them by two, a simple shift.
sources. It is also capable of creating a custom core compat-
ible with PLB bus standard [13] so that it can be used inside
3.3. Voter
XPS.
The overall project can be divided into two main parts: The voter is in charge of deciding which filter, if any, is giv-
ing a corrupted output. In table 1 the basic output of a major-
• Capture core ity voter can be seen [14]. This circuit is capable of finding
• Complete SoPC which one of the three inputs is giving a value different to
the other two. If all three inputs are the same the circuit
outputs ‘0’.
3. CAPTURE CORE STRUCTURE

3.1. Simulation framework A B C Error


0 0 0 0
The overall simulation framework (see figure 2) is composed
0 0 1 3
of the following elements:
0 1 0 2
• Input signal. It is composed of two different sinu- 0 1 1 1
soidal signals, one of them should be filtered. 1 0 0 1
1 0 1 2
• Error generation block. To the input signal a white 1 1 0 3
noise signal can be added, simulating an error. 1 1 1 0
• Hardware FIR filter. Three equal filter designed using
Table 1. Truth table of the voter circuit. For every input
the FDATool (filter generation wizard).
combination the circuit outputs where the error is. 0 output
• Voter and mean. This block is in charge of detecting means that no error was found.
any non working filter+ADC and of giving the correct
output signal. In our circuit, the voter test the most significant bit of the
output of the filter. This has been done because the system is
• Reference filter. FDATool filter to be used as baseline. tuned to use the full dynamic range of input values. If such

74
Fig. 2. Overall system, depicting inputs, filters and voting circuitry.

Fig. 3. Voter and mean calculator. A majority voter decides which filter output is probably failing. Several counters counts
how many failures happen in a given time. A third block disables a core if an anomalous condition is found. The fourth block
makes the output of the failing filter 0 while the last block calculates the mean of the outputs of the filters.

75
a thing is not possible, bits with lower binary weight should hardware cost, the circuit has been designed to add the three
be use. outputs from the filters and divide them by two.
Contrary to conventional voting circuit. The current out-
put of the voter is not used, but the average of errors is used.
This way the voting circuit needs not to be perfectly tuned,
since there is margin for spurious outputs.

3.4. Error counter


In order to deal with spurious outputs from the voter circuit,
an error counter is added. This blocks counts the number of
errors from each filter in a given time. This time reference
is given using the terminal count from a counter. The size
of the counter is a parameter that can be changed to suit
the applications. The smaller the counter, the quicker will it
react to changes of the inputs but it will also react to short Fig. 4. Mean calculator. Dividing by two is very hardware
term spurious errors. The bigger the counter, the slower will efficient. In case of no failures in any filter, one filter is
react to errors in the inputs but will filter short term spurious disabled so the mean is always calculated the same way.
signals from the voter. It must be noted that any change of
the active filters will produce a small transient in the output, The operation performed can be seen in equation 1. It
so bigger counters can be desirable. In our test case, a 8 bit must be taken into account that one of the outputs will al-
counter has been used. ways be zero. Making a simpler circuit than a divide by
three circuit.
3.5. Disabling circuitry
V V V
The truth table of this simple circuit can be found in 2. This A dA + B dB + C dC def ault A+B
F = −−−−−→ (1)
circuit decides which filter output not to use in the mean 2 2
calculation. To to so, three comparator compare the differ-
ent input values and by means of a truth table, which filter 3.7. Simulation results
output should not be used is decided.
Figure 5 depicts the simulation results. Channel 1 is the
output of the ideal filter using the FDATool. Cannel 2 is
A>B A>C B>C disable the output of the real system while one of the filters is not
0 0 0 C working properly. Channel 3 is the output of the faulty filter.
0 0 1 B The fault is injected using a white noise generator simulating
0 1 0 X a faulty ADC. As it can be seen, the erroneous filter does not
0 1 1 B interfere with the correct functioning of the overall system.
1 0 0 C
1 0 1 X 3.8. Hardware results
1 1 0 A
1 1 1 A In table 4 we can see the resources needed for the system.
The target FPGA has been a Xilinx Spartan 3A-DSP [15].
Table 2. Truth table of the disabling circuitry. Depending This FPGA is well suited for signal processing since it has
on the number of errors, A, B or C filter is disabled. X marks plenty of DSP48A resources as well as internal RAM and lo-
a non possible situation. gic. The overall system uses a mere 3% of the FPGA when
implementing a 50 order lowpass equiripple FIR filter. The
After deciding which output not to use, this output is input data rate is been set to 1MHz and 16 bit. The system
converted to zero so it does not interfere in the addition. runs 64 times faster, at 64 MHz. This increases speed allows
to perform 64 operations in each DSP before a new data ar-
rives. This means that each filter only requires a single DSP
3.6. Mean calculator
(in fact, a 63 order filter could be implemented with minor
The mean calculator is composed of two adders and a divide extra hardware requirements). Since the filter is tripled, 3
by two circuit. Since division by an arbitrary number is an DSP48 [16] are required as well as e little bit of extra logic
expensive operation and, division by power of two has zero for the comparators, adders and counters. This means that,

76
CH 1
interconection interface with the PLB bus has to be defined.
20 CH 2
CH 3
In our case, this connections is done through a FIFO style
0
shared memory. This memory is written by the core and
-20
read through the PLB.
-40
Magnitude-squared, dB

-60
4.1. Hardware structure
-80

The main elements inside the FPGA are:


-100

-120
• Capture core
-140

• PowerPC hard processor


-160

0
Frame: 86
5 10 15 20 25
Frequency (kHz)
30 35 40 45 50 • TEMAC: Hard Ethernet MAC
• Memory cores: DDR2 interface core, Flash interface
Fig. 5. Spectrum result of the filters. Channel 1 depicts the core
ideal floating point filter. Channel 2 the output of the real
system. Channel 3 depicts the output of the non working The capture core has been explained in previous sec-
filter. tions. So we will focus on the rest of the cores.

Table 3. Resource summary report for a Spartan 3A-DSP. 4.2. PowerPC hard processor
Timing constrains set to 64MHz, allowing 1MHz input sam- The IBM PowerPC 440 c core is a hard 32-bit RISC CPU
pling time. blocks designed into the fabric of select Virtex series FPGAs
Quantity % of FPGA to implement high performance embedded applications. The
DSP48As 3 3% combination of hard cores with integrated co-processing ca-
Slice 648 3% pability enables a wide range of performance optimization
options.
The PowerPC 440 processor supported by Virtex-5 FXT
in a worst case scenario, the hardware overhead is limited to FPGAs with a sophisticated CPU/APU controller and high-
2 DSP48A blocks and less than 2% of the FPGA. This im- bandwidth crossbar switch. The crossbar switch enables
plementation is useful for a standalone version of the core. high-throughput 128-bit interfaces and point-to-point con-
The resources used in the final FPGA, are slightly bigger nectivity. Integrated DMA channels, dedicated memory in-
since the interconnection logic has to be added. It may seem terface, and Processor Local Bus (PLB) interfaces minimize
that it requires less Slices, but Virtex5 slices are twice that logic utilization, reduce system latency and optimize perfor-
of a Spartan 3. mance. Simultaneous I/O and memory access maximizes
data transfer rates.

4. SOPC STRUCTURE
4.3. TEMAC: Hard Ethernet MAC
The capture core seen in the previous section can be used TEMAC is an acronym for Tri-Mode Ethernet Media Access
as standalone, but, using the export Pcore feature, it can Controller and is a reference to the three speed (10, 100, and
be used inside a SoPC. In this kind of system, all the el- 1000 Mb/S) capable Ethernet MAC function available in this
ements of the circuit are integrated inside an FPGA. The core. This core is based on the Xilinx hard silicon Ethernet
capturing core has to be slightly modified, specifically, the MAC in the Virtex-5 FXt.
This core provides some very advanced capabilities:

• DMA transfers between memory and Gigabit Ether-


Table 4. Resource summary report for a Virtex5. Look up
net controller:
tables in Virtex5 devices are bigger than in Spartan 3 and the
number of flip-flops is also bigger. • Hard checksum functions. This releases any IP stack,
Quantity % of FPGA in charge of packet composition, from the software
execution of these tasks and allows a substantial im-
DSP48E 3 2%
provement in final performance of the communica-
Slice 431 3%
tion.

77
on Control, Data Acquistion, and Remote Participation
Table 5. Resource summary report for a Virtex5 70fxt. Tim- for Fusion Research. [Online]. Available: http://www.
ing constrains set for 100MHz bus speed to allow high speed sciencedirect.com/science/article/B6V3C-4CGNSF2-1/2/
communications. bfd1aabcaa30ed6414008b4742affb1a
Quantity % of FPGA [2] K. Nurdan, H. Besch, B. Freisleben, T. Conka-Nurdan,
PPC440 1 100% N. Pavel, and A. Walenta, “Development of a Compton Cam-
TEMAC 1 50% era Data Acquisition System Using FPGAs,” in Proceed-
Slice 3495 31% ings of the 2003 International Signal Processing Conference,
2003.
BRAM 15 10%
DSP48As 3 2% [3] H. I. Schlaberg, D. Li, Y. Wu, and M. Wang, “FPGA
Based Data Acquisition and Processing for Gamma Ray
Tomography,” AIP Conference Proceedings, vol. 914,
no. 1, pp. 831–837, 2007. [Online]. Available: http:
4.4. Memory cores
//link.aip.org/link/?APC/914/831/1
The system has two memory interfaces, one for DDR2 an [4] P. Adell and G. Allen, “Assessing and mitigating radiation
another for Flash. The combination of these memories al- effects in Xilinx FPGAs,” JPL, Tech. Rep., 2008. [Online].
lows the use of complex software scheme such as operat- Available: http://hdl.handle.net/2014/40763
ing systems, IP stacks,. . . allowing the system to transfer any [5] R. Baumann, “Soft errors in advanced semiconductor
data using standard protocols. devices-part I: the three radiation sources,” Device and Ma-
terials Reliability, IEEE Transactions on, vol. 1, no. 1, pp.
17–22, mar 2001.
4.5. Hardware results
[6] R. Baumann and E. Smith, “Neutron-induced boron fission
In table 5 a summary of the required resources is presented. as a major source of soft errors in deep submicron SRAM
The system is built around the high performance Virtex5 devices,” 2000, pp. 152–157.
70fxt. The PowerPC is running at 400 MHz to provide max- [7] M. Bellanger, Digital Processing of Signals: Theory and
imum performance. The presented system has only a single Practice. John Wiley & Sons Ltd., 2000.
capture core, but there is plenty of room both to have more
[8] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal
capturing core and to have a more complex SoC. Processing, 3rd ed. Prentice Hall, 2009.
[9] Xilinx, “TMRTool Product Brief,” http://www.xilinx.com/
5. CONCLUSIONS AND FUTURE WORK publications/prod mktg/XTMRTool ssht.pdf.
[10] ——, “Xilinx System Generator for DSP,” http://www.xilinx.
The present paper presents both a simulation framework and
com/tools/sysgen.htm.
a practical implementation of a high reliability filter imple-
mentation. The implementation uses FIR filters although it [11] ——, “Xilinx Platform Studio,” http://www.xilinx.com/tools/
xps.htm.
can be extended to IIR filters or any other kind of mathemat-
ical circuit. [12] T. MathWorks, “Simulink - Simulation and Model-Based De-
In systems where FPGA failure is of concern, the vote sign,” http://www.mathworks.com/products/simulink/.
and mean circuitry should also be tripled as well as any fol- [13] Xilinx, “Processor Local Bus (PLB) v4.6,” http://www.xilinx.
lowing signal processing circuitry. com/support/documentation/ip documentation/ds531.pdf.
The system can be upgraded to detect an error both in the [Online]. Available: http://www.xilinx.com/support/
input (analog to digital converter) and in the output (result documentation/ip documentation/ds531.pdf
of the filtering). This way action can be taken to try to solve [14] R. Perez, “Methods for Spacecraft Avionics Protection
the problem. If the error is in the input, not much can be Against Space Radiation in the Form of Single-Event Tran-
done but, if the error is inside the FPGA some action can be sients,” Electromagnetic Compatibility, IEEE Transactions
taken. This can range from resetting the offending circuit on, vol. 50, no. 3, pp. 455–465, aug. 2008.
to full FPGA reconfiguration with partial reconfiguration as [15] Xilinx, “Spartan-3A DSP FPGA Family: Complete Data
the middle point. Sheet,” http://www.xilinx.com/support/documentation/data
sheets/ds610.pdf, 3 2009.
6. REFERENCES [16] ——, “XtremeDSP DSP48A for Spartan-3A DSP FPGAs
User Guide,” http://www.xilinx.com/support/documentation/
[1] B. McHarg, “Control, data acquisition, and remote participa- user guides/ug431.pdf, 7 2008.
tion for fusion research,” Fusion Engineering and Design,
vol. 71, no. 1-4, pp. 1–3, 2004, 4th IAEA Technical Meeting

78
DESARROLLO DE UNA PLATAFORMA GENÉRICA PARA SISTEMAS DE VISIÓN
BASADA EN LA ARQUITECTURA CORECONNECT

Pantaleone Luis M., Leiva Lucas E., Vazquez Martı́n

INCA/INTIA
Universidad Nacional del Centro de la pcia. de Bs. As.
Paraje Arrollo Seco, Tandil, pcia. de Bs. As, Argentina
email: [email protected], {lleiva,mvazquez}@exa.unicen.edu.ar

ABSTRACT PLB

En este trabajo se presenta la implementación y el análisis de Core Cámara


PPC
una plataforma genérica de adquisición, procesamiento, vi-
sualización y transmisión de imágenes. La mencionada pla-
taforma esta basada en SoC (System on a Chip) implemen-
tados en FPGAs de Xilinx, utilizando la arquitectura Co- VGA
Memoria
reConnect. La ventaja de esta plataforma es la facilidad de
agregar diversos cores a la plataforma, como ası́ también
algoritmos de procesamientos de imágenes en el micropro-
cesador. Se ha desarrollado un core encargado de controlar
y adquirir imágenes desde una cámara, escribiendo las mis-
mas en una memoria externa. Como ası́ también el software FPGA

para el procesamiento y visualización de las mismas. Dicho BUS core IP


software se ejecuta sobre el microprocesador embebido del
sistema.
Fig. 1. Plataforma genérica del sistema visión
1. INTRODUCCIÓN
un softcore es un recurso sintetizable. Ejemplo de esto son
Cada dı́a es más común observar sistemas dedicados a microprocesadores tale como Microblaze, OpenSparc, Nios
procesamiento de vı́deo con fines especı́ficos. Dentro estos II, etc.; filtros de procesamiento; etc. Los cores pueden es-
podemos destacar usos relacionados a medición, inspección, tar conectados al bus como maestros o como esclavos. Para
reconocimiento, orientación o en sistemas especı́ficos para comunicarse con el bus deben implementar el protocolo de
aplicaciones industriales[1]. comunicación correspondiente. Un core es maestro cuando
Para implementar sistemas de visión podemos usar dis- este es el que inicia una solicitud hacia otro core, mientras
tintas tecnologı́as, desde simples computadoras, hasta los que un core es esclavo, cuando solamente recibe solicitu-
más complejos sistemas ASIC, transitando por FPGAs, mi- des. Un core puede ser maestro y esclavo a la vez, siempre
crocontroladores, DSP, etc. Se denomina SoC a un circuito y cuando tenga la suficiente lógica para cumplir ambas fun-
integrado que incluyen un procesador, bus, y otros elemen- ciones.
tos en un chip. Los desarrollos SoC pueden ser implemen-
tados en distintas tecnologı́as, las cuales pueden ser ASIC,
FPGA, microcontroladores, DSP, etc. 2. DISEÑO DE LA PLATAFORMA
Al momento de desarrollar un SoC sobre FPGAs exis-
ten varı́as arquitecturas de las cuales se destaca la CoreCon- El objetivo es diseñar un sistema (fig 1) el cual almacene
nect[2], la cual está orientada a la interconexión de cores en la memoria RAM las imágenes recibidas desde una cáma-
mediante buses. Se denomina core a los dispositivos o pe- ra para su posterior procesamiento, visualización y transmi-
riféricos que se conectan al bus. Estos pueden ser de dos ti- sión. El sistema debe ser escalable fácilmente y portable.
pos, softcore y hardcore . Los hardcore son recursos fı́sicos Para ello se ha decidido utilizar la arquitectura Core-
de la FPGA, ejemplo de éstos son los microprocesadores Connect provista por Xilinx, la misma fue desarrollada por
embebidos, block rams, multiplicadores, etc. Mientras que IBM para utilizarse en conjunto con el microprocesador Po-

79
Sobre un determinado bus se conectan los periféricos en-
cargados de la UART, memoria externa, bloques de ram, in-
terruptores externos.
El core IP del controlador de video es del tipo “master”
y “slave”. Se conectan a dos buses distintos, el master se
conecta a un bus propio para comunicarse con la memoria y
el slave al bus donde estan conectado el resto de los periféri-
cos. El motivo de que se conecte con el controlador de me-
moria mediante un bus dedicado se debe a que el periférico
necesita un alto ancho de banda. De esta forma evita com-
partir el bus con otros periféricos, teniendo siempre acceso
a él.
El core IP encargado de controlar el PowerPC posee cua-
tro puertos, dos dedicados a los datos y otras dos dedicadas
a las instrucciones del mismo, denominadas D0 y D1 para
los datos, e I0 e I1 para las instrucciones. Los puertos I0 y
Fig. 2. Arquitectura del sistema D0 se conectan al mismo bus donde están conectados el res-
to de los cores de los periféricos para poder interactuar con
ellos. Mientras que D1 e I1 están conectados a la memoria
werPC. Esta arquitectura se basa en uso de Cores y buses. La mediante un bus dedicado para obtener un acceso rápido a
versión de la arquitectura es la 4.6[3], la cual viene integrada ella sin necesidad de competir por el acceso al bus con otros
en el EDK 10.1[4]. cores.
El sistema posee un core encargado de controlar y recibir
los datos desde la cámara y escribirlos en la memoria. Se
3.1. Core de la Cámara
trata de un sensor CMOS de 5MP, fabricado por la empresa
Micron, cuyo nombre de serie es MT9P001[5]. Dicho sensor Como se mencionó anteriormente el core (fig 3) se desa-
va montado sobre una placa de desarrollo (headboard), la rrolló bajo la arquitectura de capas para el desarrollo de
cual posee una lente Navitar capaz de controlar la apertura cores[7]. Esta arquitectura en su capa de más bajo nivel (lla-
como la distancia focal. La resolución del mismo es de 2592 mada IPIF) se comunica con el bus PLB y provee ası́ una
(horizonal) x 1944 pı́xeles (vertical). Cada pı́xel tiene una interface simplificada denominada IPIC hacia la capa supe-
profundidad de 12 bits. El sensor trabaja con el patrón de rior denominada User Logic. En el User Logic es donde se
Bayer[6]. coloca la lógica del core.
El procesamiento de las imágenes se lleva a cabo en el El core se desarrolló en VHDL. El mismo es portable
microprocesador embebido PowerPC (o MicroBlaze), me- hacia otros sistemas, siempre y cuando utilicen la versión
diante la ejecución de un programa codificado en C. Tam- 4.6 del bus PLB.
bién se encarga de transmitir las imágenes tanto hacia un Principalmente consta de dos módulos, uno encargado
monitor como hacia la interface serial. de recibir los datos y configurar la cámara (driver); y otro
Las imágenes procesadas se almacenan en la memoria el encargado de enviar los datos a través del bus hacia la
externa, en tres áreas distintas. Un área de memorı́a es la del memoria para su posterior procesamiento.
core de video, una segunda área es donde el core de la cáma- El driver se encarga de la configuración de la cámara y
ra escribe las imágenes, y una tercera donde se almacena la de la obtención del valor de intensidad de los pı́xeles con
imágen procesada para luego ser visualizada o transmitida. una resolución de 8 bits. La configuración de la cámara se
realiza mediante el protocolo I2C.
El User Logic implementa la lógica del controlador. Esta
3. ARQUITECTURA DEL SOC entidad se comunica con el driver de la cámara y envı́a a los
datos hacia la memoria a través del bus.
En la figura 2 se observa la arquitectura con los compo- Este compononente se encarga de comunicar los datos
nentes que intervienen en el sistema. desde el driver hacia el IPIF mediante el seteo de señales y
Los cores IP’s empleados son principalmente controla- direcciones para la transferencia de los datos. Debido a que
dores de block ram (bram block y xps bram if cntrl), con- el driver trabaja con 8 bits de resolución por pı́xel y que el
troladores de memoria RAM (mpmc), controlador UART ancho de bus es de 32 bits, se almacena el dato en un buffer
(xps uartlite), driver de video (xps tft), y el controlador del y se envian paquetes de 32 bits (4 pı́xeles). La dirección de
PowerPC (ppc405). El bus que se utiliza para la comunica- inicio del área de escritura es configurable vı́a parámetros
ción es el PLB en su versión 4.6. del core.

80
Table 1. Tiempos empleados
Algoritmos Sin Cache Con Cache Aceleración
Interp. simple 251ms 225ms 11.6 %
Interp. bilineal 276ms 270ms 2.2 %
Interp. gradiente 647ms 602ms 7.5 %

4. RESULTADOS EXPERIMENTALES

Para la implementación del sistema se utilizo el kit de


desarrollo Xilinx University Program (XUP), el cual posee
una Virtex 2 Pro (XC2VP30[8]). Dicho kit se lo conoce co-
mo XUPV2P[9] y lo fabrica Digilent Inc. La FPGA cuenta
con dos microprocesadores embebidos PowerPc en su ver-
sión 405[10], pudiendo utilizarse como alternativa a estos el
procesador softcore Microblaze. Esta versión del PowerPC
posee algunas diferencias respecto a la versión 440[11] uti-
lizada en algunas familias de Virtex 5.
Para la implementación de la plataforma se utilizarón un
total de 6609 luts (24 %). El core de la cámara necesitó de
1108 luts (4 %). La frecuencia máxima a la cual puede eje-
cutar la plataforma resultó de 106 MHz, mientras que el core
de la cámara a 163 MHz. El sistema se ejecuta a 100MHz
(el microprocesador PowerPC inclusive).
Se implementaron diferentes algoritmos de decodifica-
ción de Pattern Bayer, los cuales convierten las imágenes de
entrada en imágenes de color verdadero. Se probaron los al-
goritmos de decodificación de interpolación simple, bilineal
Fig. 3. Arquitectura del Core de la cámara e interpolación basada en gradiente[12][6].
Las métricas se realizaron sobre estos algoritmos de de-
codificación realizados por el microprocesador. Los tiempos
empleados en los diversos algoritmos incluyen los tiempos
de copia de la memoria de la imágen hacia una memoria
temporal y el tiempo de copia hacia la memoria del core de
La cámara trabaja a una frecuencia de 24Mhz y el core la salida de video VGA.
a una frecuencia de 100Mhz, el buffer ademas de almacenar En una segunda instancia se introdujo una memoria blo-
los datos, los sincroniza para que puedan trabajar a ambas ck ram al bus DSOCM[13] conectado al microprocesador,
frecuencias. Se escriben datos en el buffer cuando lo indica para que actúen como caches de datos.
el driver. La escritura del buffer se realiza a 24Mhz, mien- En la tabla 1 se observan los tiempos de ejecución de los
tras que la lectura del mismo es a 100Mhz. En promedio el diferentes algoritmos implementados con y sin utilización
protocolo de envı́o de datos tarda aproximadamente 11 ci- de cache.
clos del reloj del sistema. De esta manera el sistema tiene la La diferencia de tiempos entre los tres algoritmos se de-
capacidad de enviar los datos a medida que arriban. be principalmente a la diferencia en la cantidad de accesos
a memoria. Por cada lectura de un determinado pı́xel de la
Para la implementación del buffer mencionado se uti- imágen se está haciendo un acceso a memoria.
lizó el componente srl fifo 16 obtenido de Open Cores. A
este componente se lo dotó de lógica adicional para que pue- 5. CONCLUSIONES
da cumplir las funciones mencionadas. El buffer para su im-
plementación no utiliza block ram si no registros en LUT. De Se desarrolló una plataforma genérica para sistemas de
esta manera se puede reservar la blockram para otros usos, visión basados en FPGAs, utilizando la arquitectura Core-
permitiendo una mejor portabilidad al no consumir recursos Connect. La ventaja de esta plataforma es la facilidad de
que varı́an significativamente de familia a familia de FPGAs. agregar algoritmos de procesamientos de imágenes, ya sean

81
implementados en el microprocesador o como un core co- [5] Micron, 1/2.5-Inch 5-Megapixel CMOS Digital Image
nectado al bus. Otra ventajas es la facilidad de agregar di- Sensor, 2005.
versos cores a la plataforma. Es posible trabajar con más de
una cámara en la plataforma, agregando un core de la cáma- [6] S. Imaging, RGB Bayer Color and MicroLenses, 2010.
ra por cada cámara que se conecte y configurándolos para [7] Xilinx, PLB IPIF (v1.00f), 2007.
que escriban en distintas áreas de memorı́a. .
Se desarrolló un Core encargado de controlar y recibir [8] ——, Virtex-II Pro and Virtex-II Pro X Platform FP-
los datos de la cámara. El mismo resulta ser portable, siem- GAs: Complete Data Sheet, 2007.
pre y cuando, se trabaje con la versión 4.6 del bus PLB. La
dirección de inicio de escritura es parametrizable. [9] ——, Xilinx University Program Virtex-II Pro Deve-
Se realizaron métricas sobre los algoritmos ejecutados lopment System - Hardware Reference Manual, 2005.
sobre el microprocesador. Se observó que la diferencia en [10] ——, PowerPC 405 Processor Block Reference Guide,
tiempos entre los diversos algoritmos se debió a la diferen- 2010.
tes cantidad de accesos a memoria que realizaban por cada
pı́xel. Se logró una reducción en los tiempos al incorporar [11] ——, Embedded Processor Block in Virtex-5 FPGAs,
una cache al microprocesador. 2010.

[12] R. Ramanath, W. E. Snyder, G. L. Bilbro, and W. A. S.


6. TRABAJOS FUTUROS III, “Demosaiking methods for bayer color arrays,”
Journal of Electronic Imaging, vol. 11, pp. 306–315,
Como trabajo a futuro se propone la implementación
2002.
de la plataforma sobre una FPGA Virtex 5 FX. Esta FPGA
cuenta con un microprocesador embebido PowerPC 440, el [13] Xilinx, Data Side OCM Bus v1.0, 2007.
cual se comunica directamente con el controlador de memo-
ria.
Como complemento a la transmisión serial se propone
la implementación de rutinas en el programa que ejecuta el
microprocesador que permitan enviar las imágenes procesa-
das a través de Ethernet, utilizando algún protocolo de redes
tal como TCP/IP o PPPoE (Point to Point Protocol over Et-
hernet). Complementariamente el desarrollo de un core que
permita el envı́o de imágenes a través de una interface serial
de alta velocidad.
Para aumentar la capacidad de procesamiento utilizar va-
rios microprocesadores para la ejecución de diversos pro-
cesamientos. Algunas alternativas serı́an ejecutar distintos
algoritmos sobre los distintos microprocesadores. Otra al-
ternativa serı́a que los microprocesadores trabajen en forma
cooperativa (ejecutando el mismo algoritmo sobre diferen-
tes áreas de la imágen), quedando uno de ellos como maes-
tro y el otro como esclavo. En el caso de la Virtex 2 Pro,
al contar con dos microprocesadores embebidos PowerPC
se utilizarı́an estos, mientras que la Virtex 5 al disponer de
uno solo se lo combinarı́a con un microprocesador softcore
(pudiendo ser el MicroBlaze).

7. REFERENCES

[1] EMVA, “An introduction to machine vision,” 2010.


[2] IBM, CoreConnect Bus Architecture.
[3] Xilinx, Processor Local Bus (PLB) v4.6 (v1.04a),
2009.
[4] ——, EDK Concepts, Tools, and Techniques, 2008.

82
PROTOTIPADO RÁPIDO DE UN IP PARA APLICAR LA TRANSFORMADA WAVELET EN
IMÁGENES
MELO Hugo Maximiliano PEREZ Alejandro
email: [email protected] email: [email protected]

GUTIÉRREZ Francisco CAVALLERO Rodolfo


email: [email protected] email: [email protected]

Centro Universitario de Desarrollo en Automación y Robótica (CUDAR)


Universidad Tecnológica Nacional Facultad Regional Córdoba
M.M. Lopez esq. Cruz Roja Argentina – Ciudad Universitaria - Córdoba

XILINX ha desarrollado el System Generator el cual


RESUMEN incluye un conjunto de bloques para utilizar con SimuLink.
El método de prototipado rápido emplea el concepto de
Se presenta una metodología para la implementación de “cajas negras” y de abstracción de hardware para facilitar el
prototipos funcionales para sistemas embebidos basados en manejo conceptual de los módulos.
tecnología FPGA. Se describen las herramientas de alto Una vez implementado y simulado el prototipo es
nivel utilizadas y se desarrolla, un banco de filtros para necesario bajar el sistema a la plataforma de hardware. La
transformada Wavelet en 2-D que luego es incorporado al empresa XILINX propone como herramienta de alto nivel
sistema embebido. para desarrollo de sistemas embebidos el EDK (Embedded
Development Kit). Esta plataforma de software se compone
1. INTRODUCCIÓN del XPS (Xilinx Plataform Studio) para el desarrollo del
hardware y el SDK (Software Development Kit) para el
Uno de los proyectos desarrollados en el CUDAR de la desarrollo de software. Estas herramientas están pensadas
Universidad Tecnológica Nacional Facultad Regional para acortar el tiempo de desarrollo. El XPS permite diseñar
Córdoba es la Compresión de Video con Wavelet en Lógica gráficamente la arquitectura del sistema. Se manejan
Programable. Para el desarrollo del proyecto se ha utilizado bloques funcionales que representan un dispositivo de
la FPGA VirtexII Pro de la empresa XILINX, la cual cuenta hardware distribuido en forma de IP (Intellectual Property).
con dos procesadores embebidos en silicio tipo PowerPC. Los mismos hacen uso de las celdas de las FPGAs para
El sistema hace uso de uno de estos procesadores e generar el dispositivo físico.
incorpora distintos periféricos en forma de IP que Los códigos de MatLab auxiliares serán reemplazados
configuran la lógica de la FPGA. Como plataforma de por código de programa implementado en los PowerPC
desarrollo se ha utilizado el XPS de la empresa XILINX. disponibles en la plataforma.
La compresión de video involucra operaciones
matemáticas y algoritmos, varios de los cuales han sido 2. WAVELETS.
implementados como funciones de Hardware. Durante el
proceso de desarrollo se investigaron y probaron distintos Las Wavelets son familias de funciones que se
métodos para prototipado rápido de funciones para poder encuentran en el espacio y se emplean para el análisis,
evaluar y verificar algoritmos, evitando el tiempo necesario examinan a la señal de interés para obtener determinadas
para el desarrollo por los métodos convencionales de características de espacio, tamaño y dirección. La familia
descripción. El método descripto en el presente trabajo hace está definida por:
uso de uno de los programas más difundidos en el área
⎛ x −b⎞
matemática: MatLab perteneciente a la empresa h⎜ ⎟
MathWorks y su módulo asociado SimuLink, que es una ⎝ a ⎠
plataforma versátil de diseño y simulación de sistemas ha ,b = a ∀a, b ∈ ℜ ∧ a ≠ 0
dinámicos, lo que permitió que integrantes del proyecto con
poca experiencia en la metodología de desarrollo en lógica La familia Wavelet se genera desde una función madre
programable, pudiesen probar ideas y realizar simulaciones h(x), que es modificada con las variables a y b para obtener
que con poco esfuerzo pueden ser llevadas al hardware. traslaciones y escalado temporal. De esta manera se logra la

83
mejor concentración en información de tiempo y frecuencia La característica de energía Wavelet {Eni} n=1...d,
[1]. i= H, V, D refleja la distribución de energía a lo largo del
Las transformadas Wavelet se clasifican en eje de frecuencia sobre una escala y en una orientación
Transformadas Wavelet Discretas (DWT) y Transformadas determinada.
Wavelet Continuas (CWT). La energía de las imágenes se concentra en las
frecuencias bajas. Una imagen tiene un espectro que se
reduce con el incremento de las frecuencias. Estas
2.1.Transformada Wavelet Discreta en 2-D propiedades quedan reflejadas en la Transformada Wavelet
El análisis por transformada Discreta de Wavelet Discreta de la imagen [3].
(DWT) puede ser implementada con bancos de filtros, pasa En compresión y en algunas otras aplicaciones de la
transformada se hace necesario aplicar una técnica
bajos y pasa altos seguidos de etapas de down sampling.
multinivel. Esta se obtiene aplicando sucesivamente las
Para la síntesis también se utilizan los bancos de filtros y up
transformadas a la parte de aproximación de la etapa
sampling de la señal. La “Fig. 1” es un esquema del
proceso de análisis. anterior. En la “Fig. 3” se observa una representación
El decimado (Down Sampling) y undecimado (Up clásica del resultado de la transformada Wavelet multinivel,
en donde las dimensiones de la matriz son las mismas que
Sampling) indican decremento o incremento,
la imagen original.
respectivamente, de números de muestras, lo cual se logra
La nomenclatura se interpreta de la siguiente manera:
eliminando una muestra o intercalando un cero entre ellas
La primer letra indica el sentido del detalle o aproximación:
[2].
V=Vertical, D=Diagonal, H=Horizontal, A=Aproximación;
el número representa el nivel de transformada al cual
corresponde.

Figura 1. Descomposición Simple

Una imagen es una matriz de datos en donde cada


elemento representa un pixel, en caso de ser imagen color la
misma puede representarse por sus componentes RGB o
YcrCb. Para aplicar la transformada Wavelet en dos Figura 3. 3 Niveles de Wavelet en 2-D
dimensiones utilizando el método de filtros separables, es
necesario recorrer la matriz de dos maneras, primero por
filas y luego por columnas como puede verse en la “Fig. 3. IMPLEMENTACIÓN
2”.
Para la implementación de la transformada Wavelet en
2-D se optó por aplicar el método de filtros separables. A
modo de prueba de concepto se optó por implementar
modularmente las distintas etapas de la transformada.
Cada una de las etapas se presenta como un módulo
independiente y la unión entre ambos se realiza con un
código de MatLab que posteriormente será reemplazado en
la FPGA por rutinas de manipulación de datos ejecutadas
en los procesadores embebidos. A continuación se describe
el método empleado y los resultados obtenidos.

3.1.Características

Los filtros FIR a implementar son los que permiten


Figura 2. Wavelet 2-D
realizar la transformada Wavelet por el método de bancos
La energía normalizada de una sub-imagen formada por de filtros. Estos coeficientes se obtienen desde la ventana de
N coeficientes de Wavelet se define como: comando de MatLab con la siguiente expresión:
1 [LO_D,HI_D,LO_R,HI_R]= WFILTERS('db3');
⋅ ∑[ Dni ⋅(b j ,bk )]
2
Eni =
N j ,k

84
'db3' le indica a la función de MatLab que la Wavelet la siguiente etapa, lo que dio como resultado una alteración
madre es una Daubechies 3. Los filtros resultantes para esta de la imagen reconstruida ya que los ceros quedaban
Wavelet son de orden 5 con un total de 6 coeficientes. embebidos en el análisis.
A continuación se realiza el siguiente esquemático “Fig. b) Se optó por truncar el número de datos, considerando
4” en entorno SimuLink, utilizando bloques propios de válidos 5000 datos. Durante los ensayos se determinó que
SimuLink y System Generator. no se puede descartar cualquier dato, ya que esto repercute
en los resultados de la posterior reconstrucción. Para cada
descomposición simple, se optó por tomar como válidos los
N primeros datos, eliminando M datos de la
descomposición, el valor de M se obtiene truncando la parte
entera de la siguiente relación:
Orden del filtro
M =
2 Truncación parte entera

La cantidad N de datos útiles se calcula con la siguiente


Figura 4. Primera Descomposición Simple fórmula:
Datos de Entrada
Se toma como entrada la componente de Crominancia N=
roja (Cr) de una imagen de 100 x 100 píxeles. 2
Luego de la decimación de la primera descomposición De esta forma se obtienen los datos para formar la
simple, los filtros arrojan N+2 coeficientes con valor matriz rectangular necesaria para los siguientes pasos.
numérico, donde N es el número que se espera luego de una Este método se utilizó tanto en la etapa de
decimación. Estos dos valores podrían ser interpretados descomposición como en la de reconstrucción.
como extras, producto de: Se debe tener en cuenta que para una reconstrucción
correcta, utilizando el presente método, se debe aplicar el
Datosde entrada Ordendel filtro
N +2= + recorte de datos de manera invertida. Es decir si en la etapa
2 2 Truncación parte entera de descomposición se utilizaron los primeros N datos, en
las etapas de reconstrucción se deben utilizar los últimos N’
datos.
3.2. Propuesta de implementación
N ' = ( Datos de entrada) ⋅ 2
El presente trabajo propone implementar rápidamente Dejando de tener en cuenta una cantidad M de
en hardware un prototipo funcional del sistema que permita coeficientes de salida.
hacer una evaluación conceptual y funcional del diseño, sin
atacar aún el problema de optimización del mismo.
3.3.Distorsión en los bordes de una imagen
Para realizar los distintos barridos necesarios de la matriz
de datos, se utiliza programación directa en MatLab, y se La teoría de banco de filtros utilizada para la
aplica la señal a SimuLink directamente en forma de vector, implementación de la transformada Wavelet, está planteada
lo cual hace transparente el recorrido de la matriz para este y funciona adecuadamente para señales infinitas, pero se
último. producen distorsiones en los límites de las señales finitas,
MatLab esta diseñado para trabajar con matrices, por lo como es en el caso de una imagen [4].
que las operaciones con este tipo de arreglo de datos son Se han propuesto varios métodos para solucionar este
extremadamente simples de realizar, la mayoría de ellas se problema. Todos ellos proponen extender la señal de alguna
reducen a operadores, como la que devuelve un vector, a manera. La bibliografía consultada propone entre otros el
partir de recorrer la matriz por columnas. Para obtener el método de convolución circular y la reflexión simétrica,
barrido horizontal y vertical se utiliza la misma función que se obtienen mediante la reflexión y la repetición
pero aplicada a la matriz original o a su transpuesta. Para simétrica de las muestras en la frontera. MatLab también
poder hacer uso de este método es necesario que la matriz plantea la posibilidad de relleno con ceros.
sea cuadrada y además debe respetar la apariencia de la Estas ampliaciones no son arbitrarias y dependen
imagen original. exclusivamente del orden del filtro. Pese a la extensión, la
Como método de prueba se trabajó sobre la siguiente salida continúa generando distorsión en los bordes, sin
proposición: a la salida de la primera descomposición se embargo es bastante fácil ver la reflexión simétrica también
obtienen 5002 datos, lo cual no es compatible con una a la salida de los filtros. Eliminando dicha distorsión
matriz rectangular. A los efectos de lograr una matriz simétrica, se obtiene la salida recuperada perfecta, que
rectangular se ensayaron las siguientes soluciones: puede ser verificada con MatLab a través de:
a) Se agregaron 98 ceros para hacer compatible el [Ap De]= dwt (entrada, ‘db3’);
número de datos con una matriz rectangular necesaria para

85
Donde entrada es un vector de valores finitos, db3 Instanciar las fuentes de los archivos generados por
corresponde al tipo de onda utilizado para el cálculo de System Generator en los archivos “user:logic.hdl” y
coeficientes y las salidas son Ap y De corresponden a la “top_entidad.hdl.”.
Aproximación y Detalle respectivamente. Una vez verificada la incorporación, mediante la síntesis
de las fuentes, se agrega el IP desde el repositorio en el
entorno EDK, se conecta al bus correspondiente y se
3.4.Resultado de la implementación generan las posiciones de memoria del sistema .
Al comparar la imagen reconstruida con la imagen
original utilizando el método implementado, se hallaron
errores en el margen superior izquierdo de la imagen, de
manera más específica en una submatriz de n x n, donde
“n” es la cantidad de coeficientes del filtro implementado
para Daubechies 3.
Como solución a este problema se agregaron marcos a
la imagen original. En esta experiencia para prototipado Figura 5 EDK.
rápido se utilizó un marco cuyo valor numérico era el uno,
esto permitió apreciar el comienzo de la imagen al finalizar Incorporado el filtro al sistema se creará una rutina en
el marco, y pese a que no se empleó ninguna extensión de SDK la cual escribirá en el registro de entrada del filtro
frontera se logró una reconstrucción perfecta. Esto se debe a datos enviados por el puerto serie de una PC conectada al
que los errores de los procesos de truncación de sistema. Los datos resultantes en el registro de salida se
información para la formación de matrices auxiliares de envían a la PC para contrastar con los resultados de las
recorrido, descriptos anteriormente, se sitúan dentro del simulaciones realizadas en SimuLink.
perímetro del marco, “Fig. 5”, que posteriormente es
eliminado. 4. CONCLUSIÓN Y TRABAJOS FUTUROS

Las pruebas realizadas con la metodología utilizada


demostró ser efectiva para la implementación de módulos
IP, la interacción entre las herramientas de alto nivel
demostró ser robusta y confiable. La posibilidad de utilizar
la potencia de MatLab en desarrollo de algoritmos y
verificaciones abren la puerta a la implementación de
Figura 5. Marcos agregados a la imagen original. hipótesis de manera rápida para validación o refutación,
acortando los tiempos de investigación y desarrollo. La
evolución de las herramientas de desarrollo de sistemas
3.5. Implementación del módulo embebidos en plataformas FPGA muestran un crecimiento
importante y sostenido de este campo en este tipo de
El sistema generado en VHDL por SimuLink tendrá por tecnologías.
puertos dos registros de entrada salida de 8 bits, una entrada
para el reloj del sistema y una señal de chip enable la cual 5. REFERENCIAS
en esta aplicación sera conectada a un 1 lógico. Para
embeber el mismo se ejecuta el asistente “Create or import [1] Jalali, Payman. “Wavelets and aplication.” Energy
Peripheral”. Tecnology Department. Lappeenranta University of
Los pasos para implementar un sistema embebido Tecnology.
utilizando el entorno XPS son:
Generar gráficamente la plataforma base, Micro, bus, [2] Burrus, Sidney; Gopinath, Ramesh; Guo Haitao.
“Introduction to Wavelets and Wavelets Transforms”.
controladores de memorias, periféricos de IO genéricos etc.
Electrical and computer engenieering departament. Rice
Generar un nuevo IP para poder incorporar la entidad
University. Houston, Texas.
principal de los bloques Wavelet. Esto se debe contemplar
dentro de las funciones que se seleccionan durante el [3] Borja José García Menéndez; Eva Mancilla Ambrona; Ruth
asistente para generación de la interfaz funcional para Montes Fraile. “Optimizacion de la transformada Wavelet
propiedad intelectual (IPIF). La existencia de los dos Discreta” Universidad Complutense de Madrid – Facultad
registros accesibles por software que serán los mismos que de informática 2004 – 2005.
necesita el módulo generado por System Generator. [4] Strang ,Gilbert; Nguyen, Truong. “Wavelets and Filter
Banks”.

86
Cortex-M0 Implementation on a Xilinx FPGA
Pedro Ignacio Martos y Fabricio Baglivo

Laboratorio de Sistemas Embebidos


Facultad de Ingeniería - UBA
Buenos Aires, Argentina
[email protected] / [email protected]

ABSTRACT

This paper presents an implementation of the recently


available ARM Cortex TM-M0 soft core processor on a
small Xilinx FPGA. These kinds of processors are oriented
to mixed-signal and micro- controller devices with low cost
and low power consumption. At present, Xilinx does not
offer a Cortex-M0 soft core solution. By contrast, Actel
and Altera offer CortexTM-M1 cores in FPGA solutions.
The aim of this work is to connect the ARM Cortex-M0
soft core with a synthesized code memory using the
Advanced Microcontroller Bus Architecture (AMBA®)
AHB-Lite interface in a Xilinx FPGA to evaluate how Figure 1 ARM Cortex Family
much fabric resources are needed to implement this soft
core processor. For this purpose, we have designed an Thumb-2 technology. This provides the exceptional
small system that receives the data request from the performance expected of a modern 32-bit architecture, with
processor, sends it to the synthesized memory, and returns a higher code density than other 8-bit and 16-bit
the data obtained to the processor. In this implementation, microcontrollers, being capable of achieving performance
the proper work of the system is monitored with Chipscope- figures around 0.9 DMIPS/MHz..
Pro, a Xilinx in-system debugging tool.
1.1. Cortex-M0 DesignStart

1. INTRODUCTION The Cortex-M0 DesignStart (M0DS) processor, shown in


Figure 2, is a fixed configuration of the Cortex-M0 (M0)
Within the 32-bit ARM processor family, shown in Figure processor. It is delivered by ARM as a pre-configured and
1, Cortex-processors support high performance applications obfuscated netlist, but it is synthesizable, Verilog version of
with embedded operating systems and real time the full Cortex-M0 processor. The main differences
applications. CortexTM-R processors are oriented toward between them are: a) M0 has a full AMBA interface that
low-cost controllers, with deterministic, fixed-latency and provides MASTER and SLAVE support, while the M0DS
interrupt handling. These processors are also intended for provides an AMBA Lite interface with only MASTER
high-performance, real-time applications. Cortex-M is the support; b) M0 can be implemented with either a high
lowest member in Cortex family. This family is an option speed multiplier (1 clock cycle) or a slow-speed multiplier
for simple, low-cost, low-power and low-performance (32 clock cycles), while the M0DS only allows the slow-
designs. The Cortex-M0 is, nowadays, an alternative to 8- speed multiplier; c)M0 can handler 32 interrupts while
bit microcontrollers, with the advantage of high processing M0DS can handler 16 interrupts; and d) M0 includes an
capacity. It is built on a high-performance processor core, optional wake-up interrupt controller, architectural clock
with a 3-stage pipelined Von Neumann architecture. This gating and hardware debugging. M0DS does not include
processor implements the ARMv6-M architecture, which these options.
supports the ARMv6-M Thumb® instruction set, including The M0DS processor is distributed as a single zipped tar
file bundle, containing the release notes, synthesized

87
Verilog, and test-bench code. The test-bench instantiates from the Digilent web page. This is very important for our
the Cortex-M0 DS module and connects it in a minimal purpose because it is possible to program the board and see
way to a memory model and clock and reset generators. It the state of it easily.
also provides a means of outputting information from the The Xilinx S3E500-4 is the FPGA included on
processor to the Verilog simulator’s console output. the board. It has 500K gates, 10,500 logic cells, 20
The aim of this work is: synthesize the Cortex-M0_DS hardware multipliers, 360Kbits of dedicated RAM, 73
in a real FPGA, connect it to a memory with a small Kbits of distributed RAM, 4 clock handlers, and a
program inside using the AMBA-Lite interface, evaluate maximum clock frequency of 300MHz .
how much FPGA fabric resources are needed to do it, and
see its applicability in small footprint systems.
2.2. Software tools

1.2. AHB AMBA-Lite overview We used the Xilinx® Integrated Software Environment
(ISE™) as our design suite software. The ISE project
This protocol defines the data and address buses, and all navigator allowed us to manage the project and synthesize.
the control signal for high performance synthesizable Core Generator is the tool we used to generate the ROM
designs requiring high bandwidth. First, the address bus, memory and the reset generator. ISIM was used for
HADDR[31:0], is a MASTER output to a SLAVE simulation purpose. Also we used Chipscope Pro to
device. Data transfers are performed using two buses: a perform online debugging. This program let us see the
read one, which is a SLAVE output to a MASTER input, state of the system. For C code compiling, we used the
called HRDATA[31:0], and a write one, which is a ARM Microcontroller Development Kit by KeilTM.
MASTER output to a SLAVE, called HWDATA[31:0]. In The ARM deliverables package contains a logical
this work we use only HRDATA. Finally the protocol folder with synthesizable code and test-bench code. The
specifies control signals. HBURST[2:0] indicates if the test-bench project has a Verilog implementation of the
transfer is a single transfer or forms part of a burst. processor, Cortexm0ds_tb.v, prepared for simulation. It also
HMASTLOCK indicates if the current transfer is part of a includes a HelloWorld.c program with a make file
locked sequence. HPROT[3:0] provides additional containing the compilation parameters. As result of this
information about a bus access and is primarily intended compilation a .bin file is obtained. This file is a memory
for use by any module that wants to implement some level
of protection. HSIZE[2:0] indicates the size of the
transfer, e.g. ,byte, half word, or word. HTRANS[1:0]
indicates the transfer type of the current transfer (IDLE,
BUSY, NON SEQUENTIAL, SEQUENTIAL). The
HWRITE signals indicates the transfer direction, when
HIGH this signal indicates a write transfer, and when
LOW, a read transfer.

2. THE IMPLEMENTATION

2.1. The board

For this paper we used a Digilent Spartan 3E Starter FPGA


board. This digital system design platform is built around a
Xilinx Spartan 3E FPGA. It has 16 megabytes of fast
SDRAM and 16 megabytes of flash ROM. It also has a Figure 2 Cortex-M0 implementation
50MHz oscillator, plus a socket for a second oscillator.
This board contains a USB2 port that provides board
power, programming, and data transfers. Some peripherals
are also included on the board like an LCD display, a set of
LEDs, switches, etc. In particular, we use LEDs as event
indicators. One of the major advantages of this board is
that is possible to use Xilinx software tools (Impact,
Chipscope Pro, xmd, etc.), downloading software (Adept)

88
Figure 4 FPGA use with the Project Implementation.

Figure 3 Cortex M0 Design test-bench Schematic A “Toggle LED” project was created. Its aim was
to turn on and off one of the board´s LEDs with the Cortex
image that is loaded by the Verilog test-bench at the processor. The .bin file was generated using the Keil tools.
beginning of the simulation. The Core Generator tool allowed us to fill the memory
with an image. The file format required was .coe. A script
was generated to transform the .bin file in a .coe. Once the
2.3. System Implementation project was synthesized, IMPACT was used to program the
board.
The first step of the implementation process was the
replication of the Hello World project. We realized that
the makefile did not work correctly with Keil or IAR, so 3. RESULTS
we decided to begin a new project with Keil. We made the
compilation process and we compared the .bin file As mentioned before, the Chipscope Pro package was used
obtained with the one ARM provide in the test-bench to see the transitions on the AMBA AHB-Lite interface
package. After that, we began the simulation with ISIM. signals and for debugging. With this tool, we could verify
In it we saw that the data bus did not contain valid values. that the interface worked as expected. All memory
The .bin file was not correctly loaded by the Verilog code accesses were correctly synchronized and we saw that the
because it was assumed that the Verilog instruction LED on the board blinks. So we conclude that the
$fread() made double words accesses, where it actually implementation is correct and functional.
made byte accesses. After this modification the test-bench In Figure 4 we show some results using the FPGA
was working as expected. with the project implementation. Figure 5 shows the
The next step was the implementation of the simulation of the memory accesses in Keil software. Figure
testbench code into a synthesized VHDL code (see Figure 6 shows the Chipscope Pro capture of the AMBA bus.
3). An external 50MHz oscillator was used as the external Timing reports (post place & route) gave a maximum
clock. We used a synthesized DCM to generate the 10Mhz clock speed of 40MHz. This value could be improved by
system clock. The reset generator was implemented using implementing time and placement constraints.
the Xilinx architecture wizard. A two clock-cycles reset
pulse is needed to initialize the processor. The pre
initialized 1Kx32bit RAM was created using the Core
Generator tool. The processor does 32 bits data access, so
we had to shift the RAM address bus in 2 positions, so
processor addr[0] is connected to RAM addr[2]. Some
LEDs were connected to internal signals, namely,
LOCKED and SLEEPING. The address and data buses of
the AMBA-Lite interface and HWRITE were connected
directly to the memory. HREADY was fixed to ‘0’ and
HRESP was fixed to ‘1’. Others signals of the interface
were connected to internal signals for debugging purpose.
A bus signal detector was developed to compare HWRITE
bus information with two patterns, one to turn the LED on
and the other one to turn it off. The user constraint file Figure 5 Simulation in Keil
(.ucf) was defined using the Plan Ahead tool

89
[3]
ARM Ltd, “AT510-DC-80001-r0p0-00-rel0 ARM Cortex
M0 DesignStart Release Note” August 2010.
[4]
ARM Ltd, “ARM DDI 0432C Cortex M0 Revision r0p0
Technical Reference Manual”, November 2009.
[5]
ARM Ltd, “ARM DUI 0497A Coertex M0 Devices Generic
User Guide”, October 2009.
[6]
Xilinx, “DS312 Spartan-3E FPGA Family: Datasheet”,
August 2009.
[7]
Digilent, “Digilent Spartan 3E Starter Kit Reference
Manual”, June 2008.

Figure 6 Capture of the AMBA Lite bus II. ACKNOWLEDGES

4. CONCLUSIONS To William Hohl, Joe Bungo, Fiona Cole, the people at the AR
University Program and the people at the Xilinx University
The most remarkable conclusion is that it is possible to Program (XUP) for their support and cooperation.
implement the M0DS in a low range FPGA. With this
result, Xilinx, Actel and Altera (the three most important
FPGA manufacturers) can support this core making it a
III. TRADEMARKS AND COPYRIGHTS
considerable alternative when portability between these
three FPGA types is a requirement for a design.
As an improvement, it would be useful to have a The information about ARM processor families was mainly extracted
from ARM Ltd web site (www.arm.com), as published on October, 2010.
complete test bench that allows us to generate the bin file
from the source code. That is not possible right now, and it ARM, CORTEX, CORTEX-M, AMBA, AMBA-LITE, and other
would accelerate the development time. designated brands included herein are trademarks of ARM Limited.
In future work, other peripherals will be
Xilinx, Spartan, ISE, and other designated brands included herein are
connected to the AMBA bus in order to increase the trademarks of Xilinx Inc.
processors capacity. Also, going further, a Linux operating
system can be investigated on this processor, obtaining a Digilent, Spartan3E Starter Kit, Adept, and other designated brands
Linux implementation in a small footprint design. included herein are trademarks of Digilent Inc.

All other trademarks are the property of their respective owners.


I. REFERENCES
Figures 1 to 3 are copyright ARM Ltd. Reproduced with permission.
[1]
ARM Ltd, “ARM DDI 0419C ARMv6-M Architecture References [1] to [5] are copyright ARM Ltd.
Reference Manual”, September 2010.
Reference [6] is copyright Xilinx Ltd.
[2]
ARM Ltd,, “ARM IHI 0033A AMBA 3 AHB-Lite Protocol
V.1 Specification”, June 2006. Reference [7] is copyright Digilent Inc.

90
DIGITALY CONFIGURABLE PLATFORM FOR POWER QUALITY ANALYSIS

B. Falduto, E.Ferro, R. Cayssials1

Universidad Nacional del Sur – CONICET1


Department of Electrical Engineering
Bahía Blanca – Argentina
email: [email protected]

Power quality disturbances can range form impulses


ABSTRACT1 with rise times in microseconds range, to long-duration
variations in voltage magnitudes.
Nowadays, the power quality requires sophisticated To solve various quality problems, PQ should be
approaches to get an efficient utilization of both: the measured accurately according to the exact definition of
electrical energy and the electrical installations and each power quality category, and then evaluated and
equipment. Most of the power quality devices are base on diagnosed in versatile ways. Power quality monitoring
analogical circuits to synchronise the processing stage with systems (PQMS), that characterizes power quality events
both the voltage and the current of the power line. Because and variations, has experienced rapid progress using high
the analogue components utilised, these approaches require tech detection functions. PQMS has many variations in
being precisely tuned and calibrated. their structure, but the recent trend is the permanent type
On the other hand, power quality analysis is cover by that is installed at one site permanently to asses the power
several standards. These standards define the parameters quality for 24 hours a day without a break. Moreover, in
and measures that have to be satisfied in order to guarantee many cases, to manage local power systems that have more
an adequate power quality. According new parameters or than one PQMS efficiently, the analyzed data of each
perturbations appear, new standard or modifications can be PQMS are sent to supervision controllers using network
proposed to adequate them to new necessities. connections ([22]).
The digital architecture proposed is aim to integrate the On the other hand, electric market defines a set of
synchronization stage and processing unit in a peripherical different economical conditions that have to be taken into
core compatible with a processor bus. This core can be account in order to get a cost-effective utilization of the
easily modified to include future power quality standards. electrical energy. As power quality becomes an
economical, ecological and electrical efficiency concern,
several standards have been proposed ([5, 6, 7, 8]).
1. INTRODUCTION Embedded Systems are used to implement most of the
power quality equipment and instrumentation. These
In recent years, power quality (PQ) has become a systems have to be design to meet the power quality
significant issue for both power suppliers and customers. standards, as well as been compatible with another
There have been important changes in relation to power instruments. Compatibility issues may become a concern
quality. First of all, the characteristics of load have become since most of power quality devices should work
so complex that the voltage and current of the power line synchronised to meet the requirements define by the
connected with these loads are easily distorted. Lately, for standards.
example, non-linear loads with power electronic interface The measuring of harmonics is already defined in [6]
that generate large harmonic current have been greatly and requires processing the information of voltages and
increased in power systems ([1]) currents. In this paper, we propose a digital architecture to
process the power quality parameters according the IEC
61000-4-7. The only analogue components required are the
attenuators and the A/D converters. All the
11This work was supported by the Technological Modernization synchronization and processing are perform digitally and
Program under Grant BID1728/OC-AR-PICT2005 Number consequently there is no need of calibration.
38162. and the project “Digital processing platfrom for Active This paper is organized as follows: Section 2 describes
Power Line Filters” granted by Fundación Hermanos Agustín y the main concepts in PQ. Section 3 explains the necessity
Enrique Rocca. of a variable sampling frequency. In section 4, it is stated

91
the FFT analysis of harmonics and interharmonics waves. 0.1% for the complete instrument including sensors (e.g.
In Section 5, we describe the platform proposed. In Section current clamps).
6, the architecture is analysed. Conclusions are drawn in An adequate measurement is the basis for any other
Section 7. power quality device. Modern PQ monitoring systems
range from traditional watt-hour meters or digital
2. POWER QUALITY CONCEPTS protection relays in which the PQ analysis algorithms are
inserted, to complex devices that deal with PQ parameters
The power quality is determined by a set of different and events. Most of these devices have to be configured in
measurements performed over the voltage and current order to implement the adequate strategies to guarantee an
waveforms. The main purpose of these measurements is to acceptable power quality. PQ strategies determine the
determine both: (1) how efficiently the electrical energy is different actions to take for the different PQ events. The
utilized and (2) how good is the energy provision. possible actions could be the modification of the regimen
Former electrical applications consisted in linear and of electrical loads, the connection of compensators, the
balanced loads. Under this case of loads, power quality switch off of secondary components, etc.
analyses were confined to determine the phase angle
between the voltage and current waveform. The cosine of 3. VARIABLE SIGNAL SAMPLING
this phase angle, denoted cos(φ), gives a relationship
between the electrical energy effectively utilized (active An important parameter for power quality is the
energy) and the electrical energy supplied (apparent harmonics content of the power line supply and load. The
energy). specification of the measurement and analysis are well
Nowadays, the characteristics of the loads are different defined in [5] and [8].
from the former ones. Most of the electrical loads use In [8], it is specified that the measurement interval shall
semiconductor devices that produce a non-linear be 10 or 12 cycle time for 50Hz or 60Hz respectively. The
behaviour, and consequently it introduces perturbations standard is defining the time period which need to be
into the power line that worsen the PQ of the system. measured and how the measured values will be aggregated.
The perturbations defined in [3], and supported by most The interval in time is not fixed but varies in time as the
of the modern PQMS, are: fundamental frequency of the power changes. This kind of
• Swell: is an increase in the A/C voltage, with a measurement requires synchronization with the power line
duration which may range from a half cycle to a few in order to adapt the sampling interval accordingly. The
seconds. easiest way to achieve adaptive sampling frequency is
• Sag: idem swell with reduction of the voltage. using PLL (Phase Lock Loop).
• Flicker: is a momentary interruption of the electrical A PLL is an electronic feedback system that generates a
energy. signal, the phase of which is locked to the phase of an
• Undervoltage: is a reduction of the voltage during input or reference signal. This is accomplished in a
more than 1 minute. common negative feedback configuration by comparing
• Overvoltage: is an increase of the voltage during the output of a voltage controlled oscillator to the input
more than 1 minute. reference signal using a phase detector.
• Interruption: is a reduction of the voltage below 10%. Analog PLLs are generally built of a phase detector,
• Harmonics: are voltage and currents components with low pass filter and voltage-controlled oscillator (VCO)
a frequency different from the frequency of the power placed in the forward path of a negative feedback closed-
line. loop configuration. Figure 1 shows a block diagram of a
• Frequency derivation: is the difference between the basic PLL structure.
frequency measured and the theoretical of the power
line.
Similarly, and according to IEC61000-4-30, Power reference
input phase
low pass filter
Quality Analyzer should analyze and evaluate these detector
(LPF)
(PD)
quantities: power frequency, magnitude of the supply
voltage, flicker, harmonics and interharmonics, supply
controlled
voltage unbalance, rapid voltage changes and voltage dips, VCO output oscillator
swells and interruptions. The standard suggests the (VCO)

monitoring and analysis of the current as well as it


specifies the measurement uncertainty to be better than Figure 1: Block diagram of a basic PLL structure

92
Analogue PLL circuits should be calibrated in order to get the 50th harmonic frequency as the highest that can be
achieve adequate response times when the input frequency measured. Besides, the 10/12 cycles windows determines
changes. Moreover, including additional analogue at least 1200 sample per window. Because of the radix-2
components in a system requires a more careful design to factor of the DFT transform, a length of was 2048 chosen,
avoid interferences between the analogue and digital with a sampling frequency varying from 9kHz to 13kHZ.
circuits.
On the other hand, software PLL circuits require strict 5. PLATFORM PROPOSED
temporal requirements to be met. The processing time
required by this king of algorithms is very demanding and Power quality measurement and analysis requires strict
hard to be met by most of the embedded processor without temporal constraints to be met. On the other hand, it is also
an exclusive or prioritised utilisation. When these temporal required processing and storage a large amount of
constraints are not met, then harmonics components are information. We can define two kinds of functions that a
introduced in the fequential analysis and consequently power quality device has to carry out: (1) power quality
measurement errors. synchronization, measurement, monitoring and analysis
In this paper,, we utilised the digital PLL circuit for and (2) processing, storage and communication of the data
power line applications proposed in [9]. This PLL is and information of the system.
implemented as a digital circuit that produces a fast When the first kind of functions is implemented as
response time when the power frequency changes. This software, then perturbations are introduced when the
PLL circuit does not require processing time from the processor’s time is shared among all the functions of the
system processor. system. These perturbations are produced because the real-
The PLL synchronises with the power frequency and time features of a Real-Time Operating System are not
generates the adequate sampling frequency to meet the well match with the temporal constraints that a power
harmonic analysis specified in [5] and [8]. The goal of an quality analysis requires.
adequate synchronization for the harmonics analysis is to We proposed a platform based on a FPGA device that
reduce the spectral leakage effect. Besides, the PLL implements two main units: (1) the Power Quality unit
utilised computes the sine and cosine of the power line (PQU) and (2) the SoPC unit with Linux. Both units are
frequency that it is used to easily detect voltage and current communicated through a communication bus that maps the
perturbations as well as to determine reactive, active and PQU in the memory map of the SoPC unit.
apparent powers and energies. The PQU contains the acquisition, synchronisation and
DFT stages. Voltage and current signal are attenuated,
4. FFT for Harmonics and Inteharmonics Analysis isolated, filtered, converted from analogue to digital and
transformed. Figure 2 shows a scheme of the PQU.
Power frequency is called the fundamental frequency. A Bus Interface
sinusoidal wave with a frequency k times (k is an integer)
is called harmonic wave or harmonic for short. Other DFT Dual Port Memory
sinusoidal waves whose frequency cannot be expressed as
an integer multiple of the fundamental, it is called an LP Filter PLL
V1
interharmonic wave or interharmonic for short. clk
In [6], it is specified the principle of harmonics and ADC LM12L458 Sampling
interharmonics measurements: a 200 ms windows (10 Generator
periods of 50Hz or 12 periods of 60Hz signal) is used in Isolation Isolation
DFT calculation resulting with 5Hz increment in frequency
spectrum. Analogue Stage
Attenuation Attenuation
For power quality measurements, usually the analysis of
harmonics is reduced to the 50th harmonic (i.e. to 2500Hz
for 50Hz signal). I1, I2, I3 V1, V2, V3
FFT (Fast Fourier Transform) transforms a time
Figure 2: PQU Architecture
sampled signal into its frequential spectrum. When FFT is
implemented for discrete time applications, then the
Whilst the power quality measurement has need of a
suitable algorithm is the DFT (Discrete Fourier
dedicated hardware to meet the strict timing constraints,
Transform).
the rest of the functions required to either communicate,
With these specifications, it is determined that the
storage or process a great deal of information may be
sampling frequency is at least 100 samples per period to

93
implemented on a processor. For this reason, a soft- possible perturbations considered for power quality
processor FPGA-based system was found as a suitable analysis. We need to define a protocol of testing to
alternative to implement it. The System on Programmable compare the architecture with a Class A certified analyser.
Chip gives a great flexibility of the system as well as a However, we can assure that the digital architecture of
friendly environment to build the embedded system the PQ unit proposed, allows us to configure the
architecture. We use the NIOS II soft-processor from synchronisation, transform and analysis parameters to
Altera, implemented on a Cyclone III FPGA device. The optimise the performance of the unit for different power
system executes a µCLinux operating system to give line perturbations. This feature helps to improve the
support to the software applications. flexibility of the architecture.
Figure 3 shows the architecture of the FPGA-Linux The utilization of a FPGA device running µCLinux
board. µCLinux was chosen because of its wide support for reduces the design complexity. Linux reduces the
communication an storage. The embedded system offers complexity on implementing the data processing and data
native server communications through Ethernet and Serial communication functions since they are programmed as
interface. applications that use the already tested drivers.
Data Exchange The processing and communication speed reached is
adequate to measure and analyse the power quality
Modbus Pertubations to Modbus Events parameters with harmonics up to the 60th order.
RTU Waveform conver sion TCP processing
protocol appli cation protocol appl ication

uCLinux 7. CONCLUSIONS
DRIVES

Seri al Ethernet PQ U
Modern loads utilize semiconductor devices whose non-
Interface NIOS Interface linear behaviour introduces perturbations in the power line.
processor Such perturbations could reduce the efficiency on the
secondary to PC wi th energy utilization as well as cause damage to the
communication Matlab/ Simuli nk Voltages and
link Currents equipment connected to the power line. Several standards
Figure 3. Architecture of the FPGA-Linux board has been published to define the different parameters to
take into account to assure a good quality of service.
The interface between the NIOS processor and the PQ Power quality requires the processing of the voltage and
unit is through the Avalon interface. A µCLinux device current of the power line. Analogue and software
driver was programmed in order to an easy access to the approaches have been proposed for this purpose. Whilst
PQ unit from software applications. the analogue ones requires a precisely tuning and
The FPGA-Linux Board was implemented on a DE2 calibration for each device, the second ones require a great
Altera board with a Cyclone® II 2C35 FPGA. The board deal of processing time of the system processor.
includes Ethernet and serial ports used to communicate We proposed a power quality platform implemented on
with a supervisor PC. The drivers and protocols for these a FPGA. The power quality measurement, synchronization
communications links are easily implemented as Linux and analysis are performed by the Power Quality unit. This
applications. unit may be changed and modified in order to incorporate
new power quality specification.
On the other hand, the processing, storage and
6. RESULT ANALYSIS communication is implemented on a NIOS II soft-
processor executing a Linux version for software support.
Power Quality standards do not prescribe protocols or In this way, the platform is highly flexible from both,
experiences that have to be carried out to meet different the power quality unit and the SoPC unit. Changes
Class requirements. Instead, they define the measurement produced in one unit does not affect on the other, making
and parameters for power quality analysis and monitoring. the design and adaptation easy.
This turns difficult to assure that a certain instrument,
device or platform meets the power quality specification of
the standards. 8. REFERENCES
Several simulations have been carried out considering
different scenarios of perturbations, finding processing [1] B. H. Chowdhury, "Power Quality," IEEE Potentials, vol.
20, pp. 5-11, 2001.
errors within the boundaries of the standard. However, we
cannot assure that this performance is achieved for all the [2] I.-Y. C.-J. Won, J.-M. Kim, S.-J. Ahn, S.-I. Moon, J.-C.
Seo, and J.-W. Choe, "Development of Power Quality

94
Diagnosis System for Power Quality Improvement," - General guide on harmonics and interharmonics
presented at Power Engineering Society General Meeting, measurements and instrumentation, for power supply
2003. systems and equipment connected thereto
[3] "IEEE Std 1100-1992, "IEEE Recommended Practice for [7] IEC 61000-4-15 Electromagnetic compatibility (EMC):
Powering and Grounding Sensitive Electronic Testing and measurement techniques Flickermeter
Equipement", (IEEE Emeral Book)"," ISBN 1-55937-231- Functional design specifications
1, 1992. [8] IEC 61000-4-30 Electromagnetic compatibility (EMC):
[4] D.-J. Won, I.-Y. Chung, J.-M. Kin, S.-J. Ahn, S.-I. Moon, Testing and measurement techniques – Power quality
J.-C. Seo, and J.-W. Choe, "Power Quality Monitoring measurement methods.
System with a New Distributed Monitoring Structure," [9] Ricardo Cayssials, Omar Alimenti, Edgardo Ferro, “A Digital
KIEE International Transactions on PE, vol. 4A, pp. 214- PLL Circuit for AC Power Lines with Instantaneous Sine
220, 2004. and Cosine Computation”, IV IEEE Southern Conference
[5] EN 50160: Voltage characteristics of Electricity supplied on Programmable Logic, San Carlos de Bariloche, ISBN
by Public Distribution Systems. 978-1-4244-1992-0, 26-28 de Marzo de 2008, Argentina.
[6] IEC 61000-4-7 Amend.1 to Ed.2: Electromagnetic
compatibility (EMC): Testing and measurement techniques

95
96
SOLAR TRACKER FOR COMPACT LINEAR FRESNEL REFLECTOR USING PICOBLAZE

Daniel Hoyos, Maiver Villena, Carlos Cadena ∗ Victor Serrano, Telmo Moya, Marcelo Gea †

INENCO Departamento de Fı́sica


Universidad Nacional de Salta Universidad Nacional de Salta
Av. Bolivia 5150 - Salta (Argentina) Av. Bolivia 5150 (Salta)
email: { hoyosd, maiver, cadena } email: { serranovh, tmoya, geam }
@unsa.edu.ar @asades.unsa.edu.ar

ABSTRACT
This paper describes a distributed control system for a
Compact Linear Fresnel Reflector using a combination of
chronological and light-sensing tracking techniques. The
system uses LabVIEW at controller stage, ZigBee for wire-
less communications and Spartan 3 FPGA’s at input /output
stages.

1. INTRODUCTION

One of the most interesting options for electric energy gener-


ation using renewable energy its to warm water using Fres-
nel solar concentrators, which are mirror arrays that send
sunbeams to an absorber elevated over them, the reflectors
are low curvature parabolic cylinders. They are installed at Fig. 1. General Scheme.
floor level tracking the apparent solar path rotating over hor-
izontal axes. The reflectors concentrate the direct solar ra-
diation on an absorber fixed some meters over floor. 2.29 degrees. This precision of two degrees at the concen-
The absorber is a linear tower with a cavity in its inferior trator implies one degree of precision for mirrors movement.
face. This kind of concentrator must orientate the mirror to Mirrors must be rotated, following sun’s path to keep re-
reflect sun rays over a concentrator located at a height of flected rays on the absorber, mirrors should be placed below
10 meters. The mirrors have to turn following sun’s path 45 degrees east at sunrise and finish below 45 degrees west
to hold the reflected rays over the absorber. Analyzing the at sunset. This means that runs 90 degrees all day.
mirror located under the absorber it must be 45 degrees East
at sunrise and 45 degrees West at sunset, so it have to scan
2. DESCRIPTION OF MOTION
90 degrees during daytime.
In order to stablish mirror’s speed we must compute sun- This device consists of a set of mirrors and an absorber lo-
rise and sunset times for each day of the year. Sunrise time cated 10 meters over it, both of them North-South oriented.
and daytime duration vary too with seasons, so to track the The mirrors should concentrate sunlight to the absorber. It
sun the mirror must starts its movement always at 45 degrees is considered that the absorber is exactly over the mirrors.
but at different clock times every day and its speed will be The rays of the rising sun are parallel to the surface of the
different too depending on the day of the year. For others earth, the mirrors should be at 45 degrees, in the solar noon
mirrors located at other positions its movement must start at should be vertical and in the evening should be an angle of
the same time but with different angles. −45 degrees, because at sunset the sunbeams are again pa-
Concentrator is located at a height of 10m and its wide rallel to the horizontal. The mirrors must track the sun from
is 0.4 meters implying that sunbeams must concentrate over sunrise to sunset, these values depend on the position of the
∗ Instituto de Investigación en Energı́as no Convencionales sun for each day of the year. [1]
† Departamento de Fı́sica - Universidad Nacional de Salta. Solar declination is given by (1)

97
Table 1. Protocol codes.
Code Operation
Start 11110000 Start daily controller routine
Time update 11110001 Set controller time
Time check 11110010 Check controller’s time
Position Check 11110011 Check controller’s position
Position change 11110100 Order position change
Save 11110101 Save
Relocate 11110110 Relocate
Echo order 11110111 Echo request
Status order 11111000 Status request
Time blw steps 11111001 Set time between steps
Fig. 2. LabVIEW communications subroutine. Id Request 11111010 Request controller identification

  3. SYSTEM CONTROL
284 + n
d = 23.45 sin 360 (1)
365 The system has a central control, a communications net-
work and one controller for each mirror. Central control
Solar time it does not coincide with local clock time. To makes more complex calculations like sunrise time, sunset
convert standard time to solar time takes two corrections: time and day duration and sends them to the controller set.
First, there are a constant for the difference between the ob- It also verifies controller operation, updates system time and
server’s longitude and the longitude of the country. The sec- orders system protection mode on bad weather. This central
ond correction is from equation of time, which takes into ac- control was implemented with LabVIEW running on an em-
count the perturbations of the rotation of the earth, is show bedded PC (PXI8155) sending data through serial port [3].
in (2) Communications subroutine is show in Fig. 2
A simple three bytes protocol was defined for control
orders, containing instructions at first byte and data at the
Solar time − Standard time = 4(Lst − Lloc ) + E (2) others. Instruction byte is composed by 0xf at high nib-
ble and the proper instruction code at the low nibble. The
Where E is given by (3) and B is shown in (4) instruction set is show in Table 1
The controller located at each mirror drives its move-
ment in function of the orders received from the central con-
E = 229.2(0.000075 + 0.001868 cos B trol and position sensors data. The controllers are indepen-
− 0.032077 sin B − 0.014615 cos 2B (3) dent among themselves. Protocol 802.15.4 (ZigBee) modu-
les are used for RF communications, working at the 2, 4GHz
− 0.04089 sin 2B) band [4]. A module configured as Coordinator is connected
to PC serial port and End Device configured one for each
360 controller.
B = (n − 1) (4)
365
3.1. Mirror Control
where n is number of days, Lst Standard longitude of
the country and Lloc is longitude of place in question. [2] Controller stage uses a Xilinx’s FPGA with PicoBlaze, an
To protect mirrors at night they must be placed looking embedded soft processor that performs overall control. The
down, so this device must go over 135 degrees (with 7.500 tasks required by the controller are implemented on FPGA
steps). In order to go back to start position at sunrise it must including: motor control, real time clock, analog to digital
return 12.500 steps. The speed of this movement is limited converters for sensors and UART to drive ZigBee module.
by motor’s maximum possible speed and system’s inertia. As those devices are connected to PicoBlaze’s input/output
It was experimentally determined that 100Hz pulses gives ports it can access them using configuration registers.
a free fault motor working, so the time needed to put the Motor control is implemented as a state machine that
system in repose mode is one minute. After, at sunrise, it compares position register data with internal current posi-
relocates the system in two minutes. tion and sets movement sense and steps number. This con-

98
Fig. 3. Controller System.

trol has two control registers: CE to enable/disable and Re-


Fig. 4. Engine Control.
set to clear, one input register: Position to set desired posi-
tion and one output register with current position.
RTC module presents three input registers in order to set input.
system’s time and a control register with CE enable/disable,
Reset clear RTC, Act Update time. Three output registers
shows current time. The used UART and ADC are the ones
3.4. PicoBlaze Software
proposed by the manufacturer and have implemented all the
proposed registers. Connection of those registers with Pi- PicoBlaze processor receives orders from the central control
coBlaze processor is made by two multiplexers to its input using the implemented UART and enters a menu that de-
and output ports [5]. termines the actions to follow [7]. The processor uses the
time between engine steps to increase position’s register of
3.2. Engine Control the stepper motor, waits a second and sees if the position
sensors are illuminated. If this occurs sensors indicates in
Engine control is implemented as an state machine that com- which direction it should move the engine, so it moves the
pares position register and actual position register. If the stepper motor, waits for 0.1 seconds and repeats this routine
result is zero it does not perform any action. If the result up to ten times. This program follows sun’s path, in case the
is greater than zero it increases the current position regis- day it is cleared with the sensors or when the day is cloudy
ter and increases the position register which is connected to roughly assuming that the sun moves at constant speed.
the output decoder that generates control signals for stepper
motor. When the difference is less than zero the movement
must be in the opposite direction and decreases the current 3.5. ZigBee Considerations
position register.
On a first approach broadcast ZigBee transmissions where
3.3. Sensors used in order to simplify frames composition [8]. The re-
sults were satisfactory for early test, but for a complete sys-
To verify proper system’s motion two LDR sensors are placed tem broadcast delays are unacceptable and unicast ZigBee
looking at the edge of the mirrors. When sun-rays do not im- transmissions are needed. That implies compose the frames
pinge on the absorber, one of the sensors is illuminated and with individual End Device’s ZigBee addresses at Coordi-
indicates that the system is out of focus and in which direc- nator level. As central control is implemented on LabVIEW
tion is the blur. The sensor circuit consists of a resistor in the corresponding libraries where designed to allow unicast
series with the LDR and this signal is connected to ADC managing and obtain the desired performance.

99
ture of the absorber if necessary, for example by blurring a
mirror.

6. REFERENCES

[1] Y. J. Huang, B. C. Wu, C. Y. Chen, C. H. Chang, and T. C.


Kuo, “Solar Tracking Fuzzy Control System Design using
FPGA” IAENG Proceedings of the World Congress on Engi-
neering 2009, WCE London vol I, Jul. 2009
[2] J. A. Duffie, W. A. Beckman, “Solar Engineering Of Thermal
Processes Second Edition” John Wiley & Sons, Inc. 1980
[3] National Instruments, “LabVIEW Data Acquisition Basic
Manual” NI.com 2000
[4] C. Evans-Pughe, “ZigBee wireless standard.” IEEE Review.
Vol 49, Iss 3, pp. 28-31, Mar. 2003
[5] Xilinx, “PicoBlaze 8-bit Embedded Microcontroller User
Guide for Spartan-3, Spartan-6, Virtex-5, and Virtex-6 FP-
GAs” xilinx.com Jan. 2010
[6] J. Logue, “Virtex Analog to Digital Converter.” Xilinx XAPP
155, Sep. 1999
[7] D. Antonio-Torres, D. Villanueva-Perez, E. Sanchez-Canepa,
“PicoBlaze-Based Embedded System for Monitoring Appli-
Fig. 5. Mirror with controller. cations,” CONIELECOMP 2009, pp. 173-177, Feb. 2009
[8] Shahin Farahani, “ZigBee Wireless Networks and
Transceivers” Elsevier pp. 16-23, 47-78, 2008
4. RESULTS [9] J.A. Beltran, J.L.S. Gonzalez Rubio, C.D. Garcia-Beltran,
“Solar Tracker (seguidor solar de lazo abierto con look up
The tested system consists of: PC with a PXI8255 embed- table de datos precalculados)” Design, Manufacturing and
ded data acquisition board, USB-RS232 module, two X-Bee Performance Test of a Solar Tracker Made by a Embedded
modules and Spartan 3a FPGA. All of them connected to a Control, CERMA, pp. 129-134, Sep. 2007
power interface and stepper motor. A real scale mirror was
builded, as shown in Fig. 5, and different motors and gear-
boxes where tested to optimize the system.
The described system went through various stages. At
the first stage the embedded PC algorithms were developed
and tested, then tried on stepper motor control, PicoBlaze
control on the driver and finally I 2 C and ZigBee networks.
Also tested different control strategies.
A sensor based control strategy was tested, the system
was experiencing problems on clouding situation and was
slow to refocus. Using only sun’s motion equations the sys-
tem is out of focus at noon (maximum radiation) and re-
mained focused for the day. Using a combined control strat-
egy, as described in this paper, the system don’t experience
problems at noon nor at a cloudy day.

5. CONCLUSION

The use of FPGA allows quickly control system reconfig-


uration using a minimum of discrete components for opti-
mization, PicoBlaze simplifies the control program. System
developed works acceptably and further development is to
share information between mirrors to reduce the tempera-

100
TOOLBOX NURBS AND VISUALIZATION SYSTEM VIA FPGA

Luiz Marcelo Chiesse da Silva* Maria Stela Veludo de Paiva

Electrical Engineering Department Electrical Engineering Department


Federal University of Technology – Paraná USP – University of São Paulo
Cornélio Procópio – Paraná – Brazil São Carlos – São Paulo – Brazil
[email protected] [email protected]

ABSTRACT polynomials require a minimal number of procedures,


namely orderly parameterization, linear system solution and
NURBS curves and surfaces are widely used in the curve/surface fitting. Manufacturing systems, like CNC,
CAD/CAM, reverse engineering and rapid prototyping and 3D data acquisition systems makes use of NURBS to
systems, to represent adequately, in a compact way, almost provide greater efficience, being implemented in embedded
any shape. For this sake, NURBS is included in standards, systems [10] based in microcontrollers, DSPs or a
being implemented in CAD/CAM systems and graphical application specific integrated circuit. The FPGA
processing units (GPUs). Data acquisition and technology provide a all-in-one chip solution for the data
manufacturing systems make use of NURBS implemented pre-processing and control in embedded systems, an
in embedded systems, and works using FPGA are incipient, architecture specified by the system designer, and
as a alternative to other technologies based in reconfigurable logic, capable to perform a custom processor
microcontrollers or dedicated integrated circuits. This work [11] for a specific task.
proposes the implementation of a NURBS interpolation and This work proposes a Soc in a FPGA system, with a
visualization Soc - System on a chip, using FPGA aiming NURBS local curve an surface interpolation cores, based in
implement a embedded system for manufacturing and the fast Cox-de Boor implementation [12], and a basic
CAM systems for tool positioning and toolpath simulation. graphic pipeline [13], for visualization purposes.
Optionnally, is included two cores for the generation of
1. INTRODUCTION straight lines and circles [14]. To connect and synchronize
the cores, is used a Wishbone based bus [15], following its
Curve and surface interpolation is a fundamental task in conventions and being an open source logic bus.
graphic systems, like CAD/CAM, for example, where the The use of FPGASs in the area of video and image
resolution between the design and the manufacturing processing is consolidated [16], despite the fact that new
systems should be adjusted [1]. When a set of points are technologies in the area of application, ad hoc integrated
given or received from a graphical unit, and is essential to circuits like the Cell processor [17-18], there is a gap in the
fit this set with a curve or straight lines coincident with the graphics processing that leads to aplications making use of
given points, is done the interpolation (if the points are not mixed technologies, like GPGPU and FPGAs .
coincident, the fitting is made by aproximation) [2]. There
are several methods for the interpolation of a set of points, 2. NURBS
ranging from simple and efficient triangulations [3] to
modified methods using RBF [4] and others [5]. Among A NURBS curve of degree p is a piecewise polynomial
these methods, NURBS - Non Uniform Rational B-Splines curve defined as:
are adopted in graphic standards like IGES [6], STEP [7]
and OpenGL [8] (PHIGS) for curve and surface n
representation between graphical systems. The main C(u) = ∑ wi Pi N i,p(u) (1)
advantages of the rational b-splines (affine transformations, i =0
for example) make them the most suitable choice for
standardization, despite the lack of compression in the where u is a parameter value, Pi form the so called control
representation of conic sections [9], is widely used too in polygon points, weighted by wi, and Ni,p(u), i = 0,...,n, are
generic mathematical applications. The use of piecewise the B-spline basis functions defined over a knot vector,
_________________________ where:
*sponsored by Teacher Qualification Institutional Program
– Coordination for the Improvement of Higher Level U = {u 0 ,..., u m }, u i ≤ ui +1 , i = 0,..., m − 1 (2a)
Personnel (Capes).

101
1 if u ≤ u < u
N (u ) =  i i +1 (2b) D1 T1
i, p 0 otherwise
u −ui u −u q1 Dn Tn
Ni, p(u) = Ni, p−1(u) + i+p+1 Ni+1, p−1(u) (2c) q0
ui+p −ui ui+p+1 −ui+1 T0 D2
T2
D0 qn-1
We assume throughout this paper that the knot vector
has the following form: Dn-1
Tn-1
U = { a, a,2...4
,3
a,u p+1, ... , um-p-1, b, b,2...4
, b} (3)
14 14 3 Fig. 1. Data points, junctions (knots), distance and tangent
p +1 p +1
vectors in a NURBS curve.
where, in most practical applications, a = 0 and b = 1.
A NURBS surface of degree (p, q) is defined similarly as: knots is calculating initial parameters values given by the
chord length method:
n m
S(u,v) = ∑∑ wi,j Pi,j N i,p(u) N j,q (v) (4) t0 = 0 (6a)
i = 0 j =0 k
1
where u and v are the parameter values in the longitudinal
tk = ∑ Di − Di −1
L i =1
(6b)

and isoparametric directions of surface construction, Pi,j, i = tn = 1 (6c)


0, ... , n; j = 0, ... , m, form the so-called control net defined
by a set of points weighted by wi,j and the basis functions
Ni,p(u), i = 0, ... , n, and Nj,q(v), j = 0, ... , m, are defined as where t is the parameter value, L the total chord length
above (the construction of Nj,q (v) is similar), over the knot between the data points, |Di – Di-1| the chord length between
vectors: two adjacent data points, k the paramater index and n the
number of data points. For the knot vector is recommended
the technique of averaging:
U = {u 0 ,...,ur }, u i ≤ u ι +1 , i = 0,...,r − 1 (5a)
V = {v 0 ,..., v s }, v j ≤ v j +1 , j = 0 ,...,s − 1 (5b) u 0 = ... = u p = 0 ; u m-p = ... = u m = 1 (7a)
j + p −1
1
2.1. NURBS INTEPOLATION
u j+ p =
p
∑t
i= j
i ; j = 1, ..., n-p (7b)

The NURBS interpolation could be divided in local and


global interpolation methods. The first constructs a curve where t is the parameter in the equation (6b). The values u0
by rational segments (rational polynomials), in the case of at up and um-p at um reflects the knot multiplicities required
surfaces by rational patches, such that the endpoints of each for the spline beginning and end conditions. The property of
segment are the given data points. Neighboor segments are local control allows the local interpolation, given by the
joined with some continuity level between the junctions, suport region of the basis functions, that restricts the
with the curve construction proceeding segment wise. The influence of the basis functions only in a limited number of
global interpolation makes the curve as a whole, using all piecewise polynomials. So, many polynomials segments
the given points in a matrix calculation, and the control can be computed concurrently to generate the final curve,
points are obtained by the inversion in the matrix form of given the desired contnuity at the polinomials junctions
the NURBS equation (1). For both methods, is necessary (knots), feature exploited by the de Cox-de Boor algorithm
the points to be interpolated (data points), the number of given in equation (9), where t is the parameter, u are the
control points (segments), the knot vector and the parameter knots, P the control points for each layer (data points in the
values. In figure 1, Di are the interpolated data points, and first layer), i the control point index in the layer, k the order
the knots are calculated based in the chord length between of the polynomial segments and j the layer number.
each data point given by the module of vectors qi. Another
method for the interpolation is to satisfy the conditions  t − ui  j −1  t − ui  j −1 (8)
Ci j (t ) = 1 − .Pi −1 +  .Pi
given for the curve tangent vectors Ti in each data point Di.  ui + k − j − ui   ui + k − j − ui 
   
For a given set of data points, the best method to set up the

102
Layer 0 Layer 1 Layer 2 Ve rtex R e ad er

P00 V ie w p ort Transfo rm atio n

P01 Prim itive D raw e r


P10 P02 = C(u)
Vide o M em ory W riter
P11
P20 Fig. 3. Basic 3D visualization pipeline.

video DAC. This pipeline could be used for the 2D or 3D


Fig. 2. Cox-de Boor algorithm control point layers. view modifing the viewport transformation stage. Figures 4
and 5 shows examples of NURBS curve and surface
The control point in a b-spline curve is the convex obtained from the video memory system.
combination of another two control points in the previous
layer, as illustrated in figure 2, and its influence in the
4. FPGA IMPLEMENTATION
curvature is given by the term in brackets in equation (8). If
a control point and the respective knot are repeated, the
Given the set of data points to be interpolated, the initial
curvature tends towards its position, until the curve pass
parameters for the knots are calculated by equations (7a-
through the point, resulting in the interpolation.
07c) using the city block distance, since the chord length is
relative between each adjacent point, and the knot vector is
3. GRAPHIC PIPELINE calculated by the equation (8a-8b). For each parameter t is
given a core that recursively calculates the point C in the
A basic visualization pipeline is used, and by the sequential curve by the Cox-de Boor algorithm, with the recurrence
nature, this process is divided in serial stages, whose for the interpolation of the data points. Figure 4 shows a
number could vary between implementations, but follows simplified diagram with the cores for the chord length and
the general arrangement given by fig 3. In computer the averaging techniques for the knot vector generation,
graphics systems, the last three stages are managed by a excluding the enabling gates from the bus signals, that
API – Application Programming Interface. increments in two clock cicles each core processing. In
The vertex reader reads 3D data obtained from a "cloud figure 5 there is an example for the generation of one point
of points", in a form of a triple indicating the euclidean by the Cox-de Boor algorithm for a degree 3 NURBS,
coordinates of the points. This data is stored in a RAM resulting in 3 cores layers. The cores between dashed lines
memory and, according to the primitive drawer cores, the are parallelized, showing that the degree determines the
required points are bufferized in the FPGA. In the viewport number of layers, and the number of clock cicles to
transformation stage, the computation of each generate the point.
transformation matrix could be parallelized, but between
them are serialized. The primitive drawer include the
5. CUDA GPU SYSTEM
NURBS, straight line and circle generation cores. The video
memory is made in a built-in RAM, for the purpose of
The Graphics Processing Units was created with the
different display resolutions suport and faster data transfer,
multicore processing capability, but the first ones was made
executing the scanning to a D/A video converter
for the computer graphics processing applications only.
independently of the remaining systems.
Actually, it has an architecture made for the multidata
The vertex reader is the data input of the pipeline,
processing, open to the programmers and suitable for high
obtaining the data in the form of coordinates in euclidean
performance computing, so that the manufacturers allows
space, from another built-in memory of the FPGA. The data
actually GPGPUs with more than 500 cores devoted to
could be inserted in the synthetization process, mapped
general purpose applications. The use of GPU here is based
from a graphical user interface to a memory initialization
in the CUDA implementation of NURBS, that consists in a
file. The primitive drawer is responsible for the effective
multitthread processing of the Cox-de Boor and knot
data processing to originate the graphics, remaining to the
addiction algorithm for the generation for each parameter in
viewport transformation maps the data to fit the video
the curve/surface, despite the devoted circuits in the board.
memory. The video memory writer send the data in 10 bits
The performance is compared with the FPGA
size for the video resolution of XGA (1024 columns by 768
implementation, up to 16 cores, being made in C language.
lines), generating the timing synchronization signals for the
The CUDA implementation follows these items:

103
Fig. 4. Chord length and averaging cores for the knot vector generation.

t0 t0 – u 0
u0 ÷
u4 u4 – u0

0 0 0 P01
(1- ).d0 + .d10
1
d 0, d1, d2 (1- ).P01 + 1
.P11 P02 = C(0)
0 P11
(1- ).d10 + 0
.d2 0

layer 0 layer 1 layer 2

Fig. 5. Core for the generation of one point in the degree 3 NURBS curve of p=3 layers.

- dividing the task in blocks, each of them consisting of a dedicated circuit, designed to optimally perform graphic
thread, defining 8 to 32 threads for each block; functions. The cores are synthesized to perform similar
- passing the serial processing to the computer processor; functionallity like the GPU, with 16 simultaneous threads
- 16 bits data size (compatible with the FPGA system). and a independent clock counter synchronized with the
The GPU used has 16 multicores with 4 processors each beginning and end of the process.
(enabling at most 64 processors). The NURBS is
implemented multithreading each layer of the Cox-de Boor 7. CONCLUSIONS
algorithm, just like the FPGA implementation, regarding
the hardware limitation. The use of FPGAs in computer graphic is still incipient,
being confirmed despite the fact that circuits dedicated to
6. RESULTS this aim leading to the following items:
1. while the GPU is a highly specialized processor that
The NURBS local interpolation algorithm and the can get great performance (for a specific subset of the
visualization system for FPGAs, is compared with a single problems), actually most of them are not suitable for
GPU, comparing the number of clock cicles, separated embedded applications in respect to FPGAs due to the
The GPU processing still results in a more efficient power dissipation, sometimes requiring more cooling than
manner to deal with the data interpolation, being a computer processors;

104
Table 1. Total clock cycles for 32 interpolation points and Actually, General Purpose GPUs (GPGPUs) provide
100 parameters (p is the NURBS degree). more flexibility to the system designer, still locked to the
CPU - GPU - FPGA hardware architecture, being some operations, like fixed
Vectorized code CUDA point operations, efficiently done in FPGAs. Future works
NURBS are devised to match reconfigurable systems with
47p(p+1) 16p 8p GPGPUS.
interpolation
Visualization 12p+2 4p 10p+20
pipeline 8. REFERENCES
Knot addiction 10p 16p
(16 knots) [1] M. C. Tsai, C. W. Cheng, M. Y. Cheng, “A real-time
NURBS surface interpolator for precision three-axis CNC
Table 2. Cores and respective number of logic elements. machining,” International Journal of Machine Tools &
Core LEs Manufacture, vol. 43, no. 12, pp. 1217–1227, May. 2003.

57 [2] L. Piegl, W. Tiller, “The NURBS Book,” 2nd ed., Springer,


Vertex reader
New York, 1997.
Viewport 110
transformation [3] S. Mann, M. Lounsbery, C. Loop, D. Meyers, J. Painter,
NURBS T. DeRose, K. Sloan, "A Survey of Parametric Scattered
148 Data Fitting Using Triangular Interpolants", Curve and
Primitive interpolation surface Design, chapter 8, H. Hagen, SIAM, 1992.
drawer Straight line 24
[4] J. C. Carr, R. K. Beatson, J. B. Cherrie, T. J. Mitchell, W. R.
Circle 48
Fright, B. C. McCallum, T. R. Evans, "Reconstruction and
Vídeo memory writer 62 Representation of 3D Objects with Radial Basis Functions",
Wishbone 54 Proc. 28th annual Conference on Computer Graphics and
Interactive Ttechniques, pp. 67-76, 2001.
[5] S. K. Lodha, R. Franke, "Scattered Data Techniques for
Surfaces", Proc. Conference on Scientific Visualization, pp.
181, 1997.
[6] IGES/PDES Organization, “Initial Graphics Exchange
Specification - IGES 5.3,” ANSI 1996, U. S. Product Data
Association, Sep., 1996.
Fig. 6. NURBS curve (100 points, degree 3) produced by [7] ISO 10303, “Industrial automation systems and integration -
the visualization system, control points (square dots) and Product data representation and exchange, multipart
control polygon - (a) integer 16 bits, (b) single precision. standard”, International Organization for Standardization –
ISO.
[8] OpenGL Architecture Review Board, D. Shreiner, M. Woo,
J. Neider, T. Davis, “The OpenGL Programming Guide: The
Official Guide to Learning OpenGL - Version 2.1”, 6th ed.,
Addison-Wesley Professional, Boston, Massachusetts, 2008.
[9] L. Piegl, “On NURBS: A Survey,” IEEE Computer
Graphics & Applications, pp. 55-71, Jan. 1991.
[10] Orenstein, P. (2002). “High-Speed CAM of 3-D Sculpted
Surfaces”, Time Compression Magazine, Mar. 2002.
Fig. 7. NURBS Cube surface with 160x110 points, made [11] H. Styles, W. Luk, “Customising graphics applications:
by the system from 16x11 control points. techniques and programming interface” IEEE Symposium on
Field-Programmable Custom Computing Machines, pp. 77 –
2. the GPU is limited by the built-in hardware and 87, Apr. 2000.
firmware, despite of the multiprocessing power;
[12] H. T. Yau, M. T. Lin e M. S. Tsai, “Real-Time NURBS
3. the processors based in the traditional computer
interpolation using FPGA for high speed motion control”,
architecture, are restricted by the demand of a higher clock
Computer-Aided Design, no. 38, pp. 1123-1133, 2006.
frequency, giving arising to multicore processors. PLAs
technologies didn’t reach the higher frequency, being [13] N. Knutsson, "An FPGA-based 3D Graphics System",
developed with higher densities too. Master’s thesis in Electronics Systems, Linkoping Institute
of Technology, 2005.

105
[14] J. E. Bresenham, "Algorithm for Computer Control of a Proc. IEEE International Conference on Field-
Digital Plotter", IBM Systems Journal, vol. 4(1), pp. 25-30, Programmable Technology, pp. 111-118, Dec. 2005.
1965. [17] M. L. Stokes, “A Brief Look at FPGAs, GPUs and Cell
[15] OpenCores Organization, “WISHBONE System-on-Chip Processors,” ITEA Journal, pp. 09 – 11, Jun./Jul. 2007.
(SoC) Interconnection Architecture for Portable IP Cores”, [18] L. W. Howes, P. Price, O. Mencer, O. Beckmann, “PGAs,
Revision: B.3, 2002. GPUs and the PS2 - A Single Programming Methodology,”
[16] B. Cope, P. Y. K. Cheung, W. Luk, S. Witt, "Have GPUs 14th Annual IEEE Symposium on Field-Programmable
made FPGAs redundant in the field of Video Processing?", Custom Computing Machines, pp. 313 – 314, Apr. 2006.

106
UNA METODOLOGÍA PARA EL DESARROLLO DE SISTEMAS EN CHIP DE ALTA
PERFORMANCE

Marcos J. Oviedo Pablo A. Ferreyra

Facultad de Ingeniería Facultad de Ciencias Exactas, Físicas y


Instituto Universitario Aeronáutico Naturales
Córdoba, AR Universidad Nacional de Córdoba
email: [email protected] Posgrado en Sistemas Embebidos
Instituto Universitario Aeronáutico
Córdoba, AR
email: [email protected]

ABSTRACT

El procesamiento de datos a alta performance se ha desarrollo de sistemas embebidos en un SoC de alta


convertido en un desafío para la tecnología de los sistemas performance (HPSoC).
embebidos actuales. En el presente trabajo se describe una En un HPSoC la arquitectura de hardware esta
metodología para diseñar de sistemas embebidos de alta optimizada para que la plataforma de cómputo que lo
performance a través de la utilización de aceleradores de compone trabaje cooperativamente con aceleradores de
hardware implementados en lógica programable. Como hardware. En el curso de desarrollo de un HPSoC, las
prueba de concepto se implementa el algoritmo de funcionalidades que componen al mismo son definidas
encriptación simétrica TripleDES en un sistema embebido primero en software, para que luego parte de las mismas sea
de alta performance. trasladada a hardware. La implementación en hardware a
través de lógica programable es una alternativa válida que
1. INTRODUCCION se puede utilizar para lograr sustanciales incrementos en
performance, teniendo en cuenta el alto nivel de
Avances en la industria de los semiconductores han paralelismo que se puede conseguir con la utilización de
permitido que utilizando componentes de lógica una plataforma de lógica programable.
programable sea posible implementar sistemas digitales En el presente trabajo se mostrará en forma teórica una
complejos. Un sistema en chip programable (SoC) es un metodología de desarrollo de un HPSoC para luego
sistema digital que en una sola pastilla de silicio, finalizar con una comparación de performance entre los
implementa un sistema embebido, dispositivos de resultados obtenidos de dos alternativas de implementación
aplicación específica y software de aplicación y control. del algoritmo de encriptación simétrico TripleDES en un
Uno de los conceptos más poderosos atrás del diseño HPSoC. La implementación se realizo sobre una plataforma
SoC es que la funcionalidad del sistema puede ser de lógica programable que contiene una FPGA Xilinx
especificada y asignada no sólo al software que corre sobre Virtex 4.
el procesador, sino también a los componentes de hardware Este paper está organizado de la siguiente forma: La
que lo constituyen. Permitiendo así, que para efectos de sección 2 presenta información teórica sobre las
aceleración de procesamiento, sea posible implementar limitaciones de los sistemas basados en procesador que
unidades de procesamiento de alta performance encargadas motivan el presente trabajo. La sección 3 presenta la
de realizar ciertos tipos de operaciones computacionales de metodología de trabajo propuesta para desarrollar este tipo
forma óptima y por lo tanto a mayor velocidad que el de sistemas digitales. La sección 4 presenta un enfoque en
procesador del sistema embebido, ayudando así a distintos niveles sobre las técnicas y mecanismos
incrementar la performance del sistema. disponibles para incrementar la performance de un HPSoC.
Debido a la creciente demanda de tecnológica de la La sección 5 presenta dos implementaciones de HPSoC que
sociedad moderna, los sistemas embebidos que componen proveen aceleradores criptográficos desarrollados siguiendo
los equipos electrónicos actuales deben ser capaces de la metodología propuesta. La sección 6 muestra y compara
evolucionar constantemente con el fin de soportar la los resultados obtenidos. Por último, en la sección 7 se
creciente demanda de capacidad de procesamiento a los que muestran los resultados obtenidos y se presentan las
están sometidos. A modo de resolver esto existen conclusiones del presente trabajo.
alternativas a los sistemas embebidos tradicionales, como el

107
2. LIMITACIONES DE LOS SISTEMAS BASADOS trabajo se demostrara la implementación de un HPSoC que
EN PROCESADOR permitirá crear una plataforma de computo especifica y
orientada a la aplicación, a modo de optimizar el camino de
Los procesadores fueron concebidos para realizar ejecución de datos (a través de extraer e implementar
computación de propósito general. Esta decisión de diseño paralelismo), optimizar el uso de la memoria (aumento la
produjo que los procesadores no sean eficientes a la hora de localidad y el acceso), disminuir la disipación de potencia
realizar tareas de cómputo específicas y por lo tanto, que no (hardware especifico requiere menor número de
puedan satisfacer la performance de procesamiento que transistores) y disminuir la frecuencia de trabajo (posible
demandan algunos sistemas embebidos actuales. debido a que en cada ciclo de reloj se realizan múltiples
Persiguiendo la ley de Moore, a lo largo de los años se operaciones).
ha buscado alternativas para mejorar la performance de las
plataformas de cómputo basadas en procesador. Sin 3. METODOLOGIA DE DESARROLLO DE UN
embargo estas alternativas no han sido eficientes ni HPSOC
aplicables en muchos escenarios en donde la performance
era un requerimiento. Como se enuncia en [1], esto se debe En la metodología propuesta, el diseño de un HPSoC
principalmente a que existen limitaciones físicas inherentes consistirá en dos áreas separadas pero que requieren
a los procesadores que en muchos casos y dada la interacción entre ellas. Una de esas áreas es la creación del
tecnología actual, impiden que estas alternativas se apliquen soporte necesario para implementar un sistema embebido
arbitrariamente: en la FPGA basado en microprocesador y la otra es la
optimización de la aplicación en vistas de una posterior
- El hecho de aumentar la cantidad de transistores y la implementación basada en un codiseño hardware-software.
frecuencia a la que estos trabajan, introduce serios En este codiseño el componente de hardware se
problemas de disipación de calor (barrera de potencia). implementara como un componente acelerador que se
comunicara con el componente de software a través del
- La frecuencia no puede ser incrementada diseño embebido.
arbitrariamente, no solo por la barrera de potencia, si En primera instancia, el desarrollo de sistemas
no también debido a una inherente limitación física en embebidos se realiza utilizando herramientas EDA que
los tiempos de conmutación de los transistores permiten interconectar, a través de una jerarquía de buses
utilizados en el diseño del microprocesador (barrera de de interconexión, un microprocesador (que puede ser un
frecuencia). softcore, o un hardcore como se menciona en [3]), con un
conjunto de dispositivos que vuelven al sistema embebido
- En un sistema de cómputo actual, el ancho de banda una plataforma de computo funcional. El desarrollo de
del microprocesador es generalmente 70 veces superior sistemas embebidos no es estandarizado y varía
al de la memoria externa, convirtiendo el acceso a la dependiendo del fabricante de FPGAs que se utilice. En el
misma en un cuello de botella. El uso de complejas presente trabajo se utilizó FPGAs de la firma Xilinx, por lo
jerarquías de memorias locales al microprocesador cual se trabajo con el ecosistema de desarrollo de Xilinx
(caches) disminuye considerablemente el tiempo de para implementar el sistema embebido. Esto consistió en
acceso a los datos, pero debido a la imposibilidad utilizar las herramientas EDK, ISE y las librerías de
tecnológica de incrementar el tamaño del cache componentes de hardware XilinxProcessorIPLib.
arbitrariamente, el acceso a memoria sigue siendo un En segunda instancia, se deberá trabajar sobre la
problema real (barrera de memoria). aplicación que se busca optimizar. Para esto se debe realizar
un prototipo por software de la aplicación o algoritmo a
- Finalmente los procesadores en si tienen una limitación implementar en el HPSoC. Este prototipo será luego
fundamental: Un diseño basado en ejecución serial, que caracterizado y evaluado mediante herramientas como
hace extremadamente difícil extraer niveles de profilers y analizadores de código, a modo de poder
paralelismo de un flujo de ejecución de instrucciones. detectar cuales son los segmentos o áreas de la misma en
Como se menciona en [2] existen complejos diseños y donde más procesamiento se realiza (secciones críticas en
técnicas en las arquitecturas de los procesadores términos de performance). Con esta información y a través
actuales que intentan extraer el paralelismo en las de un enfoque top-down, se procederá a estudiar el
instrucciones y mitigar esta limitación. algoritmo que define la aplicación, a modo de refactorizar
la misma y que las secciones críticas puedan ser
La utilización de lógica programable y la realización de optimizadas y aisladas para ser implementadas en
un HPSoC es una valida alternativa para lograr sustanciales hardware. La implementación en hardware de las secciones
incrementos en performance en sistemas donde la críticas de la aplicación permiten que las operaciones
performance es el principal requerimiento. En el presente computacionales puedan ser representadas en lenguajes de

108
descripción de hardware y que a través de una estrategia de tiempo y aumentar el nivel esfuerzo, a modo de mejorar el
optimización por niveles, se puedan implementar rendimiento disminuyendo el tiempo de propagación de
componentes aceleradores de hardware, es decir hardware datos a través del hardware.
de procesamiento especifico que permita realizar computo Por otra parte, a nivel de sistema se puede paralelizar el
altamente performante y eficiente. procesamiento de datos a nivel de componente acelerador.
Para el desarrollo del componente acelerador se puede Siempre que el algoritmo a procesar lo permita, es decir que
utilizar un lenguaje de descripción de hardware como el algoritmo de procesamiento trabaje con un conjunto de
VHDL o utilizar una herramienta ESL como ImpulseC [4]. datos independiente unos de otros, y que además exista
El componente acelerador además, deberá integrarse dentro disponibilidad de recursos en la FPGA utilizada, se puede
del diseño embebido del HPSoC, por lo que un canal de implementar más de un componente acelerador y procesar
comunicación de alta velocidad entre hardware y software así varios conjuntos de datos en paralelo.
deberá también ser desarrollado.
Por otro lado, cabe mencionar que el componente de
software del codiseño HW/SW puede correr directamente 4.2. Optimizaciones a nivel de aplicación
sobre el procesador o bajo el control y soporte de un Describimos como aplicación al algoritmo computacional
sistema operativo (como una aplicación más de espacio de
que cumple un cierto número de requerimientos con el fin
usuario). Dado los beneficios que provee un sistema
de implementar por software o hardware la funcionalidad
operativo, en nuestra metodología se brinda soporte de un
principal del HPSoC.
sistema operativo para el componente de software.
El objetivo de las optimizaciones a nivel de aplicación
Una vez que estas dos instancias del HPSoC estén es estudiar el algoritmo que define las operaciones críticas
completas, el diseño de hardware tiene que ser trasladado y en performance de la aplicación, a modo detectar el
mapeado en el fabric de una FPGA, y las imágenes binarias
paralelismo inherente en el mismo y optimizar el
del software correspondiente tienen que ser almacenadas en
procesamiento.
las memorias correspondientes para su posterior evaluación.
Cabe aclarar que las optimizaciones sobre el algoritmo
se harán sobre los detalles de alto nivel del mismo, y no
4. OPTIMIZACION DE PERFORMANCE EN UN sobre los detalles de bajo nivel que definen la
DISEÑO HPSOC implementación del mismo. Entonces, si posibles
paralelizaciones son detectadas, y siempre cumpliendo con
Existen diversos factores que pueden ser modificados y los requerimientos funcionales iniciales, se buscara
técnicas que pueden ser aplicadas en la arquitectura de un implementar las modificaciones necesarias en el código del
HPSoC a modo de incrementar la performance general del algoritmo, de modo de que este deje de lado su flujo de
mismo. Estos factores pueden ser agrupados en tres áreas a ejecución serial y adopte un modelo de funcionamiento en
las que llamaremos niveles de optimización. paralelo.
Además de detectar niveles de paralelización y
optimizaciones en el flujo de ejecución del algoritmo, otra
4.1. Optimizaciones a nivel de Sistema interesante técnica que se puede utilizar para optimizar la
performance a nivel de aplicación es el uso de precomputo
Describimos como sistema a la plataforma física en donde
de datos. Esto consiste básicamente en acotar el rango de
se implementa la aplicación. Las optimizaciones a nivel de
acción del algoritmo, tomando asumpciones sobre el
sistema están ligadas a la forma en que se pueden
espacio de trabajo del algoritmo, a modo de precomputar y
implementar las aplicaciones en esta plataforma, y las
modificaciones que pueden ser realizadas en la misma para simplificar sus operaciones y de ese modo acelerar la
que estas se ejecuten más rápido y para que el throughput ejecución del mismo.
sea más elevado.
La optimización trivial es modificar el diseño de 4.3. Optimizaciones a nivel de micro arquitectura
hardware que compone la plataforma de cómputo para que
los diversos componentes de esta funcionen a la máxima Describimos como micro arquitectura a los componentes de
velocidad admisible. Además, es óptimo establecer canales lógica programable que implementan los detalles de bajo
de comunicaciones de alta velocidad entre los componentes nivel del algoritmo que define la aplicación que se ejecutara
de uso frecuente por el procesador, por ejemplo los bancos sobre el HPSoC. Algunas técnicas para mejorar la
de memorias o la comunicación con el fabric de la FPGA. performance de la micro arquitectura del componente
El uso de caches de memorias (preferentemente memoria acelerador son las siguientes:
RAM en bloque) puede aumentar la localidad de datos y así
mejorar la performance. 1) Replicar los arrays o bancos de memoria que
Así mismo, las herramientas de síntesis que sintetizan el contienen los datos: Una de las ventajas más importantes
diseño de hardware permiten configurar restricciones de que nos ofrece la programación en hardware es la

109
posibilidad de acceder a múltiples bancos de memoria en un HPSoC con aceleración por hardware. En estos dos últimos
solo ciclo de reloj. A diferencia de una implementación de casos, el componente acelerador fue desarrollado en VHDL
software, en la que un CPU esta conectado a uno o mas y con la herramienta de síntesis de alto nivel ImpulseC
dispositivos de memoria física siempre a través de un solo respectivamente.
bus, una implementación en hardware permite la La aplicación criptográfica consistió en obtener un set
flexibilidad de generar una topología de conexionado de datos de memoria y cifrarlos a través del algoritmo de
arbitraria, en la que un conjunto de operaciones al ser cifrado simétrico TripleDES. El algoritmo de cifrado
ejecutadas puedan acceder a datos distribuidos en varios simétrico TripleDES se utilizó en modo ECB, siguiendo los
bancos de memoria en una sola operación de reloj. Es por lineamientos mencionados en [7] y [8]. Siguiendo la
esto que un factor importante a tener en cuenta, es que para metodología propuesta en el presente trabajo, después de
lograr resultados óptimos debemos replicar nuestro set de diseñar e implementar el sistema embebido en el SoC, se
datos en diferentes bancos de memoria. Con esto desarrollo en software un prototipo no optimizado del
lograremos tener bancos de memoria separados, cada uno algoritmo a utilizar. Este prototipo sirvió para estudiar el
con su puerto de lectura/escritura, lo que permitirá acceder algoritmo y caracterizarlo. Con los datos obtenidos y
a los mismos en forma paralela para su posterior evaluando las técnicas de optimización enumeradas en [9],
operación/procesamiento. se procedió a desarrollar los componentes aceleradores y
aplicar los niveles de optimización anteriormente descritos.
2) Operaciones sobre bucles: En un algoritmo, los bucles Cabe aquí citar que para la implementación de los
son una de las construcciones que contienen un alto grado prototipos se utilizó el kit de desarrollo FX12 Minimodule,
de paralelismo inherente, y por lo tanto, son una de las provisto por la firma Avnet y que cuenta con una FPGA
construcciones que se apunta a optimizar. Los bucles Virtex4 FX12 y diversos componentes externos descritos en
generalmente realizan operaciones repetitivas sobre un set la página del fabricante. El diagrama en bloque del HPSoC
de datos. Si cada de las operaciones del bucle no depende desarrollado puede verse en la figura 1.
de datos calculados en interacciones anteriores, es decir si
en cada iteración se puede operar sobre set de datos
independientes, el grado de paralelismo que se puede 5.1. Desarrollo del sistema embebido del SoC
obtener es elevado. Existen dos técnicas para optimizar las Durante el desarrollo del diseño embebido se dio soporte a
operaciones sobre bucles, estas son el desenrollado del
todos los dispositivos físicos de hardware del kit de
bucle y la generación de “líneas de ensamblado”, o mas
desarrollo, utilizando los IP Cores de Hardware necesarios
conocido por su término en ingles, pipelines. El
para el funcionamiento del sistema embebido.
desenrrollado de bucles consiste en expandir el conjunto de
Para implementar el diseño embebido se utilizó la
iteraciones consideradas por el bucle y reacomodar el herramienta EDK de Xilinx descrita en [10]. El procesador
algoritmo para que estas puedan ser realizadas en paralelo y
elegido para el diseño embebido fue un recurso de hardware
en una sola iteración del bucle. El desarrollo de pipelines
que posee la FPGA elegida, es decir el hardcore de un
consiste en dividir el trabajo a procesar en subtareas, a
PowerPC 405 (PPC). Mediante esta herramienta se
modo de que a medida que van entrando los datos a
desarrollo un sistema embebido que permitió comunicar el
procesar, cada subtarea pueda ir procesando en forma procesador PPC con los dispositivos externos del kit de
concurrente un diferente set de datos. Entonces, si cada desarrollo, tales como la Memoria RAM, la memoria
iteración del bucle requiere ejecutar N subtareas, en una
FLASH, el puerto UART, la PHY de Gigabit Ethernet así
implementación sin pipeline, el bucle realizara una cantidad
como también implementar componentes necesarios para
(N * cantidad_elementos_de_datos) de iteraciones para
volver al sistema embebido y su procesador una plataforma
completar su trabajo. En cambio en una implementación
de computo funcional. Además la herramienta permitió,
con pipeline, la totalidad de datos serán procesados en una desarrollar un canal de comunicación de alta velocidad
cantidad (N + 1) de iteraciones. La teoría de pipelines y entre el componente acelerador implementado en la lógica
desenrollado de bucles ha sido extensamente desarrollada
programable y el microprocesador del sistema embebido.
en [5] y [6].
Se genero el soporte necesario para que el PPC pueda
comunicarse a través de los buses PLB, OPB, FCB, OCM y
5. V. PRUEBA DE CONCEPTO – DCR a los distintos dispositivo. Estos buses pertenecen a la
IMPLEMENTACIÓN DE UN HPSOC familia de buses CrossConnect y están descriptos en [10].
CRIPTOGRAFICO Cabe aclarar que este procesador solo soporta conexión
directa a los buses PLB, OCM, DCR y FCB, por lo que los
A modo de evaluar las mejoras obtenidas a través de la dispositivos atrás del bus OPB se alcanzaran a través de un
implementación de un HPSoC acelerado por hardware, se bridge PLB2OPB.
desarrollo un SoC prototipo sin aceleración Una vez definida la arquitectura de buses, sus tamaños y
(implementación solo por software), y dos versiones de un frecuencias de trabajo, así como también los elementos

110
RAM de Power PC Bus FCB Componente
bloque 405 Acelerador de HW
I cache D cache

PLB Bus OPB


Arbitro

Arbitro
Bridge

Controlador Controlador Controlador Controlador Controlador Fig. 2. Flujo de consultas para booteo del sistema
PHY DDR UART GPIO HWICAP
operativo.

Fig. 1. Diagrama en bloques de la arquitectura HPSoC


desarrollada hardware sobre el que se ejecuta, configurar las interfaces
de red, autoconfigurar su dirección de red a través de
DHCP y bootear un root filesystem remoto a través de NFS.
adicionales necesarios para su correcto funcionamiento, se
En la figura 2 se puede observar la iteración de protocolos
escogió a los dispositivos del kit de desarrollo a los cuales
durante el booteo del kernel.
se les dará soporte y como estarán conectados estos a la
La versión de Linux utilizada es la 2.6.20. Se realizaron
jerarquía de buses, a modo de que estos sean visibles al
modificaciones masivas sobre el código del kernel,
procesador PPC. La configuración de los mismos, es
utilizando código provisto por el fabricante y
particular de cada caso y dependiente del diseño adoptado,
modificaciones ad-hoc.
aunque por lo general esta configuración especifica incluye
Además, para poder soportar el canal de
aspectos como el tipo de DMA que utilizara, velocidad de
comunicaciones APU, hubo que habilitar el bit 6 del
funcionamiento, tipo de clock al que estará conectado, pines
registro MSR, Machine-State Register del procesador PPC.
y redes al que estará conectado, cantidad, tipo de
Este registro define el estado de funcionamiento del
interrupciones que generara y áreas de memoria que estarán
procesador, y debe ser configurado en tiempo de
reservadas en el mapa de memoria del sistema para los
inicialización del sistema operativo. En este modo de
registros del mismo. Cada IPCore de la librería de hardware
funcionamiento, el procesador puede utilizar instrucciones
XilinxProcessorIPLib posee un datasheet que detalla sus
vectoriales para transmitir datos a través del canal de
parámetros de configuración posibles.
comunicación de alta performance entre el hardware y el
El canal de comunicaciones de alta velocidad utilizado
software.
se implemento por medio del controlador APU, una
funcionalidad del procesador PPC descripta en [11]. El
controlador APU provee una interfase de comunicación
flexible y de alta velocidad entre el fabric de la FPGA y el 5.3. Desarrollo de la aplicación del HPSoC
procesador PPC. Esta interfase de comunicación conecta
directamente el pipeline de instrucciones del PPC a uno o Como se mencionó, se desarrollaron tres versiones de la
más componentes aceleradores de hardware. aplicación. Una desarrollada enteramente en software que
sirvió como punto de estudio del algoritmo de encriptación.
A partir de este prototipo, se desarrollaron dos versiones de
5.2. Desarrollo del soporte del software de control del componentes aceleradores. Una versión utilizando la
HPSoC herramienta ImpulseC y otra versión desarrollada en
VHDL. A ambas versiones se les aplicaron las mismas
El componente de software de la aplicación del HPSoC se
optimizaciones descritas a continuación:
ejecutara con soporte de un sistema operativo. Se eligió
Linux como sistema operativo de soporte. Para esto se
1) Optimizaciones a nivel de sistema: Las optimizaciones
preparo el sistema operativo Linux a modo que controle los
a nivel de sistema realizadas sobre la plataforma serán
distintos componentes de hardware del sistema embebido.
listadas a continuación.
Se genero además un root filesystem con las
aplicaciones necesarias para volver funcional al sistema. Se
a) Se incremento la velocidad del clock del
desarrollo además un mecanismo para que el kernel pueda
procesador embebido a 200 Mhz.
ser cargado en memoria de ejecución mediante el uso de
b) Se incrementaron las velocidades de los buses PLB
XMD, un debugger provisto por EDK y que a través de
y OPB.
JTAG puede bajar y ejecutar binarios ELF compilados para
c) Se incremento la velocidad del componente
el procesador PPC. Esto permitió que el kernel se pueda
acelerador.
ejecutar sobre el sistema embebido, inicializarse, detectar el

111
TABLA 1. RESULTADOS DE IMPLEMENTACION HPSOC TABLA 2. RESULTADOS DE SINTESIS HPSOC CRIPTOGRAFICO
CRIPTOGRAFICO
HPSoC con componente desarrollado en ImpulseC
HPSoC TripleDES Porcentaje de
Recurso Utilización
Implementación Throughput uso
Frecuencia de
(aplicación Ganancia
operación BUFGs 11 out of 32 34%
userspace)
Software 300 Mhz 42.096 Kbps 1X DCM_ADVs 2 out of 4 50%
Hardware ILOGICs 29 out of 320 9%
50 Mhz 17.929 Mbps 415X
ImpulseC
Hardware External IOBs 73 out of 240 30%
50 Mhz 19.280 Mbps 458X
VHDL
LOCed IOBs 73 out of 73 100%
d) Se utilizo el máximo nivel de esfuerzo en la OLOGICs 54 out of 320 16%
síntesis: Esto se logro editando el archivo
PPC405_ADVs 1 out of 1 100%
etc/fast_runtime.opt.
e) Se incremento la transferencia de datos para utilizar RAMB16s 26 out of 36 72%
el máximo ancho de banda provisto por el canal SLICES 5470 out of 5472 99%
APU. Esto es 64 bits de datos.
SLICEMs 355 out of 2736 12%

2) Optimizaciones a nivel de aplicación: A nivel de


aplicación se realizaron optimizaciones sobre el algoritmo pueden ser combinadas con tablas S para
TripleDES en si mismo. Como se describe en [9], se precomputar futuro procesamiento. Estas cajas
pueden realizar cambios en distintos puntos del algoritmo combinadas se llaman cajas SP.
que mejoraran drásticamente la performance del mismo a
través del precomputo de datos y de la optimización de las c) Tabla de expansión E embebida: La tabla de
operaciones lógicas de permutación y tablas de búsqueda expansión E se utiliza en DES para llevar un
(operaciones sobre las cajas S). bloque de datos de 32 bits a un formato de 48 bits,
necesario para poder XORear las subllaves con
a) Precomputo de datos de las subllaves: Se realizo este valor expandido. Con fines de performance,
el precomputo de las subllaves necesarias para se embebió el proceso de expansión de la tabla E
operar el algoritmo TripleDES. Esto permitio que en las operaciones de translación de las cajas SP.
las subllaves puedan ser accesibles de forma
inmediata. Ademas, el hecho de precomputarlas 3) Optimizaciones a nivel de microarquitectura: A nivel
de antemano permitio que los valores de las de microarquitectura no se realizaron optimizaciones sobre
mismas puedan ser replicados espacialmente de el algoritmo TripleDES en si mismo, si no que se busco
modo que puedan ser utilizados de forma eficiente optimizar las formas en que las operaciones se realizan, así
en las operaciones del algoritmo TripleDES. El como también la implementación de pipelines, desenrollado
precomputo se puede realizar debido a que se de bucles y replicación de datos para favorecer la
asume que el valor de la llave no será cambiado paralelización de operaciones.
por el usuario en el futuro. La generación y
precomputo de las subllaves se realiza en forma a) Implementación de la técnica de secuencias de
externa y el código se embebio en la descripción permutación: A modo de implementar las
del hardware. permutaciones inicial y final de bits de forma
optimizada, se utilizó la técnica "secuencias de
b) Precomputo y combinación de la tabla de permutación" descrita también en [9]. Esta técnica
permutación P con las cajas S: El algoritmo DES es ampliamente utilizada en diversos algoritmos de
provee un mecanismo de expansión de cifrado de datos, así como también en algoritmos
información a través de un conjunto predefinido de corrección de errores en la industria.
de bits conocido como cajas S. El algoritmo
incluye además tablas de permutación bits, b) Replicación de datos: Se genero código replicado
conocidas como tabla P, que a fines de de entidades de memoria que almacenan los datos
performance y teniendo en cuenta ciertas de las 8 cajasSP. Esto permitió realizar
modificaciones en la implementación del mismo,

112
operaciones de búsqueda en las tablas de forma para implementar el sistema embebido y los componentes
paralela. aceleradores.
Este trabajo concluye que la utilización de un HPSoC es
c) Pipelines: A modo de aumentar el throughput de una alternativa técnicamente viable para mejorar la
procesamiento de datos, se subdividió el performance de los sistemas digitales embebidos del mundo
procesamiento del algoritmo en diferentes etapas actual.
que pueden funcionar en forma aislada. Estas
etapas se encargan en la gran mayoría de realizar 8. REFERENCIAS
el proceso de combinar las cajas SP con los datos
de entrada. El pipeline permitió que se procesen [1] Wohlmuth, Otto, “High performance computing based on
varios datos al mismo tiempo. FPGAS” IEEE Field Programmable Logic and
Applications, FPL, 2008.
6. RESULTADOS OBTENIDOS [2] Ramakrishna, Rau and Fisher, Joseph, "Instruction-level
parallel processing: History, overview, and perspective",
Las métricas obtenidas de la ejecución del los componentes The Journal of Supercomputing, Volume 7, Numbers 1-2.
aceleradores de hardware desarrollados, así como también [3] Meyer-Baese, Uwe, "Digital signal processing with field
de la versión en software de la aplicación pueden verse en programmable gate arrays", Third Edition, Springer, pagina
la tabla 1. 589.
En esta tabla se muestra que la aceleración obtenida al
implementar parte del algoritmo de la aplicación en un [4] D. Pellerin and S. Thibault, “Practical FPGA Programming
in C”. Prentice Hall Professional Technical Reference,
HPSoC con un componente acelerador es de alrededor de
2005.
400X en ambos casos. El throughput expuesto corresponde
a la medición del tiempo de ejecución de la función de SW [5] Pai, Vijay and Adve, Sarita, "Code transformations to
que envía los datos al componente acelerador. improve memory parallelism", Proceedings of the 32nd
Se muestra además en la tabla 2 un extracto del reporte annual ACM/IEEE international symposium on
de la síntesis del componente acelerador más significativo Microarchitecture, Pages: 147 - 155, 1999.
(desarrollado en ImpulseC) que muestra el porcentaje de [6] Wolf, M.E, Chen, Ding-Kai, "Combining loop
recursos usados en la FPGA. transformations considering caches and scheduling", 29th
Annual IEEE/ACM International Symposium on
Microarchitecture, 1996.
7. CONCLUSIONES [7] Bruce Schneier, “Applied Cryptography Second Edition”.
John Wiley, 2004.
En el presente trabajo se presentaron los beneficios, en [8] Federal Information Processing Standars Publication,
términos de performance, obtenidos a través de la “DATA ENCRYPTION STANDARD (DES)”. FIPS PUB
implementación de aceleradores criptográficos utilizando 46-3.
HPSoCs. [9] PK Yuen, “Practical Cryptology and Web Security”.
Las implementaciones realizadas muestran cómo es Pearson Education Limited, Chap 4, 2006.
posible incrementar la performance de una aplicación de
[10] Xilinx Documentation files, “EDK Concepts, Tools, and
software corriendo en un sistema embebido en varios
Techniques”.
órdenes de magnitud. Los resultados comparativos
mostraron que luego de aplicar la metodología propuesta se [11] Shenoy, "Accelerating Software Applications Using the
obtuvo una ganancia de alrededor de 400X en ambos casos. APU Controller and C-to-HDL Tools", Xilinx Application
En el presente trabajo se utilizaron además herramientas note XAPP 901.
del estado del arte en el desarrollo de lógica programable

113
114
HIGH THROUGHPUT 4X4 AND 8X8 SATD SIMILARITY CRITERIA ARCHITECTURES
FOR VIDEO CODING APPLICATIONS
Julio S. Dominges Jr.,Vinicius N. Possani, Dieison S. Silveira,
Leomar S. da Rosa Jr., Luciano V. Agostini

Group of Architectures and Integrated Circuits – GACI


Center of Technological Development – CDTec
Federal University of Pelotas – UFPel
{jsdominges, vnpossani, dssilveira, leomarjr, agostini} @inf.ufpel.edu.br

encoders and decoders of this standard prohibitive in


ABSTRACT software, particularly when high resolution videos are
processed and real-time constraints are needed (24 to 30
This paper presents hardware solutions for the SATD frames per second). Other problem with software
(Sum of Absolute Transformed Differences) similarity implementations is related to the power and energy
criterion calculation using the 2-D Hadamard transform. consumptions. These factors are preponderant to design
Two SATD versions were designed: one for 4x4 blocks and hardware solutions for video coding systems.
other for 8x8 blocks. This design focuses in the H.264/AVC The H.264/AVC standard explores the redundancies
video coding standard. The SATD criterion was compared presented in digital videos. This standard supports two
with two other criteria commonly used in video coding: types of prediction: Inter-frame prediction, which deals
Sum of Squares of Differences (SSD) and Sum of Absolute with temporal redundancy among frames of a video
Differences (SAD). The results obtained through this (similarities in neighboring frames), and Intra-frame
evaluation showed that SATD is a good solution, especially prediction, which exploits the spatial redundancy (areas
when high motion videos are being encoded. The designed with pixels of similar colors and shades within the same
architectures targets real time when processing high frame). In these processes, the frame is divided into blocks
resolution videos, then a high level of parallelism was of predefined sizes, and then the prediction scheme is
considered. The reached results encourage the use of SATD applied. The prediction searches, in the information
in video coding systems. already processed, the best way to represent the current
block (block being processed) from blocks which were
1.INTRODUCTION already coded. In this process, it is necessary to define a
criterion that expresses the similarity between the reference
The transmission and storage of digital videos become a block and the original one. This similarity criterion has a
serious issue when the video resolution and quality direct impact on video quality and size of the generated
increases. Thus, video coding has become a large field of final bitstream [4]. The block chosen in this process
research, where algorithms and techniques for data (predicted block) is then subtracted from the current block,
compression have been developed. Then digital videos can generating an information residue. Other steps, like
be represented with a very smaller amount of data and with transforms and quantization, also contributes to reduce the
minimum or imperceptible quality losses. In this context, spatial redundancy, but now considering the residues of the
the H.264/AVC standard emerges as the newest video prediction step.
coding standard [1-2] having a significant increase in The literature presents several types of similarity
compression rate, maintaining the same quality when criteria, and the most used are: Mean Absolute Error
compared to previous standards. This is because the (MAE), Sum of Squares of Differences (SSD), Sum of
standard introduces the use of variable block size, Absolute Differences (SAD), and also the criterion
intraframe prediction performed in spatial domain, the use explored in this work, which is the Sum of Absolute
of multiple reference frames for the inter-frame prediction, Transformed Differences (SATD) [4]. The SATD is the
among others [3]. similarity criterion which gives the best results in the
encoding process, allowing higher compression rates or
The main objective of the H.264/AVC standard was to
higher quality of the compressed video. This is explained
combine high compression rates with high quality in
because a transform is applied on the calculated residue,
compressed videos. This is achieved through the use of
allowing a best decision about the similarity. This paper
extremely complex algorithms. Considering current
presents different architectural solutions with different
technology, this complexity makes the implementation of
pipeline stages to calculate the SATD with blocks of size

115
4x4 and 8x8 samples using the 2-D Hadamard transform,
transform

as defined by the H.264/AVC standard.. The goal of this ,  



,  4
work is to investigate the advantages and disadvantages of  
the use of SATD as a similarity criterion in video coding This paper considers other two of the most widely used
using the H.264/AVC standard. With the use of SATD it is metrics: the Sum of Absolute Differences (SAD) and Sum
possible to obtain higher compression rates without with of Squared Differences (SSD). These similarities criteria
significant loss of quality. For this, does
es not require a large were compared to the SATD through software evaluations
increase in hardware complexity compared to other to analyze the impact on quality of encoded videos
video and the
solutions. compression rate achieved by each of the evaluated evaluate
criterion. The result using SAD is obtained through the
2.SIMILARITY CRITERIA AND HADAMARD addition of the absolute differences between each
TRANSFORM corresponding sample from the candidate and from the
current blocks. The SAD definition
defin is shown in Equation
The similarity criteria are used in intra-
intra and inter- (3), where the matrix O represents the original block and
prediction process, and it aims to inform how similar is the the matrix R represents the candidate block [4].
[


candidate block in comparison to the current block. To
calculate the SATD it is necessary subtract the current  ,  

,  ,  3
block from the candidate block, sample by sample. Then T 
 
the absolute results of this subtraction are applied to the 2-D The SSD degree of similarity is calculated through the
Hadamard transform. The results of the 2-D 2 Hadamard are sum of the differences between each sample of the original
added to generate the similarity of the candidate block in block and of the candidate block and the result is squared.
relation to the original block. The 2-D Hadamard transform This criterion uses more complex arithmetic functions
was selected for the SATD application in order to achieve (such as power), which increases the complexity of the
even greater compression of video. Equations (1) and (2) algorithms used to perform the calculation of the SSD. The
are used for the calculation of 2-D D Hadamard. Such mathematical expression that defines the SSD is shown in
equations were based in the H.264/AVC standard [1] [ since equation (4) [3].
the Hadamard transforms are used in the forward and

inverse transforms modules of this standard, using 4x4 and ,  

,  ,  4
8x8 block as inputs.  
In (1), the matrices used in the Hadamard calculation
for 4x4 blocks are represented. On the other hand, if 8x8
blocks are targeted, the expression (2) is used. The 3. SOFTWARE EVALUATION
calculations considering both sizes need the same kinds of
components to be performed. However, in the second case, The SATD was compared to other two similarity
the number of samples processed is four times larger, criteria using the H.264/AVC reference software [5]. The
which increases the amount of components used. As goal of this software evaluation was to demonstrate the
defined by the H.264/AVC standard, the t values of impact
mpact of the selected similarity criterion in the global
equations (1) and (2) are integer approximations
appro of the encoder results. The similarity criteria were evaluated
floating point values of the transform, so it is possible to through PSNR and bitrate results.
results The criteria evaluated
simplify the necessary operations. This approximation does were SATD, TD, SSD and SAD, each one performedp the
not bring major changes if compared to the results with encoding
coding with three different QCIF videos. The selected
floating point values [2]. videos are widely used by the video coding community.
The H.264/AVC reference software was configured to use
all encoding tools and with a QP 38.
As can be seen in Table 1, the PSNR and bitrate gains
are better when using SATD. In some cases the SATD
results are worst than that reached by other similarity
criteria. But the results achieved by SATD present the best
tradeoff between PSNR value (higher quality) and bitrate
b
(smaller amount of data) for the evaluated videos.
videos That
means that it is possible to achieve a higher compression
rate as well as a higher quality when using SATD.
SATD Based
Before the Hadamard calculation,, the equation
e (4) is on the results presented in thisis comparison, we decided to
applied, where S represents each sample resulting from the develop this work.
absolute difference of the blocks applied to the Hadamard
transform.

116
Table 1. Comparison of PSNR
R and bitrate
b among The architecture takes as input two 4x4 blocks, the
SATD, SSD and SAD criteria of similarity current block and the candidate block, where each block
has 16 samples. The first designed module calculates the
SATD SSD SAD
difference between the two blocks,
bloc subtracting each
Mobile
sample of the current block of each sample of the
PSNR (dB) 33.807 33.790 33.754 candidate block. The results of subtraction
subtr are sent to the
Bitrate (kbit/s) 325.28 328.02 328.72 unit that generates the absolute values.
values Then these values
Foreman are applied to the 2-D D Hadamard module. After
PSNR – Y (dB) 36.790 36.713 36.641 transformed, these values are added through an adder tree
Bitrate (kbit/s) 120.02 121.61 120.48 to obtain the final value of SATD. Fig. 1 illustrates the
block diagram of the 4x4 SATD architecture,
architecture where the
Carphone
four designed modules are presented.
PSNR (dB) 37.396 37.341 37.296
Bitrate (kbit/s) 99.09 100.73 100.00

4.DESIGNED
DESIGNED ARCHITECTURES
ARCHITECTUR

Two different architectures were designed in VHDL Figure 1. Block diagram of the 4x4 SATD architecture .
language for SATD calculation: one for 4x4 blocks and
another for 8x8 blocks.. The main objective was to find the Some pipelined versions of this architecture were
best hardware solution for calculating this similarity designed to find a best relation between processing rates
criterion. As equations (1) and (2) shows, the matrices and hardware use. The solution presented in this paper was
involved on Hadamard calculations have only two possible that with the highest throughput among all investigated
values (1 and -1), so the calculation of 2--D Hadamard only solutions. This version was designed in a pipeline with 10
requires additions and subtractions. A division by two or stages. One stage is used for the differences calculation,
four is also used in the final result for 4x4 and 8x8 one stage is used for the absolute
bsolute generation, four stages
Hadamards respectively. These divisions are easily are used in the Hadamard calculations and four stages are
converted to a shift right of one or two binary positions,
positions used in the output adder tree.
which is very simple to be designed in hardware.
hardware
4.2. SATD 8x8 Architectures
rchitectures
4.1. SATD 4x4 Architecture The structure used in the 8x8 SATD architecture is
Based on the 4x4 2-D Hadamard formula defined in similar to that used in the 4x4 SATD architecture. The
(1), the process was divided in four steps, in order to better algorithm
hm for calculating the 8x8 Hadamard is more
detect parallelizable operations. These calculations are complex due to the large number of input samples. Aiming
expressed on Table 2 [3]. to avoid a large increase in use of hardware resources the
8x8 Hadamard was designed exploiting the separability
Table 2 - Algorithm for the 4x4 2-D Hadamard calculation principle of the 2-D transforms.
transform Then, two 1-D transforms
a0 = w0 + w4 b0 = a0 + a1 c0 = b0 + b1 S0 = c0 + c1 are applied over the input data to generate the final result.
a1 = w8 + w12 b1 = a2 + a3 c1 = b2 + b3 S1 = c0 - c1 The two 1-D D transforms are identical, but the 2-D
2 output
a2 = w1 + w5 b2 = a4 + a5 c2 = b0 - b1 S2 = c2 - c3 block of the first transform must be transposed before to be
a3 = w9 + w13 b3 = a6 + a7 c3 = b2 - b3 S3 = c2 + c3 used as input for the second 1-D1 transform. In this case,
a4 = w2 + w6 b4 = a0 - a1 c4 = b4 + b5 S4 = c4 + c5 each 1-D D transform must process eight input samples to
a5 = w10 + w14 b5 = a2 - a3 c5 = b6 + b7 S5 = c4 - c5 finish its calculations (one line or one column of the 8x8
a6 = w3 + w7 b6 = a4 - a5 c6 = b4 - b5 S6 = c6 - c7 input block) and the process must be repeated eight times
a7 = w11 + w15 b7 = a6 - a7 c7 = b6 - b7 S7 = c6 + c7 to process one complete 8x8 block.
a8 = w0 - w4 b8 = a8 - a9 c8 = b8 + b9 S8 = c8 + c9 An algorithm was extracted from equation (2) to
a9 = w8 - w12 b9 = a10 - a11 c9 = b10 + b11 S9 = c8 - c9 support the hardware design. This algorithm is used to
a10 = w1 - w5 b10 = a12 - a13 c10 = b8 - b9 S10 = c10 - c11
calculate the 1-D Hadamard and it is illustrated in Table 3.
a11 = w9 - w13 b11 = a14 - a15 c11 = b10 - b11 S11 = c10 + c11
This algorithm was simplified for a beter understanding,
a12 = w2 - w6 b12 = a8 + a9 c12 = b12 + b13 S12 = c12 + c13
avoiding the use of a lot of lines in Table 3. The
a13 = w10 - w14 b13 = a10 + a11 c13 = b14 + b15 S13 = c12 - c13
calculations shown in Table 3 are expanded eight times
a14 = w3 - w7 b14 = a12 + a13 c14 = b12 - b13 S14 = c14 - c15
and the index "i" is incremented by 8 units at each
expansion. Thus, after the eight expansions, the algorithm
a15 = w11 - w15 b15 = a14 + a15 c15 = b14 - b15 S15 = c14 + c15
is complete. The hardware implementation of this

117
algorithm allow the processing of 64 samples
amples in parallel or, Table 5. Synthesis results
esults of architectures SATD 4x4 and
in other words, one complete 8x8 block can be processed SATD 8x8.
at each clock cycle. SATD 4x4 SATD 8x8
# Slices 858 (6%) 4067 (29%)
Table 3. Simplified algorithm
lgorithm for the 8x8 Hadamard # Slices Flip Flop 1.535 (5%) 3265 (11%)
a0 = Wi + Wi+4 b0 = ai + ai+2 c0 = bi + bi+1 # 4 input LUTs 1.120 (4%) 6459 (32%)
a1 = Wi+1 + Wi+5 b1 = ai+1 + ai+3 c1 = bi - bi+1 Minimum Period 2,800ns 6.549ns
Frequency 357,079 MHz 152.685MHz
a2 = Wi+2 + Wi+6 b2 = ai - ai+2 c2 = bi+2 + bi+3
a3 = Wi+3 + Wi+7 b3 = ai+1 - ai+3 c3 = bi+2 - bi+3
All solution presented high operation frequencies. The
a4 = Wi - Wi+4 b4 = ai+4 + ai+6 c4 = bi+4 + bi+5 consumption of hardware resources of the 8x8 SATD was
a5 = Wi+1 - Wi+5 b5 = ai+5 + ai+7 c5 = bi+4 – bi+5 relatively high. Such consumption of hardware resources
a6 = Wi+2 - Wi+6 b6 = ai+4 - ai+6 c6 = bi+6 + bi+7 can be a limiting factor in implementations of the complete
a7 = Wi+3 - Wi+7 b7 = ai+5 - ai+7 c7 = bi+6 – bi+7 encoder or decoder in hardware, where more than one
SATD unity is required.
The 8x8 SATD was also designed with different
pipeline stages and the best solution in terms of processing 6.CONCLUSION
CONCLUSION
rates is presented in this paper. Then the differences
calculations must be able to process two 8x8 input blocks, This paper presented two SATD architectures, one for
one for the current block and other for the candidate 4x4 blocks and other for 8x8 blocks. The SATD was
blocks. The absolute generation of the subtraction must evaluated through comparisons with other criteria (SSD
process 64 input samples. This means that the two first and SAD) and the SATD presented the best tradeoff
modules are four times bigger than that of the 4x4 SATD between bitrate and quality.. The architectures designed in
version. The 1-D transform
form architecture was duplicated to this work were described in VHDL and synthesized to
increase the processing rates,
s, as show in Fig. 2.
2 Then the Xilinx Virtex2P FPGAs. The designed SATD modules can
same hardware is not reused, avoiding data dependencies. be introduced into the inter prediction or intra prediction
With this solution the 8x8 SATD architecture
itecture is able to modules of H.264/AVC standard, but also they can be used
process one new input block at each clock cycle. in older standards. As the SATD has a higher complexity
when compared to other criteria, it brings higher
consumption in area. But,
ut, using the SATD it is possible to
achieve higher quality in compressed video without
reducing the compression rate, or even achieve higher
compression rates without significant degradation in the
Figure 2. Block diagram of the 8x8 SATDarchiteture.
SATD quality of video. Thus,
hus, with the software evaluation results
and with the hardware design results it is possible to
conclude that the use of SATD is a good solution to be
5.RESULTS
used in hardware implementations of video coders.
The architectures were described in VHDL,
VHDL synthesized
and validated using the Xilinx ISE 10.1 CAD tool. The 7.REFERENCES
REFERENCES
Virtex2p FPGA family was used and the XC2VP30 device
was selected (XILINX INC, 2010). The synthesis results [1] ITU-TT Recommendation H.264/AVC (03/05):
are presented in Tables 1 and 2. Table 1 shows the results advanced video coding for generic audiovisual services, 2005.
only for the 2-D Hadamard transform for 4x4 and 8x8 [2] RICHARDSON, I. H.264/AVC and MPEG-4
MPEG Video
blocks sizes . Table 2 presents the complete SATD Compression – Video Coding for Next-Generation
Next Multimedia.
architectures, also considering 4x4 and 8x8 block sizes. Chichester: John Wiley and Sons, 2003.
[3] Omitted to allow blind review.
Table 4. Synthesis results of 4x4 and 8x8 Hadamard [4] KUHM, P. Algorithms, Complexity
Co Analysis and VLSI
architectures Architetures for MPEG-44 Motion Estimation. Boston: Kluwer
Hadamard 4x4 Hadamard 8x8 Academic Publisher, 1999.
# Slices 433 (3%) 402 (2%) [5] VCEG. JM Reference Software 17.2. Disponible in
# Slices Flip Flop 656 (2%) 736 (2%) <http://iphome.hhi.de/suehring/tml>. Accessed August 2010.
# 4 input LUTs 672 (2%) 481 (1%) [6] XILINX INC. Virtex-II
Virtex Pro and Virtex-II Pro X
Minimum Period 2.774ns 2.738ns Platform FPGAs: Complete Data Sheet. [S.l.], 2005. Disponible
Frequency 360.458MHz 365.263MHz in: <www.xilinx.com>. Accessed August 2010.

118
ADQUISICIÓN DE VIDEO BAJO ESTÁNDAR ITU-R BT.656-4 MEDIANTE LÓGICA
PROGRAMABLE

CONTRERAS Juan Carlos KOWALSKI Emilio


email: [email protected] email: [email protected]

GUTIÉRREZ Guillermo CAVALLERO Rodolfo


email: [email protected] email: [email protected]

CUDAR - Centro Universitario de Desarrollo en Automación y Robótica -


Universidad Tecnológica Nacional - Facultad Regional Córdoba -
MM López esq. Cruz Roja Argentina - Ciudad Universitaria - Córdoba - Argentina

ABSTRACT encuentran embebidos en el silicio de la FPGA. Este


procesador se dedica a la gestión de datos entre los distintos
El presente trabajo describe el diseño e implementación módulos IP del sistema que hacen posible la adquisición,
de un IP para la adquisición de señales video proveniente de compresión y transmisión de video digital.
un conversor analógico digital con salida según normas
ITU-RBT.656-4. Se describen las señales generadas por el
2. CONVERSOR ADV7183B / ESTÁNDAR ITU-R
conversor, el método implementado para su adquisición y
BT.656-4
almacenamiento en una memoria FIFO que será la interfase
con el sistema embebido del cual el presente IP forma parte.
El conversor ADV7183B acepta en sus entradas los
formatos de video compuesto (CVBS), S-Video, Video
1. INTRODUCCIÓN componente (YPrPb), entre otros, bajo los estándares de
video analógico NTSC, PAL y SECAM. La señal de salida
El Centro Universitario de Desarrollo en Automación y de video digital responde a la norma ITU-R BT.656-4
Robótica (CUDAR) se encuentra desarrollando un proyecto YCrCb 4:2:2 [3] la cual especifica un bus de 8 bits de datos
de compresión de video con wavelet en lógica programable, a 27MHz con sincronismos embebidos en la trama.
utilizando como plataforma de hardware la placa de Este conversor posee varios modos de funcionamiento
desarrollo Xilinx® University Program Virtex-II Pro configurable por medio de registros accesibles por I2C; el
Development System (XUPV2P) de la firma Digilent [1]. modo seleccionado para esta aplicación entrega una salida
El proyecto implica la adquisición y procesamiento de de datos mediante un bus de 16 bits a 13.5MHz y señales de
señales de video. Se utiliza para ello una cámara Samsung sincronismo LLC2, HS, VS y FIELD por pines
SDC-415 como fuente de señal de video analógica la cual individuales.
posee salida de video compuesto (CVBS) bajo norma PAL.
Esta señal es recibida y convertida en señales digitales bajo
la norma ITU-R BT.656-4 YCrCb 4:2:2 [2] [3] [4] por el
conversor ADV7183B [5], el cual se encuentra en la placa
de adquisición de video VDEC1 de Digilent. Esta última
placa se conecta como expansión a la plataforma de
desarrollo.
Al momento de consultar diversos papers similares y
preexistentes tales como [6], [7] y [8], se halló que la
adquisición de video la realizan módulos provistos por
Xilinx, los cuales no se adaptaban a las necesidades de
nuestro proyecto. Es por esto que se decidió diseñar un IP
propio que sea flexible para su uso también en otras
aplicaciones.
El IP elaborado se encarga de capturar, adquirir y
organizar la trama de datos digitales provenientes del
conversor, para luego almacenarla en una memoria FIFO.
Esta información ubicada en la FIFO es puesta a
Fig.1. Trama de datos del ADV7183B en modo de salida
disposición de uno de los dos procesadores PowerPC que se
de 16 bits para una línea horizontal.

119
Tabla 1.Estructura de AVCODE
Nº de Bit Nombre de
Descripción
bit
7 (MSB) 1 Cte.
6 F Campo par/impar
5 V Campo blanking
4 H SAV/EAV
3 P3 Bit de protección
2 P2 Bit de protección
1 P1 Bit de protección
0 P0 Bit de protección
Tabla 2.

3. DESCRIPCIÓN DEL SISTEMA


Fig.2. Distribución de líneas horizontales para los
El sistema de compresión utiliza uno de los
diferentes campos bajo la norma PAL.
procesadores PowerPC de la FPGA como elemento de
La trama de datos se muestra en la Figura 1, donde los coordinación entre los subsistemas internos como ser,
datos de luminancia Y se ubican en la parte alta del bus, es transferencia de datos, manejo de comunicaciones, control
decir de las líneas 15 a 8 y los de crominancia Cr y Cb se de adquisición etc. Para el armado del sistema se utilizo la
alternan en la parte baja de las líneas 7 a 0. plataforma de desarrollo EDK de Xilinx, el XPS[x] para el
Los primeros datos Cb0, Y0, Cr0 conforman el primer Hardware y el SDK[x] para el software..
pixel de video activo, mientras que para el siguiente lo La Arquitectura del sistema utiliza un procesador
hacen Y1 y la crominancia del pixel anterior. Así conectado a través de un bus PLB [9] al resto de los
sucesivamente se representan los 720 píxeles de video periféricos. Este bus posee 64 bits de datos y 32 bits de
activo de una línea. direccionamiento, y al igual que los periféricos, son
El código SAV (Start Active Video) indica el inicio de configurados con los recursos disponibles en la FPGA. La
una línea y el código EAV (End Active Video) el final. plataforma de desarrollo permite automatizar varios de
Estos códigos son los denominados AVCODE. La estos procesos, como ser la configuración del bus, el mapeo
estructura de estos códigos consiste en dos palabras de 16 de periféricos, administración de periféricos, agregado o
bits, donde la primera es 0x00-0xFF (0x indica numeración
hexadecimal) y la segunda 0x[AV]-0x00. La parte
denominada AV esta definida en la Tabla 1. El bit F indica
si la línea está dentro de un campo par o impar. El bit V
indica si se trata de una línea ubicada dentro de un campo
blanking o de video activo; el bit H si es el comienzo
(SAV) o fin de línea (EAV) y finalmente P0-P3 son bits de
protección. Estos últimos se generan según las siguientes
ecuaciones:

(1)

La recomendación ITU-R BT.656-4 [2] [3] [4] define


625 líneas de video para la norma PAL tal como lo ilustra la
Figura 2. Es posible también apreciar los valores que
toman los bits V, H y F de los AVCODE para los distintos
campos.

Fig.3. Diagrama de bloques simplificado del sistema

120
eliminación, etc. Para lograr la sincronización de las señales
En la Figura 3 se observa un diagrama en bloques provenientes del AD con la señal de clock de la FPGA se
simplificado del sistema implementado, los bloques diseñó un componente denominado “Sincronizador”. Este
“UART”, “I2C”, “Ethernet”, “PowerPC” y el Bus PLB son componente sincroniza el reloj LLC2 de 13.5MHz con la
provistos por el EDK. El IP “Adquisición de Video” señal de reloj CLK de 50MHz. El circuito descripto en este
(motivo principal de este paper), y “Transformada Wavelet” bloque se muestra en la Figura 5. Es posible observar
son desarrollos propios. también la simulación de las señales de entrada y salida del
Para poder incluir desarrollos propios al sistema, se circuito en la Figura 6.
utiliza la herramienta de importación provistas por el EDK,
esta herramienta facilita el proceso de interconexión del
periférico con el sistema a través de lo que ellos denominan
IPIF. Este IPIF contiene módulos ya pre ensamblados para
intercambiar datos entre el PLB y el periférico, así como
memorias FIFO y líneas de control, facilitando
enormemente la tarea de importación y asegurando la
compatibilidad en el sistema.
Fig.5. Circuito sincronizador.
4. DISEÑO DEL IP

Para lograr la captura de la trama de video, se diseñó


un sistema síncrono basado en un reloj de 50MHz. En la
Figura 4 se presenta el diagrama en bloques del sistema.

Fig.6. Simulación de circuito sincronizador.

En la captura de una línea de Video Activo se decidió


descartar la parte de Blanking horizontal y almacenar en
forma separada SAV, EAV y el Video Activo. Esta
selección la realiza el componente llamado “Habilitador”,
el cual es una máquina de estados de Moore que identifica
las diferentes partes de la trama. Éste a su vez envía una
señal de habilitación a los tres bloques denominados
“Puente”. Cuando esta señal es recibida se deja pasar los
datos del bus de video y la señal LLC2 previamente
sincronizada hacia los “Conversores 8-64”. Este bloque
agrupa los datos de video de 8 bits en palabras de 64 bit, ya
que es el ancho seleccionado para la memoria FIFO.
Existen dos modelos de conversor 8-64, uno usado para
AVCODE (SAV, EAV) y otro usado para luminancia (Y)
y las crominancias (Cr,Cb). El componente encargado de
los AVCODE recibe dos palabras de 8bits
correspondientes a la parte baja del bus de 16 bits, los
agrupa y completa con cero los 48 bits restantes para
formar una palabra de 64 bits. El componente para las
restantes partes recibe ocho datos de 8 bits para agruparlos
y formar otra palabra de 64bits. Para la luminancia siempre
se utiliza la parte alta del bus de 16bits y para la
crominancia la parte baja. Para Cr y Cb es necesario un
componente que alterne la señal de habilitación que

Fig.4. Diagrama de bloques del IP. Fig.7. Simulación de circuito secuenciador.

121
almacena los datos para que puedan ser separados, este 5. SÍNSTESIS E IMPLEMENTACIÓN
componente se llama “Secuenciador”. En la Figura 7 se
puede ver su simulación. Al formar la palabra de 64 bits La síntesis del IP arrojó los resultados de la Tabla 4.
los datos son transferidos a los buffers intermedios
llamados “Buffer”. Existen seis de ellos: SAV; Y1; Cr; Y2; Resultados de la síntesis del IP
Tabla 4.
Cb; EAV. Dado el formato 4:2:2 de video, se hace Descripción Utilizado Total %
necesario dos buffer para la luminancia Y1 e Y2.
Nº de Slices 466 13696 3%
Tabla 2. Funcionamiento del multiplexor Nº Slices de FF 649 27392 2%
C2 C1 C0 Salida
Nº LUTs 4 input 364 27392 1%
0 0 0 SAV
Nº de IOBs 85 556 15%
0 0 1 Y1
Nº de FF IOB 8 - -
0 1 0 Cr
Nº de GCLKs 5 16 31%
0 1 1 Y2
1 0 0 Cb 6. CONCLUSIÓN Y FUTUROS TRABAJOS
1 0 1 EAV El presente desarrollo ya forma parte del sistema de
compresión de video, ha sido embebido con éxito en la
El último paso es el almacenamiento en la memoria plataforma y se están realizando pruebas de verificación.
FIFO. Esto lo realiza el componente “Organizador”, el cual El método de incorporación de periféricos propuesto por
se compone de una máquina de estados de Moore, que al Xilinx en el EDK, utilizando un IPIF como interface,
recibir el aviso de llenado del “Buffer” correspondiente a demuestra ser robusto y con un campo muy amplio de
SAV, Cb o EAV, generará una secuencia binaria. posibles aplicaciones. En estos momentos se está buscando
optimizar el movimiento de datos entre la FIFO y la DDR
Tabla 3. Almacenamiento en FIFO del sistema de compresión combinando las señales de
FIFO 64-bit word Nº dato crominancia y luminancia para optimizar el número de
AV CODE (SAV) 1 movimientos de datos durante la compresión.

Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 2 7. REFERENCIAS
Cr1 Cr2 Cr3 Cr4 Cr5 Cr6 Cr7 Cr8 3
Y9 Y10 Y11 Y12 Y13 Y14 Y15 Y6 4 [1] “Xilinx University Program Virtex-II Pro Development
System” - Hardware Reference Manual - Marzo 2005.
Cb1 Cb2 Cb3 Cb4 Cb5 Cb6 Cb7 Cb8 5 [2] AN9728.2 Intersil Aplication Note. - “BT.656 Video
----------------------------------- --- Interfce for ICs” Julio 2002.
[3] Recommendation ITU-R BT.656-4. “INTERFACES
Y705 Y706 Y707 Y708 Y709 Y710 Y711 FOR DIGITAL COMPONENT VIDEO SIGNALS IN
178
Y712 525-LINE AND 625-LINE TELEVISION SYSTEMS
Cr353 Cr354 Cr355 Cr356 Cr357 Cr358 OPERATING AT THE 4:2:2 LEVEL OF
179
Cr359 Cr360 RECOMMENDATION ITU-R BT.601 (PART A)”.
Y713 Y714 Y715 Y716 Y717 Y718 Y719 [4] AN-10 Digital Creation Labs “Digital Video
180
Y720 Overview” Rev 1.0 Abril 2004.
Cb353 Cb354 Cb355 Cb356 Cb357 Cb358 [5] Datasheet “Multiformat SDTV Video Decoder
181
Cb359 Cb360 ADV7183B”- Rev.B 2005.
AV CODE (EAV) 182 [6] “Capturing Higher Quality Video” - Justin A. Horn,
Student Member, IEEE, James Y. Hu, and Bryce C. Orgill.
La misma va desde cero a seis para controlar un [7] “Real Time Video Processing on FPGA Using on the
multiplexor en el cual están conectados los diferentes Bus Fly Partial Reconfiguration” - Sheetal U. Bhandari, Shaila
de 64bit de los componentes “Buffer”. En la Tabla 2 puede Subbaraman, Shashank S. Pujari, Rashmi Mahajan.
verse el funcionamiento del multiplexor. El componente [8] “FPGA-Based Design Of a High-Performance and
“Organizador” también envía una señal a la FIFO para que Modular Video Processing Platform” Christophe
se almacene el dato puesto en la salida del multiplexor. En Desmouliers, Erdal Oruklu and Jafar Saniie
la Tabla 3 se puede observar el orden en que se almacenan [9] DS448 Xilinx Product specification “PLB IPIF
los datos en la FIFO. (v2.01a)” Agosto 2004.

122
Sponsors

You might also like