100% found this document useful (1 vote)
13 views

Multicore Computing Algorithms Architectures and Applications 1st Edition Sanguthevar Rajasekaran instant download

The document provides information about the book 'Multicore Computing: Algorithms, Architectures, and Applications' edited by Sanguthevar Rajasekaran and others, focusing on multicore computing's architectures, algorithms, and applications. It covers various topics including memory hierarchy, caching strategies, programming languages, and design trade-offs among different processors. The book aims to equip readers with the foundation to design efficient multicore algorithms and addresses challenges in parallel computing.

Uploaded by

faragazuina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
13 views

Multicore Computing Algorithms Architectures and Applications 1st Edition Sanguthevar Rajasekaran instant download

The document provides information about the book 'Multicore Computing: Algorithms, Architectures, and Applications' edited by Sanguthevar Rajasekaran and others, focusing on multicore computing's architectures, algorithms, and applications. It covers various topics including memory hierarchy, caching strategies, programming languages, and design trade-offs among different processors. The book aims to equip readers with the foundation to design efficient multicore algorithms and addresses challenges in parallel computing.

Uploaded by

faragazuina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Multicore Computing Algorithms Architectures and

Applications 1st Edition Sanguthevar Rajasekaran


pdf download

https://ebookname.com/product/multicore-computing-algorithms-
architectures-and-applications-1st-edition-sanguthevar-
rajasekaran/

Get Instant Ebook Downloads – Browse at https://ebookname.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Handbook of parallel computing models algorithms and


applications 1st Edition Sanguthevar Rajasekaran

https://ebookname.com/product/handbook-of-parallel-computing-
models-algorithms-and-applications-1st-edition-sanguthevar-
rajasekaran/

High Performance Embedded Computing Architectures


Applications and Methodologies 1st Edition Wayne Wolf

https://ebookname.com/product/high-performance-embedded-
computing-architectures-applications-and-methodologies-1st-
edition-wayne-wolf/

Reversible Computing Fundamentals Quantum Computing and


Applications 1st Edition Alexis De Vos

https://ebookname.com/product/reversible-computing-fundamentals-
quantum-computing-and-applications-1st-edition-alexis-de-vos/

Flour Spectacular Recipes from Boston s Flour Bakery


Cafe 1st Edition Joanne Chang

https://ebookname.com/product/flour-spectacular-recipes-from-
boston-s-flour-bakery-cafe-1st-edition-joanne-chang/
Tinnitus A Multidisciplinary Approach Second Edition
David Baguley

https://ebookname.com/product/tinnitus-a-multidisciplinary-
approach-second-edition-david-baguley/

International Mathematical Olympiads 1959 2000 1st


Edition Mircea Becheanu

https://ebookname.com/product/international-mathematical-
olympiads-1959-2000-1st-edition-mircea-becheanu/

The Equinox Keep Silence Edition Vol 1 No 7 Aleister


Crowley

https://ebookname.com/product/the-equinox-keep-silence-edition-
vol-1-no-7-aleister-crowley/

Computational Number Theory and Modern Cryptography 1st


Edition Song Y. Yan

https://ebookname.com/product/computational-number-theory-and-
modern-cryptography-1st-edition-song-y-yan/

The Rebirth of History Times of Riots and Uprisings 1st


Edition Alain Badiou

https://ebookname.com/product/the-rebirth-of-history-times-of-
riots-and-uprisings-1st-edition-alain-badiou/
Early childhood mathematics 2. ed Edition Sperry Smith

https://ebookname.com/product/early-childhood-mathematics-2-ed-
edition-sperry-smith/
Computer Science/Computer Engineering/Computing

Chapman & Hall/CRC

Multicore Computing
Chapman & Hall/CRC
Computer & Information Science Series
Computer & Information Science Series
Multicore Computing: Algorithms, Architectures, and Applications

Multicore
focuses on the architectures, algorithms, and applications of multicore
computing. It will help you understand the intricacies of these architectures
and prepare you to design efficient multicore algorithms.

Contributors at the forefront of the field cover the memory hierarchy for

Computing
multicore and manycore processors, the caching strategy Flexible Set
Balancing, the main features of the latest SPARC architecture specifi-
cation, the Cilk and Cilk++ programming languages, the numerical soft-
ware library Parallel Linear Algebra Software for Multicore Architectures
(PLASMA), and the exact multipattern string matching algorithm of Aho-
Corasick. They also describe the architecture and programming model of
the NVIDIA Tesla GPU, discuss scheduling directed acyclic graphs onto
multi/manycore processors, and evaluate design trade-offs among Intel
Algorithms, Architectures,
and AMD multicore processors, IBM Cell Broadband Engine, and NVIDIA
GPUs. In addition, the book explains how to design algorithms for the Cell
and Applications
Broadband Engine and how to use the backprojection algorithm for gen-
erating images from synthetic aperture radar data.

Features
• Equips you with the foundation to design efficient multicore
algorithms Edited by
• Addresses challenges in parallel computing
Sanguthevar Rajasekaran

Rajasekaran, Fiondella,
• Covers many techniques, tools, and algorithms for solving big data

Ahmed, and Ammar


problems, including PLASMA, Cilk, the Aho-Corasick algorithm,
sorting algorithms, a modularized scheduling method, and the Lance Fiondella
backprojection algorithm
• Describes various architectures, such as SPARC and the NVIDIA Mohamed Ahmed
Tesla GPU
• Includes numerous applications and extensive experimental results Reda A. Ammar

K12518

K12518_Cover.indd 1 11/13/13 9:21 AM


Multicore
Computing
Algorithms, Architectures,
and Applications
CHAPMAN & HALL/CRC
COMPUTER and INFORMATION SCIENCE SERIES

Series Editor: Sartaj Sahni

PUBLISHED TITLES HANDBOOK OF PARALLEL COMPUTING: MODELS, ALGORITHMS


AND APPLICATIONS
ADVERSARIAL REASONING: COMPUTATIONAL Sanguthevar Rajasekaran and John Reif
APPROACHES TO READING THE OPPONENT’S MIND
Alexander Kott and William M. McEneaney HANDBOOK OF REAL-TIME AND EMBEDDED SYSTEMS
Insup Lee, Joseph Y-T. Leung, and Sang H. Son
DELAUNAY MESH GENERATION
Siu-Wing Cheng, Tamal Krishna Dey, and HANDBOOK OF SCHEDULING: ALGORITHMS, MODELS, AND
Jonathan Richard Shewchuk PERFORMANCE ANALYSIS
Joseph Y.-T. Leung
DISTRIBUTED SENSOR NETWORKS, SECOND EDITION
S. Sitharama Iyengar and Richard R. Brooks HIGH PERFORMANCE COMPUTING IN REMOTE SENSING
Antonio J. Plaza and Chein-I Chang
DISTRIBUTED SYSTEMS: AN ALGORITHMIC APPROACH
Sukumar Ghosh HUMAN ACTIVITY RECOGNITION: USING WEARABLE SENSORS
AND SMARTPHONES
ENERGY-AWARE MEMORY MANAGEMENT FOR EMBEDDED Miguel A. Labrador and Oscar D. Lara Yejas
MULTIMEDIA SYSTEMS: A COMPUTER-AIDED DESIGN APPROACH
Florin Balasa and Dhiraj K. Pradhan INTRODUCTION TO NETWORK SECURITY
Douglas Jacobson
ENERGY EFFICIENT HARDWARE-SOFTWARE
CO-SYNTHESIS USING RECONFIGURABLE HARDWARE LOCATION-BASED INFORMATION SYSTEMS:
Jingzhao Ou and Viktor K. Prasanna DEVELOPING REAL-TIME TRACKING APPLICATIONS
Miguel A. Labrador, Alfredo J. Pérez, and Pedro M. Wightman
FUNDAMENTALS OF NATURAL COMPUTING: BASIC CONCEPTS,
ALGORITHMS, AND APPLICATIONS METHODS IN ALGORITHMIC ANALYSIS
Leandro Nunes de Castro Vladimir A. Dobrushkin

HANDBOOK OF ALGORITHMS FOR WIRELESS NETWORKING AND MULTICORE COMPUTING: ALGORITHMS, ARCHITECTURES,
MOBILE COMPUTING AND APPLICATIONS
Azzedine Boukerche Sanguthevar Rajasekaran, Lance Fiondella, Mohamed Ahmed,
and Reda A. Ammar
HANDBOOK OF APPROXIMATION ALGORITHMS
AND METAHEURISTICS PERFORMANCE ANALYSIS OF QUEUING AND COMPUTER
Teofilo F. Gonzalez NETWORKS
G. R. Dattatreya
HANDBOOK OF BIOINSPIRED ALGORITHMS
AND APPLICATIONS THE PRACTICAL HANDBOOK OF INTERNET COMPUTING
Stephan Olariu and Albert Y. Zomaya Munindar P. Singh

HANDBOOK OF COMPUTATIONAL MOLECULAR BIOLOGY SCALABLE AND SECURE INTERNET SERVICES AND ARCHITECTURE
Srinivas Aluru Cheng-Zhong Xu

HANDBOOK OF DATA STRUCTURES AND APPLICATIONS SOFTWARE APPLICATION DEVELOPMENT: A VISUAL C++®, MFC,
Dinesh P. Mehta and Sartaj Sahni AND STL TUTORIAL
Bud Fox, Zhang Wenzu, and Tan May Ling
HANDBOOK OF DYNAMIC SYSTEM MODELING
Paul A. Fishwick SPECULATIVE EXECUTION IN HIGH PERFORMANCE COMPUTER
ARCHITECTURES
HANDBOOK OF ENERGY-AWARE AND GREEN COMPUTING David Kaeli and Pen-Chung Yew
Ishfaq Ahmad and Sanjay Ranka
VEHICULAR NETWORKS: FROM THEORY TO PRACTICE
Stephan Olariu and Michele C. Weigle
Multicore
Computing
Algorithms, Architectures,
and Applications

Edited by
Sanguthevar Rajasekaran
Lance Fiondella
Mohamed Ahmed
Reda A. Ammar
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2014 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works


Version Date: 20130808

International Standard Book Number-13: 978-1-4398-5435-8 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
Dedication

To my teachers,

Esakki Rajan, P.S. Srinivasn, V. Krishnan, and John H. Reif

—Sanguthevar Rajasekaran

To my son,

Advika

—Lance Fiondella

To my wife,

Noha Nabawi

my parents, and my advisors

Professors Sanguthevar Rajasekaran and Reda Ammar

—Mohamed F. Ahmed

To my family,

Tahany Fergany, Rabab Ammar, Doaa Ammar and Mohamed


Ammar

—Reda A. Ammar
( ; - : )

An indestructible and impeccable treasure to one is learning;


all the other things are not wealth.

Thiruvalluvar (circa 100 B.C.)


(Thirukkural; Section - Wealth; Chapter 40 - Education)
Contents

Preface xvii

Acknowledgements xxi

List of Contributing Editors xxiii

List of Contributing Authors xxv

1 Memory Hierarchy for Multicore and Many-Core Processors 1


Mohamed Zahran and Bushra Ahsan
1.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Latency and Bandwidth . . . . . . . . . . . . . . . . . 5
1.1.2 Power Consumption . . . . . . . . . . . . . . . . . . . 6
1.2 Physical Memory . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Cache Hierarchy Organization . . . . . . . . . . . . . . . . . 8
1.3.1 Caches versus Cores . . . . . . . . . . . . . . . . . . . 9
1.3.1.1 Technological and usage factors . . . . . . . 9
1.3.1.2 Application-related factors . . . . . . . . . . 9
1.3.2 Private, Shared, and Cooperative Caching . . . . . . . 11
1.3.3 Nonuniform Cache Architecture (NUCA) . . . . . . . 13
1.4 Cache Hierarchy Sharing . . . . . . . . . . . . . . . . . . . . 16
1.4.1 At What Level to Share Caches? . . . . . . . . . . . . 17
1.4.2 Cache-Sharing Management . . . . . . . . . . . . . . . 18
1.4.2.1 Fairness . . . . . . . . . . . . . . . . . . . . . 18
1.4.2.2 Quality of Service (QoS) . . . . . . . . . . . 20
1.4.3 Configurable Caches . . . . . . . . . . . . . . . . . . . 21
1.5 Cache Hierarchy Optimization . . . . . . . . . . . . . . . . . 23
1.5.1 Multilevel Inclusion . . . . . . . . . . . . . . . . . . . 23
1.5.2 Global Placement . . . . . . . . . . . . . . . . . . . . . 25
1.6 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.6.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.6.2 Protocols for Traditional Multiprocessors . . . . . . . 31
1.6.3 Protocols for Multicore Systems . . . . . . . . . . . . 32
1.7 Support for Memory Consistency Models . . . . . . . . . . . 36
1.8 Cache Hierarchy in Light of New Technologies . . . . . . . . 37
1.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . 37

vii
viii

2 FSB: A Flexible Set-Balancing Strategy for Last-Level


Caches 45
Mohammad Hammoud, Sangyeun Cho, and Rami Melhem
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2 Motivation and Background . . . . . . . . . . . . . . . . . . 48
2.2.1 Baseline Architecture . . . . . . . . . . . . . . . . . . 48
2.2.2 A Caching Problem . . . . . . . . . . . . . . . . . . . 49
2.2.3 Dynamic Set-Balancing Cache and Inherent Shortcom-
ings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2.4 Our Solution . . . . . . . . . . . . . . . . . . . . . . . 52
2.3 Flexible Set Balancing . . . . . . . . . . . . . . . . . . . . . . 54
2.3.1 Retention Limits . . . . . . . . . . . . . . . . . . . . . 54
2.3.2 Retention Policy . . . . . . . . . . . . . . . . . . . . . 55
2.3.3 Lookup Policy . . . . . . . . . . . . . . . . . . . . . . 57
2.3.4 FSB Cost . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . 58
2.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.2 Comparing FSB against Shared Baseline . . . . . . . . 59
2.4.3 Sensitivity to Different Pressure Functions . . . . . . . 62
2.4.4 Sensitivity to LPL and HPL . . . . . . . . . . . . . . . 63
2.4.5 Impact of Increasing Cache Size and Associativity . . 64
2.4.6 FSB versus Victim Caching . . . . . . . . . . . . . . . 66
2.4.7 FSB versus DSBC and V-WAY . . . . . . . . . . . . . 66
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . 69

3 The SPARC Processor Architecture 73


Simone Secchi, Antonino Tumeo, and Oreste Villa
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2 The SPARC Instruction-Set Architecture . . . . . . . . . . . 75
3.2.1 Registers and Register Windowing . . . . . . . . . . . 76
3.3 Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3.1 MMU Requirements . . . . . . . . . . . . . . . . . . . 79
3.3.2 Memory Models . . . . . . . . . . . . . . . . . . . . . 79
3.3.3 The MEMBAR instruction . . . . . . . . . . . . . . . 81
3.4 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.5 The NIAGARA Processor Architecture . . . . . . . . . . . . 82
3.6 Core Microarchitecture . . . . . . . . . . . . . . . . . . . . . 84
3.7 Core Interconnection . . . . . . . . . . . . . . . . . . . . . . 86
3.8 Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . 86
3.8.1 Cache-Coherence Protocol . . . . . . . . . . . . . . . . 87
3.8.1.1 Example 1 . . . . . . . . . . . . . . . . . . . 87
3.8.1.2 Example 2 . . . . . . . . . . . . . . . . . . . 88
3.9 Niagara Evolutions . . . . . . . . . . . . . . . . . . . . . . . 88
ix

4 The Cilk and Cilk++ Programming Languages 91


Hans Vandierendonck
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 The Cilk Language . . . . . . . . . . . . . . . . . . . . . . . 93
4.2.1 Spawning and Syncing . . . . . . . . . . . . . . . . . . 93
4.2.2 Receiving Return Values: Inlets . . . . . . . . . . . . . 94
4.2.3 Aborting Threads . . . . . . . . . . . . . . . . . . . . 95
4.2.4 The C Elision . . . . . . . . . . . . . . . . . . . . . . . 96
4.2.5 Cilk++ . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.1 The Cilk Model of Computation . . . . . . . . . . . . 97
4.3.2 Cactus Stacks . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.3 Scheduling by Work Stealing . . . . . . . . . . . . . . 99
4.3.4 Runtime Data Structures . . . . . . . . . . . . . . . . 100
4.3.5 Scheduling Algorithm . . . . . . . . . . . . . . . . . . 103
4.3.6 Program Code Specialization . . . . . . . . . . . . . . 105
4.3.7 Efficient Multi-way Fork . . . . . . . . . . . . . . . . . 105
4.4 Analyzing Parallelism in Cilk Programs . . . . . . . . . . . . 107
4.5 Hyperobjects . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.1 Reducers . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5.2 Implementation of Views . . . . . . . . . . . . . . . . 112
4.5.3 Holder Hyperobjects . . . . . . . . . . . . . . . . . . . 113
4.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.6.1 Further Reading . . . . . . . . . . . . . . . . . . . . . 115

5 Multithreading in the PLASMA Library 119


Jakub Kurzak, Piotr Luszczek, Asim YarKhan, Mathieu Faverge, Julien
Langou, Henricus Bouwmeester, and Jack Dongarra
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.1.1 PLASMA Design Principles . . . . . . . . . . . . . . . 120
5.1.2 PLASMA Software Stack . . . . . . . . . . . . . . . . 121
5.1.3 PLASMA Scheduling . . . . . . . . . . . . . . . . . . . 122
5.2 Multithreading in PLASMA . . . . . . . . . . . . . . . . . . 123
5.3 Dynamic Scheduling with QUARK . . . . . . . . . . . . . . 124
5.4 Parallel Composition . . . . . . . . . . . . . . . . . . . . . . 126
5.5 Task Aggregation . . . . . . . . . . . . . . . . . . . . . . . . 130
5.6 Nested Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 133
5.6.1 The Case of Partial Pivoting . . . . . . . . . . . . . . 133
5.6.2 Implementation Details of Recursive Panel Factoriza-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.6.3 Data Partitioning . . . . . . . . . . . . . . . . . . . . . 135
5.6.4 Scalability Results of the Parallel Recursive Panel Ker-
nel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
x

5.6.5 Further Implementation Details and Optimization


Techniques . . . . . . . . . . . . . . . . . . . . . . . . 138

6 Efficient Aho-Corasick String Matching on Emerging Multi-


core Architectures 143
Antonino Tumeo, Oreste Villa, Simone Secchi, and Daniel Chavarrı́a-
Miranda
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.3.1 Aho-Corasick String-Matching Algorithm . . . . . . . 148
6.3.2 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.4 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . 152
6.4.1 GPU Algorithm Design . . . . . . . . . . . . . . . . . 155
6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 158
6.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . 158
6.5.2 GPU Optimizations . . . . . . . . . . . . . . . . . . . 159
6.5.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

7 Sorting on a Graphics Processing Unit (GPU) 171


Shibdas Bandyopadhyay and Sartaj Sahni
7.1 Graphics Processing Units . . . . . . . . . . . . . . . . . . . 172
7.2 Sorting Numbers on GPUs . . . . . . . . . . . . . . . . . . . 175
7.2.1 SDK Radix Sort Algorithm . . . . . . . . . . . . . . . 176
7.2.1.1 Step 1: Sorting tiles . . . . . . . . . . . . . . 177
7.2.1.2 Step 2: Calculating histogram . . . . . . . . 179
7.2.1.3 Step 3: Prefix sum of histogram . . . . . . . 179
7.2.1.4 Step 4: Rearrangement . . . . . . . . . . . . 179
7.2.2 GPU Radix Sort (GRS) . . . . . . . . . . . . . . . . . 180
7.2.2.1 Step 1: Histogram and ranks . . . . . . . . . 181
7.2.2.2 Step 2: Prefix sum of tile histograms . . . . . 184
7.2.2.3 Step 3: Positioning numbers in a tile . . . . . 185
7.2.3 SRTS Radix Sort . . . . . . . . . . . . . . . . . . . . . 185
7.2.3.1 Step 1: Bottom-level reduce . . . . . . . . . . 187
7.2.3.2 Step 2: Top-level scan . . . . . . . . . . . . . 188
7.2.3.3 Step 3: Bottom-level scan . . . . . . . . . . . 188
7.2.4 GPU Sample Sort . . . . . . . . . . . . . . . . . . . . 188
7.2.4.1 Step 1: Splitter selection . . . . . . . . . . . 189
7.2.4.2 Step 2: Finding buckets . . . . . . . . . . . . 190
7.2.4.3 Step 3: Prefix sum . . . . . . . . . . . . . . . 190
7.2.4.4 Step 4: Placing elements into buckets . . . . 190
7.2.5 Warpsort . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.2.5.1 Step 1: Bitonic sort by warps . . . . . . . . . 191
7.2.5.2 Step 2: Bitonic merge by warps . . . . . . . . 192
xi

7.2.5.3 Step 3: Splitting long sequences . . . . . . . 193


7.2.5.4 Step 4: Final merge by warps . . . . . . . . . 193
7.2.6 Comparison of Number-Sorting Algorithms . . . . . . 194
7.3 Sorting Records on GPUs . . . . . . . . . . . . . . . . . . . . 194
7.3.1 Record Layouts . . . . . . . . . . . . . . . . . . . . . . 194
7.3.2 High-Level Strategies for Sorting Records . . . . . . . 195
7.3.3 Sample Sort for Sorting Records . . . . . . . . . . . . 196
7.3.4 SRTS for Sorting Records . . . . . . . . . . . . . . . . 197
7.3.5 GRS for Sorting Records . . . . . . . . . . . . . . . . 198
7.3.6 Comparison of Record-Sorting Algorithms . . . . . . . 199
7.3.7 Runtimes for ByField Layout . . . . . . . . . . . . . . 199
7.3.8 Runtimes for Hybrid Layout . . . . . . . . . . . . . . 201
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

8 Scheduling DAG-Structured Computations 205


Yinglong Xia and Viktor K. Prasanna
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.4 Lock-Free Collaborative Scheduling . . . . . . . . . . . . . . 209
8.4.1 Components . . . . . . . . . . . . . . . . . . . . . . . 210
8.4.2 An Implementation of the Collaborative Scheduler . . 212
8.4.3 Lock-Free Data Structures . . . . . . . . . . . . . . . . 213
8.4.4 Correctness . . . . . . . . . . . . . . . . . . . . . . . . 214
8.4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . 216
8.4.5.1 Baseline schedulers . . . . . . . . . . . . . . 216
8.4.5.2 Data sets and task types . . . . . . . . . . . 218
8.4.5.3 Experimental results . . . . . . . . . . . . . . 219
8.5 Hierarchical Scheduling with Dynamic Thread Grouping . . 223
8.5.1 Organization . . . . . . . . . . . . . . . . . . . . . . . 223
8.5.2 Dynamic Thread Grouping . . . . . . . . . . . . . . . 224
8.5.3 Hierarchical Scheduling . . . . . . . . . . . . . . . . . 225
8.5.4 Scheduling Algorithm and Analysis . . . . . . . . . . . 226
8.5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . 230
8.5.5.1 Baseline schedulers . . . . . . . . . . . . . . 230
8.5.5.2 Data sets and data layout . . . . . . . . . . . 231
8.5.5.3 Experimental results . . . . . . . . . . . . . . 231
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

9 Evaluating Multicore Processors and Accelerators for Dense


Numerical Computations 241
Seunghwa Kang, Nitin Arora, Aashay Shringarpure, Richard W. Vuduc,
and David A. Bader
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.2 Interarchitectural Design Trade-offs . . . . . . . . . . . . . . 245
xii

9.2.1 Requirements for Parallelism . . . . . . . . . . . . . . 245


9.2.2 Computation Units . . . . . . . . . . . . . . . . . . . . 247
9.2.3 Start-up Overhead . . . . . . . . . . . . . . . . . . . . 248
9.2.4 Memory Latency Hiding . . . . . . . . . . . . . . . . . 248
9.2.5 Control over On-Chip Memory . . . . . . . . . . . . . 249
9.2.6 Main-Memory Access Mechanisms and Bandwidth Uti-
lization . . . . . . . . . . . . . . . . . . . . . . . . . . 249
9.2.7 Ideal Software Implementations . . . . . . . . . . . . . 250
9.3 Descriptions and Qualitative Analysis of Computational Statis-
tics Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
9.3.1 Conventional Sequential Code . . . . . . . . . . . . . . 252
9.3.2 Basic Algorithmic Analysis . . . . . . . . . . . . . . . 253
9.4 Baseline Architecture-Specific Implementations for the Compu-
tational Statistics Kernels . . . . . . . . . . . . . . . . . . . . 254
9.4.1 Intel Harpertown (2P) and AMD Barcelona (4P) Multi-
core Implementations . . . . . . . . . . . . . . . . . . 254
9.4.2 STI Cell/B.E. (2P) Implementation . . . . . . . . . . 255
9.4.3 NVIDIA Tesla C1060 Implementation . . . . . . . . . 255
9.4.4 Quantitative Comparison of Implementation Costs . . 256
9.5 Experimental Results for the Computational Statistics Kernels 257
9.5.1 Kernel1 . . . . . . . . . . . . . . . . . . . . . . . . . . 257
9.5.2 Kernel2 . . . . . . . . . . . . . . . . . . . . . . . . . . 260
9.5.3 Kernel3 . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.6 Descriptions and Qualitative Analysis of Direct n-Body Kernels 263
9.6.1 Characteristics, Costs, and Parallelization . . . . . . . 263
9.7 Direct n-Body Implementations . . . . . . . . . . . . . . . . 265
9.7.1 x86 Implementations . . . . . . . . . . . . . . . . . . . 265
9.7.2 PowerXCell8i Implementation . . . . . . . . . . . . . . 266
9.7.2.1 Parallelization strategy . . . . . . . . . . . . 266
9.7.2.2 Data organization and vectorization . . . . . 266
9.7.2.3 Double buffering the DMA . . . . . . . . . . 267
9.7.2.4 SPU pipelines . . . . . . . . . . . . . . . . . 267
9.7.3 NVIDIA GPU Implementation . . . . . . . . . . . . . 268
9.7.3.1 Parallelization strategy . . . . . . . . . . . . 268
9.7.3.2 Optimizing the implementation . . . . . . . . 268
9.8 Experimental Results and Discussion for the Direct n-Body
Implementations . . . . . . . . . . . . . . . . . . . . . . . . . 270
9.8.1 Performance . . . . . . . . . . . . . . . . . . . . . . . 270
9.8.1.1 CPU performance . . . . . . . . . . . . . . . 270
9.8.1.2 GPU performance . . . . . . . . . . . . . . . 271
9.8.1.3 PowerXCell8i performance . . . . . . . . . . 274
9.8.1.4 Overall performance comparison . . . . . . . 275
9.8.2 Productivity and Ease of Implementation . . . . . . . 277
9.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
xiii

10 Sorting on the Cell Broadband Engine 285


Shibdas Bandyopadhyay, Dolly Sharma, Reda A. Ammar, Sanguthevar Ra-
jasekaran, and Sartaj Sahni
10.1 The Cell Broadband Engine . . . . . . . . . . . . . . . . . . 286
10.2 High-Level Strategies for Sorting . . . . . . . . . . . . . . . . 286
10.3 SPU Vector and Memory Operations . . . . . . . . . . . . . 288
10.4 Sorting Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 291
10.4.1 Single SPU Sort . . . . . . . . . . . . . . . . . . . . . 291
10.4.2 Shellsort Variants . . . . . . . . . . . . . . . . . . . . . 291
10.4.2.1 Comb and AA sort . . . . . . . . . . . . . . 292
10.4.2.2 Brick sort . . . . . . . . . . . . . . . . . . . . 294
10.4.2.3 Shaker sort . . . . . . . . . . . . . . . . . . . 296
10.4.3 Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . 296
10.4.3.1 Merge Sort Phase 1—Transpose . . . . . . . 297
10.4.3.2 Merge Sort Phase 2—Sort columns . . . . . . 298
10.4.3.3 Merge Sort Phase 3—Merge pairs of columns 299
10.4.3.4 Merge Sort Phase 4—Final merge . . . . . . 302
10.4.4 Comparison of Single-SPU Sorting Algorithms . . . . 304
10.4.5 Hierarchical Sort . . . . . . . . . . . . . . . . . . . . . 305
10.4.6 Master-Slave Sort . . . . . . . . . . . . . . . . . . . . 308
10.4.6.1 Algorithm SQMA . . . . . . . . . . . . . . . 308
10.4.6.2 Random Input Integer Sorting with Single
Sampling & Quick Sort (RISSSQS) . . . . . 309
10.4.6.3 Random Input Integer Sorting with Single
Sampling using Bucket Sort (RISSSBS) . . . 310
10.4.6.4 Algorithm RSSSQS . . . . . . . . . . . . . . 311
10.4.6.5 Randomized Sorting with Double Sampling
using Quick Sort (RSDSQS) . . . . . . . . . 311
10.4.6.6 Randomized Sorting with Double Sampling
using Merge Sort (SDSMS) . . . . . . . . . . 312
10.4.6.7 Evaluation of SQMA, RISSSQS, RISSSBS,
RSSSQS, RSDSQS, and SDSMS . . . . . . . 312
10.4.6.8 Results . . . . . . . . . . . . . . . . . . . . . 313
10.4.6.9 Analysis . . . . . . . . . . . . . . . . . . . . 326
10.4.6.10 Conclusion . . . . . . . . . . . . . . . . . . . 327
10.5 Sorting Records . . . . . . . . . . . . . . . . . . . . . . . . . 328
10.5.1 Record Layout . . . . . . . . . . . . . . . . . . . . . . 328
10.5.2 High-Level Strategies for Sorting Records . . . . . . . 328
10.5.3 Single-SPU Record Sorting . . . . . . . . . . . . . . . 329
10.5.4 Hierarchical Sorting for Records . . . . . . . . . . . . 330
10.5.4.1 4-way merge for records . . . . . . . . . . . . 330
10.5.4.2 Scalar 4-way merge . . . . . . . . . . . . . . 332
10.5.4.3 SIMD 4-way merge . . . . . . . . . . . . . . 333
10.5.5 Comparison of Record-Sorting Algorithms . . . . . . . 334
10.5.5.1 Runtimes for ByField layout . . . . . . . . . 335
xiv

10.5.5.2 Runtimes for ByRecord layout . . . . . . . . 338


10.5.5.3 Cross-layout comparison . . . . . . . . . . . 340
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

11 GPU Matrix Multiplication 345


Junjie Li, Sanjay Ranka, and Sartaj Sahni
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
11.2 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . 347
11.3 Programming Model . . . . . . . . . . . . . . . . . . . . . . . 349
11.4 Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
11.5 Single-Core Matrix Multiply . . . . . . . . . . . . . . . . . . 354
11.6 Multicore Matrix Multiply . . . . . . . . . . . . . . . . . . . 355
11.7 GPU Matrix Multiply . . . . . . . . . . . . . . . . . . . . . . 357
11.7.1 A Thread Computes a 1 × 1 Submatrix of C . . . . . 358
11.7.1.1 Kernel code . . . . . . . . . . . . . . . . . . . 358
11.7.1.2 Host code . . . . . . . . . . . . . . . . . . . . 359
11.7.1.3 Tile/block dimensions . . . . . . . . . . . . . 360
11.7.1.4 Runtime . . . . . . . . . . . . . . . . . . . . 361
11.7.1.5 Number of device-memory accesses . . . . . 362
11.7.2 A Thread Computes a 1 × 2 Submatrix of C . . . . . 365
11.7.2.1 Kernel code . . . . . . . . . . . . . . . . . . . 365
11.7.2.2 Number of device-memory accesses . . . . . 367
11.7.2.3 Runtime . . . . . . . . . . . . . . . . . . . . 367
11.7.3 A Thread Computes a 1 × 4 Submatrix of C . . . . . 368
11.7.3.1 Kernel code . . . . . . . . . . . . . . . . . . . 368
11.7.3.2 Runtime . . . . . . . . . . . . . . . . . . . . 370
11.7.3.3 Number of device-memory accesses . . . . . 370
11.7.4 A Thread Computes a 1 × 1 Submatrix of C Using
Shared Memory . . . . . . . . . . . . . . . . . . . . . . 371
11.7.4.1 First kernel code and analysis . . . . . . . . 371
11.7.4.2 Improved kernel code . . . . . . . . . . . . . 373
11.7.5 A Thread Computes a 16 × 1 Submatrix of C Using
Shared Memory . . . . . . . . . . . . . . . . . . . . . . 376
11.7.5.1 First kernel code and analysis . . . . . . . . 376
11.7.5.2 Second kernel code . . . . . . . . . . . . . . . 379
11.7.5.3 Final kernel code . . . . . . . . . . . . . . . . 379
11.8 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 382
11.8.1 GPU Kernels . . . . . . . . . . . . . . . . . . . . . . . 382
11.8.2 Comparison with Single-Core and Quadcore Code . . 386
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
xv

12 Backprojection Algorithms for Multicore and GPU Architec-


tures 391
William Chapman, Sanjay Ranka, Sartaj Sahni, Mark Schmalz, Linda
Moore, Uttam Majumder, and Bracy Elton
12.1 Summary of Backprojection . . . . . . . . . . . . . . . . . . 392
12.2 Partitioning Backprojection for Implementation on a GPU . 394
12.3 Single-Core Backprojection . . . . . . . . . . . . . . . . . . . 395
12.3.1 Single-Core Cache-Aware Backprojection . . . . . . . 396
12.3.2 Multicore Cache-Aware Backprojection . . . . . . . . 398
12.4 GPU Backprojection . . . . . . . . . . . . . . . . . . . . . . 398
12.4.1 Tiled Partitioning . . . . . . . . . . . . . . . . . . . . 398
12.4.2 Overlapping Host–Device Communication with Compu-
tation . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
12.4.3 Improving Register Usage . . . . . . . . . . . . . . . . 406
12.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 410

Index 413
Preface

We live in an era of big data. Every area of science and engineering has
to process voluminous data sets. Numerous problems in such critical areas
as computational biology are intractable, and exact (or even approximation)
algorithms for solving them take time that is exponential in some of the un-
derlying parameters. As a result, parallel computing has become inevitable.
Parallel computing has been made very affordable with the advent of multi-
core architectures such as Cell, Tesla, etc. On the other hand, programming
these machines is much more difficult due to the oddities existing in these ar-
chitectures. This volume addresses different facets of multicore computing and
offers insights into them. The chapters in this handbook will help the readers
in understanding the intricacies of these architectures and will prepare them
to design efficient multicore algorithms. Topics covered range through archi-
tectures, algorithms, and applications.

Chapter 1 covers the memory hierarchy for multicore and many-core pro-
cessors. The performance of computer systems depends on both the memory
and the processor. In the beginning, the speed gap between the processor
and memory was narrow. The honeymoon ended when the number of transis-
tors per chip increased almost exponentially (the famous Moore’s law). This
transistor budget has translated to performance, at least until a decade ago!
Memory system performance did not keep up with processor performance and
has been improving at a much lower pace. When designs shifted from a sin-
gle core to multicore, the memory system faced even more challenges. The
challenges facing memory system designers, how to deal with them, and the
future of this field are some of the issues discussed in this chapter.

In Chapter 2, the authors present Flexible Set Balancing (FSB), a caching


strategy that exploits the large asymmetry in cache sets’ usages on tiled CMPs.
FSB attempts to retain cache lines evicted from highly pressured sets in un-
derutilized sets using a very flexible many-from-many sharing so as to satisfy
far-flung reuses. Simulation results show that FSB can minimize the L2 miss
rate by an average of 36.6% for the tested benchmarks. This translates to
an overall performance improvement of 13%. In addition, results show that
FSB compares favorably with three closely related schemes and incurs minor
storage, area, and energy overheads.

xvii
xviii Preface

In Chapter 3, the authors describe the main features of the latest SPARC
architecture specification, SPARCv9, and try to motivate the different design
decisions behind them. They also look at each architectural feature in the
context of a multicore processor implementation of the architecture. After
describing the SPARC architecture, they present in detail one of its most suc-
cessful implementations, the Sun UltraSPARC T1 (also known as Niagara)
multicore processor.

Chapter 4 presents the Cilk and Cilk++ programming languages, which


raise the level of abstraction of writing parallel programs. Organized around
the concept of tasks, Cilk allows the programmer to reason about what set
of tasks may execute in parallel. The Cilk runtime system is responsible for
mapping tasks to processors. This chapter presents the Cilk language and
elucidates the design of the Cilk runtime scheduler. Armed with an under-
standing of how the scheduler works, this chapter then continues to explain
how to analyze the performance of Cilk programs. Next, hyperobjects, a pow-
erful abstraction of common computational patterns is explained.

Chapter 5 introduces Parallel Linear Algebra Software for Multicore Ar-


chitectures (PLASMA), a numerical software library for solving problems in
dense linear algebra on systems of multicore processors and multisocket sys-
tems of multicore processors. PLASMA relies on a variety of multithreading
mechanisms, including static and dynamic thread scheduling. PLASMA’s su-
perscalar scheduler, QUARK, offers powerful tools for parallel task composi-
tion, such as support for nested parallelism and provisions for task aggregation.
The dynamic nature of PLASMA’s operation exposes its user to an array of
new capabilities, such as asynchronous mode of execution, where library func-
tion calls can be invoked in a non-blocking fashion.

Chapter 6 discusses Aho-Corasick, an exact multipattern string-matching


algorithm that performs the search in a time linearly proportional to the length
of the input text independently from pattern set size. However, in reality,
software implementations suffer from significant performance variability with
large pattern sets because of unpredictable memory latencies and caching
effects. This chapter presents a study of the behavior of the Aho-Corasick
string-matching algorithm on a set of modern multicore and multithreaded
architectures. The authors discuss the implementation and the performance
of the algorithm on modern x86 multicores, multithreaded Niagara 2 proces-
sors, and GPUs from the previous and current generation.

In Chapter 7, the authors first describe the architecture of NVIDIA Tesla


GPU. They then describe some of the principles for designing efficient algo-
rithms for GPUs. These principles are illustrated using recent parallel algo-
rithms to sort numbers on a GPU. These sorting algorithms for numbers are
then extended to sort large records. The authors also describe efficient strate-
Preface xix

gies for moving records within GPU memory for various different layouts of a
record in memory. Lastly, experimental results comparing the performance of
these algorithms for sorting records are presented.

Chapter 8 discusses scheduling Directed Acyclic Graphs (DAGs) onto


multi/many-core processors, which remains a fundamental challenge in par-
allel computing. The chapter utilizes the exact inference as an example to
discuss scheduling techniques on multi/many-core processors. The authors in-
troduce a modularized scheduling method on general-purpose multicore pro-
cessors and develop lock-free data structures for reducing the overhead due to
contention. Then, they extend the scheduling method to many-core proces-
sors using dynamic thread grouping, which dynamically adjusts the number
of threads used for scheduling and task execution. It adapts to the input task
graph and therefore improves the overall performance.

Chapter 9 evaluates design trade-offs among Intel and AMD multicore


processors, IBM Cell Broadband Engine, and NVIDIA GPUs and their impact
on dense numerical computations (kernels from computational statistics and
the direct n-body problem). This chapter compares the core architectures and
memory subsystems of various platforms; illustrates the software implementa-
tion process on each platform; measures and analyzes the performance, coding
complexity, and energy efficiency of each implementation; and discusses the
impact of different architectural design choices on each implementation.

In Chapter 10, the authors look at designing algorithms for Cell Broad-
band Engine, which is a heterogeneous multicore processor on a single chip.
First, they describe the architecture of the Cell processor. They then describe
the opportunities and challenges associated with programming the Cell. These
opportunities and challenges are illustrated with different parallel algorithms
for sorting numbers. Later, they extend these algorithms to sort large records.
This latter discussion illustrates how to hide memory latency associated with
moving large records. The authors end the chapter by comparing different
algorithms for sorting records stored using different layouts in memory.

Chapter 11 begins by reviewing the architecture and programming model


of the NVIDIA Tesla GPU. Then, the authors develop an efficient matrix
multiplication algorithm for this GPU by going through a series of inter-
mediate algorithms beginning with a straightforward GPU implementation
of the single-core CPU algorithm. Extensive experimental results are pro-
vided. These results show the impact of the various optimization strategies
(e.g., tiling, padding to eliminate shared memory bank conflicts, coalesced I/O
from/to global memory) showing that our most efficient GPU algorithm for
matrix multiplication is three orders of magnitude faster than the classical
single-core algorithm.
xx Preface

Chapter 12 addresses Backprojection, an algorithm that generates im-


ages from Synthetic Aperture Radar (SAR) data. SAR data is collected by a
radar device that moves around an area of interest, transmitting pulses and
collecting the responses as a function of time. Backprojection produces each
pixel of the output image by independently determining the contribution of
every pulse, producing high-quality imagery while requiring significant data
movement and computational costs. These costs can be mitigated through
the use of Graphics Processing Units, as Backprojection is easily decomposed
along its input and output dimensions.
Acknowledgements

We are very thankful to the authors for having contributed their chapters in
a timely manner. We also thank the staff of Chapman & Hall/CRC. In addi-
tion, we gratefully acknowledge the partial support from the National Science
Foundation (CCF 0829916) and the National Institutes of Health (NIH R01-
LM010101).

Sanguthevar Rajasekaran
Lance Fiondella
Mohamed F. Ahmed
Reda A. Ammar

xxi
List of Contributing Editors

Sanguthevar Rajasekaran received his M.E. degree in Automation from


the Indian Institute of Science (Bangalore) in 1983, and his Ph.D. degree
in Computer Science from Harvard University in 1988. Currently, he is the
UTC Chair Professor of Computer Science and Engineering at the University
of Connecticut and the Director of Booth Engineering Center for Advanced
Technologies (BECAT). Before joining UConn, he served as a faculty mem-
ber in the CISE Department of the University of Florida and in the CIS
Department of the University of Pennsylvania. During 2000–2002 he was the
Chief Scientist for Arcot Systems. His research interests include Bioinformat-
ics, Parallel Algorithms, Data Mining, Randomized Computing, Computer
Simulations, and Combinatorial Optimization. He has published over 250 re-
search articles in journals and conferences. He has coauthored two texts on
algorithms and coedited five books on algorithms and related topics. He is a
Fellow of the Institute of Electrical and Electronics Engineers (IEEE) and the
American Association for the Advancement of Science (AAAS). He is also an
elected member of the Connecticut Academy of Science and Engineering.

Lance Fiondella received a B.S. in Computer Science from Eastern Con-


necticut State University and his M.S. and Ph.D. degrees in Computer Sci-
ence and Engineering from the University of Connecticut. He is presently an
assistant professor in the Department of Electrical and Computer Engineering
at the University of Massachusetts Dartmouth. His research interests include
algorithms, reliability engineering, and risk analysis. He has published over 40
research papers in peer-reviewed journals and conferences.

Mohamed F. Ahmed received his B.Sc. and M.Sc. degrees from the Amer-
ican University in Cairo, Egypt, in May 2001 and January 2004, respectively.
He received his Ph.D. degree in Computer Science and Engineering from the
University of Connecticut in September 2009. Dr. Ahmed served as an As-
sistant Professor at the German University in Cairo from September 2009
to August 2010 and as an Assistant Professor at the American University in
Cairo from September 2010 to January 2011. Since 2011, he has served as a
Program Manager at Microsoft. His research interests include multi/many-
core technologies, high performance computing, parallel programming, cloud
computing, GPUs programming, etc. He has published many papers in these
areas.

xxiii
Another Random Document on
Scribd Without Any Related Topics
back
back
back
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookname.com

You might also like