0% found this document useful (0 votes)

150 views

GPU Architecture Ebook

GPU Architecture

Uploaded by

srinixr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

150 views

GPU Architecture Ebook

GPU Architecture

Uploaded by

srinixr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

GPU Architecture and

Programming

Andrei Doncescu inspired by

NVIDIA
Traditional Computing
Von Neumann architecture:
instructions are sent from
memory to the CPU

Serial execution:
Instructions are executed
one after another on a
single Central Processing
Unit (CPU)

Problems:
• More expensive to
produce
•More expensive to run
•Bus speed limitation
Parallel Computing
Official-sounding definition: The simultaneous use of multiple
compute resources to solve a computational problem.
Benefits:
• Economical – requires less power !!! and cheaper to produce
• Better performance – bus/bottleneck issue
Limitations:
• New architecture – Von Neumann is all we know!
• New debugging difficulties – cache consistency issue
Processes and Threads
• Traditional process
– One thread of control through a large, potentially
sparse address space
– Address space may be shared with other processes
(shared mem)
– Collection of systems resources (files, semaphores)
• Thread (light weight process)
– A flow of control through an address space
– Each address space can have multiple concurrent
control flows
– Each thread has access to entire address space
– Potentially parallel execution, minimal state (low
overheads)
– May need synchronization to control access to shared
variables
Threads
• Each thread has its own stack, PC, registers
– Share address space, files,…
Flynn’s Taxonomy
Classification of computer architectures, proposed by Michael J. Flynn
•SISD – traditional serial architecture in computers.
•SIMD – parallel computer. One instruction is executed many times with
different data (think of a for loop indexing through an array)
•MISD - Each processing unit operates on the data independently via
independent instruction streams. Not really used in parallel
•MIMD – Fully parallel and the most common form of parallel
computing.
What is GPGPU ?
• General Purpose computation using GPU
in applications other than 3D graphics
– GPU accelerates critical path of application
• Data parallel algorithms leverage GPU attributes
– Large data arrays, streaming throughput
– Fine-grain SIMD parallelism
– Low-latency floating point (FP) computation
• Applications
– Game effects (FX) physics, image processing
– Physical modeling, computational engineering, matrix algebra,
convolution, correlation, sorting
• GPU – graphics processing unit

• Originally designed as a graphics processor

• Nvidia's GeForce 256 (1999) – first GPU

o single-chip processor for mathematically-intensive tasks

o transforms of vertices and polygons
o lighting
o polygon clipping
o texture mapping
o polygon rendering
Modern GPUs are present in
✓ Embedded systems
✓ Personal Computers
✓ Game consoles
✓ Mobile Phones
✓ Workstations
Basic Concepts

32Gb SDRAM 6 GB GDDR

8 Gb/s 250 Gb/s

42 Gb/s

Band Width and Synchronization

150 GF 1,3 TF
CUDA Processor Terminology
• SPA
– Streaming Processor Array (variable across GeForce 8-series, 8 in
GeForce8800)
• TPC
– Texture Processor Cluster (2 SM + TEX)
• SM
– Streaming Multiprocessor (8 SP)
– Multi-threaded processor core
– Fundamental processing unit for CUDA thread block
• SP
– Streaming Processor
– Scalar ALU for a single CUDA thread

11
Streaming Multiprocessor (SM)
• Streaming Multiprocessor (SM)
– 8 Streaming Processors (SP)
Streaming Multiprocessor
– 2 Super Function Units (SFU) Instruction L1 Data L1

• Multi-threaded instruction dispatch Instruction Fetch/Dispatch

– 1 to 512 threads active Shared Memory

– Shared instruction fetch per 32 threads SP SP
– Cover latency of texture/memory loads SP SP
• 20+ GFLOPS SP
SFU
SP
SFU

• 16 KB shared memory SP SP

• texture and global memory access

12
Streaming Multiprocessor (SM)
- Each SM has 8 Scalar Processors (SP)

- IEEE 754 32-bit floating point support (incomplete support)

- Each SP is a 1.35 GHz processor (32 GFLOPS peak)

- Supports 32 and 64 bit integers

- 8,192 dynamically partitioned 32-bit registers

- Supports 768 threads in hardware (24 SIMT warps of 32 threads)

- Thread scheduling done in hardware

- 16KB of low-latency shared memory

- 2 Special Function Units (reciprocal square root, trig functions, etc)

Each GPU has 16 SMs…

The GPU
Scalar Processor
• Supports 32-bit IEEE
floating point
instructions:
FADD, FMAD, FMIN, FMAX,
FSET, F2I, I2F
• Supports 32-bit integer
operations
IADD, IMUL24, IMAD24,
IMIN, IMAX, ISET, I2I, SHR,
SHL, AND, OR, XOR

• Fully pipelined
GPU computing
GPU: Graphics Processing Unit
Traditionally used for real-time rendering
High Computational density and memory bandwidth
Throughput processor: 1000s of concurrent threads to hide
latency
CPU GPU

CPUs consist of a few cores GPUs consist of hundreds or

optimized for serial processing thousands of smaller, efficient cores
designed for parallel performance
SCC CPU SCC GPU

Intel Xeon E5-2670: NVIDIA Tesla K40:

Clock speed: 2.6 GHz
4 instructions per cycle Single instruction
CPU - 16 cores 2880 CUDA cores

2.6 x 4 x 16 =
166.4 Gigaflops double precision 1.66 Teraflops double precision
SCC CPU SCC GPU

Intel Xeon E5-2670 : NVIDIA Tesla K40 :

Memory size: 256 GB Memory size: 12GB total
Bandwidth: 32 GB/sec Bandwidth: 288 GB/sec
Traditional GPU workflow

Vertex processing Blending, Z-buffering

Triangles, Lines, Points Shading, Texturing

Enter CUDA
CUDA is NVIDIA’s general purpose parallel computing architecture .
• designed for calculation-intensive computation on GPU hardware
• CUDA is not a language, it is an API
• we will mostly concentrate on the C implementation of CUDA
CUDA Goals
• Scale code to hundreds of cores running thousands of
threads
• The task runs on the gpu independently from the cpu
CUDA Structure
• Threads are grouped into
thread blocks
• Blocks are grouped into a
single grid
• The grid is executed on
the GPU as a kernel
Host Device

Grid 1

Block Block Block

Kernel 1 (0, 0) (1, 0) (2, 0)

Block Block Block

(0, 1) (1, 1) (2, 1)

Grid 2

Kernel 2

Block (1, 1)

Thread Thread Thread Thread Thread

(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

Thread Thread Thread Thread Thread

(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread

(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
Global and Shared Memory
Global memory not cached on G8x GPUs
• High latency, but launching more threads hides latency
• Important to minimize accesses
• Coalesce global memory accesses (more later)
Shared memory is on-chip, very high bandwidth
• Low latency (100-150times faster than global memory)
• Like a user-managed per-multiprocessor cache
• Try to minimize or avoid bank conflicts (more later)
Texture and Constant Memory
Texture partition is cached
• Uses the texture cache also used for graphics
• Optimized for 2D spatial locality
• Best performance when threads of a warp read locations
that are close together in 2D
Constant memory is cached
• 4 cycles per address read within a single warp
• Total cost 4 cycles if all threads in a warp read same address
• Total cost 64 cycles if all threads read different addresses
CUDA Compilation
• As a programming model, CUDA is a set of extensions
to ANSI C
• CPU code is compiled by the host C compiler and the
GPU code (kernel) is compiled by the CUDA compiler.
Separate binaries are produced
CUDA Stack
Limitations of CUDA
• Tesla does not fully support IEEE spec for
double precision floating point operations
• Code only supported on NVIDIA hardware
• No use of recursive functions (can
workaround)
• Bus latency between host CPU and GPU

(Although double precision will be resolved with Fermi)

Scalability
• Blocks map to cores on the GPU
• Allows for portability when changing hardware
Terms and Concepts
Each block and thread has a unique id within a
block.
• threadIdx – identifier for a thread
• blockIdx – identifier for a block
• blockDim – size of the block

Unique thread id:

(blockIdx*blockDim)+threadIdx
Thread Hierarchy

Thread – Distributed by the CUDA runtime

(identified by threadIdx)
Warp – A scheduling unit of up to 32 threads

Block – A user defined group of 1 to 512 threads.

(identified by blockIdx)

Grid – A group of one or more

blocks. A grid is created for each
CUDA kernel function
CUDA - Memory Model
• Shared memory much
much faster than
global
• Don’t trust local
memory
• Global, Constant, and
Texture memory
available to both host
and cpu
Diagram depicting memory organization.
(Rob Farber, "CUDA, Supercomputing for the Masses: Part 4", Dr.Dobbs,
http:http://www.ddj.com/architect/208401741?pgno=3//www.ddj.com/hpc-high-performance-
computing/207402986)
CUDA Memory Hierarchy
• The CUDA platform has three primary memory types
Local Memory – per thread memory for automatic variables and
register spilling.

Shared Memory – per block low-latency memory to allow for

intra-block data sharing and synchronization. Threads can safely
share data through this memory and can perform barrier
synchronization through _ _syncthreads()

Global Memory – device level memory that may be shared

between blocks or grids
CUDA - Memory Model (continue)

• Each block contain following:

o Set of local registers per thread.
o Parallel data cache or shared memory that is shared by
all the threads.
o Read-only constant cache that is shared by all the
threads and speeds up reads from constant memory
space.
o Read-only texture cache that is shared by all the
processors and speeds up reads from the texture memory
space.

• Local memory is in scope of each thread. It is allocated by

compiler from global memory but logically treated as
independent unit.
CUDA - Memory Units Description

• Registers:
o Fastest.
o Only accessible by a thread.
o Lifetime of a thread

• Shared memory:
o Could be as fast as registers if no bank conflicts or
reading from same address.
o Accessible by any threads within a block where it was
created.
o Lifetime of a block.
CUDA - Memory Units Description
(continue)
• Global Memory:
o Up to 150x slower then registers or share memory.
o Accessible from either host or device.
o Lifetime of an application.

• Local Memory
o Resides in global memory. Can be 150x slower then
registers and shared memory.
o Accessible only by a thread.
o Lifetime of a thread.
NVCC compiler
•Compiles C or PTX
code (CUDA
instruction set
architecture)
•Compiles to either
PTX code or binary
(cubin object)
Development: Basic Idea
1. Allocate equal size of memory for both host
and device
2. Transfer data from host to device
3. Execute kernel to compute on data
4. Transfer data back to host
Kernel Function Qualifiers
• __device__
• __global__
• __host__
Example in C:
CPU program
void increment_cpu(float *a, float b, int N)

CUDA program
__global__ void increment_gpu(float *a, float b, int N)
Variable Type Qualifiers
• Specify how a variable is stored in memory
• __device__
• __shared__
• __constant__
Example:
__global__ void increment_gpu(float *a, float b, int N)
{
__shared__ float shared[];
}
Calling the Kernel
• Calling a kernel function is much different from
calling a regular function

void main(){
int blocks = 256;
int threadsperblock = 512;
mycudafunc<<<blocks,threadsperblock>>>(some
parameter);
}
CUDA: Hello, World! example

/* Main function, executed on host (CPU) */

int main( void) {
Kernel:
/* print message from CPU */
A parallel function that runs on
printf( "Hello Cuda!\n" ); the GPU

/* execute function on device (GPU) */

hello<<<NUM_BLOCKS, BLOCK_WIDTH>>>();

/* wait until all threads finish their job

*/
cudaDeviceSynchronize();

/* print message from CPU */

printf( "Welcome back to CPU!\n" );

return(0);
}
CUDA: Hello, World! example

/* Function executed on device (GPU */

__global__ void hello( void) {

printf( "\tHello from GPU: thread %d and block %d\n",

threadIdx.x,
blockIdx.x );

}
CUDA: Hello, World! example

Compile and build the program using NVIDIA's nvcc compiler:

nvcc -o helloCuda helloCuda.cu -arch sm_20

Running the program on the GPU-enabled node:

helloCuda

Hello Cuda!
Hello from GPU: thread 0 and block 0
Hello from GPU: thread 1 and block 0 Note:
. . . Threads are executed on "first
come, first serve" basis. Can
Hello from GPU: thread 6 and block 2 not expect any order!
Hello from GPU: thread 7 and block 2
Welcome back to CPU!
GPU Memory Allocation / Release
Host (CPU) manages GPU memory:
• cudaMalloc (void ** pointer, size_t nbytes)
• cudaMemset (void * pointer, int value, size_t count);
• cudaFree (void* pointer)
Void main(){
int n = 1024;
int nbytes = 1024*sizeof(int);
int * d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
}
Memory Transfer
cudaMemcpy( void *dst, void *src, size_t nbytes, enum
cudaMemcpyKind direction);
• returns after the copy is complete blocks CPU
• thread doesn’t start copying until previous CUDA calls
complete

enum cudaMemcpyKind
• cudaMemcpyHostToDevice
• cudaMemcpyDeviceToHost
• cudaMemcpyDeviceToDevice
Host Synchronization
All kernel launches are asynchronous
• control returns to CPU immediately
• kernel starts executing once all previous CUDA calls have completed
Memcopies are synchronous
• control returns to CPU once the copy is complete
• copy starts once all previous CUDA calls have completed
cudaThreadSynchronize()
• blocks until all previous CUDA calls complete
Asynchronous CUDA calls provide:
• non-blocking memcopies
• ability to overlap memcopies and kernel execution
The Big Difference
CPU program GPU program
void increment_cpu(float *a, float b, int N) __global__ void increment_gpu(float *a, float b, int
{ N)
for (int idx = 0; idx<N; idx++) {
a[idx] = a[idx] + b; int idx = blockIdx.x * blockDim.x + threadIdx.x;
} if( idx < N) a[idx] = a[idx] + b;
}
void main() void main() {
{ …..
… dim3 dimBlock (blocksize);
increment_cpu(a, b, N); dim3 dimGrid( ceil( N / (float)blocksize) )
increment_gpu<<<dimGrid, dimBlock>>>(a, b,
} N);
}
CUDA: Vector Addition example
/* Main function, executed on host (CPU) */
int main( void) {

/* 1. allocate memory on GPU */

/* 2. Copy data from Host to GPU */

/* 3. Execute GPU kernel */

/* 4. Copy data from GPU back to Host */

/* 5. Free GPU memory */

return(0);
}
CUDA: Vector Addition example

/* 1. allocate memory on GPU */

float *d_A = NULL;

if (cudaMalloc((void **)&d_A, size) != cudaSuccess)
exit(EXIT_FAILURE);

float *d_B = NULL;

cudaMalloc((void **)&d_B, size); /* For clarity we'll not check
for err */

float *d_C = NULL;

cudaMalloc((void **)&d_C, size);
CUDA: Vector Addition example

/* 2. Copy data from Host to GPU */

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

CUDA: Vector Addition example

/* 3. Execute GPU kernel */

/* Calculate number of blocks and threads */

int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) /
threadsPerBlock;

/* Launch the Vector Add CUDA Kernel */

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B,
d_C, numElements);

/* Wait for all the threads to complete */

cudaDeviceSynchronize();
CUDA: Vector Addition example

/* 4. Copy data from GPU back to Host */

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

CUDA: Vector Addition example

/* 5. Free GPU memory */

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
CUDA: Vector Addition example

v0 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12

Block # 0 Block # 1

/* CUDA Kernel */
__global__ void vectorAdd( const float *A,
const float *B,
float *C,
int numElements) {

/* Calculate the position in the array */

int i = blockDim.x * blockIdx.x + threadIdx.x;

/* Add 2 elements of the array */

if (i < numElements) C[i] = A[i] + B[i];
}
GPU Programming

CUDA: Vector Addition example

/* To build this example, execute Makefile */

> make

/* To run, type vectorAdd: */

> vectorAdd

[Vector addition of 50000 elements]

Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads *
Copy output data from the CUDA device to the host memory
Done

* Note: 196 x 256 = 50176 total threads

intro_gpu/vectorAdd/vectorAdd.cu
Mechanics of Using Shared
Memory
• __shared__ type qualifier required
• Must be allocated from global/device function, or as
“extern”
• Examples:
extern __shared__ float d_s_array[]; __global__ void compute2() {
__shared__ float d_s_array[M];
/* a form of dynamic allocation */
/* MEMSIZE is size of per-block */ /* create or copy from global memory */
/* shared memory*/ d_s_array[j] = …;
__host__ void outerCompute() {
compute<<<gs,bs,MEMSIZE>>>(); /* write result back to global memory */
} d_g_array[j] = d_s_array[j];
__global__ void compute() { }
d_s_array[i] = …;
}
Optimization using Shared Memory
10x GPU Computing Growth

2008 2015

6,000 450,000
Tesla GPUs Tesla GPUs

150K 3M
CUDA downloads CUDA downloads

77 54,000
Supercomputing Teraflops Supercomputing
Teraflops
60
University Courses 800
University Courses
4,000
Academic Papers
60,000
Academic Papers
GPU Acceleration

Applications
GPU-accelerated OpenACC Programming
libraries Directives Languages

Seamless linking to Simple directives for Most powerful and flexible

GPU-enabled libraries. easy GPU-acceleration way to design GPU
of new and existing accelerated applications
applications
cuFFT, cuBLAS,
C/C++, Fortran,
Thrust, NPP, IMSL, PGI Accelerator
Python, Java, etc.
CULA, cuRAND, etc.
Minimum Change, Big Speed-up
Application Code

Rest of Sequential
CPU Code
Compute-Intensive
GPU Functions CPU
Use GPU to
Parallelize

+
Will Execution on a GPU Accelerate My

Application?

Computationally intensive—The time spent on computation

significantly exceeds the time spent on transferring data to and from
GPU memory.

Massively parallel—The computations can be broken down into

hundreds or thousands of independent units of work.
GPU Programming

C OpenACC, CUDA

C++ Thrust, CUDA C++

Fortran OpenACC, CUDA Fortran

Python PyCUDA

Numerical analytics MATLAB, Mathematica

Myths About CUDA
• GPUs are the only processors in a CUDA application
– The CUDA platform is a co-processor, using the CPU and GPU
• GPUs have very wide (1000s) SIMD machines
– No, a CUDA Warp is only 32 threads
• Branching is not possible on GPUs
– Incorrect.
• GPUs are power-inefficient
– Nope, performance per watt is quite good
• CUDA is only for C or C++ programmers
– Not true, there are third party wrappers for Java, Python, and
more
Different Types of CUDA Applications
CUDA - Uses

• CUDA provided benefit for many applications. Here list of

some:
o Seismic Database - 66x to 100x speedup
http://www.headwave.com.
o Molecular Dynamics - 21x to 100x speedup
http://www.ks.uiuc.edu/Research/vmd
o MRI processing - 245x to 415x
speedup http://bic-test.beckman.uiuc.edu
o Atmospheric Cloud Simulation - 50x speedup
http://www.cs.clemson.edu/~jesteel/clouds.html.

KLG-AI Strategies and Applications For Leaders
No ratings yet
KLG-AI Strategies and Applications For Leaders
19 pages
2023 Vertical Snapshot Generative AI Preview
No ratings yet
2023 Vertical Snapshot Generative AI Preview
11 pages
Monte Carlo Simulation - Methods, Assessment and Applications
No ratings yet
Monte Carlo Simulation - Methods, Assessment and Applications
167 pages
Generative AI
No ratings yet
Generative AI
2 pages
Roadmap To Learn AI in 2024
No ratings yet
Roadmap To Learn AI in 2024
14 pages
RAG Multimodal Complexe Financial Reports
No ratings yet
RAG Multimodal Complexe Financial Reports
25 pages
00 Course Introduction
100% (1)
00 Course Introduction
17 pages
Artificial Intelligence For Business 2016
No ratings yet
Artificial Intelligence For Business 2016
1 page
Brief Introduction To GenAI
No ratings yet
Brief Introduction To GenAI
1 page
Agentic Ai
No ratings yet
Agentic Ai
11 pages
Langchain PDF Reader
100% (1)
Langchain PDF Reader
15 pages
AI eBook Mar27 2025 Final Advt
No ratings yet
AI eBook Mar27 2025 Final Advt
134 pages
Generative AI
No ratings yet
Generative AI
5 pages
Paper3 - LLM Agent Operating System
No ratings yet
Paper3 - LLM Agent Operating System
14 pages
AI For Leaders Course
No ratings yet
AI For Leaders Course
15 pages
ADL - BLUE SHIFT - Generative - AI - 2023
No ratings yet
ADL - BLUE SHIFT - Generative - AI - 2023
104 pages
Building a Dynamic Multi-Agent Workflow_ Harnessing AI Collaboration with LangChain & LangGraph _ by Rohit Kumar _ Oct, 2024 _ Medium
No ratings yet
Building a Dynamic Multi-Agent Workflow_ Harnessing AI Collaboration with LangChain & LangGraph _ by Rohit Kumar _ Oct, 2024 _ Medium
13 pages
Generative AI Database
No ratings yet
Generative AI Database
14 pages
Gartner_become-an-ai-first-organization-5-critical-ai-adoption-phases
No ratings yet
Gartner_become-an-ai-first-organization-5-critical-ai-adoption-phases
21 pages
CVPR2022 Tutorial Diffusion Model
No ratings yet
CVPR2022 Tutorial Diffusion Model
188 pages
Personalized UX for Agentic AI _ by Debmalya Biswas _ in AI Advances - Freedium
No ratings yet
Personalized UX for Agentic AI _ by Debmalya Biswas _ in AI Advances - Freedium
13 pages
LLM Benchmark
No ratings yet
LLM Benchmark
21 pages
2025 Relatório Tech Trends SXSW- Amy Webb
No ratings yet
2025 Relatório Tech Trends SXSW- Amy Webb
1,000 pages
Gene AI
No ratings yet
Gene AI
13 pages
7 Data Science Trends You Need To Know Going Into 2023
No ratings yet
7 Data Science Trends You Need To Know Going Into 2023
7 pages
Playbook Executive+Briefing Machine Learning
No ratings yet
Playbook Executive+Briefing Machine Learning
38 pages
Data For GenAI
No ratings yet
Data For GenAI
17 pages
Machine Learning (10.17.2018)
No ratings yet
Machine Learning (10.17.2018)
45 pages
A Step-By-Step Guide to Building AI Agents With LangGraph _ by Alannaelga _ Coinmonks _ Nov, 2024 _ Medium
No ratings yet
A Step-By-Step Guide to Building AI Agents With LangGraph _ by Alannaelga _ Coinmonks _ Nov, 2024 _ Medium
32 pages
Agentic AI the Next Evolution
No ratings yet
Agentic AI the Next Evolution
8 pages
LLM Fine Tuning
No ratings yet
LLM Fine Tuning
1 page
intro-to-intelligent-apps-workshop
No ratings yet
intro-to-intelligent-apps-workshop
106 pages
Data Science Guide
No ratings yet
Data Science Guide
275 pages
NorthStar Kick Off - Master
No ratings yet
NorthStar Kick Off - Master
78 pages
Generative AI
100% (2)
Generative AI
10 pages
LLM Challenges
No ratings yet
LLM Challenges
1 page
Agentic AI
No ratings yet
Agentic AI
26 pages
GenerativeAIBootcamp Presentation
No ratings yet
GenerativeAIBootcamp Presentation
50 pages
Aisha A Custom AI Library Chatbot Using The ChatGPT API
No ratings yet
Aisha A Custom AI Library Chatbot Using The ChatGPT API
23 pages
Intro Gen AI 6p
100% (1)
Intro Gen AI 6p
6 pages
LLM and RAG
No ratings yet
LLM and RAG
12 pages
Aios LLM As Os
100% (2)
Aios LLM As Os
35 pages
Llm Application Through Production
No ratings yet
Llm Application Through Production
254 pages
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
100% (1)
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
91 pages
AI Governance For The Enterprise
No ratings yet
AI Governance For The Enterprise
20 pages
Transformers LLMs
No ratings yet
Transformers LLMs
163 pages
Image Captioning
No ratings yet
Image Captioning
33 pages
AI Agents and Environment
No ratings yet
AI Agents and Environment
42 pages
Generative AI
No ratings yet
Generative AI
67 pages
MM-LLMs Recent Advances in MultiModal Large Language Models
No ratings yet
MM-LLMs Recent Advances in MultiModal Large Language Models
22 pages
Artificial Intelligence (A.I) : Submitted by
No ratings yet
Artificial Intelligence (A.I) : Submitted by
9 pages
Augmented Reality
No ratings yet
Augmented Reality
28 pages
Ebook - Unleash The Next Wave of Productivity With AI A Practical Guide For IT Leaders
No ratings yet
Ebook - Unleash The Next Wave of Productivity With AI A Practical Guide For IT Leaders
9 pages
What Is Agentic AI, and How Will It Change Work
No ratings yet
What Is Agentic AI, and How Will It Change Work
12 pages
Generative AI 3d Model
No ratings yet
Generative AI 3d Model
117 pages
Agentic Deep Graph Reasoning 1739950593
No ratings yet
Agentic Deep Graph Reasoning 1739950593
102 pages
Llm-Based Chat
0% (1)
Llm-Based Chat
12 pages
GenAI Pitfalls
No ratings yet
GenAI Pitfalls
2 pages
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning _ by Gao Dalie (高達烈) _ in Towards AI - Freedium
No ratings yet
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning _ by Gao Dalie (高達烈) _ in Towards AI - Freedium
13 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
Computer Concepts and C Programming
50% (2)
Computer Concepts and C Programming
2 pages
8th Sem Vtu Syllabus
No ratings yet
8th Sem Vtu Syllabus
15 pages
Advanced Computer Architecture: UNIT-1 One Mark Questions
No ratings yet
Advanced Computer Architecture: UNIT-1 One Mark Questions
4 pages
DC Merged PDF
No ratings yet
DC Merged PDF
100 pages
Accelerating GNSS Software Receivers
No ratings yet
Accelerating GNSS Software Receivers
18 pages
Computer Architecture Unit 1 - Phase1 PDF
No ratings yet
Computer Architecture Unit 1 - Phase1 PDF
55 pages
Comp 372 Assignment 3
No ratings yet
Comp 372 Assignment 3
11 pages
Cloud Computing IIT Kanpur PDF
No ratings yet
Cloud Computing IIT Kanpur PDF
123 pages
Stereo Video Processing For Depth Map: Harlan Hile and Colin Zheng
100% (2)
Stereo Video Processing For Depth Map: Harlan Hile and Colin Zheng
8 pages
Ds Assignment
No ratings yet
Ds Assignment
6 pages
Data Partitioning
No ratings yet
Data Partitioning
8 pages
Chapter 9 - Pipeline and Vector Processing Section 9.1 - Parallel Processing
No ratings yet
Chapter 9 - Pipeline and Vector Processing Section 9.1 - Parallel Processing
10 pages
Lecture 30 GPU Programming Loop Parallelism
No ratings yet
Lecture 30 GPU Programming Loop Parallelism
16 pages
UGRD-ITE6300 Cloud Computing and Internet of Things3
No ratings yet
UGRD-ITE6300 Cloud Computing and Internet of Things3
9 pages
chapter 1
No ratings yet
chapter 1
25 pages
Parallel Computing Hits the Power Wall Principles Challenges and a Survey of Solutions SpringerBriefs in Computer Science Arthur Francisco Lorenzon Antonio Carlos Schneider Beck Filho - Own the ebook now with all fully detailed content
100% (1)
Parallel Computing Hits the Power Wall Principles Challenges and a Survey of Solutions SpringerBriefs in Computer Science Arthur Francisco Lorenzon Antonio Carlos Schneider Beck Filho - Own the ebook now with all fully detailed content
67 pages
II - Software Design For Low Power
No ratings yet
II - Software Design For Low Power
11 pages
DC-5 Message Ordering and Termination Detection 14th May 2023
No ratings yet
DC-5 Message Ordering and Termination Detection 14th May 2023
47 pages
MTIA First Generation Silicon Targeting Meta's Recommendation
No ratings yet
MTIA First Generation Silicon Targeting Meta's Recommendation
13 pages
CS - 687 Parallel and Distributed Computing
100% (2)
CS - 687 Parallel and Distributed Computing
3 pages
Department of Cse CP7103 Multicore Architecture Unit Iii TLP and Multiprocessors 100% THEORY Question Bank
No ratings yet
Department of Cse CP7103 Multicore Architecture Unit Iii TLP and Multiprocessors 100% THEORY Question Bank
3 pages
Adv DBMS-Unit 2
No ratings yet
Adv DBMS-Unit 2
15 pages
The Design and Analysis of Parallel Algorithms
No ratings yet
The Design and Analysis of Parallel Algorithms
470 pages
Software Development, Design and Coding: With Patterns, Debugging, Unit Testing, and Refactoring, 2nd Edition John F. Dooley 2024 scribd download
100% (1)
Software Development, Design and Coding: With Patterns, Debugging, Unit Testing, and Refactoring, 2nd Edition John F. Dooley 2024 scribd download
55 pages
Datastage Notes
No ratings yet
Datastage Notes
10 pages
DTS Notes digital technologies and solutions
No ratings yet
DTS Notes digital technologies and solutions
37 pages
Chapter 2
No ratings yet
Chapter 2
34 pages
The Z Technical Whitepaper: Illiqa
No ratings yet
The Z Technical Whitepaper: Illiqa
14 pages
Exploring Analog AI: Theoretical Foundations and Potential Advancements in Continuous Computing
No ratings yet
Exploring Analog AI: Theoretical Foundations and Potential Advancements in Continuous Computing
21 pages