GPT 2 - Learninhg 5
GPT 2 - Learninhg 5
md 2024-07-27
input tensor x of shape (B, T) represents the number of sequences (B) and the sequence length (T). The
goal is to extend each sequence to a specified maximum length (max_length).
In a loop, the model performs a forward pass to obtain logits (predicted probabilities for each token) without
computing gradients, as specified by torch.no_grad(). The logits tensor, initially of shape (B, T,
vocab_size), is reduced to (B, vocab_size) by selecting the logits of the last token in the sequence. These
logits are then converted to probabilities using the softmax function.
To introduce randomness and diversity in the generated sequences, top-k sampling is employed. The top
50 token probabilities (topk_probs) and their corresponding indices (topk_indices) are extracted. A
token is randomly selected from the top-k probabilities for each sequence using torch.multinomial,
and the chosen token indices are gathered and appended to the sequences.
This process continues until the sequences reach the desired length. Finally, the generated sequences are
decoded back into text tokens and printed. Each sequence is individually processed, converted from tokens
to text, and displayed, demonstrating the model's ability to generate coherent text sequences based on the
initial input.
torch.manual_seed(42)
torch.cuda.manual_seed(42)
print(x.size())
while x.size(1) < max_length:
with torch.no_grad():
logits = model(x)
logits = logits[:, -1, :]
probs = F.softmax(logits, dim=-1)
topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)
ix = torch.multinomial(topk_probs, 1)
xcol = torch.gather(topk_indices, -1, ix)
x = torch.cat((x, xcol), dim=1)
for i in range(num_return_sequences):
tokens = x[i, :max_length].tolist()
decoded = enc.decode(tokens)
print(">", decoded)
Despite the high speed of tensor cores in GPUs, their performance can be limited by memory bandwidth—
the speed at which data is transferred to the cores. Achieving even 60% utilization of tensor cores is
considered excellent due to these constraints.
Matrix multiplications dominate our operations, especially in linear layers
where tensor cores excel. Operations like GELU, LayerNorm, and softmax are
comparatively shallow. Notably, the most computationally intensive task is
transforming a 768-dimensional embedding to a 50257-dimensional vocabulary size.
The concept of tensor float, introduced in the NVIDIA Ampere architecture, is key to this performance leap.
Tensor float is a 23-bit floating point representation used by tensor cores, designed for rapid matrix
multiplications in the format a * b + c, where a, b, and c are 4x4 matrices. Although this format reduces
precision slightly, it remains adequate for training and enables tensor cores to operate up to eight times
faster.
In summary, leveraging lower precision formats and understanding the computational constraints can
significantly enhance model training efficiency, particularly when utilizing advanced hardware like NVIDIA's
tensor cores.
Floating Point in NVIDIA A100
The NVIDIA A100 Tensor Core GPU introduces several enhancements for AI and HPC workloads, providing
significant performance improvements:
TF32 (TensorFloat-32): TF32 is a new precision format that accelerates single-precision dense
matrix multiplications. TF32 maintains the range of FP32 while providing improved performance
FP16 and Mixed Precision: A100 supports FP16 for faster computations. Combined with automatic
mixed precision, this allows for training with FP32 accuracy at FP16 speed.
Double Precision (FP64): Enhanced FP64 Tensor Cores deliver significant performance
improvements for HPC applications.
Reference: NVIDIA A100 Datasheet
Tensor Cores in NVIDIA Ampere Architecture
Tensor Cores, introduced with the Volta architecture, have been significantly improved in the Ampere
architecture:
Sparse Tensor Cores: These allow for up to 2x performance improvement by leveraging sparsity in
models.
Enhanced Precision: Supports multiple precisions, including FP16, BFLOAT16, TF32, INT8, and
FP64, optimizing for both training and inference workloads.
Reference: NVIDIA Ampere Architecture Whitepaper
Programming Tensor Cores with CUDA 9
10 / 11