0% found this document useful (0 votes)
35 views

Introduction To LLMs 1730172304

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Introduction To LLMs 1730172304

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

IntroLLM Attention Interpretability Text Classification

Introduction to Large Language Models

Dr. Aijun Zhang

October 2024

StatSoft.org 1
IntroLLM Attention Interpretability Text Classification

Recommended Texts

• Tunstall, L., von Werra, L. and Wolf, T. (2022).


• Bishop, C. and Bishop, H. (2023)
• Alammar, J. and Grootendorst, M. (2024)
• Raschka, S. (2024)

StatSoft.org 2
IntroLLM Attention Interpretability Text Classification

Outline

1 LLMs - A Quick Overview

2 Attention is All You Need (2017)


Transformer Architecture
Attention Mechanism
Specialized Transformer LLMs

3 Interpreting Contextual Embeddings

4 Text Classification and Fine-Tuning

StatSoft.org 3
IntroLLM Attention Interpretability Text Classification

Landscape of Artificial Intelligence

Source: Raschka (2024)

StatSoft.org 4
IntroLLM Attention Interpretability Text Classification

Evolution of Language Models

Source: Alammar and Grootendorst (2024)

StatSoft.org 5
IntroLLM Attention Interpretability Text Classification

Growing Scales of Language Models

Scaling laws: increasing param-


eters, data quality, and compute
resources generally improves LLM
performance.

Click into VizSweet

StatSoft.org 6
IntroLLM Attention Interpretability Text Classification

Key Tasks of Large Language Models

• Text Classification: Sentiment analysis, Spam detection, Named


Entity Recognition (NER), Natural Language Inference (NLI)

• Text Generation: Creative writing, Chatbot, Translation, Coding

• Summarization: News article summaries, Research paper abstracts,


Document condensation

• Question Answering: Customer support queries, Educational FAQs

• Knowledge Integration: Retrieval-Augmented Generation (RAG) for


up-to-date responses, evidence-based information generation

• Reasoning: Logical deduction, Mathematics problem solving,


Chain-of-Thought analysis

StatSoft.org 7
IntroLLM Attention Interpretability Text Classification

Outline

1 LLMs - A Quick Overview

2 Attention is All You Need (2017)


Transformer Architecture
Attention Mechanism
Specialized Transformer LLMs

3 Interpreting Contextual Embeddings

4 Text Classification and Fine-Tuning

StatSoft.org 8
IntroLLM Attention Interpretability Text Classification

Attention is All You Need (2017)


• By Google Brain: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
Jones, L. and Gomez, A. N., Kaiser, L. and Polosukhin, I. (NIPS 2017)
• It revolutionized neural networks by introducing the Transformer, a
highly flexible architecture enabling LLMs like BERT, GPT, and T5.
• The core innovation of self-attention allows each token to capture
global context efficiently, eliminating the need for recurrence.
• It sparked groundbreaking models across NLP and other domains, with
applications extending to vision, audio, and multimodal tasks.

StatSoft.org 9
IntroLLM Attention Interpretability Text Classification

Transformer Architecture

• Encoder processes input

• Decoder generates output, predicting


the next token auto-regressively

• Feed forward network (deep learning)

• Multi-head self-attention

• Masked attention for decoder

• Positional encoding

• Check PyTorch function:

Docs>torch.nn>Transformer

StatSoft.org 10
IntroLLM Attention Interpretability Text Classification

Attention Mechanism

Single-head Attention Multi-head Attention

StatSoft.org 11
IntroLLM Attention Interpretability Text Classification

Scaled Dot-Product Self-Attention

N
X
y(k) ← αki x(i)
i=1

exp(x(k) xT(i) )
αki = PN
T
j=1 exp(x(k) x(j) )

Expressed in matrix form:


Y = Softmax[XXT ]X

(Q,K,V)-parameterization: W(q) , W(k) , W(v) ∈ RD×D


h i
Y = Softmax XW(q) (XW(k) )T XW(v) ≡ Softmax[QKT ]V
h T
i
QK
Scaled self-attention: Y = Softmax √
D
V ≡ Attention(Q, K, V)

StatSoft.org 12
IntroLLM Attention Interpretability Text Classification

Specialized Transformer LLMs

• Encoder-Only Models: BERT (Bidirenctional Encoder Representations


from Transformers), DistilBERT, RoBERTa, DeBERTa, etc. Ideal for
understanding tasks such as text classification, sentiment analysis, and
named entity recognition, with bidirectional attention capturing full context.

• Decoder-Only Models: GPT (Generative Pretrained Transformer), GPT-2,


GPT-3, and beyond. Optimized for generation tasks, with unidirectional
attention. Live example: ChatGPT (GPT-3.5+)

• Encoder-Decoder Models: T5 (Text-to-Text Transfer Transformer),


BART. Balance input understanding with output generation, suitable for
tasks like translation, summarization, and paraphrasing.

Highlight: Encoder-only models (BERT) are for understanding tasks, while


decoder-only models (GPT) are for generative tasks.

StatSoft.org 13
IntroLLM Attention Interpretability Text Classification

Transformer Explainer: Interactive Learning of GPT-2 by Cho et al.(2024)

StatSoft.org 14
IntroLLM Attention Interpretability Text Classification

Live Demo: BertViz

BertViz: Visualize Attention in NLP Models (BERT, GPT2, BART, etc.)


https://github.com/jessevig/bertviz

Try it on Google Colab: BertViz Interactive Tutorial

StatSoft.org 15
IntroLLM Attention Interpretability Text Classification

Outline

1 LLMs - A Quick Overview

2 Attention is All You Need (2017)


Transformer Architecture
Attention Mechanism
Specialized Transformer LLMs

3 Interpreting Contextual Embeddings

4 Text Classification and Fine-Tuning

StatSoft.org 16
IntroLLM Attention Interpretability Text Classification

Static and Contextual Embeddings

• Embeddings provide a way to represent textual data as dense,


continuous vectors in a high-dimensional space, to capture the
semantic meaning of words, phrases, and documents.

• Traditional Static Embeddings: Word2Vec, GloVe, etc. Easy to


compute, able to capture basic semantic relationships. However, each
word has a single, fixed embedding regardless of context.

• Contextual Embeddings: BERT, GTP, etc. Generate dynamic,


context-dependent embeddings for each token. BERT is most popular
as it captures both the left and right context of a word in a sentense.

StatSoft.org 17
IntroLLM Attention Interpretability Text Classification

Interpreting Contextual Embeddings

• Interpretability matters as it is crucial to understand how LLMs


work and make decisions, fostering transparency, trust and reliability.

• Challenges in Interpreting LLMs:


– Complexity of contextual embeddings, high-dimensionality
– Black-box nature, millions/billions of parameters, highly nonlinear
– Semantic ambiguity, polysemy (words with multiple meanings) in
diverse contexts

• Here’s a structured process to interpret contextual embeddings:


1. Dimensionality reduction for effective clustering
2. Extract topics for each cluster
3. Further dimensionality reduction for 2D or 3D visualization

StatSoft.org 18
IntroLLM Attention Interpretability Text Classification

Interpreting Contextual Embeddings

Source: Alammar and Grootendorst (2024)

StatSoft.org 19
IntroLLM Attention Interpretability Text Classification

Dimensionality Reduction Techniques


Technique Type Strength Weakness Best Use
PCA Linear Captures maximum Assumes linear- Medium data with
variance ity, less flexible linear relationships
t-SNE Nonlinear Preserves local Computationally Small data, ideal for
structure expensive 2D/3D visualization
UMAP Nonlinear Balances local & Requires tuning Clustering in large,
global structure, complex datasets
scalable
Random Linear Fast, scalable, pre- Lower inter- High-dim data with
Projection (Random) serves distances pretability simple structure
AE (Auto- Nonlinear Learns nonlinear Requires tuning Complex datasets
Encoders) relationships, cus- & significant with non-linear pat-
tomizable data terns
VAE (Varia- Nonlinear Captures data vari- Complex train- When data variabil-
tional AE) (Proba- ability, generates ing, requires ity and generation
bilistic) new data tuning are important
MDS Nonlinear Flexible metrics, Computationally Semantic similarity
preserves distances intensive in embeddings

StatSoft.org 20
IntroLLM Attention Interpretability Text Classification

Clustering Techniques

Technique How It Works Advantages Limitations Best Use


k-Means Minimizing Simple, effi- Requires k in When clusters
Clustering distances to cen- cient, scalable advance; as- are approximately
troids sumes spherical spherical and
clusters equally sized
Agglomerative Merges closest Captures hierar- Computationally Data with inher-
/Hierarchical clusters itera- chical structure, intensive, sensi- ent hierarchical
Clustering tively, forming a no need for k tive to noise structure
hierarchy
DBSCAN Groups densely Handles non- Sensitive to Arbitrary shapes,
packed points; convex shapes, parameters, no outlier detection,
labels sparse detects outliers good for varying unknown clusters
points as outliers densities
Spectral Uses similar- Good for non- Computationally Small data with
Clustering ity matrix and convex clusters, expensive, re- complex relation-
eigenvalues for adaptable simi- quires k ships
clustering larity measures

StatSoft.org 21
IntroLLM Attention Interpretability Text Classification

Topic Extraction Techniques

Technique Description Use in Clusters


TF-IDF Identifies important terms by Applied to each cluster to find
(Term Frequency – comparing term frequency in a keywords that best represent its
Inverse Document document to its frequency in the content.
Frequency) corpus. High scores indicate terms
important to the document but
uncommon in the corpus.
KeyBERT A transformer-based method us- Identifies representative key-
(Keyword Extraction ing BERT embeddings to extract words or phrases, providing
with BERT) semantically relevant keywords. rich, contextually accurate topic
descriptions for each cluster.
LDA A topic modeling technique that Extracts coherent topics from
(Latent Dirichlet Al- assumes documents are mixtures clusters, revealing specific
location) of topics, each with a unique word themes within broader topics.
distribution.

StatSoft.org 22
IntroLLM Attention Interpretability Text Classification

BERTopic Pipeline for Interpretable Topic Modeling

https://github.com/MaartenGr/BERTopic

Source: Alammar and Grootendorst (2024)

StatSoft.org 23
IntroLLM Attention Interpretability Text Classification

Live Demo: BERTopic

Try it on Google Colab: BERTopic Demo with Rotten Tomatoes Dataset

StatSoft.org 24
IntroLLM Attention Interpretability Text Classification

Outline

1 LLMs - A Quick Overview

2 Attention is All You Need (2017)


Transformer Architecture
Attention Mechanism
Specialized Transformer LLMs

3 Interpreting Contextual Embeddings

4 Text Classification and Fine-Tuning

StatSoft.org 25
IntroLLM Attention Interpretability Text Classification

Text Classification with Pretrained Transformers

Source: Tunstall et al. (2022)

1. Select a pretrained transformer model, e.g. SBERT or DistilBERT;


2. Prepare the labeled text data, split into training and validation sets;
3. Add a classifier layer with sigmoid/softmax activation;
4. Freeze the transformer model parameters, train only the classifier;
5. Evaluate performance and perform outcome analysis.

StatSoft.org 26
IntroLLM Attention Interpretability Text Classification

Fine-Tuning Transformers for Classification

Source: Tunstall et al. (2022)

• Pros: Fine-tuning the entire model adapts fully to the task, yielding
higher accuracy and flexibility.
• Cons: Increased computational demands, potential risk of overfitting.

StatSoft.org 27
IntroLLM Attention Interpretability Text Classification

Live Demo: DistilBERT

• DistilBERT is a small, fast, cheap and light Transformer model


trained by distilling BERT base.

• Check the hugging face transformers page: DistilBERT

Table: DistilBERT Classification on Rotten Tomatoes Data


DistilBERT-raw DistilBERT-finetuned
train-AUC 0.918097 0.980089
test-AUC 0.882378 0.911010

Try it on Google Colab: Text Classification using DistiBERT

StatSoft.org 28
IntroLLM Attention Interpretability Text Classification

Thank you!
https://www.linkedin.com/in/ajzhang/

StatSoft.org 29

You might also like