Lecture 16
Lecture 16
16
Dr. Karthik Mohan
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 1 / 34
Deep Learning and Transformers References
Deep Learning
Great reference for the theory and fundamentals of deep learning: Book by
Goodfellow and Bengio et al Bengio et al
Deep Learning History
Embeddings
SBERT and its usefulness
SBert Details
Instacart Search Relevance
Instacart Auto-Complete
Attention
Illustration of attention mechanism
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 2 / 34
Generative AI References
Prompt Engineering
Prompt Design and Engineering: Introduction and Advanced Methods
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 3 / 34
Generative AI references
Stable Diffusion
The Original Stable Diffusion Paper
Reference: CLIP
Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion
Diffusion Explainer Demo
The Illustrated Stable Diffusion
Unet
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 4 / 34
GenAI Evaluation and Annotation References
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 5 / 34
Previous Lecture
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 6 / 34
This Lecture
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 7 / 34
Adverserial Attacks on LLMs
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 8 / 34
Costly Mistakes
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 10 / 34
LLM Jailbreaks
Jailbreak
The idea of bypassing the safety measures embedded into an LLM, to
make the LLM behave in a manner that is not its intended use-case: e.g.
being toxic or engaging in sensitive discussions, etc.
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 11 / 34
LLM Jailbreaks
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 12 / 34
LLM Jailbreaks
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 13 / 34
LLM Jailbreaks
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 14 / 34
LLM Jailbreaks
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 15 / 34
LLM Jailbreak - Example with GPT-3.5
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 16 / 34
LLM Jailbreak - Example with GPT-4
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 17 / 34
Automated Jailbreaks
Adverserial suffixes
Train a model to generate a prompt add-on/suffix, which increases the
probability of the model engaging in the desired objectionable behavior!
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 18 / 34
Automated Jailbreaks
Adverserial suffixes
Train a model to generate a prompt add-on/suffix, which increases the
probability of the model engaging in the desired objectionable behavior!
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 18 / 34
Automated Jailbreaks
Adverserial suffixes
Train a model to generate a prompt add-on/suffix, which increases the
probability of the model engaging in the desired objectionable behavior!
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 18 / 34
LLM Jailbreaks
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 19 / 34
LLM Jailbreaks
Downstream Impact
Jailbreaks on LLMs can not just impact LLMs but downstream
components that depend on those LLMs. Think LLM Agents that
coordinate with each other to produce a response. Attack on one
component can impact the whole system behavior adverserially.
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 19 / 34
LLM Jailbreak - Violent Example with GPTs
GPT 3.5
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 20 / 34
LLM Jailbreak - Violent Example with GPTs
GPT 3.5
GPT-4
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 20 / 34
LLM Jailbreak - Violent Example with GPTs
GPT-4 Example 1
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 21 / 34
LLM Jailbreak - Violent Example with GPTs
GPT-4 Example 1
GPT-4 Example 2
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 21 / 34
Game on Adverserial Attack - Level 1
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 22 / 34
Game on Adverserial Attack - Level 2
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 23 / 34
ICE #1: Adverserial Game with LLMs
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 24 / 34
Adverserial Game
Based on your tryout with the game - What would be a way to automate
the process of cracking each level of the game?
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 25 / 34
Toxic System Prompting
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 27 / 34
ICE #2: Play around with adverserial role-playing for GPT
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 28 / 34
Adverserial Attacks Benchmarks
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 29 / 34
Adverserial Attacks
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 30 / 34
Adverserial Attacks
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 31 / 34
Adverserial Attacks
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 32 / 34
Adverserial Attacks
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 33 / 34
Adverserial Attacks
(Univ. of Washington, Seattle) EEP 596: LLMs: From Transformers to GPT ∥ Lecture February
16 29, 2024 34 / 34