0% found this document useful (0 votes)
8 views

Lecture 10 - Knowledge and Reasoning - 2025 - LLM (1)

The document discusses Large Language Models (LLMs), focusing on their architecture, training processes, and applications in Knowledge Representation and Reasoning. It covers key concepts such as transformer architecture, self-attention mechanisms, and word embeddings, highlighting their importance in understanding and generating human-like text. The content is presented in a structured format with learning outcomes and acknowledgments from various professors.

Uploaded by

ragingtuna58
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 10 - Knowledge and Reasoning - 2025 - LLM (1)

The document discusses Large Language Models (LLMs), focusing on their architecture, training processes, and applications in Knowledge Representation and Reasoning. It covers key concepts such as transformer architecture, self-attention mechanisms, and word embeddings, highlighting their importance in understanding and generating human-like text. The content is presented in a structured format with learning outcomes and acknowledgments from various professors.

Uploaded by

ragingtuna58
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

COMP-2024 / COMP-2039

Artificial Intelligence
Methods

Lecture 10b
Large Language Model
Learning Outcomes

IDENTIFY

1. Types of Knowledge
2. Representation Methods
3. Case-Based Reasoning
4. Decision Making, Decision Support
5. LLMs in Knowledge
Representation and Reasoning

ACK: Prof. Ender Özcan, UNUK, Prof. Tomas Maul & Dr Chen ZhiYuan, UNM

COMP2024 / COMP2039 Knowledge and Reasoning 2


Learning Outcomes

IDENTIFY

1. Introduction to LLM
2. Transformer Architecture
3. Training process for LLMs
4. Fine Tuning
5. Model Evaluation
6. Capabilities and Roles of LLM in
Knowledge and Reasoning

ACK: Prof. Ender Özcan, UNUK, Prof. Tomas Maul & Dr Chen ZhiYuan, UNM

COMP2024 / COMP2039 Knowledge and Reasoning 3


Introduction to LLM
Large Language Model (LLM)

▪ A Large Language Model is a type of AI model that's trained to


understand and generate human-like text. It uses a deep learning
technique called transformers, and it's trained on massive
amounts of text data from books, websites, articles, and more.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 5
Large Language Model (LLM)

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 6
The Evolution of Large Language Models

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 9
Introduction to LLMs

▪ Large Language Models (LLMs), such as GPT, BERT etc., are


increasingly being leveraged in the domain of Knowledge
Representation and Reasoning (KRR) due to their ability to
process, understand, and generate human-like text.

▪ Their applications in this field range from encoding knowledge in


textual formats to reasoning over facts, concepts, and relationships.

Simon Lau Boung Yew


Probability-based language models

▪ Probability-based language models analyze sequences of words.


▪ They can predict the next word given previous words.
▪ This prediction forms the basis of text generation, as used in AI-
powered chatbots and language processing tools.
▪ This is based on the chain rule of probability.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 11
Probability-based language models

• Language models can evaluate the probability of a given text


occurring in a particular language.

• Predicting the Next Word: Given a sentence, enabling a


language model to predict the most likely next word equips it with
language generation capabilities.

where V is the vocabulary.


Probability-based language models - Neural
Language Models

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 13
Probability-based language models

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 14
Probability-based language models

Probability
sampling
Large language model

Language
Large language model architecture
model

Large language model


Simon Lau Boung Yew
… <EOS>
architecture
Types of Probability-Based Models

▪ N-gram Models: Use a sliding window of n words to predict the


next word.
▪ Hidden Markov Models (HMMs): Models sequences where there's
an underlying hidden state (e.g., grammatical roles) generating the
visible output (e.g., words).
▪ Neural Language Models (early ones): word embeddings (e.g.,
Word2Vec, GloVe) and neural networks. Predict next words using a
small neural net trained on large corpora. https://lena-
voita.github.io/nlp_course/word_embeddings.html
▪ Transformers and self-attention mechanisms: modern LLMs like
GPT, Bert → handle longer contexts and understand semantic
Simon Lau Boung Yew

relationships
COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 16
Transformer
Architecture
Language Model and Transformer

The goal of language models is to estimate the joint


probability of the whole sequence:

for example:
Transformer

• Translation model

**********
(Embedding)

Encoder Decoder

Attention
mechanism
學生都快睡著Input 了, The studentsOutput
were almost falling
TheThe
students
The
students
The
students
were
students
Thewere
almost
were
almost
falling
因為他們聽老師講課很無聊 asleep…
Simon Lau Boung Yew
asleep 47
COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 19
Encoder and Decoder

▪ Encoder takes an input sequence and returns an internal state (a


embedding vector); Decoder starts with the internal state generated
by the encoder to predict the next sequence.
▪ Encoder: Processes (understand) the input sequence.
▪ If the input is: "The students were almost falling asleep."
The encoder converts this into a numerical representation capturing
meanings and relationships between words.
▪ Tokenized (words) → word embeddings
▪ Decoder: Generates the output sequence.
▪ Given the encoded input "The students were almost...", the decoder predicts
the next word ("falling") and continues generating.
Simon Lau Boung Yew
Transformer: Encoder and Decoder

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 21
After Transformer was proposed in 2017, both Google and OpenAI have leveraged certain
part of Transformer to develop BERT and GPT models leading to significant achievements in
the NLP field.

**********
(Embedding)

BERT GPT
Encoder Decoder
pre-trained model Text generation
for various NLP model
tasks
Unidirectional: Predicts the
Bidirectional: Looks at next word based only on
both past and future past words. 48
context to understand
meaning.
Understanding the Transformer Structure

BERT is great for understanding text (NLP tasks like search and classification).
GPT is powerful for generating text (chatbots, creative writing).

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 23
Encoder

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 24
Embedding

▪ Transformers process words in parallel, so they don’t inherently understand


which word comes first, second, etc.
▪ Word order matters in language—for example:
▪ “The cat chased the dog.”
▪ “The dog chased the cat.”
▪ These two sentences have different meanings, but a Transformer without
positional encoding would treat them the same!
▪ Each word in a sentence gets an extra positional vector added to its
embedding.
• Word embedding → Represents the meaning of the word.
• Positional encoding → Tells the model where the word is in the sentence.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 25
Word Embedding

▪ Embedding is a dense vector representation of a word, token, or


subword in a continuous space.
▪ Example: "cat" → [0.25, -1.03, ..., 0.78] (e.g., 512 dimensions)
▪ Embeddings capture semantic meaning → Similar words have
similar embeddings: "king" and "queen" might be closer in
embedding space than "king" and "table".

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 26
Input Embedding
Embedding dimension is the length of the vector
used to represent each token

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 27
Word / Token Embedding and Word2vec

▪ Word Embedding is a language


modeling technique for mapping
words to vectors of real numbers.
It represents words or phrases in
vector space with several
dimensions. Word embeddings
can be generated using various
methods like neural networks.
▪ Word2Vec allows words to be
represented as vectors in a
continuous vector space.

ACK: Prof. Ender Özcan, UNUK, Dr Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 28
Embedding

1 apple
2 green
3 blue
4 cat
5 fish
6 sky
7 grass
8 glass

Embeddings in LLMs are numerical representations of text that
35
capture semantic meaning in a high-dimensional vector space.
Similar concept to latitude and longitude

• Washington DC is at [ 38.9, 77 ]
• New York is at [ 40.7, 74 ]
• London is at [ 51.5, 0.1 ]
• Paris is at [ 48.9, -2.4 ]
• Taipei is at [ 25, 121.6 ]

Distance、Position ...
36
Word2vec - Shallow neural network models used to
learn word embeddings
I 0.1 0.4 0.9 0.7

am 0.5 -0.2 -0.2 0.3

I I studying 0.4 0.6 -0.1 0.8

look up the table NLP -0.2 -0.4 0.9 -0.3


am am
studying now 0.1 0.5 -0.7 0.8

NLP NLP

now now

37
Predict the target word from Predict surrounding context
surrounding context words words from the target word
Word2vec

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 32
Embedding Property

• Calculate similarity
• The similarity between "man" and "woman" is higher than the similarity
between "man" and "apple."

• Perform numerical calculations directly queen

• woman + ( king - man ) = queen woman

• Analyze the meaning king


apple
man

Simon Lau Boung Yew


41
Embedding Projector
https://projector.tensorflow.org/

38
Transformer Architecture

▪ Transformer architecture addressed the issue of preserving long-


term dependencies (vanishing gradient issue in RNN, LSTM) by
leveraging
▪ self-attention mechanisms to retain word-to-word relation to
weigh the importance of different words in a sentence.
▪ positional encodings to represent each word’s position.
▪ This enables parallel computation over the entire text without
disrupting the order, rather than sequentially. They outperform
traditional sequence models (e.g., RNN, LSTM).

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 35
Positional Encoding
- "What is the order of the words?"

▪ Transformers don’t have loops or recurrence, so they have no


idea about word order unless we explicitly tell them by adding the
positional information
▪ The positional encoding is a fixed-length vector added to the word
i‘s embedding xᵢ (computed with sinusoids) to enable parallel
processing of input sequences without disrupting the word order.
▪ This allows the model to learn relative positions, like: Next word,
Previous word and Distance between words
▪ Final Input = Token Embedding + Positional Encoding

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 36
Positional Encoding

Positional Encoding:

sine and cosine functions to generate unique position values

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 37
Positional Encoding

• By using sine and cosine functions, each position has


a unique encoding, which allows the model to
differentiate between words based on their order.
• The continuous nature of sine and cosine functions
provides smooth transitions between positional
encodings, which can help the model learn
relationships between different positions more
effectively.

This is the index of the dimension for the


embedding vector, indicating which sine
or cosine function to use.

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 38
Example: Positional Encoding

Each word gets a unique pattern of numbers that helps the Transformer understand its position.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 39
Self-Attention
- "Who should I pay attention to?"

▪ In a sentence, not all words are equally important to each other. Self-attention
lets the model look at all the words in the sentence at once, and decide
which ones matter most for understanding a specific word.
▪ The “self” part of self-attention refers to the "egocentric" focus of each token in a
corpus. Effectively, on behalf of each token of input, self-attention asks, "How
much does every other token of input matter to me?“
▪ "The cat sat on the mat.“ → To understand "sat" in context, the model might pay
attention to:
▪ "cat" (to know who sat)
▪ maybe "mat" (to know where it sat)
▪ The self-attention mechanism computes how much attention "sat" should give to
each word (including itself) — and forms a weighted sum of all word vectors.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 40
Self-Attention

▪ Self-attention calculates attention scores between different sentence parts. Help


the model decide which words to focus on when processing input sequences.
▪ Each input word embedding x is transformed into 3 vectors:
• Query (Q)
• Key (K)
• Value (V)
• A query Q for a specific word and returning keys K of all words (including itself)
to retrieve their values V.
• Attention Score = Q · Kᵀ (dot product)
▪ Weighted sum = Σ attention_score × V (value vector) → represent the word j of
important relevance with word i
▪ This process happens for every word, at the same time, hence "self"-attention.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 41
Single Head Attention

▪ It learns to focus on one type of relationship between words.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 45
Multi-Head Attention

▪ Multi-heads attention mechanism performs the self-attention


operation in parallel multiple times, each with its own set of
weights.
▪ Each "head":
• Sees the input slightly differently
• Focuses on different types of relationships
▪ For example, with 4 attention heads:
• Head 1: Focuses on subject-verb
• Head 2: Focuses on positional alignment
• Head 3: Focuses on adjective-noun
• Head 4: Focuses on long-distance dependencies
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 46
Multi-Head Attention

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 47
Multi-Head Attention

Multi-head attention:

head function:

concatenation of h heads:
COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 48
Multi-Head Attention

▪ Multi-Head Attention runs multiple attention layers in parallel.


• Different heads focus on different aspects of the input.
• The results are concatenated and passed through a linear layer.

• Handles long-range dependencies better than RNNs.


• Parallelizable unlike sequential models.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 49
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 50
Feed Forward Layer

▪ A fully connected layer (also called a dense layer) with dropout


and ReLu activation functions in between that operates
independently on each position (token) in the sequence to
transform and refine the output representations
▪ Applies a transformation to each token vector independently
▪ Helps the model learn non-linear combinations of features
▪ Acts like a feature refiner after attention

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 51
Decoder

• In the decoder, the generated text


sequences are produced one by one.
Each output word is considered as the
new input, which is then passed through
the encoder’s attention mechanism.
After N encoder stacks, a softmax
output predicts the most probable
generated sequence.
• The process may seem sequential.
During real training, the decoder can
process the prediction of next words in
the entire sequence in parallel. This self-attention layer in decoder also employs
Simon Lau Boung Yew
the attention mechanism, but with future word
masks to prevent access to future information.
COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing Thus, it is also called causal self-attention layer.
52
ACK: Prof. Ender Özcan, UNUK, Dr Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 53
What is the training
process for LLMs
Deep Learning Training Process

Data Data Model Model Model Model


Deployment
Collection Preprocess Building Training Evaluation Optimization

Simon Lau Boung Yew


21
Deep Learning Training Process

Data Data Model Model Model Model


Deployment
Collection Preprocess Building Training Evaluation Optimization

Input Output
Image Image
Audio Audio
Numerical
Text Text
Simon Lau Boung Yew … …
22
Deep Learning Training Process

Data Data Model Model Model Model


Deployment
Collection Preprocess Building Training Evaluation Optimization

Input Output

Tokenization
Text Text
Embedding
Simon Lau Boung Yew
23
Vector Database
ACK: Prof. Ender Özcan, UNUK, Dr Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 59
ACK: Prof. Ender Özcan, UNUK, Dr Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 60
Vector Database

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 61
Why is Vector Database Necessary?

Database

Question:
What is the artificial intelligence?
difficult to
search and compare
Embedding

Consideration:
● Storage format
● Query operations
● Performance requirements
What is Vector Database?

Vector
Database
Question:
What is the artificial intelligence?
[0.1, 0.3, -0.2, …]
Embedding [0.9, -0.1, -0.4, …]
Embedding

[0.8, 0.7, -0.6, …]

Vector similarity search


Use case
The possible return methods
for actual applications:
——————————————
Immediate questions 1. Return the highest-scoring result
directly.
Create a 2. Return multiple similar results.
document database 3. Return a summary or answer after
querying through FFM
(Feature-Fused Model).

T enT

Enterprise-specific
LLM
Common Vector Databases

▪ Pinecone
▪ FAISS (Facebook AI Similarity Search)
▪ Chroma
▪ Milvus
▪ Redis

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 65
How to train/fine-tune
your own LLM
Pretrained LLMs

▪ Pretrained Large Language Models (LLMs) are advanced neural


network architectures pretrained (offline) on vast amounts of text
data, allowing them to capture the intricacies of language, context,
and meaning.
▪ Popular Pretrained LLMs
▪ Google BERT (Bidirectional Encoder Representations from
Transformers)
▪ OpenAI GPT (Generative Pretrained Transformer)
▪ T5 (Text-to-Text Transfer Transformer)
▪…
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 67
Pre-trained LLM

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 68
Learning of LLMs

Pre-training Instruction
Alignment Prompting
model Tuning

Billions 10K~M K O(1)

Number of parameters - learnable weights in a neural network


Size: Number of Parameters

▪ In a Transformer-like LLM, parameters come from:


1. Token embedding matrix
▪ Vocabulary size = 50,000
▪ Embedding dimension = 4,096
▪ [50,000 × 4,096] = 204,800,000 parameters
▪ ≈ 205M just for embeddings!
2. Attention layers (queries, keys, values, output)
3. Feed-forward layers
4. Layer norms, biases, projection heads, etc.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 70
Language Model Size

Model size

Dataset
Computing resources

Time

Source: https://epochai.org/blog/tracking-large-scale-ai-models
Pretraining

• When you first train a model, its parameters ▪ Large Scale


are randomly initialized and adjusted during • These models typically have millions to
training. Involves training the model on a billions of parameters, allowing them to
large corpus of text without specific tasks in learn complex patterns and relationships
mind. The goal is to learn patterns, in the data.
structures, and semantics of language.
• The size of the model contributes
• Common objectives include: significantly to its performance, but it
• Masked Language Modeling (MLM): also requires substantial computational
Predicting missing words in a sentence resources for training and inference.
(used in models like BERT).
• Next Sentence Prediction (NSP):
Determining if one sentence follows
another (also used in BERT).
• Causal Language Modeling:
Predicting the next word in a sequence
ACK: Prof. Ender (used
Özcan, UNUK, in models
Dr Tomas Maul & Dr Chenlike
ZhiYuan,GPT).
UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 72
How to Train a Deep Learning Model

Forward
Prediction Evaluation

loss

Backward
Fine Tuning
What is Transfer Learning?
• Transfer existing knowledge to a new domain without the need to relearn!

Knowledge transfer

Transfer learning (TL) is a machine learning (ML) technique that uses a model
pre-trained on one task to improve its performance on a related task.
This technique is used to retrain existing models with new data instead of training
a new model from scratch. 61
Fine Tuning

• After pretraining, the model can be fine-tuned on specific tasks


(e.g., sentiment analysis, question answering) using smaller, task-
specific datasets.
• Fine-tuning helps the model adapt its general language knowledge
to specific knowledge for particular applications.
▪ Fine-tuning is the process of adjusting the parameters of a pre-
trained model to adapt it for a specific task or domain. In simpler
terms, it’s like taking a model that has learned general patterns
(from a large, diverse dataset) and then making it better at a
particular task by fine-tuning its weights with a more specific,
task-related dataset.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 82
Fine Tuning

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 84
ACK: Prof. Ender Özcan, UNUK, Dr Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 85
Fine Tuning

• Full Fine-Tuning
• Fine-tuning with limited resources
• In-context learning: Instruction Tuning (Hard Prompt, strictly not
fine tuning)
• Parameter-Efficient Fine-Tuning (PEFT)
• Distilled Training
• Data Efficient Training
• Alternative: Retrieval Augmented Generation (RAG)

Simon Lau Boung Yew https://developers.google.com/machine-learning/crash-course/llm/tuning


COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 86
Full Fine-Tuning

▪ This approach involves fine-tuning the entire pre-trained LLM on


a specific task or dataset.
▪ Fine-tune all the parameters of the model. In this case, all layers of
the model undergo small adjustments based on the task-specific
dataset.
▪ This technique is effective but can be computationally expensive
and time-consuming, especially for large-scale models.

Simon Lau Boung Yew


Full Fine-Tuning

5. Model evaluation
4. Model Training 6. Iterate until the goal is achieved
Response

Query
Pretrained LLM Fine-tuned LLM Users
3. Model Building

Specific-domain dataset
1. Dataset preparation
2. Data preprocess
59
Soft vs Hard Prompts

The actual written


or spoken phrases
or questions that a
user inputs directly
into the model
during its use

The model weights of the


LM are frozen and there
are separate learnable
tensors concatenated with
the model weights and are
trained for the specific
downstream task.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 89
In-context Learning: Hard Prompt

• LLMs have the ability to quickly learn and apply new concepts
or skills based on the provided context, without requiring explicit
fine-tuning or retraining.

• Hard Prompting can help activate and utilize this in-context


learning capability.

Simon Lau Boung Yew


Instruction Tuning

• Instruction tuning involves fine-tuning the model parameters on


a dataset where tasks are described in natural language, like:
"Translate this sentence," "Summarize the paragraph," or "Answer
the question.“

• The key idea behind Instruction Tuning is to provide the language


model with clear, structured, and tailored instructions or prompts
that guide the model's behavior and output towards the desired
task or objective.

Simon Lau Boung Yew


Instruction Tuning
Parameter-Efficient Fine-Tuning (PEFT)

• PEFT is a more efficient fine-tuning approach that aims to minimize


the number of parameters (only modify a subset of parameters)
that need to be updated during the fine-tuning process.
• Instead of updating the entire model, PEFT introduces additional
trainable parameters, such as adapter modules or prompt tuning,
while keeping the majority of the pre-trained parameters frozen.
• This approach can significantly reduce the computational and
memory requirements for fine-tuning, making it more efficient for
large-scale models.

Simon Lau Boung Yew


Parameter-Efficient Fine-tuning (PEFT)

5. Model evaluation
Pretrained LLM 4. Model Training 6. Iterate until the goal is achieved
Response

Query

Users

Extra architecture or Fine-tuned LLM


part of pretrained model

Specific-domain dataset
1. Dataset preparation
3. Model building 2. Data preprocess
62
PEFT

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 98
PEFT Techniques

▪ Soft Prompting: P-tuning, prefix tuning, and prompt tuning -


modifies how the model processes input, but they do so in different
ways.
▪ Adapters: Introduces small additional layers while keeping the base
model frozen.
▪ LoRA (Low-Rank Adaptation): Injects small trainable matrices
into transformer layers, reducing memory overhead.
▪ QLoRA – Quantized LoRA

Hugging Face’s PEFT: Supports LoRA, Adapters, and Prefix Tuning.


Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 99
Soft Prompting

▪ Soft prompts are learnable continuous embeddings (vectors) rather


than hard-coded text (like "Translate this sentence:").

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 101
Soft Prompting

▪ Prompt Tuning

▪ Prefix Tuning

▪ P-Tuning

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 102
ACK: Prof. Ender Özcan, UNUK, Dr Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 103
LoRA (Low-Rank Adaptation)

▪ Update only a small subset of parameters while keeping the majority


of the model's parameters frozen.
▪ During the training, only the low-rank update matrices A and B are
updated, while the rest of the model remains frozen.
▪ Rank = number of linearly independent rows (or columns) in the
matrix
▪ The key idea is to represent changes to the model's parameters in a
low-rank form, to ensure that the added parameters are efficient in
terms of both computation and memory.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 104
LoRA (Low-Rank Adaptation)

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 105
Quantization-Aware Fine-Tuning - QLoRA

▪ Reducing the precision of model weights (e.g., from FP32 to INT8


or INT4) lowers memory usage and speeds up training.

• QLoRA: Combines quantization with LoRA for efficient fine-tuning on


consumer GPUs.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 106
What is Quantization?

▪ Quantization is a technique used to reduce the memory and computational cost


of running large language models (LLMs) by reducing the precision of
numerical values (weights and activations) from higher precision (e.g., FP32) to
lower precision (e.g., INT8, INT4).
▪ Example:
Original Model Weights in FP32 (32-bit floating point)
Weights (FP32):[ 0.15234375, -0.83789062, 1.12890625, -0.23046875 ]
Quantized Weights in INT8 (8-bit integer)
Weights (INT8):[ 19, -107, 144, -30 ]
Reconstructed Weights:[ 19 × 0.0078125, -107 × 0.0078125, 144 ×
0.0078125, -30 × 0.0078125 ]≈ [ 0.148, -0.835, 1.125, -0.234 ] (with Scaling
factor=0.0078125)
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 107
Distilled Training

▪ LLM distilled training refers to the process of training a smaller


model (student) to mimic a larger model (teacher), retaining key
capabilities while reducing size.
▪ Knowledge Distillation: The student model learns from the soft
probabilities of the teacher model.
▪ Examples of Distilled Models
• DistilBERT (from BERT)
• TinyLLaMA (from LLaMA)
• T5-Small (from T5)

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 108
Knowledge Distillation Process

1. A large teacher model (e.g., GPT-4, LLaMA 3) generates


responses, logits (probability distribution over tokens), or
intermediate representations.
2. A smaller student model (e.g., DistilBERT, TinyLLaMA) is trained to
mimic the teacher’s outputs.
3. The student model learns from both hard labels (ground truth data)
and soft labels (teacher’s predictions).

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 109
Knowledge Distillation Process

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 110
Knowledge Distillation Process

Logits: unscaled predictions


(scores) for each token, which
Simon Lau Boung Yew can then be transformed into
probabilities, passed through a
COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 111
softmax function
Data Efficient Training

▪ Optimize dataset selection and preprocessing to minimize training


time.
• Few-shot Fine-tuning: Use a small, high-quality dataset →
leverage general knowledge from pre-training, specializes in
specific task or domain with minimal data.
• Active Learning: Select only the most informative (uncertain or
difficult) samples for fine-tuning → minimizes the amount of
labeled data required

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 112
Alternative to fine-tuning:
Retrieval Augmented Generation (RAG)

▪ Retrieval Augmented Generation (RAG) is a technique that


combines the strengths of retrieval-based systems and language
generation models.
▪ Instead of modifying the model’s parameters, RAG enhances an
LLM's responses by retrieving relevant external information from a
knowledge base or document store at runtime and then using that
information to guide the generation of the final output.
▪ This approach can enhance the factual accuracy and coherence of
the LLM's responses, as it can draw upon external (often more up-
to-date) knowledge in addition to its internal language modeling
capabilities.
Simon Lau Boung Yew
Retrieval Augmented Generation (RAG)

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 114
How RAG Differs from Fine-Tuning

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 115
When to Use RAG Instead of Fine-Tuning

▪ When your model needs real-time access to external or frequently


updated data.
▪ When you lack enough labeled training data for fine-tuning.
▪ When computational resources are limited, as RAG does not require
modifying model weights.

▪ You can fine-tune a model for better reasoning or response style


while still using RAG to provide external factual knowledge. This
hybrid approach can optimize both performance and efficiency.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 116
Model Evaluation
Evaluation: Different Way to Compute Metric Scores

Source:
https://www.confident-ai.com/blog/llm-evaluation-metrics-e
verything-you-need-for-llm-evaluation

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 118
How to evaluate

https://huggingface.co/docs/evaluate/index

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 119
High-level categories of metrics

• Generic metrics, which can be applied to a variety of situations


and datasets, such as precision and accuracy.
• Task-specific metrics, which are limited to a given task, such as
Machine Translation (often evaluated using metrics BLEU or
ROUGE) or Named Entity Recognition (often evaluated with
seqeval).
• Dataset-specific metrics, which aim to measure model
performance on specific benchmarks: for instance, the GLUE
benchmark has a dedicated evaluation metric.

Simon Lau Boung Yew


Open LLM Leaderboard
Open LLM Leaderboard aims to track, rank and evaluate open LLMs and chatbots.
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

12
1
Key Metrics in the Leaderboard

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 122
Key Metrics in the Leaderboard

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 123
Capabilities and Roles
of LLM in Knowledge
and Reasoning
Instruction following

Simon Lau Boung Yew


Capabilities of LLMs

• Natural Language Understanding: Ability to comprehend and


generate human-like text.
• Knowledge Representation: Discuss how LLMs can encode
knowledge in their parameters and generate representations based
on input queries.
• Reasoning Abilities: Explore the extent to which LLMs can perform
reasoning tasks, such as inference, summarization, and question-
answering.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 126
Chain-of-Thought

• LLMs can demonstrate step-by-step reasoning abilities, breaking


down problems into logical steps and articulating the thought
process.
• This technique allows the model to break down complex problems
into sequential steps, facilitating clearer reasoning and more
accurate conclusions.
• Prompting can encourage the model to generate such step-by-
step explanations, which can be valuable for tasks like problem-
solving, question answering, and knowledge-intensive applications.

Simon Lau Boung Yew


Chain-of-Thought

Simon Lau Boung Yew


Role of LLMs in Knowledge Representation

▪ Implicit Knowledge Representation


• LLMs encode information in their parameters, learned from
extensive corpora. This enables them to store and retrieve
knowledge efficiently without explicit structuring.

• Knowledge is distributed and embedded as semantic


representations in vector spaces, which allows LLMs to perform
tasks like answering factual questions or summarizing
documents.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 129
Role of LLMs in Knowledge Representation

▪ Explicit Knowledge Representation


• LLMs can extract structured knowledge from unstructured data,
generating explicit representations such as:
• Knowledge Graphs (KGs): Extracting entities and
relationships from text to populate KGs.
• Ontologies: Recognizing hierarchical and semantic structures
to define relationships between concepts.
• Tabular Formats: Converting textual information into rows
and columns for analysis.
• Examples include applications in natural language interfaces for
structured query generation or database population.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 130
Reasoning with Knowledge

▪ LLMs can perform various reasoning tasks, including:


▪ Deductive Reasoning
• Drawing logical conclusions from explicit premises, such as
understanding syllogisms or resolving logical contradictions.
▪ Inductive Reasoning
• Generalizing patterns from specific instances, like summarizing
trends or identifying novel patterns in datasets.
▪ Abductive Reasoning
• Inferring the most likely explanation for observed phenomena,
such as identifying causes or generating plausible hypotheses.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 131
Reasoning with Knowledge

▪ LLMs can perform various reasoning tasks, including:


▪ Commonsense Reasoning
• Leveraging pre-trained knowledge to answer questions or
resolve ambiguities based on real-world norms and
expectations.
▪ Counterfactual Reasoning
• Analyzing hypothetical scenarios to reason about "what-
if" situations, aiding in simulations and predictive
modeling.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 132
Applications

▪ Knowledge Augmentation
• LLMs can integrate external knowledge bases during inference, combining
their implicit knowledge with structured information for better accuracy and
interpretability.
▪ Question Answering (QA) Systems
• Applications like ChatGPT demonstrate how LLMs perform reasoning to
answer queries, synthesize knowledge, or even offer multi-hop reasoning
over connected facts.
▪ Automated Reasoning
• Assisting in formal logic tasks like theorem proving, solving puzzles, or
verifying logical consistency in complex systems.
▪ Semantic Search and Retrieval
• Supporting search engines and recommendation systems through context-
Simon Lau Boung Yew

aware retrieval
COMP 2024 / COMP 2039
and ranking.
Components of Heuristics Search Methods and Hill Climbing 133
Limitations of LLMs

• The generation are based on the next token prediction, not from solid
facts nor from logical inference LLMs Knowledge cutoff date Provider

GPT-4o October 2023 OpenAI


→ Hallucinations GPT-3.5 January 2022 OpenAI

• Its knowledge is up to the date of training data Google Gemini Pro April 2023 Google

Llama 3-70B December 2023 Meta

→ Knowledge cutoff Claude 3 August 2023 Anthropic

• Lack of long-term memory and learning


Mistral-7B August 2021 Mistral
Emergent Abilities of LLMs

• In-context learning
• Instruction following
• Step-by-step reasoning (chain-of-thought)

Simon Lau Boung Yew


Summary

1. Types of Knowledge
2. Representation Methods
3. Case-Based Reasoning
4. Decision Making, Decision Support
5. LLMs in Knowledge
Representation and Reasoning

ACK: Prof. Ender Özcan, UNUK, Prof. Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Knowledge and Reasoning 13


6
NEXT:

Modelling and
Simulation

You might also like