0% found this document useful (0 votes)

8 views

Lecture 10 - Knowledge and Reasoning - 2025 - LLM (1)

The document discusses Large Language Models (LLMs), focusing on their architecture, training processes, and applications in Knowledge Representation and Reasoning. It covers key concepts such as transformer architecture, self-attention mechanisms, and word embeddings, highlighting their importance in understanding and generating human-like text. The content is presented in a structured format with learning outcomes and acknowledgments from various professors.

Uploaded by

ragingtuna58

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Lecture 10 - Knowledge and Reasoning - 2025 - LLM (1)

Uploaded by

ragingtuna58

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 121

COMP-2024 / COMP-2039

Artificial Intelligence
Methods

Lecture 10b
Large Language Model
Learning Outcomes

IDENTIFY

1. Types of Knowledge
2. Representation Methods
3. Case-Based Reasoning
4. Decision Making, Decision Support
5. LLMs in Knowledge
Representation and Reasoning

ACK: Prof. Ender Özcan, UNUK, Prof. Tomas Maul & Dr Chen ZhiYuan, UNM

COMP2024 / COMP2039 Knowledge and Reasoning 2

Learning Outcomes

IDENTIFY

1. Introduction to LLM
2. Transformer Architecture
3. Training process for LLMs
4. Fine Tuning
5. Model Evaluation
6. Capabilities and Roles of LLM in
Knowledge and Reasoning

ACK: Prof. Ender Özcan, UNUK, Prof. Tomas Maul & Dr Chen ZhiYuan, UNM

COMP2024 / COMP2039 Knowledge and Reasoning 3

Introduction to LLM
Large Language Model (LLM)

▪ A Large Language Model is a type of AI model that's trained to

understand and generate human-like text. It uses a deep learning
technique called transformers, and it's trained on massive
amounts of text data from books, websites, articles, and more.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 5
Large Language Model (LLM)

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 6
The Evolution of Large Language Models

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 9
Introduction to LLMs

▪ Large Language Models (LLMs), such as GPT, BERT etc., are

increasingly being leveraged in the domain of Knowledge
Representation and Reasoning (KRR) due to their ability to
process, understand, and generate human-like text.

▪ Their applications in this field range from encoding knowledge in

textual formats to reasoning over facts, concepts, and relationships.

Simon Lau Boung Yew

Probability-based language models

▪ Probability-based language models analyze sequences of words.

▪ They can predict the next word given previous words.
▪ This prediction forms the basis of text generation, as used in AI-
powered chatbots and language processing tools.
▪ This is based on the chain rule of probability.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 11
Probability-based language models

• Language models can evaluate the probability of a given text

occurring in a particular language.

• Predicting the Next Word: Given a sentence, enabling a

language model to predict the most likely next word equips it with
language generation capabilities.

where V is the vocabulary.

Probability-based language models - Neural
Language Models

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 13
Probability-based language models

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 14
Probability-based language models

Probability
sampling
Large language model

Language
Large language model architecture
model

Large language model

Simon Lau Boung Yew
… <EOS>
architecture
Types of Probability-Based Models

▪ N-gram Models: Use a sliding window of n words to predict the

next word.
▪ Hidden Markov Models (HMMs): Models sequences where there's
an underlying hidden state (e.g., grammatical roles) generating the
visible output (e.g., words).
▪ Neural Language Models (early ones): word embeddings (e.g.,
Word2Vec, GloVe) and neural networks. Predict next words using a
small neural net trained on large corpora. https://lena-
voita.github.io/nlp_course/word_embeddings.html
▪ Transformers and self-attention mechanisms: modern LLMs like
GPT, Bert → handle longer contexts and understand semantic
Simon Lau Boung Yew

relationships
COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 16
Transformer
Architecture
Language Model and Transformer

The goal of language models is to estimate the joint

probability of the whole sequence:

for example:
Transformer

• Translation model

**********
(Embedding)

Encoder Decoder

Attention
mechanism
學生都快睡著Input 了， The studentsOutput
were almost falling
TheThe
students
The
students
The
students
were
students
Thewere
almost
were
almost
falling
因為他們聽老師講課很無聊 asleep…
Simon Lau Boung Yew
asleep 47
COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 19
Encoder and Decoder

▪ Encoder takes an input sequence and returns an internal state (a

embedding vector); Decoder starts with the internal state generated
by the encoder to predict the next sequence.
▪ Encoder: Processes (understand) the input sequence.
▪ If the input is: "The students were almost falling asleep."
The encoder converts this into a numerical representation capturing
meanings and relationships between words.
▪ Tokenized (words) → word embeddings
▪ Decoder: Generates the output sequence.
▪ Given the encoded input "The students were almost...", the decoder predicts
the next word ("falling") and continues generating.
Simon Lau Boung Yew
Transformer: Encoder and Decoder

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 21
After Transformer was proposed in 2017, both Google and OpenAI have leveraged certain
part of Transformer to develop BERT and GPT models leading to significant achievements in
the NLP field.

**********
(Embedding)

BERT GPT
Encoder Decoder
pre-trained model Text generation
for various NLP model
tasks
Unidirectional: Predicts the
Bidirectional: Looks at next word based only on
both past and future past words. 48
context to understand
meaning.
Understanding the Transformer Structure

BERT is great for understanding text (NLP tasks like search and classification).
GPT is powerful for generating text (chatbots, creative writing).

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 23
Encoder

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 24
Embedding

▪ Transformers process words in parallel, so they don’t inherently understand

which word comes first, second, etc.
▪ Word order matters in language—for example:
▪ “The cat chased the dog.”
▪ “The dog chased the cat.”
▪ These two sentences have different meanings, but a Transformer without
positional encoding would treat them the same!
▪ Each word in a sentence gets an extra positional vector added to its
embedding.
• Word embedding → Represents the meaning of the word.
• Positional encoding → Tells the model where the word is in the sentence.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 25
Word Embedding

▪ Embedding is a dense vector representation of a word, token, or

subword in a continuous space.
▪ Example: "cat" → [0.25, -1.03, ..., 0.78] (e.g., 512 dimensions)
▪ Embeddings capture semantic meaning → Similar words have
similar embeddings: "king" and "queen" might be closer in
embedding space than "king" and "table".

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 26
Input Embedding
Embedding dimension is the length of the vector
used to represent each token

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 27
Word / Token Embedding and Word2vec

▪ Word Embedding is a language

modeling technique for mapping
words to vectors of real numbers.
It represents words or phrases in
vector space with several
dimensions. Word embeddings
can be generated using various
methods like neural networks.
▪ Word2Vec allows words to be
represented as vectors in a
continuous vector space.

ACK: Prof. Ender Özcan, UNUK, Dr Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 28
Embedding

1 apple
2 green
3 blue
4 cat
5 fish
6 sky
7 grass
8 glass
⋮
Embeddings in LLMs are numerical representations of text that
35
capture semantic meaning in a high-dimensional vector space.
Similar concept to latitude and longitude

• Washington DC is at [ 38.9, 77 ]
• New York is at [ 40.7, 74 ]
• London is at [ 51.5, 0.1 ]
• Paris is at [ 48.9, -2.4 ]
• Taipei is at [ 25, 121.6 ]

Distance、Position ...
36
Word2vec - Shallow neural network models used to
learn word embeddings
I 0.1 0.4 0.9 0.7

am 0.5 -0.2 -0.2 0.3

I I studying 0.4 0.6 -0.1 0.8

look up the table NLP -0.2 -0.4 0.9 -0.3

am am
studying now 0.1 0.5 -0.7 0.8

NLP NLP

now now

37
Predict the target word from Predict surrounding context
surrounding context words words from the target word
Word2vec

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 32
Embedding Property

• Calculate similarity
• The similarity between "man" and "woman" is higher than the similarity
between "man" and "apple."

• Perform numerical calculations directly queen

• woman + ( king - man ) = queen woman

• Analyze the meaning king

apple
man

Simon Lau Boung Yew

41
Embedding Projector
https://projector.tensorflow.org/

38
Transformer Architecture

▪ Transformer architecture addressed the issue of preserving long-

term dependencies (vanishing gradient issue in RNN, LSTM) by
leveraging
▪ self-attention mechanisms to retain word-to-word relation to
weigh the importance of different words in a sentence.
▪ positional encodings to represent each word’s position.
▪ This enables parallel computation over the entire text without
disrupting the order, rather than sequentially. They outperform
traditional sequence models (e.g., RNN, LSTM).

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 35
Positional Encoding
- "What is the order of the words?"

▪ Transformers don’t have loops or recurrence, so they have no

idea about word order unless we explicitly tell them by adding the
positional information
▪ The positional encoding is a fixed-length vector added to the word
i‘s embedding xᵢ (computed with sinusoids) to enable parallel
processing of input sequences without disrupting the word order.
▪ This allows the model to learn relative positions, like: Next word,
Previous word and Distance between words
▪ Final Input = Token Embedding + Positional Encoding

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 36
Positional Encoding

Positional Encoding:

sine and cosine functions to generate unique position values

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 37
Positional Encoding

• By using sine and cosine functions, each position has

a unique encoding, which allows the model to
differentiate between words based on their order.
• The continuous nature of sine and cosine functions
provides smooth transitions between positional
encodings, which can help the model learn
relationships between different positions more
effectively.

This is the index of the dimension for the

embedding vector, indicating which sine
or cosine function to use.

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 38
Example: Positional Encoding

Each word gets a unique pattern of numbers that helps the Transformer understand its position.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 39
Self-Attention
- "Who should I pay attention to?"

▪ In a sentence, not all words are equally important to each other. Self-attention
lets the model look at all the words in the sentence at once, and decide
which ones matter most for understanding a specific word.
▪ The “self” part of self-attention refers to the "egocentric" focus of each token in a
corpus. Effectively, on behalf of each token of input, self-attention asks, "How
much does every other token of input matter to me?“
▪ "The cat sat on the mat.“ → To understand "sat" in context, the model might pay
attention to:
▪ "cat" (to know who sat)
▪ maybe "mat" (to know where it sat)
▪ The self-attention mechanism computes how much attention "sat" should give to
each word (including itself) — and forms a weighted sum of all word vectors.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 40
Self-Attention

▪ Self-attention calculates attention scores between different sentence parts. Help

the model decide which words to focus on when processing input sequences.
▪ Each input word embedding x is transformed into 3 vectors:
• Query (Q)
• Key (K)
• Value (V)
• A query Q for a specific word and returning keys K of all words (including itself)
to retrieve their values V.
• Attention Score = Q · Kᵀ (dot product)
▪ Weighted sum = Σ attention_score × V (value vector) → represent the word j of
important relevance with word i
▪ This process happens for every word, at the same time, hence "self"-attention.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 41
Single Head Attention

▪ It learns to focus on one type of relationship between words.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 45
Multi-Head Attention

▪ Multi-heads attention mechanism performs the self-attention

operation in parallel multiple times, each with its own set of
weights.
▪ Each "head":
• Sees the input slightly differently
• Focuses on different types of relationships
▪ For example, with 4 attention heads:
• Head 1: Focuses on subject-verb
• Head 2: Focuses on positional alignment
• Head 3: Focuses on adjective-noun
• Head 4: Focuses on long-distance dependencies
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 46
Multi-Head Attention

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 47
Multi-Head Attention

Multi-head attention:

head function:

concatenation of h heads:
COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 48
Multi-Head Attention

▪ Multi-Head Attention runs multiple attention layers in parallel.

• Different heads focus on different aspects of the input.
• The results are concatenated and passed through a linear layer.

• Handles long-range dependencies better than RNNs.

• Parallelizable unlike sequential models.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 49
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 50
Feed Forward Layer

▪ A fully connected layer (also called a dense layer) with dropout

and ReLu activation functions in between that operates
independently on each position (token) in the sequence to
transform and refine the output representations
▪ Applies a transformation to each token vector independently
▪ Helps the model learn non-linear combinations of features
▪ Acts like a feature refiner after attention

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 51
Decoder

• In the decoder, the generated text

sequences are produced one by one.
Each output word is considered as the
new input, which is then passed through
the encoder’s attention mechanism.
After N encoder stacks, a softmax
output predicts the most probable
generated sequence.
• The process may seem sequential.
During real training, the decoder can
process the prediction of next words in
the entire sequence in parallel. This self-attention layer in decoder also employs
Simon Lau Boung Yew
the attention mechanism, but with future word
masks to prevent access to future information.
COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing Thus, it is also called causal self-attention layer.
52
ACK: Prof. Ender Özcan, UNUK, Dr Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 53
What is the training
process for LLMs
Deep Learning Training Process

Data Data Model Model Model Model

Deployment
Collection Preprocess Building Training Evaluation Optimization

Simon Lau Boung Yew

21
Deep Learning Training Process

Data Data Model Model Model Model

Deployment
Collection Preprocess Building Training Evaluation Optimization

Input Output
Image Image
Audio Audio
Numerical
Text Text
Simon Lau Boung Yew … …
22
Deep Learning Training Process

Data Data Model Model Model Model

Deployment
Collection Preprocess Building Training Evaluation Optimization

Input Output

Tokenization
Text Text
Embedding
Simon Lau Boung Yew
23
Vector Database
ACK: Prof. Ender Özcan, UNUK, Dr Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 59
ACK: Prof. Ender Özcan, UNUK, Dr Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 60
Vector Database

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 61
Why is Vector Database Necessary?

Database

Question:
What is the artificial intelligence?
difficult to
search and compare
Embedding

Consideration:
● Storage format
● Query operations
● Performance requirements
What is Vector Database?

Vector
Database
Question:
What is the artificial intelligence?
[0.1, 0.3, -0.2, …]
Embedding [0.9, -0.1, -0.4, …]
Embedding
፧
[0.8, 0.7, -0.6, …]

Vector similarity search

Use case
The possible return methods
for actual applications:
——————————————
Immediate questions 1. Return the highest-scoring result
directly.
Create a 2. Return multiple similar results.
document database 3. Return a summary or answer after
querying through FFM
(Feature-Fused Model).

T enT

Enterprise-specific
LLM
Common Vector Databases

▪ Pinecone
▪ FAISS (Facebook AI Similarity Search)
▪ Chroma
▪ Milvus
▪ Redis

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 65
How to train/fine-tune
your own LLM
Pretrained LLMs

▪ Pretrained Large Language Models (LLMs) are advanced neural

network architectures pretrained (offline) on vast amounts of text
data, allowing them to capture the intricacies of language, context,
and meaning.
▪ Popular Pretrained LLMs
▪ Google BERT (Bidirectional Encoder Representations from
Transformers)
▪ OpenAI GPT (Generative Pretrained Transformer)
▪ T5 (Text-to-Text Transfer Transformer)
▪…
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 67
Pre-trained LLM

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 68
Learning of LLMs

Pre-training Instruction
Alignment Prompting
model Tuning

Billions 10K~M K O(1)

Number of parameters - learnable weights in a neural network

Size: Number of Parameters

▪ In a Transformer-like LLM, parameters come from:

1. Token embedding matrix
▪ Vocabulary size = 50,000
▪ Embedding dimension = 4,096
▪ [50,000 × 4,096] = 204,800,000 parameters
▪ ≈ 205M just for embeddings!
2. Attention layers (queries, keys, values, output)
3. Feed-forward layers
4. Layer norms, biases, projection heads, etc.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 70
Language Model Size

Model size

Dataset
Computing resources

Time

Source: https://epochai.org/blog/tracking-large-scale-ai-models
Pretraining

• When you first train a model, its parameters ▪ Large Scale

are randomly initialized and adjusted during • These models typically have millions to
training. Involves training the model on a billions of parameters, allowing them to
large corpus of text without specific tasks in learn complex patterns and relationships
mind. The goal is to learn patterns, in the data.
structures, and semantics of language.
• The size of the model contributes
• Common objectives include: significantly to its performance, but it
• Masked Language Modeling (MLM): also requires substantial computational
Predicting missing words in a sentence resources for training and inference.
(used in models like BERT).
• Next Sentence Prediction (NSP):
Determining if one sentence follows
another (also used in BERT).
• Causal Language Modeling:
Predicting the next word in a sequence
ACK: Prof. Ender (used
Özcan, UNUK, in models
Dr Tomas Maul & Dr Chenlike
ZhiYuan,GPT).
UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 72
How to Train a Deep Learning Model

Forward
Prediction Evaluation

loss

Backward
Fine Tuning
What is Transfer Learning?
• Transfer existing knowledge to a new domain without the need to relearn!

Knowledge transfer

Transfer learning (TL) is a machine learning (ML) technique that uses a model
pre-trained on one task to improve its performance on a related task.
This technique is used to retrain existing models with new data instead of training
a new model from scratch. 61
Fine Tuning

• After pretraining, the model can be fine-tuned on specific tasks

(e.g., sentiment analysis, question answering) using smaller, task-
specific datasets.
• Fine-tuning helps the model adapt its general language knowledge
to specific knowledge for particular applications.
▪ Fine-tuning is the process of adjusting the parameters of a pre-
trained model to adapt it for a specific task or domain. In simpler
terms, it’s like taking a model that has learned general patterns
(from a large, diverse dataset) and then making it better at a
particular task by fine-tuning its weights with a more specific,
task-related dataset.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 82
Fine Tuning

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 84
ACK: Prof. Ender Özcan, UNUK, Dr Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 85
Fine Tuning

• Full Fine-Tuning
• Fine-tuning with limited resources
• In-context learning: Instruction Tuning (Hard Prompt, strictly not
fine tuning)
• Parameter-Efficient Fine-Tuning (PEFT)
• Distilled Training
• Data Efficient Training
• Alternative: Retrieval Augmented Generation (RAG)

Simon Lau Boung Yew https://developers.google.com/machine-learning/crash-course/llm/tuning

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 86
Full Fine-Tuning

▪ This approach involves fine-tuning the entire pre-trained LLM on

a specific task or dataset.
▪ Fine-tune all the parameters of the model. In this case, all layers of
the model undergo small adjustments based on the task-specific
dataset.
▪ This technique is effective but can be computationally expensive
and time-consuming, especially for large-scale models.

Simon Lau Boung Yew

Full Fine-Tuning

5. Model evaluation
4. Model Training 6. Iterate until the goal is achieved
Response

Query
Pretrained LLM Fine-tuned LLM Users
3. Model Building

Specific-domain dataset
1. Dataset preparation
2. Data preprocess
59
Soft vs Hard Prompts

The actual written

or spoken phrases
or questions that a
user inputs directly
into the model
during its use

The model weights of the

LM are frozen and there
are separate learnable
tensors concatenated with
the model weights and are
trained for the specific
downstream task.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 89
In-context Learning: Hard Prompt

• LLMs have the ability to quickly learn and apply new concepts
or skills based on the provided context, without requiring explicit
fine-tuning or retraining.

• Hard Prompting can help activate and utilize this in-context

learning capability.

Simon Lau Boung Yew

Instruction Tuning

• Instruction tuning involves fine-tuning the model parameters on

a dataset where tasks are described in natural language, like:
"Translate this sentence," "Summarize the paragraph," or "Answer
the question.“

• The key idea behind Instruction Tuning is to provide the language

model with clear, structured, and tailored instructions or prompts
that guide the model's behavior and output towards the desired
task or objective.

Simon Lau Boung Yew

Instruction Tuning
Parameter-Efficient Fine-Tuning (PEFT)

• PEFT is a more efficient fine-tuning approach that aims to minimize

the number of parameters (only modify a subset of parameters)
that need to be updated during the fine-tuning process.
• Instead of updating the entire model, PEFT introduces additional
trainable parameters, such as adapter modules or prompt tuning,
while keeping the majority of the pre-trained parameters frozen.
• This approach can significantly reduce the computational and
memory requirements for fine-tuning, making it more efficient for
large-scale models.

Simon Lau Boung Yew

Parameter-Efficient Fine-tuning (PEFT)

5. Model evaluation
Pretrained LLM 4. Model Training 6. Iterate until the goal is achieved
Response

Query

Users

Extra architecture or Fine-tuned LLM

part of pretrained model

Specific-domain dataset
1. Dataset preparation
3. Model building 2. Data preprocess
62
PEFT

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 98
PEFT Techniques

▪ Soft Prompting: P-tuning, prefix tuning, and prompt tuning -

modifies how the model processes input, but they do so in different
ways.
▪ Adapters: Introduces small additional layers while keeping the base
model frozen.
▪ LoRA (Low-Rank Adaptation): Injects small trainable matrices
into transformer layers, reducing memory overhead.
▪ QLoRA – Quantized LoRA

Hugging Face’s PEFT: Supports LoRA, Adapters, and Prefix Tuning.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 99
Soft Prompting

▪ Soft prompts are learnable continuous embeddings (vectors) rather

than hard-coded text (like "Translate this sentence:").

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 101
Soft Prompting

▪ Prompt Tuning

▪ Prefix Tuning

▪ P-Tuning

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 102
ACK: Prof. Ender Özcan, UNUK, Dr Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 103
LoRA (Low-Rank Adaptation)

▪ Update only a small subset of parameters while keeping the majority

of the model's parameters frozen.
▪ During the training, only the low-rank update matrices A and B are
updated, while the rest of the model remains frozen.
▪ Rank = number of linearly independent rows (or columns) in the
matrix
▪ The key idea is to represent changes to the model's parameters in a
low-rank form, to ensure that the added parameters are efficient in
terms of both computation and memory.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 104
LoRA (Low-Rank Adaptation)

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 105
Quantization-Aware Fine-Tuning - QLoRA

▪ Reducing the precision of model weights (e.g., from FP32 to INT8

or INT4) lowers memory usage and speeds up training.

• QLoRA: Combines quantization with LoRA for efficient fine-tuning on

consumer GPUs.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 106
What is Quantization?

▪ Quantization is a technique used to reduce the memory and computational cost

of running large language models (LLMs) by reducing the precision of
numerical values (weights and activations) from higher precision (e.g., FP32) to
lower precision (e.g., INT8, INT4).
▪ Example:
Original Model Weights in FP32 (32-bit floating point)
Weights (FP32):[ 0.15234375, -0.83789062, 1.12890625, -0.23046875 ]
Quantized Weights in INT8 (8-bit integer)
Weights (INT8):[ 19, -107, 144, -30 ]
Reconstructed Weights:[ 19 × 0.0078125, -107 × 0.0078125, 144 ×
0.0078125, -30 × 0.0078125 ]≈ [ 0.148, -0.835, 1.125, -0.234 ] (with Scaling
factor=0.0078125)
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 107
Distilled Training

▪ LLM distilled training refers to the process of training a smaller

model (student) to mimic a larger model (teacher), retaining key
capabilities while reducing size.
▪ Knowledge Distillation: The student model learns from the soft
probabilities of the teacher model.
▪ Examples of Distilled Models
• DistilBERT (from BERT)
• TinyLLaMA (from LLaMA)
• T5-Small (from T5)

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 108
Knowledge Distillation Process

1. A large teacher model (e.g., GPT-4, LLaMA 3) generates

responses, logits (probability distribution over tokens), or
intermediate representations.
2. A smaller student model (e.g., DistilBERT, TinyLLaMA) is trained to
mimic the teacher’s outputs.
3. The student model learns from both hard labels (ground truth data)
and soft labels (teacher’s predictions).

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 109
Knowledge Distillation Process

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 110
Knowledge Distillation Process

Logits: unscaled predictions

(scores) for each token, which
Simon Lau Boung Yew can then be transformed into
probabilities, passed through a
COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 111
softmax function
Data Efficient Training

▪ Optimize dataset selection and preprocessing to minimize training

time.
• Few-shot Fine-tuning: Use a small, high-quality dataset →
leverage general knowledge from pre-training, specializes in
specific task or domain with minimal data.
• Active Learning: Select only the most informative (uncertain or
difficult) samples for fine-tuning → minimizes the amount of
labeled data required

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 112
Alternative to fine-tuning:
Retrieval Augmented Generation (RAG)

▪ Retrieval Augmented Generation (RAG) is a technique that

combines the strengths of retrieval-based systems and language
generation models.
▪ Instead of modifying the model’s parameters, RAG enhances an
LLM's responses by retrieving relevant external information from a
knowledge base or document store at runtime and then using that
information to guide the generation of the final output.
▪ This approach can enhance the factual accuracy and coherence of
the LLM's responses, as it can draw upon external (often more up-
to-date) knowledge in addition to its internal language modeling
capabilities.
Simon Lau Boung Yew
Retrieval Augmented Generation (RAG)

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 114
How RAG Differs from Fine-Tuning

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 115
When to Use RAG Instead of Fine-Tuning

▪ When your model needs real-time access to external or frequently

updated data.
▪ When you lack enough labeled training data for fine-tuning.
▪ When computational resources are limited, as RAG does not require
modifying model weights.

▪ You can fine-tune a model for better reasoning or response style

while still using RAG to provide external factual knowledge. This
hybrid approach can optimize both performance and efficiency.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 116
Model Evaluation
Evaluation: Different Way to Compute Metric Scores

Source:
https://www.confident-ai.com/blog/llm-evaluation-metrics-e
verything-you-need-for-llm-evaluation

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 118
How to evaluate

https://huggingface.co/docs/evaluate/index

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 119
High-level categories of metrics

• Generic metrics, which can be applied to a variety of situations

and datasets, such as precision and accuracy.
• Task-specific metrics, which are limited to a given task, such as
Machine Translation (often evaluated using metrics BLEU or
ROUGE) or Named Entity Recognition (often evaluated with
seqeval).
• Dataset-specific metrics, which aim to measure model
performance on specific benchmarks: for instance, the GLUE
benchmark has a dedicated evaluation metric.

Simon Lau Boung Yew

Open LLM Leaderboard
Open LLM Leaderboard aims to track, rank and evaluate open LLMs and chatbots.
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

12
1
Key Metrics in the Leaderboard

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 122
Key Metrics in the Leaderboard

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 123
Capabilities and Roles
of LLM in Knowledge
and Reasoning
Instruction following

Simon Lau Boung Yew

Capabilities of LLMs

• Natural Language Understanding: Ability to comprehend and

generate human-like text.
• Knowledge Representation: Discuss how LLMs can encode
knowledge in their parameters and generate representations based
on input queries.
• Reasoning Abilities: Explore the extent to which LLMs can perform
reasoning tasks, such as inference, summarization, and question-
answering.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 126
Chain-of-Thought

• LLMs can demonstrate step-by-step reasoning abilities, breaking

down problems into logical steps and articulating the thought
process.
• This technique allows the model to break down complex problems
into sequential steps, facilitating clearer reasoning and more
accurate conclusions.
• Prompting can encourage the model to generate such step-by-
step explanations, which can be valuable for tasks like problem-
solving, question answering, and knowledge-intensive applications.

Simon Lau Boung Yew

Chain-of-Thought

Simon Lau Boung Yew

Role of LLMs in Knowledge Representation

▪ Implicit Knowledge Representation

• LLMs encode information in their parameters, learned from
extensive corpora. This enables them to store and retrieve
knowledge efficiently without explicit structuring.

• Knowledge is distributed and embedded as semantic

representations in vector spaces, which allows LLMs to perform
tasks like answering factual questions or summarizing
documents.

Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 129
Role of LLMs in Knowledge Representation

▪ Explicit Knowledge Representation

• LLMs can extract structured knowledge from unstructured data,
generating explicit representations such as:
• Knowledge Graphs (KGs): Extracting entities and
relationships from text to populate KGs.
• Ontologies: Recognizing hierarchical and semantic structures
to define relationships between concepts.
• Tabular Formats: Converting textual information into rows
and columns for analysis.
• Examples include applications in natural language interfaces for
structured query generation or database population.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 130
Reasoning with Knowledge

▪ LLMs can perform various reasoning tasks, including:

▪ Deductive Reasoning
• Drawing logical conclusions from explicit premises, such as
understanding syllogisms or resolving logical contradictions.
▪ Inductive Reasoning
• Generalizing patterns from specific instances, like summarizing
trends or identifying novel patterns in datasets.
▪ Abductive Reasoning
• Inferring the most likely explanation for observed phenomena,
such as identifying causes or generating plausible hypotheses.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 131
Reasoning with Knowledge

▪ LLMs can perform various reasoning tasks, including:

▪ Commonsense Reasoning
• Leveraging pre-trained knowledge to answer questions or
resolve ambiguities based on real-world norms and
expectations.
▪ Counterfactual Reasoning
• Analyzing hypothetical scenarios to reason about "what-
if" situations, aiding in simulations and predictive
modeling.
Simon Lau Boung Yew

COMP 2024 / COMP 2039 Components of Heuristics Search Methods and Hill Climbing 132
Applications

▪ Knowledge Augmentation
• LLMs can integrate external knowledge bases during inference, combining
their implicit knowledge with structured information for better accuracy and
interpretability.
▪ Question Answering (QA) Systems
• Applications like ChatGPT demonstrate how LLMs perform reasoning to
answer queries, synthesize knowledge, or even offer multi-hop reasoning
over connected facts.
▪ Automated Reasoning
• Assisting in formal logic tasks like theorem proving, solving puzzles, or
verifying logical consistency in complex systems.
▪ Semantic Search and Retrieval
• Supporting search engines and recommendation systems through context-
Simon Lau Boung Yew

aware retrieval
COMP 2024 / COMP 2039
and ranking.
Components of Heuristics Search Methods and Hill Climbing 133
Limitations of LLMs

• The generation are based on the next token prediction, not from solid
facts nor from logical inference LLMs Knowledge cutoff date Provider

GPT-4o October 2023 OpenAI

→ Hallucinations GPT-3.5 January 2022 OpenAI

• Its knowledge is up to the date of training data Google Gemini Pro April 2023 Google

Llama 3-70B December 2023 Meta

→ Knowledge cutoff Claude 3 August 2023 Anthropic

• Lack of long-term memory and learning

Mistral-7B August 2021 Mistral
Emergent Abilities of LLMs

• In-context learning
• Instruction following
• Step-by-step reasoning (chain-of-thought)

Simon Lau Boung Yew

Summary

1. Types of Knowledge
2. Representation Methods
3. Case-Based Reasoning
4. Decision Making, Decision Support
5. LLMs in Knowledge
Representation and Reasoning

ACK: Prof. Ender Özcan, UNUK, Prof. Tomas Maul & Dr Chen ZhiYuan, UNM

COMP 2024 / COMP 2039 Knowledge and Reasoning 13

6
NEXT:

Modelling and
Simulation

Artificial Intelligence Grade 12 Notes-Capstone Project CBSE Skill Education-Artificial Intelligence
90% (10)
Artificial Intelligence Grade 12 Notes-Capstone Project CBSE Skill Education-Artificial Intelligence
10 pages
306 Seminar Report
No ratings yet
306 Seminar Report
39 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
College Documentation - Automated Image Captioning
No ratings yet
College Documentation - Automated Image Captioning
26 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
UNIT-2
No ratings yet
UNIT-2
6 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Module1_L4_LLMs_new
No ratings yet
Module1_L4_LLMs_new
37 pages
2020 NLPDeepLearning
No ratings yet
2020 NLPDeepLearning
72 pages
LLM_book_43-102
No ratings yet
LLM_book_43-102
60 pages
Augmenting LLMs Survey
No ratings yet
Augmenting LLMs Survey
33 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Speech and Language Processing - J&M
No ratings yet
Speech and Language Processing - J&M
599 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
Chapter II
No ratings yet
Chapter II
26 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Course3 LM
No ratings yet
Course3 LM
69 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
Alternative Account of Why Language model fit brain
No ratings yet
Alternative Account of Why Language model fit brain
16 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
NLP NN Language Modeling Week5
No ratings yet
NLP NN Language Modeling Week5
33 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Ed 3 Book
No ratings yet
Ed 3 Book
577 pages
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
No ratings yet
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
83 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
3. Graph Representation Learning
No ratings yet
3. Graph Representation Learning
32 pages
W03 NLP
No ratings yet
W03 NLP
88 pages
Speech and Language Processing
100% (1)
Speech and Language Processing
623 pages
Speech and Language Processing: Third Edition Draft
No ratings yet
Speech and Language Processing: Third Edition Draft
287 pages
wordembed
No ratings yet
wordembed
31 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
index llibre 2
No ratings yet
index llibre 2
6 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
5 Word Embeddingfor Understanding Natural Language ASurvey 1
No ratings yet
5 Word Embeddingfor Understanding Natural Language ASurvey 1
26 pages
Word Embedding For Understanding Natural Language: A Survey: Yang Li Tao Yang
No ratings yet
Word Embedding For Understanding Natural Language: A Survey: Yang Li Tao Yang
13 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
Thesis_LeDucDong
No ratings yet
Thesis_LeDucDong
61 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Unit - 4 DL
No ratings yet
Unit - 4 DL
33 pages
NLP - Natural Language Processing
No ratings yet
NLP - Natural Language Processing
74 pages
14-Word Embeddings II
No ratings yet
14-Word Embeddings II
31 pages
PT 2
No ratings yet
PT 2
59 pages
Word 2 Vec
No ratings yet
Word 2 Vec
29 pages
Unit iv
No ratings yet
Unit iv
57 pages
Ed3book PDF
No ratings yet
Ed3book PDF
621 pages
LLMs
No ratings yet
LLMs
40 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
Ed 3 Book
No ratings yet
Ed 3 Book
636 pages
Hello 1
No ratings yet
Hello 1
1 page
Icml2016 Memnn Tutorial
No ratings yet
Icml2016 Memnn Tutorial
89 pages
NLP Summary
No ratings yet
NLP Summary
6 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
NLP BOOK
No ratings yet
NLP BOOK
599 pages
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Python Code Smells Detection Using Conventional Machine Learning Models
No ratings yet
Python Code Smells Detection Using Conventional Machine Learning Models
21 pages
Advanced Techniques For Fault Detection and Classification in Electrical Power Transmission Systems: An Overview
No ratings yet
Advanced Techniques For Fault Detection and Classification in Electrical Power Transmission Systems: An Overview
10 pages
1. Estimation of Soil Liquefaction Using Artificial Inteligence Techniques_an Extended Comparison Between Machine and Deep Learning Approaches (2025)
No ratings yet
1. Estimation of Soil Liquefaction Using Artificial Inteligence Techniques_an Extended Comparison Between Machine and Deep Learning Approaches (2025)
23 pages
Machine Learning Models
100% (1)
Machine Learning Models
2 pages
B.Tech Research Project and Internship Report Manual 23568956
No ratings yet
B.Tech Research Project and Internship Report Manual 23568956
41 pages
Data Mining Mcq's
0% (1)
Data Mining Mcq's
17 pages
MSC Proj
No ratings yet
MSC Proj
49 pages
Matlab Neural Network
No ratings yet
Matlab Neural Network
9 pages
Haridwar University, Roorkee. Course Offered
No ratings yet
Haridwar University, Roorkee. Course Offered
2 pages
Ameen Chapter 2
No ratings yet
Ameen Chapter 2
13 pages
Anomaly Detection Using Bi-Directional Long Short-Term Memory Networks for Cyber-Physical Electric Vehicle Charging Stations
No ratings yet
Anomaly Detection Using Bi-Directional Long Short-Term Memory Networks for Cyber-Physical Electric Vehicle Charging Stations
11 pages
Buy ebook Hands on Data Science for Biologists Using Python 1st Edition Yasha Hasija cheap price
100% (3)
Buy ebook Hands on Data Science for Biologists Using Python 1st Edition Yasha Hasija cheap price
65 pages
Completed Review of Various Solar Power Forecasting Techniques Considering Different Viewpoints
No ratings yet
Completed Review of Various Solar Power Forecasting Techniques Considering Different Viewpoints
22 pages
Time-Series Forecasting in Software Projects
No ratings yet
Time-Series Forecasting in Software Projects
19 pages
Bankruptcy Prediction Model 24-04-2017
No ratings yet
Bankruptcy Prediction Model 24-04-2017
22 pages
Personalized Healthcare Recommendation project
No ratings yet
Personalized Healthcare Recommendation project
1 page
Personalized_Federated_Learning_with_Adaptive_Batchnorm_for_Healthcare
No ratings yet
Personalized_Federated_Learning_with_Adaptive_Batchnorm_for_Healthcare
12 pages
Machine Learning for Business Analytics: Concepts, Techniques and Applications with JMP Pro, 2nd Edition Galit Shmueli - Own the complete ebook with all chapters in PDF format
100% (3)
Machine Learning for Business Analytics: Concepts, Techniques and Applications with JMP Pro, 2nd Edition Galit Shmueli - Own the complete ebook with all chapters in PDF format
76 pages
DL Unit 1
No ratings yet
DL Unit 1
20 pages
Heart_Disease
No ratings yet
Heart_Disease
26 pages
Factor Graphs For Robot Perception
100% (1)
Factor Graphs For Robot Perception
144 pages
Automatic PTZ Camera Control Based On Deep-Q Network in Video Surveillance System
No ratings yet
Automatic PTZ Camera Control Based On Deep-Q Network in Video Surveillance System
3 pages
Implementation of ML model for image classification
No ratings yet
Implementation of ML model for image classification
19 pages
Gramener SEEDS Case Study Disaster Recovery With AI
No ratings yet
Gramener SEEDS Case Study Disaster Recovery With AI
3 pages
Machine Learning Approaches For Soil Type Classification in
No ratings yet
Machine Learning Approaches For Soil Type Classification in
20 pages
Monitoring Macroplastics in Aquatic and
No ratings yet
Monitoring Macroplastics in Aquatic and
11 pages
Mla Aiml
No ratings yet
Mla Aiml
52 pages