0% found this document useful (0 votes)
29 views

Small Language Models (SLMS)

Uploaded by

f20221259
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Small Language Models (SLMS)

Uploaded by

f20221259
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Small Language Models

(SLMs)
Arkaprava Roy
What are Small Language Models?
SLMs are compact versions of large language models (LLMs), with parameters in
the millions to a few billion, compared to LLMs with hundreds of billions.

• Efficiency: SLMs use less computational power and memory, making them ideal
for small devices and edge computing, enabling real-world applications like
on-device chatbots.

• Accessibility: With lower resource needs, SLMs are more accessible to


developers and organisations, democratising AI for smaller teams.

• Customization: SLMs are easier to fine-tune for specific tasks, allowing


for specialised models with higher accuracy in niche areas.
Model Architecture
The architectural designs for developing SLMs.

• Lightweight Architecture

• Efficient Self-Attention Approximations

• Neural Architecture Search Techniques

• Small Multi-modal Models


Lightweight Architectures
Lightweight language model architectures are designed to
achieve efficient performance with fewer parameters and
reduced computational overhead, which is ideal for deployment
on resource-constrained devices such as mobile phones, edge
devices, and embedded systems. Representative lightweight
models often follow the encoder-only and decoder-only
architectures.
Efficient Self-Attention
Approximations
Deploying large language models can be challenging due to the
substantial number of parameters in the self-attention
layers, as well as the computational cost associated with
self-attention.
Training Techniques
This section reviews the key training techniques used for language
model pretraining and finetuning. While SLMs involve similar
training approaches to LLMs, we will focus on efficient techniques
to facilitate the general learning scenarios with limited resources
for SLMs.

• Pre-training Techniques

• Fine-tuning Techniques
Pre-training Techniques
Pre-training large language models (SLMs and LLMs) efficiently requires
specialized techniques. One key method is mixed precision training, where
lower-precision numbers (FP16) are used for calculations while the
model’s weights are kept in higher precision (FP32). This speeds up
training without losing accuracy. Other techniques, like gradient
clipping (to prevent issues with large updates), and memory-efficient
optimizers (like Adafactor and Sophia) help improve stability and
performance. Additionally, distributed training methods, such as ZeRO and
FSDP, allow training to be spread across multiple machines, making it
faster and more scalable.
Fine-tuning Technique
Fine-tuning adapts pre-trained models to specific tasks using smaller
datasets. Key techniques include:

• Parameter-Efficient Fine-Tuning (PEFT): Updates a small subset of


parameters or adds lightweight modules, reducing costs and preventing
overfitting (e.g., LoRA, Prompt Tuning).

• Dynamic Adapters: Combine multiple adapters for multi-task learning and


prevent forgetting.

• Data Augmentation: Enhances training data diversity and quality, improving


generalization (e.g., AugGPT, Reflection-Tuning, FANNO).

These methods improve task adaptation with fewer resources and data.
Model Compression Techniques
Model compression techniques focus on reducing the size and
complexity of large pre-trained language models while
maintaining their performance. As a result, these methods are
a key approach to deriving SLMs from LLMs.

• Pruning Techniques

• Quantisation

• Knowledge Distillation Techniques


Pruning Techniques
Weight pruning reduces the number of parameters to enhance efficiency and lower memory
usage while maintaining performance. Two approaches exist:

• Unstructured pruning removes individual weights, offering flexibility. SparseGPT


optimises pruning with a sparse regression problem and ADMM algorithm, while n-
pruning balances flexibility and efficiency. It often requires specialised hardware
for computational benefits.

• Structured pruning removes entire groups of parameters for better hardware


implementation, focusing on sparsity in neurons. Techniques like contextual sparsity
and layer redundancy reduction help reduce GPU memory usage and improve speed.
Quantization
Quantization compresses large models by converting weights
and activations into lower precision formats. GPTQ and other
methods, like AWQ and ZeroQuant, focus on quantizing both
weights and activations. Activation quantization challenges
are addressed by techniques like SmoothQuant and SpinQuant.
Quantization-aware training (QAT), such as LLM-QAT, improves
performance by training models with quantization in mind,
especially for deployment on mobile devices and FPGAs.
Knowledge Distillation Techniques
Knowledge distillation transfers knowledge from a larger model
(teacher) to a smaller one (student). Methods like Babyllama show
that distillation from a robust teacher outperforms pre-training.
Strategies such as sequence-level distillation and task-aware
filters improve distillation efficiency. Combining distillation
with pruning further reduces model size while preserving
performance. Recent research also explores using rationales and
reasoning chains for more efficient distillation, improving model
performance in reasoning tasks.
Evaluation Metrics
The key metrics for evaluating SLMs across different settings are:

• Latency

• Memory

• Privacy

• Energy Optimisation

The SLMs are tested with particular conditions/settings with datasets and get
statistics of the corresponding metrics.
Examples of SLMs
Applications of SLMs
• Real-Time Interaction

• Content Generation and Processing

• Edge Inference and Privacy


Real-Time Interaction
• GPT-4o: Released in May 2024, it processes text, vision, and audio inputs, offering
faster performance than GPT-4 Turbo. It demonstrates human-like responses to
conversational interfaces.

• LLaMA-Omni: A combination of speech encoder, adaptor, LLM, and streaming decoder for
real-time speech input interactions. It uses LLaMA-3-8B-Instruct.

• EMOVA (Emotionally Omni-present Voice Assistant): Uses LLaMA-3.1-8B for generating


poems and describing images based on user requests, functioning as an end-to-end
speech model.

• Google’s Project Astra: Uses Gemini to process audio and video data from smart
devices like smartphones or glasses. It can respond to queries, solve math problems,
and memorise sequences of objects.
Content Generation and Processing
• LLMR (Language Model for Mixed Reality): Utilizes multiple
LLMs in mixed reality for generating and modifying 3D
scenes. It includes different GPT-based models for scene
analysis, code generation, and code inspection.

• DreamCodeVR: Assists users in editing applications in Unity


by generating code for VR applications, enabling non-
programmers to modify VR content.
Edge Inference and Privacy
• On-device LLMs: Focus on providing LLM capabilities on mobile devices,
reducing latency and maintaining usability even without internet
connectivity.
• MobileLLM: Improves chat benchmarks and works comparably to LLaMA-2-7B in API
calls.
• Apple Intelligence: A 3B parameter model for tasks like summarisation, image
generation, and code completion, operating directly on devices.

• HuatuoGPT & BioMistral: Tailored LLMs for medical and biomedical tasks,
adhering to privacy regulations, which can run on devices without an
internet connection.
Edge Inference and Privacy
(contd.)
• Mixture-of-Experts: Reduces inference costs by using only a
subset of model layers. Examples include GLaM (Google) and
EdgeMoE, which extend this concept to edge devices such as
Nvidia Jetson TX2 and Raspberry Pi 4B.
Bibliography
• Most of my study was this paper:
https://arxiv.org/abs/2410.20011

• Others:
• https://www.superannotate.com/blog/small-language-models#small-
language-model-examples

• https://medium.com/@nageshmashette32/small-language-models-slms-
305597c9edf2
Thank You for Listening

You might also like